Functional Overview

The AIE-ML is a highly-optimized processor featuring single-instruction multiple-data (SIMD) and very-long instruction word (VLIW) processor that supports both fixed-point and floating-point precision. As shown in the following figure, the AIE-ML has a memory interface, a scalar unit, a vector unit, two load units, one store unit, and an instruction fetch and decode unit.

Figure 1. AIE-ML

The features of the AIE-ML include:

Instruction-based VLIW SIMD processor
32-bit scalar RISC processor
- Scalar register files and special registers
- 32 x 32-bit multiplier (signed and unsigned)
- 32-bit add/subtract
- ALU operations like shifts, compares, and logical operations
- No floating point unit: Supported through emulation
Three address generator units (AGU)
- Two 256-bit load and one 256-bit store units with aligned addresses
- Supports 2D/3D addressing modes for ML functionality
- On-the-fly decompression during loading of sparse weights. See Sparsity for more information.
- One AGU dedicated for the store unit

Vector fixed-point/integer unit

Supports FFT processing and sparsity for ML inference applications, including cint32 x cint16 multiplication (data in cint32 and twiddle factor in cint16), control support for complex and conjugation, new permute mode, and shuffle mode. See Sparsity for more information.

Accommodate multiple precision for complex and real operand. See Table 1 for more information.

Table 1. Supported Precision Width of the Vector Data Path
Precision 1	Precision 2	Number of Accumulator Lanes	Bits per Accumulator Lane	Number of MACs
int 8	int 4	32	32	512
int 8	int 8	32	32	256
int 16	int 8	32	32	128
int 16	int 8	16	64	128
int 16	int 16	32	32	64
int 16	int 16	16	64	64
int 32¹	int 16	16	64	32
cint 16	cint 16	8	64	16
cint 32	cint 16	8	64	8
bfloat 16³	bfloat 16	16	SPFP 32²	128
int32 x int32 can be emulated. The operation should have half the performance of int32 x int16 and there should be 16 multiplications per cycle. Single precision floating point (SPFP) per the IEEE standard. float32 x float32 can be emulated. Emulation deviates from the IEEE-754 standard. See Answer Record 34376 for more information.

The multiplier|multiplicand can be signed or unsigned. The accumulator is always signed.
The accumulation can be performed in two operation modes, with either 32 lanes of 32 bits or 16 lanes of 64 bits.
The total number of multipliers and the number of accumulation lanes determine the depth of the post-adding.
In terms of component use, consider the first row in Table 1. Depending on whether or not sparsity is used, the multiplier inputs can be 1024 x 512 or 512 x 512 bits. The number of int8 multipliers is 256. The accumulation is on 32 lanes of 32 bits. See Sparsity for more information.

Single-precision floating-point (SPFP) vector unit:
- Supports 128 bfloat 16 MAC operations with FP32 accumulation by reusing the integer multipliers and post adders along with additional blocks for floating point exponent compute and mantissa shifting and normalization.
- Concurrent operation of multiple vector lanes.
- Supports multiplying bfloat16 numbers (16-bit vector lanes) and accumulating in SPFP (32-bit register lanes). Only 16 accumulator lanes are used in this mode.
Balanced pipeline:
- Different pipeline on each functional unit (eight stages maximum).
- Load and store units manage the 5-cycle latency of data memory.
Three data memory ports:
- Two load ports and one store port
- Each port operates in 256-bit/128-bit vector register mode. Scalar accesses (32-bit/16-bit/8-bit) are supported by only one load port and one store port. The 8-bit and 16-bit stores are implemented as read-modify-write instructions.
- Concurrent operation of all three ports
- A bank conflict on any port stalls the entire data path
Very-long instruction word (VLIW) function:
- Concurrent issuing of operation to all functional units
- Support for multiple instruction formats and variable length instructions
- Up to six operations can be issued in parallel using one VLIW word
Direct stream interface:
- One input stream and one output stream
- Each stream is 32-bits wide
- Vertical in addition to horizontal cascade stream in and stream out in 512 bits
Interface to the following modules:
- Lock module
- Stall module
- Debug and trace module
Event interface is a 16-bit wide output interface from the AIE-ML.
Processor bus interface:
- The AIE-ML architecture is a processor that allows the AIE-ML to perform direct read/write access to local tile memory mapped registers.