Functional Overview

Versal ACAP AIE-ML Architecture Manual (AM020)

Document ID
Release Date
1.0 English

The AIE-ML is a highly-optimized processor featuring single-instruction multiple-data (SIMD) and very-long instruction word (VLIW) processor that supports both fixed-point and floating-point precision. As shown in the following figure, the AIE-ML has a memory interface, a scalar unit, a vector unit, two load units, one store unit, and an instruction fetch and decode unit.

Figure 1. AIE-ML

The features of the AIE-ML include:

  • Instruction-based VLIW SIMD processor
  • 32-bit scalar RISC processor
    • Scalar register files and special registers
    • 32 x 32-bit multiplier (signed and unsigned)
    • 32-bit add/subtract
    • ALU operations like shifts, compares, and logical operations
    • No floating point unit: Supported through emulation
  • Three address generator units (AGU)
    • Two 256-bit load and one 256-bit store units with aligned addresses
    • Supports 2D/3D addressing modes for ML functionality
    • On-the-fly decompression during loading of sparse weights
    • One AGU dedicated for the store unit
  • Vector fixed-point/integer unit
    • Supports FFT processing and sparsity for ML inference applications, including cint32 x cint16 multiplication (data in cint32 and twiddle factor in cint16), control support for complex and conjugation, new permute mode, and shuffle mode.
    • Accommodate multiple precision for complex and real operand (see Table 1).
      Table 1. Supported Precision Width of the Vector Data Path
      Precision 1 Precision 2 Number of Accumulator Lanes Bits per Accumulator Lane Number of MACs
      int 8 int 4 32 32 512
      int 8 int 8 32 32 256
      int 16 int 8 32 32 128
      int 16 int 8 16 64 128
      int 16 int 16 32 32 64
      int 16 int 16 16 64 64
      int 32 1 int 16 16 64 32
      cint 16 cint 16 8 64 16
      cint 32 cint 16 8 64 8
      bfloat 16 3 bfloat 16 16 SPFP 32 2 128
      1. int32 x int32 can be emulated. The operation should have half the performance of int32 x int16 and there should be 16 multiplications per cycle.
      2. Single precision floating point (SPFP) per the IEEE standard.
      3. float32 x float32 can be emulated. Emulation deviates from the IEEE-754 standard. See Answer Record 34376 for more information.
    • Each of the two multipliers can be signed or unsigned. The accumulator is always signed.
    • The accumulation can be performed in two operation modes, with either 32 lanes of 32 bits or 16 lanes of 64 bits.
    • The total number of multipliers and the number of accumulation lanes determine the depth of the post-adding.
    • In terms of component use, consider the first row in Table 1. Depending on whether or not sparsity is used, the multiplier inputs can be 1024 x 512 or 512 x 512 bits. The number of int8 multipliers is 256. The accumulation is on 32 lanes of 32 bits.
  • Single-precision floating-point (SPFP) vector unit:
    • Supports 128 bfloat 16 MAC operations with FP32 accumulation by reusing the integer multipliers and post adders along with additional blocks for floating point exponent compute and mantissa shifting and normalization.
    • Concurrent operation of multiple vector lanes.
    • Supports multiplying bfloat16 numbers (16-bit vector lanes) and accumulating in SPFP (32-bit register lanes). Only 16 accumulator lanes are used in this mode.
  • Balanced pipeline:
    • Different pipeline on each functional unit (eight stages maximum).
    • Load and store units manage the 5-cycle latency of data memory.
  • Three data memory ports:
    • Two load ports and one store port
    • Each port operates in 256-bit/128-bit vector register mode. Scalar accesses (32-bit/16-bit/8-bit) are supported by only one load port and one store port. The 8-bit and 16-bit stores are implemented as read-modify-write instructions.
    • Concurrent operation of all three ports
    • A bank conflict on any port stalls the entire data path
  • Very-long instruction word (VLIW) function:
    • Concurrent issuing of operation to all functional units
    • Support for multiple instruction formats and variable length instructions
    • Up to six operations can be issued in parallel using one VLIW word
  • Direct stream interface:
    • One input stream and one output stream
    • Each stream is 32-bits wide
    • Vertical in addition to horizontal cascade stream in and stream out in 512 bits
  • Interface to the following modules:
    • Lock module
    • Stall module
    • Debug and trace module
  • Event interface is a 16-bit wide output interface from the AIE-ML.
  • Processor bus interface:
    • The AIE-ML architecture is a processor that allows the AIE-ML to perform direct read/write access to local tile memory mapped registers.

The AIE-ML removes some advanced DSP functionality used in the AI Engine including:

  • 32-bit floating-point vector data path is not directly supported but can be emulated via decomposition into multiple multiplications of 16 x 16-bit.
  • Scalar floating point/integer conversions
  • Complex circular addressing and FFT addressing modes. Provides some level of FFT and complex support.
  • Limited support 128-bit load/store
  • Non-aligned memory address
  • Non-blocking stream access