Functional Overview

Versal Adaptive SoC AI Engine Architecture Manual (AM009)

Document ID
AM009
Release Date
2023-08-18
Revision
1.3 English

The AI Engine is a highly-optimized processor featuring single-instruction multiple-data (SIMD) and very-long instruction word (VLIW) processor that supports both fixed-point and floating-point precision. As shown in the following figure, the AI Engine has a memory interface, a scalar unit, a vector unit, two load units, one store unit, and an instruction fetch and decode unit.

Figure 1. AI Engine

The features of the AI Engine include:

  • 32-bit scalar RISC processor
    • General purpose pointer and configuration register files
    • Supports non-linear functions (for example: sqrt, Sin/Cos, and InvSqrt)
    • A scalar ALU, including 32 x 32-bit scalar multiplier
    • Supports conversion of the data type between scalar fixed point and scalar floating point
  • Three address generator units (AGU)
    • Support for multiple addressing modes: Fixed, indirect, post-incremental, or cyclic
    • Supports Fast Fourier Transform (FFT) address generation
    • Two AGUs dedicated for two load units
    • One AGU dedicated for the store unit
  • Vector fixed-point/integer unit
    • Concurrent operations on multiple vector lanes
    • Accommodate multiple precision for complex and real operand (see Table 1).
      Note: While cfloat is a vector data type, it is not directly supported by the AI Engine vector processor. Two instructions must be issued.
      Table 1. Supported Precision Width of the Vector Data Path
      X Operand Z Operand Output Number of MACs
      8 real 8 real 48 real 128
      16 real 8 real 48 real 64
      16 real 16 real 48 real 32
      16 real 16 complex 48 complex 16
      16 complex 16 real 48 complex 16
      16 complex 16 complex 48 complex 8
      16 real 32 real 48/80 real 16
      16 real 32 complex 48/80 complex 8
      16 complex 32 real 48/80 complex 8
      16 complex 32 complex 48/80 complex 4
      32 real 16 real 48/80 real 16
      32 real 16 complex 48/80 complex 8
      32 complex 16 real 48/80 complex 8
      32 complex 16 complex 48/80 complex 4
      32 real 32 real 80 real 8
      32 real 32 complex 80 complex 4
      32 complex 32 real 80 complex 4
      32 complex 32 complex 80 complex 2
      32 SPFP 32 SPFP 32 SPFP 8
    • Can be configured to perform eight complex 16-bit multiplications
    • Full permute unit with 32-bit granularity
    • Shift, round, and saturate with multiple rounding and saturation modes
    • Two-step post adding along with 768-bit intermediate results
    • The X operand is 1024 bits wide and the Z operand is 256 bits wide. In terms of component use, consider the first row in Table 1. The multiplier operands come from the same 1024-bit and 256-bit input registers, but some values are broadcast to multiple multipliers. There are 128 8-bit single multipliers and the results are post-added and accumulated into 16 or 8 accumulator lanes of 48 bits each.
  • Single-precision floating-point (SPFP) vector unit
    • Use same permute as a fixed-point vector unit
    • Concurrent operation of multiple vector lanes
    • Eight single-precision multiplier–accumulators (MACs) per cycle
  • Balanced pipeline
    • Different pipeline on each functional unit (eight stages maximum)
    • Load and store units manage the 5-cycle latency of data memory
  • Three data memory ports
    • Two load ports and one store port
    • Each port operates in 256-bit/128-bit vector register mode or 32-bit/16-bit/8-bit scalar register mode. The 8-bit and 16-bit stores are implemented as read-modify-write instructions
    • Concurrent operation of all three ports
    • A bank conflict on any port stalls the entire data path
  • Very-long instruction word (VLIW) function
    • Concurrent issuing of operation to all functional units
    • Support for multiple instruction formats and variable length instructions
    • Up to seven operations can be issued in parallel using one VLIW word
  • Direct stream interface
    • Two input streams and two output streams
    • Each stream can be configured to be either 32-bit or 128-bit wide
    • One cascade stream in, one cascade stream out (384-bit)
  • Interface to the following modules
    • Lock module
    • Stall module
    • Debug and trace module
  • Event interface is a 16-bit wide output interface from the AI Engine