AI Engine Tile Architecture - 2022.1 English

Versal ACAP AI Engine Programming Environment User Guide (UG1076)

Document ID
UG1076
Release Date
2022-05-25
Version
2022.1 English

The AI Engine array consists of a 2D array of AI Engine tiles, where each AI Engine tile contains an AI Engine, memory module, and tile interconnect module. An overview of such an AI Engine tile is shown in the following figure.

AI Engine
Each AI Engine is a very long instruction word (VLIW) processor containing a scalar unit, a vector unit, two load units, and a single store unit.
AI Engine Tile
An AI Engine tile contains an AI Engine, a local memory module together with several communication paths to facilitate data exchange between tiles.
AI Engine Array
AI Engine array refers to the complete 2D array of AI Engine tiles.
AI Engine Program
The AI Engine program consists of a data flow graph specification which is written in C/C++. This program is compiled and executed using the AI Engine tool chain.
AI Engine Kernels
Kernels are written in C/C++ using AI Engine vector data types and intrinsic functions. These are the computation functions running on an AI Engine. The kernels form the fundamental building blocks of a data flow graph specification.
Figure 1. AI Engine Tile Block Diagram

The following illustration is the architecture of a single AI Engine.

Figure 2. AI Engine

Each AI Engine is a very long instruction word (VLIW) processor containing a scalar unit, a vector unit, two load units, and one store unit. The main compute power is provided by the vector unit. The vector unit contains a fixed-point unit with 128 8-bit fixed-point multipliers and a floating-point unit with eight single-precision floating-point multipliers. The vector registers and permute network are shared between the floating-point and fixed-point vector units. The peak performance depends on the size of the data types used by the operands. The following table provides the number of MAC operations that can be performed by the vector processor per instruction.

Table 1. Supported Precision Bit Width of the Vector Datapath
X Operand Z Operand Output Number of MACs
8 real 8 real 48 real 128
16 real 8 real 48 real 64
16 real 16 real 48 real 32
16 real 16 complex 48 complex 16
16 complex 16 real 48 complex 16
16 complex 16 complex 48 complex 8
16 real 32 real 48/80 real 16
16 real 32 complex 48/80 complex 8
16 complex 32 real 48/80 complex 8
16 complex 32 complex 48/80 complex 4
32 real 16 real 48/80 real 16
32 real 16 complex 48/80 complex 8
32 complex 16 real 48/80 complex 8
32 complex 16 complex 48/80 complex 4
32 real 32 real 80 real 8
32 real 32 complex 80 complex 4
32 complex 32 real 80 complex 4
32 complex 32 complex 80 complex 2
32 SPFP 32 SPFP 32 SPFP 8

To calculate the maximum performance for a given datapath, it is necessary to multiply the number of MACs per instruction with the clock frequency of the AI Engine kernel. For example, with 16-bit input vectors X and Z, the vector processor can achieve 32 MACs per instruction. Using the clock frequency for the slowest speed grade results in:

32 MACs * 1 GHz clock frequency = 32 Giga MAC operations/second

In most cases, 32 MACs/instruction remains a theoretical upper bound because the algorithm to be implemented cannot continuously use the full capabilities of the AI Engine or might be constrained by I/O bandwidth.

The main I/O interfaces with respect to reading and writing data to and from the AI Engine for compute are the data memory interfaces, the stream interfaces, and the cascade stream interfaces. A complete list of interfaces including the program memory interface and debug interface are available in Versal ACAP AI Engine Architecture Manual (AM009).

  • The data memory interface sees one contiguous memory consisting of the data memory modules in all four directions with a total capacity of 128 KB. The AI Engine has two 256-bit wide-load units and one 256-bit wide-store unit.
  • The AI Engine has two 32-bit input AXI4-Stream interfaces and two 32-bit output AXI4-Stream interfaces. Each of these streams allow the AI Engine to have a 128-bit access every four clock cycles or a 32-bit wide access per cycle.
  • The 384-bit accumulator data from one AI Engine can be forwarded to the neighboring AI Engine by using the cascade stream interfaces to form a chain. The cascade stream interface is uni-directional and its direction depends on the row where the AI Engine is located. There is a small, two deep, 384-bit wide FIFO on both the input and output streams that allow storing up to four values between AI Engines. Each cycle 384-bits can be sent and received by the chained AI Engines.

The program memory size on the AI Engine is 16 KB, which allows storing 1024 instructions of 128-bit each. The AI Engine instructions are 128-bits wide and support multiple instruction formats and variable length instructions to reduce the program memory size. Many instructions outside of the optimized inner loop can use the shorter formats.