AI Engine Architecture Overview - 2021.2 English

Versal ACAP AI Engine Programming Environment User Guide (UG1076)

Document ID
UG1076
ft:locale
English (United States)
Release Date
2021-12-17
Version
2021.2 English

Programming the AI Engine array requires a thorough understanding of the algorithm to be implemented, the capabilities of the AI Engines, and the overall data flow between individual functional units. The AI Engine array supports three levels of parallelism:

SIMD
Through vector registers that allow multiple elements to be computed in parallel.
Instruction level
Through the VLIW architecture that allows multiple instructions to be executed in a single clock cycle.
Multicore
Through the AI Engine array, where up to 400 AI Engines can execute in parallel.

While most standard C code can be compiled for the AI Engine, the code might need substantial restructuring to achieve optimal performance on the AI Engine array. The power of an AI Engine is its ability to execute a vector MAC operation, load two 256-bit vectors for the next operation, store a 256-bit vector from the previous operation, and increment a pointer or execute another scalar operation in each clock cycle. The AI Engine compiler does not perform any auto or pragma-based vectorization. The code must be rewritten to use SIMD intrinsic data types (for example, v8int32) and vector intrinsic functions (for example, mac(…)), and these must be executed within a pipelined loop to achieve the optimal performance. The 32-bit scalar RISC processor has an ALU, some non-linear functions, and data type conversions. Each AI Engine has access to a limited amount of memory, this means that large data sets need to be partitioned.

AI Engine kernels are functions that run on an AI Engine, and form the fundamental building blocks of a data flow graph specification. The data flow graph is a Kahn process network with deterministic behavior that does not depend on the various computational or communication delays. AI Engine kernels are declared as void C/C++ functions that take window or stream arguments for graph connectivity. Kernels can also have static data and run-time parameter arguments that can be either asynchronous or triggering. Each kernel should be defined in its own source file.

To achieve overall system performance, additional reading and experience is required with respect to the architecture, partitioning, as well as with the AI Engine data flow graph generation and optimizing data flow connectivity. The Versal ACAP AI Engine Architecture Manual (AM009) contains more detailed information.

Xilinx provides DSP and communications libraries with optimized code for the AI Engine that should be used whenever possible. The supplied source code is also a great resource for learning about AI Engine kernel coding.