Overview - 2021.2 English

AI Engine Kernel Coding Best Practices Guide (UG1079)

Document ID
UG1079
Release Date
2021-11-10
Version
2021.2 English

The Versal® AI Core series delivers breakthrough artificial intelligence (AI) inference acceleration with AI Engines that deliver over 100x greater compute performance than current server-class of CPUs. This series is designed for a breadth of applications, including cloud for dynamic workloads and network for massive bandwidth, all while delivering advanced safety and security features. AI and data scientists, as well as software and hardware developers, can all take advantage of the high compute density to accelerate the performance of any application. Given the AI Engine's advanced signal processing compute capability, it is well-suited for highly optimized wireless applications such as radio, 5G, backhaul, and other high-performance DSP applications.

Note: This version covers the essential hardware details specific to AI Engines. The software programming with AI Engine API and optimization skills are intended to be extendable to new architectures.

AI Engines are an array of very-long instruction word (VLIW) processors with single instruction multiple data (SIMD) vector units that are highly optimized for compute-intensive applications, specifically digital signal processing (DSP), 5G wireless applications, and AI technology such as machine learning (ML).

The AI Engine array supports three levels of parallelism:

Instruction Level Parallelism (ILP)
Through the VLIW architecture allowing multiple operations to be executed in a single clock cycle.
SIMD
Through vector registers allowing multiple elements (for example, eight) to be computed in parallel.
Multicore
Through the AI Engine array, allowing up to 400 AI Engines to execute in parallel.

Instruction-level parallelism includes a scalar operation, up to two moves, two vector reads (loads), one vector write (store), and one vector instruction that can be executed—in total, a 7-way VLIW instruction per clock cycle. Data-level parallelism is achieved via vector-level operations where multiple sets of data can be operated on a per-clock-cycle basis.

Each AI Engine contains both a vector and scalar processor, dedicated program memory, local 32 KB data memory, access to local memory in itself and three neighboring AI Engines with the direction depending on the row it is in. It also has access to DMA engines and AXI4 interconnect switches to communicate via streams to other AI Engines or to the programmable logic (PL) or the DMA. Refer to the Versal ACAP AI Engine Architecture Manual (AM009) for specific details on the AI Engine array and interfaces.

While most standard C code can be compiled for the AI Engine, the code might need restructuring to take full advantage of the parallelism provided by the hardware. The power of an AI Engine is in its ability to execute a multiply-accumulate (MAC) operation using two vectors, load two vectors for the next operation, store a vector from the previous operation, and increment a pointer or execute another scalar operation in each clock cycle. Specialized functions called intrinsics allow you to target the AI Engine vector and scalar processors and provide implementation of several common vector and scalar functions, so you can focus on the target algorithm. In addition to its vector unit, an AI Engine also includes a scalar unit which can be used for non-linear functions and data type conversions.

An AI Engine program consists of a data-flow graph (adaptable data flow graph) specification that is written in C++. This specification can be compiled and executed using the AI Engine compiler. An adaptive data flow (ADF) graph application consists of nodes and edges where nodes represent compute kernel functions, and edges represent data connections. Kernels in the application can be compiled to run on the AI Engines, and are the fundamental building blocks of an ADF graph specification. ADF graph is a Kahn process network with the AI Engine kernels operating in parallel. AI Engine kernels operate on data streams. These kernels consume input blocks of data and produce output blocks of data. Kernels can also have static data or run-time parameter (RTP) arguments that can be either asynchronous or synchronous.

The following figure shows the conceptual view of the ADF graph and its interfaces with the processing system (PS), programmable logic (PL), and DDR memory. It consists of the following.

AI Engine
Each AI Engine is a VLIW processor containing a scalar unit, a vector unit, two load units, and a single store unit.
AI Engine Kernel
Kernels are written in C/C++ running in an AI Engine.
ADF Graph
ADF graph is a network with a single AI Engine kernel or multiple AI Engine kernels connected by data streams. It interacts with the PL, global memory, and PS with specific constructs like PLIO (port attribute in the graph programming that is used to make stream connections to or from the programmable logic), GMIO (port attribute in the graph programming that is used to make external memory-mapped connections to or from the global memory), and RTP.
Figure 1. Conceptual Overview of the ADF Graph

This document focuses on AI Engine kernel programming and covers some aspects beyond single kernel programming, like data communication between kernels, which are essential concepts for partitioning the application into multiple kernels to achieve overall system performance.

For additional details about constructing graph, compiling and simulating graph, and hardware flow, refer to the Versal ACAP AI Engine Programming Environment User Guide (UG1076).