Mapping Algorithm onto the AI Engine - 2021.1 English

AI Engine Kernel Coding Best Practices Guide (UG1079)

Document ID
UG1079
Release Date
2021-07-19
Version
2021.1 English

When starting from a pure software model, one must first identify the application boundary. With the application boundary identified, the input and output for the application can be defined. The application can then be divided into components (processing units) that perform specific operations. Data will flow from input, through one or multiple components, to the outputs. This is called the data flow diagram (DFD).

The DFD can be mapped into a graph, which is the actual design running in the AI Engine. When performing the mapping, the input and output sources, as well as their throughputs must be taken into consideration in accordance to the system performance requirements. Ideally, high throughput data must go through PLIO or GMIO. The PS can handle lower bandwidth configuration data through RTP ports of the graph. Depending on the bandwidth the PL kernels can provide, the interfaces between AI-to-PL can be determined.

After the graph interface is defined, elements in the graph can be further divided into different kernels. This kernel partition process is typically throughput driven. The hard limits of AI Engine compute capacity, data memory size, and program memory size must be followed by the single AI Engine kernel. Besides, the data transfer between kernels must be measured and balanced. Multiple data transfer methods between AI Engine kernels are available, like ping-pong memory, stream, and cascade stream. This requires accounting for stream interface availability, memory availability, and locations of the buffers, because those considerations directly affect data communication between kernels and the data fetch throughput. When splitting a kernel into multiple kernels that are chained by cascade streams, these kernels are tightly integrated and synchronized by producing or consuming data in the cascade stream.

After kernel partitioning, the I/O interfaces and throughput requirement for a kernel should be clear. When the compute, memory, and interfaces can meet the requirements in theory, then it is advisable to think about vectorization of single kernel realization.

Application partitioning and kernel programming can be fast iterated in simulation. These can require engineers to redo partition or restructure the data flow between kernels as the process goes deeper. This iterative method of mapping algorithm into graph, partitioning applications into AI Engine kernels, and refining data flow driven by throughput is shown in the following figure.

Figure 1. Data Flow Diagram to Graph

Depending on the runtime ratios of the kernels, one kernel or multiple kernels can be mapped into one AI Engine. Runtime ratio of a kernel can be computed using the following equation:

runtime ratio = (cycles for one run of the kernel)/(cycle budget)

The cycle budget is the cycles allowed to run one invocation of the kernel which depends on how much data it will deal with and the system throughput requirement. Cycles for one run of the kernel can be estimated in the initial design stage and profiled in the AI Engine simulator when vectorized code is available.

Runtime ratio setting is required in graph specification and it affects how data is communicated between kernels. Refer to Data Movement Between AI Engines for more information about data communication. See the Versal ACAP AI Engine Programming Environment User Guide (UG1076) for more considerations on runtime ratio.

When the graph can be constructed and kernels are achievable in AI Engine tiles, vectorization is done for each individual AI Engine kernel. It is possible to leave some kernels not vectorized for proof of concept initially.

The process of vectorization is based on the vector data type and vector intrinsic functions. This requires accounting for registers and the available data to load for computation. Usually the target of the optimization is to achieve the best initiation interval of the main loop in the kernel. This involves interleaving data load, compute and store, as well as resolving data dependency and compute dependency in the loop. When optimizing hierarchical loops, it is helpful to pipeline the outer loop and unroll the inner loop when the inner loop is not large. The techniques introduced in Single Kernel Programming are applicable for vectorization and optimization of the single kernel.