Single Kernel Programming using Intrinsics

Single Kernel Programming using Intrinsics - 2021.2 English

AI Engine Kernel Coding Best Practices Guide (UG1079)

Document ID

UG1079

Release Date

2021-11-10

Version

2021.2 English

CAUTION:

It is strongly recommended that you use AI Engine APIs for your designs. Usage of intrinsics must only be considered for situations where the stringent performance needs of the design require capabilities that are not covered by the AI Engine API. For example, the AI Engine API does not currently support functionality provided by some intrinsics such as, fft_data_incr and cyclic_add. While AI Engine APIs support and abstract the main permute use cases, not all permute capabilities are covered. Using intrinsics may allow you to close the performance gap required by your design.

An AI Engine kernel is a C/C++ program which is written using native C/C++ language and specialized intrinsic functions that target the VLIW scalar and vector processors. The AI Engine kernel code is compiled using the AI Engine compiler (aiecompiler) that is included in the Vitis™ core development kit. The AI Engine compiler compiles the kernels to produce ELF files that are run on the AI Engine processors.

For more information on intrinsic functions, see the Versal ACAP AI Engine Intrinsics Documentation (UG1078). AI Engine compiler and simulator are covered in the first few sections of this chapter.

AI Engine supports specialized data types and intrinsic functions for vector programming. By restructuring the scalar application code with these intrinsic functions and vector data types as needed, you can implement the vectorized application code. The AI Engine compiler takes care of mapping intrinsic functions to operations, vector or scalar register allocation and data movement, automatic scheduling, and generation of microcode that is efficiently packed in VLIW instructions.

The following sections introduce the data types supported and registers available for use by the AI Engine kernel. In addition, the vector intrinsic functions that initialize, load, and store, as well as operate on the vector registers using the appropriate data types are also described.

To achieve the highest performance on the AI Engine, the primary goal of single kernel programming is to ensure that the usage of the vector processor approaches its theoretical maximum. Vectorization of the algorithm is important, but managing the vector registers, memory access, and software pipelining are also required. The programmer must strive to make the data for the new operation available while the current operation is executing because the vector processor is capable of an operation every clock cycle. Optimizations using software pipelining in loops is available using pragmas. For instance, when the inner loop has sequential or loop carried dependencies it might be possible to unroll an outer loop and compute multiple values in parallel. The following sections go over these concepts as well.