Improving Performance in the AI Engine - 2023.1 English

Versal Adaptive SoC System Integration and Validation Methodology Guide (UG1388)

Document ID
Release Date
2023.1 English

There are several techniques to profile and improve the performance of AI Engine graphs and kernels.

You can use the Xilinx® Runtime (XRT) APIs to measure performance metrics like platform I/O port bandwidth, graph throughput, and graph latency. Use these APIs in the host application code with the AI Engine graph object. This object is used to initialize, run, update and exit graphs. In addition, you can use these APIs to profile graph objects to measure bandwidth, throughput, and latency. For more information, see this link in the AI Engine Tools and Flows User Guide (UG1076).

AI Engine performance analysis typically involves system performance issues such as missing or mismatching locks, buffer overruns, and incorrect programming of direct memory access (DMA) buffers. It also includes memory/core stalls, deadlocks, and hot spot analysis. The AI Engine architecture has direct support for generation, collection, and streaming of events as trace data during simulation, hardware emulation, or hardware execution. This data can then be analyzed for functional issues and latency problems between kernels, memory stalls, deadlocks, etc. For more information, see the following:

AI Engine APIs versus Intrinsics

AI Engine API is a portable programming interface for AI Engine accelerators. It is implemented as a C++ header-only library that provides types and operations that get translated into efficient low-level intrinsics. AMD strongly recommends using AI Engine APIs for your designs. Usage of intrinsics must only be considered for situations where the stringent performance needs of the design require capabilities that are not covered by the AI Engine API. For example, the AI Engine API does not currently support functionality provided by some intrinsics such as, fft_data_incr and cyclic_add. While AI Engine APIs support and abstract the main permute use cases, not all permute capabilities are covered. Using intrinsics might allow you to close the performance gap required by your design.

For more information on the usage of AI Engine APIs and intrinsics, see AI Engine Kernel and Graph Programming Guide (UG1079).