AI Engine Power - 2020.2 English

Xilinx Power Estimator User Guide for Versal ACAP (UG1275)

Document ID
Release Date
2020.2 English

The AI Engine array introduced in the Xilinx® Versal™ architecture caters to solutions for high compute or complex DSP intensive applications, like 5G Wireless or Machine Learning algorithms. AI Engine is a high performance VLIW vector (SIMD) processor with integrated memory and interconnects to help communicate with other AI Engine cores that are connected together in a two dimensional array network in the device.

Power Estimation by XPE

The AI Engine sheet in XPE for Versal ACAP is available for the AI Core Series family. XPE helps you estimate the power consumption of AI Engine blocks for a particular configuration. The following figure shows the AI Engine Power interface:
Figure 1. AI Engine Power Sheet

For an early power estimation, you should provide the configuration details of the AI Engine array such as clock frequency, number of cores, kernel type, and the Vector Load average percentage for the cores. The supported kernel types are Int8, Int16, and Floating Point.

Tip: When considering the Vector Load percentage, the average loading percentage should be used. The kernel may be using 100% of the available core run-time, however overhead from pre-Fetch, memory accesses, NOPs, stream, and lock stalls should be considered. The recommended range is 30% to 70%.

Data Memory and Interconnect Load fields are auto-populated based on the number of AI Engine cores used and can be overridden based on the application requirement. There are eight memory banks in an AI Engine Tile (each bank is 4 KB in size totaling 32 KB per tile), by default, XPE uses all of them, this can be overridden if the application requires fewer bank accesses. Memory R/W rate is average Read/Write memory access for each bank.

Tip: The Memory R/W rate is an average value. XPE uses 20% by default. Recommended value range is 10% to 30%.

The AI Engine array interface allows access to rest of the Versal™ ACAP, there are interface tiles for both the Programmable Logic (PL) and Network On Chip (NoC), these interfaces tiles are represented as streams. You can override the PL/NoC streams based on your design and application. The interconnect fields are read-only and calculated based on your input. PL streams show the available streams in the first row of AIE tiles and allows you to specify the number of 64b PL streams that are utilized. It is recommended that PL streams are set at default 14 streams per 20 AIE tiles used. However, PL streams can be changed, you can see a DRC (cell turns orange) when the PL streams exceed the available streams within the total AIE array. Interconnect load is averaged to a fixed value of 12% and has minimum impact to power and can be overridden by import flow described in the next section. The maximum range for clock speed depends on the speed grade of a device with 1300 MHz for -3H grade. For more information, see the Versal ACAP AI Engine Architecture Manual (AM009).

Tip: There are multi level DRCs for Vector Load and Memory R/W Rate. Yellow indicates values are in the higher range of typical application and orange indicates the values are higher than typical expected applications.

Import Flow

Vitis™ software platform generates an .xpe file that can be imported to provide an accurate starting point for AI Engine power estimation. Once imported, all the configuration is generated and power can be estimated more accurately compared to the manual entry mode. The .xpe file generated by the Vitis™ software platform, when imported, averages out Vector Load and Memory R/W rate for a particular kernel type. For example, all the cores of kernel type INT8 vector load and the R/W rate is averaged and populated in a single row of XPE. The Interconnect Load is not a default of 12% in the import flow. Instead, it is computed by the tool based on the stream utilization for each AI Engine tile. For more information, see AR 75509.