Stage 2: System Profiling - 2022.1 English

Versal ACAP AI Engine Programming Environment User Guide (UG1076)

Document ID

UG1076

Release Date

2022-05-25

Version

2022.1 English

The goal of this stage is to profile the design and determine which domain (AI Engine, PL, NoC) is causing a throughput drop, which causes the design to stall.

The following figure shows the tasks and techniques available in this stage.

Figure 1. System Profiling

The section below lists the technique available in this stage.

Profiling AI Engine Core, Interface and Memory Module

You can profile the AI Engine Core, Interface, and Memory modules in XRT or XSDB flows. It is a non-intrusive feature which can be enabled at runtime using the XRT.ini file or running scripts in XSDB. The feature uses performance counters available in the AI Engine array to gather profile data. The amount and type of data gathered is limited by the number of performance counters available.

Profiling AI Engine Core

The profile metric sets available for profiling the AI Engine are as follows:

heat map
stalls
stream puts/gets
exceptions
tile execution
read/write bandwidth related metrics

Memory Module Profiling

The profile metric sets available for profiling the memory module are as follows:

conflicts
DMA locks
DMA stalls

Some examples of AI Engine and Memory Module profiling information displayed in Vitis Analyzer can be found in Figure 3 and Figure 4.

Interface Bandwidth Profiling

Profile metrics to collect interface bandwidth information are also available. Depending on the direction of the port and type of stall (i.e., idle, stalled), you can identify if the PL is stalling and impacting throughput of the AI Engine or vice versa. 

In the following table, the metrics used for interface profiling are indicated in the first column:

Table 1. Interface Profiling Metrics: input_bandwidths and input_stalls_idle
	Metric set: input_stalls_idle
	Stalls High	Idle High
Metric set: input_bandwidths Low bandwidth	AI Engine does not consume samples at the right rate. Proceed to stage 4.	PL Kernel does not produce samples at the right rate. Proceed to stage 3.

Table 2. Interface Profiling Metrics: output_bandwidths and output_stalls_idle
	Metric set: output_stalls_idle
	Stalls High	Idle High
Metric set: output_bandwidths Low bandwidth	PL Kernel does not consume samples at the right rate. Proceed to stage 3.	AI Engine does not produce samples at the right rate. Proceed to stage 4.

You can run the design multiple times, rebooting the board in between each run, with different parameters in the file xrt.ini. Vitis Analyzer allows you to consolidate the different xrt.run.summary files reports so that you have a global view on the various bandwidths, stalls and idles at the interface level.

For details on how to enable profiling in hardware and interpreting the results, see Profiling the AI Engine.

The profile results allow you to quickly identify the exact AI Engine, input stream or output stream involved in the design performance drop.

Next Stage:

Proceed to stage 3 if you determine that a PL kernel is causing the performance drop. In stage 3, you can identify the exact PL kernel(s) with the sub-par performance.
Proceed to stage 4 if you determine that an AI Engine kernel is causing the throughput drop.