Stage 2: System Profiling - 2022.1 English

Versal ACAP AI Engine Programming Environment User Guide (UG1076)

Document ID
Release Date
2022.1 English

The goal of this stage is to profile the design and determine which domain (AI Engine, PL, NoC) is causing a throughput drop, which causes the design to stall.

The following figure shows the tasks and techniques available in this stage.

Figure 1. System Profiling

The section below lists the technique available in this stage.

Profiling AI Engine Core, Interface and Memory Module

You can profile the AI Engine Core, Interface, and Memory modules in XRT or XSDB flows. It is a non-intrusive feature which can be enabled at runtime using the XRT.ini file or running scripts in XSDB. The feature uses performance counters available in the AI Engine array to gather profile data. The amount and type of data gathered is limited by the number of performance counters available.

Profiling AI Engine Core
The profile metric sets available for profiling the AI Engine are as follows:
  • heat map
  • stalls
  • stream puts/gets
  • exceptions
  • tile execution
  • read/write bandwidth related metrics
Memory Module Profiling
The profile metric sets available for profiling the memory module are as follows:
  • conflicts
  • DMA locks
  • DMA stalls

Some examples of AI Engine and Memory Module profiling information displayed in Vitis Analyzer can be found in Figure 3 and Figure 4.

Interface Bandwidth Profiling
Profile metrics to collect interface bandwidth information are also available. Depending on the direction of the port and type of stall (i.e., idle, stalled), you can identify if the PL is stalling and impacting throughput of the AI Engine or vice versa. 

In the following table, the metrics used for interface profiling are indicated in the first column:

Table 1. Interface Profiling Metrics: input_bandwidths and input_stalls_idle
  Metric set: input_stalls_idle
Stalls High Idle High
Metric set: input_bandwidths

Low bandwidth

AI Engine does not consume samples at the right rate.

Proceed to stage 4.

PL Kernel does not produce samples at the right rate.

Proceed to stage 3.

Table 2. Interface Profiling Metrics: output_bandwidths and output_stalls_idle
  Metric set: output_stalls_idle
Stalls High Idle High
Metric set: output_bandwidths

Low bandwidth

PL Kernel does not consume samples at the right rate.

Proceed to stage 3.

AI Engine does not produce samples at the right rate.

Proceed to stage 4.

You can run the design multiple times, rebooting the board in between each run, with different parameters in the file xrt.ini. Vitis Analyzer allows you to consolidate the different files reports so that you have a global view on the various bandwidths, stalls and idles at the interface level.

For details on how to enable profiling in hardware and interpreting the results, see Profiling the AI Engine.

The profile results allow you to quickly identify the exact AI Engine, input stream or output stream involved in the design performance drop.

Next Stage:

  • Proceed to stage 3 if you determine that a PL kernel is causing the performance drop. In stage 3, you can identify the exact PL kernel(s) with the sub-par performance.
  • Proceed to stage 4 if you determine that an AI Engine kernel is causing the throughput drop.