Stage 2: System Profiling - 2023.2 English

AI Engine Tools and Flows User Guide (UG1076)

Document ID

UG1076

Release Date

2023-12-04

Version

2023.2 English

The goal of this stage is to profile the design and determine which domain (AI Engine, PL, NoC) is causing a throughput drop, which causes the design to stall.

The following figure shows the tasks and techniques available in this stage.

Figure 1. System Profiling

The section below lists the technique available in this stage.

Profiling AI Engine Core, Interface and Memory Module

You can profile the AI Engine Core, Interface, and Memory modules in XRT or XSDB flows. It is a non-intrusive feature which can be enabled at runtime using the XRT.ini file or running scripts in XSDB. The feature uses performance counters available in the AI Engine array to gather profile data. The amount and type of data gathered is limited by the number of performance counters available.

Table 1. AI Engine Metrics
AI Engine metrics
heat_map	Profiles active, stall, vector instructions and cumulative instructions time. These metrics reveals the efficiency of the AI Engine not only in terms of code efficiency (vector instructions) but also in terms of interaction with memory and streams.
stalls	Profiles memory, stream, lock and cascade stalls. These metrics allows a deeper analysis of the reasons of the stalls detected with `heat_map` metric set.
execution	Profiles vector, load and store instructions time. With these metrics you can determine the efficiency of your kernel code.
floating-point	Profiles all floating-point exceptions. If you are using floating-point arithmetic, these metrics highlight the exceptions that occur in the code.
aie_trace	Profiles AI Engine and memory module trace word count and stall count. This is useful to determine if you have congestion in the trace stream when you use the event trace feature.
write_bandwidths	Profiles stream write, cascade write and stalls time. This is an indicator of the efficiency of the stream and cascade output. If there are many stalls, this indicates that the next kernel in the graph cannot consume data quickly enough and this could impact the design throughput.
read_bandwidth	Profiles stream read, Cascade read and stalls time. This is an indicator of the efficiency of the stream and cascade input. If there are many stalls, this indicates that the previous kernel in the graph cannot provide data quickly enough and this could impact the design throughput.

Table 2. Memory Module Metrics
Memory Module Metrics
conflicts	Profiles memory conflicts and memory errors. Memory conflicts happen when two memory chunks reside in the same memory bank and are accessed either by the same AI Engine (using the two read ports) or by two different AI Engines. A potential solution is to constrain the locations of these memories to different banks. In order to get more details about which bank is causing these conflicts, you should analyze the events from an emulation-AI Engine simulation or perform event trace in hardware.
dma_locks	Profiles lock activities on both DMAs. The four DMA channels (2xS2MM and 2xMM2S) are driven by Buffer Descriptors (BDs). The Cumulative DMA Activity is a count of the time taken due to stalled lock acquire events on all channels. All these DMA events will help you understand why some connections through the device are slower than expected.
dma_stalls_s2mm	Profiles DMA stalls on the s2mm channels due to a lock acquisition conflict. A stalling s2mm DMA indicates that there is a conflict when accessing the target memory. This may be due to another s2mm or mm2s DMA accessing the same bank or a kernel performing a memory access leading to a lock acquisition conflict.
dma_stalls_mm2s	Profiles DMA stalls on the mm2s channels due to a lock acquisition conflict. A stalling mm2s DMA indicates that there is a conflict when accessing the source memory. This may be due to another s2mm or mm2s DMA accessing the same bank or a kernel performing a memory access leading to a lock acquisition conflict.
write_bandwidths	Profiles bandwidth used by the s2mm DMA. Allows you to evaluate if you achieve your bandwidth goals.
read_bandwidths	Profiles bandwidth used by the mm2s DMA. Allows you to evaluate if you achieve your bandwidth goals.

Table 3. Interface Tile Metrics
Interface Tile Metrics
input_bandwidths	Profiles input PLIO channel bandwidth in addition to stalls and idle time. If input bandwidth is too low, this may be due to a high stall rate, which means that the AI Engine array does not consume the samples at the right rate. Proceed to AI Engine event trace (stage 4). It may be also due to a high idle rate which means that the PL side of the design does not produce samples at the right rate. Proceed to PL Kernel analysis (stage 3).
output_bandwidths	Profiles output PLIO channel bandwidth in addition to stalls and idle time. If output bandwidth is too low, this may be due to a high idle rate, which means that the AI Engine array does not produce the samples at the right rate. Proceed to AI Engine event trace (stage 4). It may be also due to a high stall rate which means that the PL side of the design does not consume samples at the right rate. Proceed to PL Kernel analysis (stage 3).
packets	Profiles number of input and output packets

You can run the design multiple times, rebooting the board in between each run, with different parameters in the file xrt.ini. The Vitis IDE allows you to consolidate the different xrt.run.summary files reports so that you have a global view on the various bandwidths, stalls and idles at the interface level.

For details on how to enable profiling in hardware and interpreting the results, see Profiling the AI Engine.

The profile results allow you to quickly identify the exact AI Engine, input stream or output stream involved in the design performance drop.

Next Stage:

Proceed to stage 3 if you determine that a PL kernel is causing the performance drop. In stage 3, you can identify the exact PL kernel(s) with the sub-par performance.
Proceed to stage 4 if you determine that an AI Engine kernel is causing the throughput drop.