Viewing Profiling Results Using Vitis Analyzer - 2022.2 English

AI Engine Tools and Flows User Guide (UG1076)

Document ID
UG1076
Release Date
2022-10-19
Version
2022.2 English

To launch the vitis_analyzer to view the profiling information in the XRT flow, use the following command.

vitis_analyzer xrt.run_summary

To launch the vitis_analyzer to view the profiling information in the XSDB flow, use the following command.

vitis_analyzer aie_trace_profile.run_summary

Example of heat_map Core Metrics and conflicts Memory Metrics

The following image shows the design's active time, stall time, cumulative instruction count, and vector_instruction_count as part of heat_map metric and memory conflict time, as well as cumulative memory error time of conflicts metrics for ten tiles of an example design.

Figure 1. Example of heat_map and conflicts Metrics

Note: Click on this icon in the upper-right corner to enable/disable charts.

Consider the AI Engine located in (24,2). The stall time (.043 ms) is 20% of the active time (.214 ms). During this active time, it performs 179200 vector instructions, which represents 95% of the active time. This is an excellent performance that indicates a well optimized core.

Example of stalls Core Metrics and dma_locks Memory Metrics

The following image shows the design's memory stall time, stream stall time, cascade stall time, and lock stall time as part of stalls metrics and cumulative DMA activity time, as well as cumulative DMA locks count of dma_locks metrics for ten tiles of an example design.

Figure 2. Example of stalls and dma_locks Metrics

On the core (24,2), the DMA has been active for 70.645 ms (77.8 millions instructions), but has been stalled 298 times.

Example of execution Core Metrics and conflicts Memory Metrics

The following image shows the design's cumulative instruction count, vector instruction count, load instruction count, and store instruction count as part of execution metrics and memory conflict time, as well as cumulative memory error time of conflicts metrics for ten tiles of an example design.

Figure 3. Example of execution and conflicts Metrics

Although they are minor, core (24,2) suffers from some memory conflicts that must be identified. The occurrence being very small might be due to some DMA or some other kernel access interference.

Example of read_bandwidths and write_bandwidth AI Engine Metrics and dma_stalls_s2mm and dma_stalls_mm2s AI Engine Memory Metrics

The following image shows the design's stream and cascade read and write instruction countas part of read_bandwidths and write_bandwidths metrics and s2mm and mm2s channel0 and channel 1stalls time of dma_stalls_s2mm and dma_stalls_mm2smetrics for ten tiles of an example design.

Figure 4. Data table for read and write bandwidths of the AI Engines and stalls of all mm2s and s2mm channels of the AI Engine Memories
Figure 5. Charts for Cascade Read and Stream read Instruction Time in percent

We can see here that there is a cascade read and a stream read more than 45% of the time in the AI Engine kernels. This is necessary to keep the AI Engine active because the stream bandwidths is much less than the memory bandwidth.

Example of heat_map Core Metrics and dma_locks Memory Metrics

The following image shows the design's active time, stall time, cumulative instruction count and vector_instruction_count as part of heat_map metrics and cumulative DMA activity time, as well as cumulative DMA locks count of dma_lock metrics for ten tiles of an example design.

Figure 6. Example of heat_map and dma_locks Metrics

The cumulative DMA Activity time jointly with the Cumulative DMA Locks count allows you to see if there is any discrepancy between lock acquisition number and the number of data transferred through the DMAs. The relative number of locks count can also be used to interpret the relative number of iterations of each core.

Example of input_bandwidths Interface Metrics

The following image shows the design's input bandwidth at the PLIO level as part of input_bandwidths:0 metric in a 8 x 8 cascaded tiles design.

Figure 7. Example of input_bandwidths:0 Interface Metrics

In this graph, the channel 0 bandwidth for all input PLIOs is approximately 95% which is close to the achievable maximum. After this profiling step, verify that the AI Engines are not starving for data.

Report Consolidation in Vitis Analyzer

During the profiling stage, not all metrics can be used at the same time during runtime. You can run the design in hardware multiple times by rebooting the board, each run using different profile metric sets in xrt.ini. Typically, for AI Engine interface bandwidth profiling, a single channel (the same for all PLIOs) can be profiled during runtime. Multiple channel profiling will necessitate multiple runs.

The vitis_analyzer has the ability to consolidate multiple reports concerning different runs of the same design. That enables you to display the bandwidth of multiple interface channels, for example. While vitis_analyzer is run with the xrt.run_summary of a specific run of the design, other xrt.run_summary reports can be opened by clicking the + toolbar button in the main toolbar and a window toolbar, as shown below.

Figure 8. The Add (+) Button in the Main Toolbar
Figure 9. The Add (+) Button in a Window Toolbar

After consolidating the profiling data for input PLIOs channels 0 and 4, and output PLIOs channel 0, vitis_analyzer can display the following table:

Figure 10. Channel 0 and 4 Input and Channel 0 Output PLIO Bandwidth