Viewing the Run Summary in the Vitis Analyzer - 2021.2 English

Versal ACAP AI Engine Programming Environment User Guide (UG1076)

Document ID
UG1076
ft:locale
English (United States)
Release Date
2021-12-17
Version
2021.2 English

After running the system, whether in simulation, hardware emulation, or in hardware, a run_summary report is generated when the application has been properly configured.

During simulation of the AI Engine graph, the AI Engine simulator or hardware emulation, captures performance and activity metrics and writes the report to the output directory ./aiesimulator_output and ./sim/behav_waveform/xsim. The generated summary is called default.aierun_summary.

The run_summary can be viewed in the Vitis analyzer. The summary contains a collection of reports, capturing the performance profile of the AI Engine application captured as it runs. For example, to open the AI Engine simulator run summary use the following command:

vitis_analyzer ./aiesimulator_output/default.aierun_summary

The Vitis analyzer opens displaying the Summary page of the report. The Report Navigator view of the tool lists the different reports that are available in the summary. For a complete understanding of the Vitis analyzer, see Using the Vitis Analyzer in the Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393).

Note: The default.aierun_summary also contains the some of the same reports as <GRAPH_TB_FILE_NAME>.aiecompile_summary. These reports are Graph and Array. To see those reports go to the Viewing Compilation Results in the Vitis Analyzer.

Report Summary

This is the top-level of the report, and reports the details of the run, such as date, tool version, and the command-line used to launch the simulator.

Profile Summary

When the aiesimulator --profile option is specified, the simulator collects profiling data on the AI Engine graph and kernels presenting a high-level view of the AI Engine graphs, kernels-mapped to processors, with tables and graphic presentation of metric data.

The Profile Summary provides annotated details regarding the overall application performance. All data generated during the execution of the application is grouped into categories. The Profile Summary lets you examine processor/DMA memory stalls, deadlock, interference, critical paths, and maximum contention. This is useful for system-level performance tuning and debug. System performance is presented in terms of latency (number of cycles taken to execute the system) and throughput (data/time taken). Sub-optimal system performance forces you to examine and control (thru constraints) mapping and buffer packing, stream and packet switch allocation, interaction with neighboring processors, and external interfaces. An example of the raw Profile Summary report is shown.

Figure 1. Profile Summary

Note: The row value of tile number from profile report is one above actual tile number. For example tile_38_1 is tile(38, 0) from the previous screen shot.

Specific tables can be used to see profile information specific to the kernels. This is shown as a chart with a table showing what is running on the tiles. The following is an example chart.

Figure 2. Example Chart

In this view, you can see a chart that shows a Total Function Time which is the total cycles the function used in running the graph. The y-axis shows the id of the function that can be referenced in the following table. This information can be useful in determining where time is being spent in a function and helps with potential optimization or debug.

Trace Report

Issues such as missing or mismatching locks, buffer overruns, and incorrect programming of DMA buffers are difficult to debug using traditional interactive debug techniques. Event trace provides a systematic way of collecting system level traces for the program events, providing direct support for generation, collection, and streaming of hardware events as a trace. The following image shows the Trace report open in the Vitis analyzer.

Figure 3. Trace Report

Note: This example illustrates kernel function and functions that are added by the compiler:
_main
Core main function. This is different from the function used in the top-level file.
_main_init
Kernel init function that runs once per graph execution.
_cxa_finalize
Calls destructors of global C++ objects.
_fini
This section holds executable instructions that terminate the process. When a program exits normally, the system runs the code in this section.
Note: If the VCD file is too large and it takes too much time for the Vitis analyzer to analyze the VCD and open the Trace view, you can do an online analysis of the VCD when running the AI Engine simulator. The Vitis analyzer then opens the existing WDB and CTF files instead of analyzing the VCD file. The command for AI Engine simulator is as follows.
aiesimulator --pkg-dir=./Work --online -wdb -ctf

Features of the trace report include the following.

  • Each tile is reported. Within each tile the report includes core, DMA, locks, and I/O if there are PL blocks in the graph.
  • There is a separate timeline for each kernel mapped to a core. It shows when the kernel is executing (blue) or stalled (red) due to memory conflicts or waiting for stream data.
  • By using lock IDs in the core, DMA, and locks sections you can identify how cores and DMAs interact with one another by acquiring and releasing locks.
  • The lock section shows the activities of the locks in the tile, both the allocation and release for read and write lock requests. A particular lock can be allocated by nearby tiles. Thus, this section does not necessarily match the core lock requests of the core shown in the left pane of the image.
  • If a lock is not released, a red bar extends through the end of simulation time.
  • Clicking the left or right arrows takes you to the start and end of a state, respectively.
  • The data view shows the data flowing through stream switch network with slave entry points and master exit points at each hop. This is most useful in finding the routing delays, as well as network congestion effects with packet switching, where one packet might get delayed behind another packet when sharing the same stream channel.