After running the system, whether in simulation, hardware emulation, or in hardware, a run_summary report is generated when the application has been properly configured.
During simulation of the AI Engine graph, the AI Engine simulator or hardware
emulation, captures performance and activity metrics and writes the report to the
output directory ./aiesimulator_output and
./sim/behav_waveform/xsim. The generated
summary is called default.aierun_summary.
The run_summary can be viewed in the Vitis Analyzer. The summary contains a collection of reports, capturing the performance profile of the AI Engine application captured as it runs. For example, to open the AI Engine simulator run summary use the following command:
The Vitis Analyzer opens displaying the Summary page of the report. The Report Navigator view of the tool lists the different reports that are available in the summary. For a complete understanding of the Vitis Analyzer, see Using the Vitis Analyzer in the Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393).
default.aierun_summaryalso contains the some of the same reports as
<GRAPH_TB_FILE_NAME>.aiecompile_summary. These reports are Graph and Array. To see those reports go to the Viewing Compilation Results in the Vitis Analyzer.
This is the top-level of the report, and reports the details of the run, such as date, tool version, and the command-line used to launch the simulator.
aiesimulator --profile option is
specified, the simulator collects profiling data on the AI Engine graph and kernels presenting a high-level view of the
AI Engine graphs, kernels-mapped to
processors, with tables and graphic presentation of metric data.
The Profile Summary provides annotated details regarding the overall application performance. All data generated during the execution of the application is grouped into categories. The Profile Summary lets you examine processor/DMA memory stalls, deadlock, interference, critical paths, and maximum contention. This is useful for system-level performance tuning and debug. System performance is presented in terms of latency (number of cycles taken to execute the system) and throughput (data/time taken). Sub-optimal system performance forces you to examine and control (thru constraints) mapping and buffer packing, stream and packet switch allocation, interaction with neighboring processors, and external interfaces. An example of the raw Profile Summary report is shown.
Specific tables can be used to see profile information specific to the kernels. This is shown as a chart with a table showing what is running on the tiles. The following is an example chart.
In this view, you can see a chart that shows a Total Function Time which is the total cycles the function used in running the graph. The y-axis shows the id of the function that can be referenced in the following table. This information can be useful in determining where time is being spent in a function and helps with potential optimization or debug.
Issues such as missing or mismatching locks, buffer overruns, and incorrect programming of DMA buffers are difficult to debug using traditional interactive debug techniques. Event trace provides a systematic way of collecting system level traces for the program events, providing direct support for generation, collection, and streaming of hardware events as a trace. The following image shows the Trace report open in the Vitis Analyzer.
mainfunction. This is different from the function used in the top-level file.
initfunction that runs once per graph execution.
- Calls destructors of global C++ objects.
- This section holds executable instructions that terminate the process. When a program exits normally, the system runs the code in this section.
aiesimulator --pkg-dir=./Work --online -wdb -ctf
Features of the trace report include the following.
- Each tile is reported. Within each tile the report includes core, DMA, locks, and I/O if there are PL blocks in the graph.
- There is a separate timeline for each kernel mapped to a core. It shows when the kernel is executing (blue) or stalled (red) due to memory conflicts or waiting for stream data.
- By using lock IDs in the core, DMA, and locks sections you can identify how cores and DMAs interact with one another by acquiring and releasing locks.
- The lock section shows the activities of the locks in the tile, both the allocation and release for read and write lock requests. A particular lock can be allocated by nearby tiles. Thus, this section does not necessarily match the core lock requests of the core shown in the left pane of the image.
- If a lock is not released, a red bar extends through the end of simulation time.
- Clicking the left or right arrows takes you to the start and end of a state, respectively.
- The data view shows the data flowing through stream switch network with slave entry points and master exit points at each hop. This is most useful in finding the routing delays, as well as network congestion effects with packet switching, where one packet might get delayed behind another packet when sharing the same stream channel.