After running the system, whether in simulation, hardware emulation, or in hardware, a run_summary report is generated when the application has been properly configured.
During simulation of the AI Engine graph, the AI Engine simulator captures performance and activity metrics and writes the report to the output directory ./aiesimulator_output. The AI Engine simulator run_summary is named default.aierun_summary.
The run_summary can be viewed in the Vitis analyzer. The summary contains a collection of reports, capturing the performance profile of the AI Engine application captured as it runs. For example, to open the AI Engine simulator run summary use the following command:
The Vitis analyzer opens displaying the Summary page of the report. The Report Navigator view of the tool lists the different reports that are available in the summary. For a complete understanding of the Vitis analyzer, see Using the Vitis Analyzer in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416).
$AIE_COMPILER_WORKDIRenvironment variable prior to launching hardware emulation. This ensures that the correct path is set in the run_summary file which is used by the Vitis analyzer to locate, process, and display trace data. If the environment variable is not specified, then the Vitis analyzer looks for the ./Work directory inside the current directory and uses the first one found.
The listed reports include:
- This is the top-level of the report, and reports the details of the run, such as date, tool version, and the command-line used to launch the simulator.
- When the
aiesimulator --profileoption is specified, the simulator collects profiling data on the AI Engine graph and kernels presenting a high-level view of the AI Engine graphs, kernels-mapped to processors, with tables and graphic presentation of metric data.
The Profile Summary provides annotated details regarding the overall application performance. All data generated during the execution of the application is grouped into categories. The Profile Summary lets you examine processor/DMA memory stalls, deadlock, interference, critical paths, and maximum contention. This is useful for system-level performance tuning and debug. System performance is presented in terms of latency (number of cycles taken to execute the system) and throughput (data/time taken). Sub-optimal system performance forces you to examine and control (thru constraints) mapping and buffer packing, stream and packet switch allocation, interaction with neighboring processors, and external interfaces. An example of the Profile Summary report is shown:
- Problems such as missing or mismatching locks, buffer overruns, and
incorrect programming of DMA buffers are difficult to debug using
traditional interactive debug techniques. Event trace provides a systematic
way of collecting system level traces for the program events, providing
direct support for generation, collection, and streaming of hardware events
as a trace. The following image shows the Trace report open in the Vitis analyzer.
Features of the trace report include:
- Each tile is reported. Within each tile the report includes core, DMA, locks, and I/O if there are PL blocks in the graph.
- There is a separate timeline for each kernel mapped to a core. It shows when the kernel is executing (blue) or stalled (red) due to memory conflicts or waiting for stream data.
- By using lock IDs in the core, DMA, and locks sections you can identify how cores and DMAs interact with one another by acquiring and releasing locks.
- The lock section shows the activities of the locks in the tile, both the allocation and release for read and write lock requests. A particular lock can be allocated by nearby tiles. Thus, this section does not necessarily match the core lock requests of the core shown in the left pane of the image.
- If a lock is not released, a red bar extends through the end of simulation time.
- Clicking the left or right arrows takes you to the start and end of a state, respectively.
- The data view shows the data flowing through stream switch network with slave entry points and master exit points at each hop. This is most useful in finding the routing delays, as well as network congestion effects with packet switching, where one packet might get delayed behind another packet when sharing the same stream channel.