Graph Performance Measurement

Graph Performance Measurement - 2023.2 English

Vitis Tutorials: AI Engine (XD100)

Document ID

XD100

Release Date

2024-03-05

Version

2023.2 English

There are multiple ways to measure performance:

The AI Engine simulator output contains a timestamp for each piece of output data. It is possible to make performance calculations both manually and by using scripts. For example, the output of the example (aiesimulator_output/data/output.txt) looks like the following:
```
T 652800 ns
2 
......
T 10889600 ns
30 
```
The first samples come out in 652800 ps, and the last samples come out in 10889600 ps. The throughput therefore can be calculated as follows:
```
 Total time = 10889600 - 652800 = 10236800 ps
 Total bytes = 128 * 100 = 12800 bytes
 Throughput = 12800/(10236800*1e-6) = 1250.3 MB/s
```
This method does not measure the latency of the first kernel execution to produce the output data. Make sure that the graph runs a number large enough that this overhead can be neglected.
AMD provides event APIs for performance profiling purposes. These APIs use performance counters in shim tiles to do profiling. The following enumeration usages are introduced in this tutorial:
- event::io_stream_start_to_bytes_transferred_cycles: This records the start of a running event with a performance counter, and records the event that a specific amount of data is transferred with another performance counter. The return number with this enumeration is therefore the total cycles required to receive that amount of data. The profiled stream should be stopped after this amount of data has been transferred.
- event::io_stream_running_event_count: This counts how many running events have occurred between start_profiling and read_profiling. It can be used to count how much data has been transferred, whether the graph is running infinitely or not.
Take a look at aie/graph.cpp. The code to perform profiling is as follows:
```
int iterations=100;
int bytes_per_iteration=128;
int total_bytes=bytes_per_iteration * iterations;
event::handle handle = event::start_profiling(*dout, event::io_stream_start_to_bytes_transferred_cycles, total_bytes);
if(handle==event::invalid_handle){
 printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
 return 1;
}
gr.run(100);
gr.wait();
long long cycle_count = event::read_profiling(handle);
std::cout<<"cycle count:"<<cycle_count<<std::endl;
event::stop_profiling(handle);
double throughput = (double)total_bytes / (cycle_count *0.8 * 1e-3); //Every AIE cycle is 0.8ns in production board
printf("Throughput of the graph: %f MB/s\n",throughput);
```
The output of AI Engine simulator looks like the following:
```
cycle count:12665
Throughput of the graph: 1263.324122 MB/s
```
The event API can be applied in AI Engine simulator, hardware emulation, and hardware flows.
The performance result can also be found in the AI Engine simulator profile report. Add the --profile option to aiesimulator, open Vitis Analyzer, and open the Profile view. The profile result can be viewed as shown in the following figure:

Kernel aie_dest1 takes 8368 cycles for 100 iterations. The main function takes 4383 cycles for 100 iterations of the graph. This is around 44 cycles of overhead for each iteration of the graph. This overhead includes the lock acquires of the buffers and the overhead of API calls.

The performance of aie_dest1 is bounded by the stream interface throughput. The theoretical limit is up to 4 bytes a cycle (5 GB/s), and there are 128 bytes of input for one run. This means that it at least has 32 cycles for the main loop, although it takes 80 cycles. This indicates that the loop is not well-pipelined.
Check performance-related events to see if they match expectations. For example, check the event PL_TO_SHIM to see if the PL can send data at the best achievable performance for a single stream interface. Look for the PL_TO_SHIM event in Trace view in Vitis Analyzer:

Search it for multiple times. When it is stable, it can be seen that every eight cycles, it happens once. This is because the frequency of the PL has been set at 312.5 MHz by the option pl-freq=312.5 for the AI Engine compiler and AI Engine-PL interface is 32-bit width, which is one-fourth of the best achievable performance.

From a best performance and best resource perspective, you might select a 64-bit interface at 625 MHz if timing allows. If not, it is possible to have the PL running at 312.5 MHz with a 128-bit width interface.
The following methods are introduced in AI Engine Performance Profile:
- Profiling by C++ class API
- Profiling by AI Engine cycles from AI Engine kernels
- Profiling by event API