Profiling Kernel Code - 2022.2 English

AI Engine Kernel and Graph Programming Guide (UG1079)

Document ID
UG1079
Release Date
2022-10-19
Version
2022.2 English
Every AI Engine has a 64-bit counter. The AI Engine API classaie::tile has method cycles() to read this counter value. For example:
aie::tile tile=aie::tile::current(); //get the tile of the kernel
unsigned long long time=tile.cycles();//cycle counter of the tile counter
The counter is continuously running. It is not limited by how many times you can read the counter. The value read back by the kernel can be written to memory, or it can be streamed out for further analysis. For example, to profile the latency of the code below, the counter value is read prior to the code being profiled, and again after the code has run:
aie::tile tile=aie::tile::current();
unsigned long long time=tile.cycles(); //first time
writeincr(out,time);

for(...){...}

time=tile.cycles();
writeincr(out,time); //second time

The latency of the loop in the kernel can then be examined in the host application by the second time minus the first time.

By comparing the data read back in between different executions of the kernel, or between different iterations of the loop, the data can be used to calculate latency. For example, the following code tries to get the latency of certain operations on an asynchronous window:

aie::tile tile=aie::tile::current();
for(...){//outer loop
  unsigned long long time=tile.cycles(); //read counter value
  writeincr(out,time);
  window_acquire(win_in);
  for(...){...} //inner loop
  window_release(win_in);
}

The latency of asynchronous window acquiring and release, plus the inner loop execution time can then be calculated by the second time minus the first time.

The counter value can also be written into data memory. The value can be read back by printf in simulation, or read back by host code in hardware. If the written value is not used by any other code, the volatile qualifier can be used to enforce the storage of the value of the counter. This qualifier ensures that the compiler optimizations do not optimize the value of this variable. For example:
static unsigned long long cycle_num[2];
aie::tile tile=aie::tile::current();
volatile unsigned long long *p_cycle=cycle_num;
*p_cycle=tile.cycles();//cycle_num[0]

for(...){...}

*(p_cycle+1)=tile.cycles();//cycle_num[1]
printf("cycles=%lld\n",cycle_num[1]-cycle_num[0]);