The following tables list the pre-defined metric set configurations available
for AI Engine, in order of priority by which they are assigned to the available
counters.
Table 1. Heat_map
Metric Name |
Event ID |
Description |
Active Time |
28 |
Time AI Engine was active since it was
enabled. |
Stall Time |
22 |
Time AI Engine was stalled. This stall
includes AI Engine memory, stream, cascade, and lock
stalls. |
Vector Instruction Time |
37 |
Time AI Engine spent executing instructions
in the vector processor. |
Cumulative Instruction Time |
32 |
Time AI Engine spent executing load/store,
stream get/put, lock acquire/release instructions. |
These indicators help you understand the efficiency of the kernels that are
implemented in the AI Engines. You can compare stall time with active time to
determine if there is a data communication issue for each AI Engine.
Table 2. Stalls
Metric Name |
Event ID |
Description |
Memory Stall Time |
23 |
Time the AI Engine was not active due to a
memory stall. |
Stream Stall Time |
24 |
Time the AI Engine was not active due to a
stream stall. |
Lock Stall Time |
26 |
Time the AI Engine was in a lock stall. |
Cascade Stall Time |
25 |
Time the AI Engine was in a cascade stall. |
A stall in an AI Engine can occur in various situations:
- A memory stall happens when multiple accesses to the same
memory bank are requested from one core, multiple cores, and/or DMAs.
- Stream stalls occur when data production and consumption on
a stream do not have the same rate, leading to input stream starvation or
output stream overflow.
- A cascade stall is generated when the cascade writer does
not have the same rate as the cascade reader.
- A lock stall happens if the window data producer does not
have the same iteration rate as the window consumer.
Table 3. Execution
Metric Name |
Event ID |
Description |
Vector Instruction Time |
37 |
Time spent by the AI Engine on vector
instructions: vector processor instruction and vector data
load/store |
Load Instruction Time |
38 |
Time spent by the AI Engine on load
instructions (move data from memory to registers) |
Store Instruction Time |
39 |
Time spent by the AI Engine on store
instructions (move data from registers to memory) |
Cumulative Instruction Time |
32 |
Time spent by the AI Engine on memory and
stream accesses and lock acquire/release |
All these indicators allow you to estimate the efficiency of your
kernel. To increase efficiency, you should optimize data access, favor vector
instructions over scalar instructions, and use 128-bit access to streams whenever
possible.
Table 4. Floating-Point
Metric Name |
Event ID |
Description |
Floating-Point Overflow Exception |
50 |
Number of floating-point overflow exceptions
generated by AI Engine |
Floating-Point Underflow Exception |
51 |
Number of floating-point underflow exceptions
generated by AI Engine |
Floating-Point Invalid Exception |
52 |
Number of floating-point Invalid exceptions
generated by AI Engine |
Floating-point Divide by Zero
Exception |
53 |
Number of floating-point divide by zero
exceptions generated by AI Engine |
Floating-point exceptions lead to erroneous results. You might have
to recode your floating-point algorithm if you get too many exceptions, or even a
single in a critical area of the code.
Table 5. Stream_put_get
Metric Name |
Event ID |
Description |
Cascade Read Instruction Time |
42 |
Time AI Engine spent executing read
instructions on the cascade stream. |
Cascade Write Instruction Time |
43 |
Time AI Engine spent executing write
instructions on the cascade stream. |
Stream Read Instruction Time |
40 |
Time AI Engine spent executing read
instructions on data streams. |
Stream Write Instruction Time |
41 |
Time AI Engine spent executing write
instructions on data streams. |