Profiling for AI Engine - 2021.2 English

Versal ACAP AI Engine Programming Environment User Guide (UG1076)

Document ID

UG1076

Release Date

2021-12-17

Version

2021.2 English

The following tables list the pre-defined metric set configurations available for AI Engine, in order of priority by which they are assigned to the available counters.

Table 1. Heat_map
Metric Name	Event ID	Description
Active Time	28	Time AI Engine was active since it was enabled.
Stall Time	22	Time AI Engine was stalled. This stall includes AI Engine memory, stream, cascade, and lock stalls.
Vector Instruction Time	37	Time AI Engine spent executing instructions in the vector processor.
Cumulative Instruction Time	32	Time AI Engine spent executing load/store, stream get/put, lock acquire/release instructions.

These indicators help you understand the efficiency of the kernels that are implemented in the AI Engines. You can compare stall time with active time to determine if there is a data communication issue for each AI Engine.

Table 2. Stalls
Metric Name	Event ID	Description
Memory Stall Time	23	Time the AI Engine was not active due to a memory stall.
Stream Stall Time	24	Time the AI Engine was not active due to a stream stall.
Lock Stall Time	26	Time the AI Engine was in a lock stall.
Cascade Stall Time	25	Time the AI Engine was in a cascade stall.

A stall in an AI Engine can occur in various situations:

A memory stall happens when multiple accesses to the same memory bank are requested from one core, multiple cores, and/or DMAs.
Stream stalls occur when data production and consumption on a stream do not have the same rate, leading to input stream starvation or output stream overflow.
A cascade stall is generated when the cascade writer does not have the same rate as the cascade reader.
A lock stall happens if the window data producer does not have the same iteration rate as the window consumer.

Table 3. Execution
Metric Name	Event ID	Description
Vector Instruction Time	37	Time spent by the AI Engine on vector instructions: vector processor instruction and vector data load/store
Load Instruction Time	38	Time spent by the AI Engine on load instructions (move data from memory to registers)
Store Instruction Time	39	Time spent by the AI Engine on store instructions (move data from registers to memory)
Cumulative Instruction Time	32	Time spent by the AI Engine on memory and stream accesses and lock acquire/release

All these indicators allow you to estimate the efficiency of your kernel. To increase efficiency, you should optimize data access, favor vector instructions over scalar instructions, and use 128-bit access to streams whenever possible.

Table 4. Floating-Point
Metric Name	Event ID	Description
Floating-Point Overflow Exception	50	Number of floating-point overflow exceptions generated by AI Engine
Floating-Point Underflow Exception	51	Number of floating-point underflow exceptions generated by AI Engine
Floating-Point Invalid Exception	52	Number of floating-point Invalid exceptions generated by AI Engine
Floating-point Divide by Zero Exception	53	Number of floating-point divide by zero exceptions generated by AI Engine

Floating-point exceptions lead to erroneous results. You might have to recode your floating-point algorithm if you get too many exceptions, or even a single in a critical area of the code.

Table 5. Stream_put_get
Metric Name	Event ID	Description
Cascade Read Instruction Time	42	Time AI Engine spent executing read instructions on the cascade stream.
Cascade Write Instruction Time	43	Time AI Engine spent executing write instructions on the cascade stream.
Stream Read Instruction Time	40	Time AI Engine spent executing read instructions on data streams.
Stream Write Instruction Time	41	Time AI Engine spent executing write instructions on data streams.