Interpreting the Profile Summary - 2021.2 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID
UG1393
ft:locale
English (United States)
Release Date
2021-12-15
Version
2021.2 English

The profile summary includes a number of useful statistics for your host application and kernels. The report provides a general idea of the functional bottlenecks in your application. The following tables show the profile summary descriptions.

Settings

This displays the report and XRT configuration settings.

Summary

This displays summary statistics including device execution time and device power.

Kernels & Compute Units

The following table displays the profile summary data for all kernel functions scheduled and executed.

Table 1. Kernel Execution
Name Description
Kernel Name of kernel
Enqueues Number of times kernel is enqueued. When the kernel is enqueued only once, the following stats are all the same.
Total Time Sum of runtimes of all enqueues (measured from START to END in OpenCL execution model) (in ms)
Minimum Time Minimum runtime of all enqueues
Average Time Average kernel runtime (in ms)

(Total time) / (Number of enqueues)

Maximum Time Maximum runtime of all enqueues (in ms)

The following table displays the profile summary data for top kernel functions.

Table 2. Top Kernel Execution
Name Description
Kernel Name of kernel
Kernel Instance Address Host address of kernel instance (in hex)
Context ID Context ID on host
Command Queue ID Command queue ID on host
Device Name of device where kernel was executed (format: <device>-<ID>)
Start Time Start time of execution (in ms)
Duration Duration of execution (in ms)

This following table displays the profile summary data for all compute units on the device.

Table 3. Compute Unit Utilization
Name Description
Compute Unit Name of compute unit
Kernel Kernel this compute unit is associated with
Device Name of the device (format: <device>-<ID>)
Calls Number of times the compute unit is called
Dataflow Execution Specifies whether the CU is executed with dataflow
Max Parallel Executions Number of executions in the dataflow region
Dataflow Acceleration Shows the performance improvement due to dataflow execution
CU Utilization (%) Shows the percent of the total kernel runtime that is consumed by the CU
Total Time Sum of the runtimes of all calls (in ms)
Minimum Time Minimum runtime of all calls (in ms)
Minimum runtime of all calls (Total time) / (Number of work groups)
Maximum Time Maximum runtime of all calls (in ms)
Clock Frequency Clock frequency used for a given accelerator (in MHz)

This following table displays the profile summary data for running times and stalls for compute units on the device.

Table 4. Compute Unit Running Times & Stalls
Name Description
Compute Unit Name of compute unit
Execution Count Execution count of the compute unit
Running Time Total time compute unit was running (in µs)
Intra-Kernel Dataflow Stalls (%) Percent time the compute unit was stalling from intra-kernel streams
External Memory Stalls (%) Percent time the compute unit was stalling from external memory accesses
Inter-Kernel Pipe Stalls (%) Percent time the compute unit was stalling from inter-kernel pipe accesses

Kernel Data Transfers

This following table displays the data transfer for kernels to the global memory.

Table 5. Data Transfer
Name Description
Compute Unit Port Name of compute unit/port
Kernel Arguments List of kernel arguments attached to this port
Device Name of device (format: <device>-<ID>)
Memory Resources Memory resource accessed by this port
Transfer Type Type of kernel data transfers
Number of Transfers Number of kernel data transfers (in AXI transactions)
Note: This might contain printf transfers.
Transfer Rate Rate of kernel data transfers (in MB/s):

Transfer Rate = (Total Bytes) / (Total CU Execution Time)

Where total CU execution time is the total time the CU was active
Avg Bandwidth Utilization (%) Average bandwidth of kernel data transfers:

Bandwidth Utilization (%) = (100 * Transfer Rate) / (0.6 * Max. Theoretical Rate)

Avg Size Average size of kernel data transfers (in KB):

Average Size = (Total KB) / (Number of Transfers)

Avg Latency Average latency of kernel data transfers (in ns)

This following table displays the top data transfer for kernels to the global memory.

Table 6. Top Data Transfer
Name Description
Compute Unit Name of compute unit
Device Name of device
Number of Transfers Number of write and read data transfers
Avg Bytes per Transfer Average bytes of kernel data transfers:

Average Bytes = (Total Bytes) / (Number of Transfers)

Transfer Efficiency (%) Efficiency of kernel data transfers:

Efficiency = (Average Bytes) / min((Memory Byte Width * 256), 4096)

Total Data Transfer Total data transferred by kernels (in MB):

Total Data = (Total Write) + (Total Read)

Total Write Total data written by kernels (in MB)
Total Read Total data read by kernels (in MB)
Total Transfer Rate Average total data transfer rate (in MB/s):

Total Transfer Rate = (Total Data Transfer) / (Total CU Execution Time)

Where total CU execution time is the total time the CU was active

This following table displays the data transfer streams.

Note: This table is only shown if there is stream data
Table 7. Data Transfer Streams
Name Description
Master Port Name of master compute unit and port
Master Kernel Arguments List of kernel arguments attached to this port
Slave Port Name of slave compute unit and port
Slave Kernel Arguments List of kernel arguments attached to this port
Device Name of device (format: <device>-<ID>)
Number of Transfers Number of stream data packets
Transfer Rate Rate of stream data transfers (in MB/s):

Transfer Rate = (Total Bytes) / (Total CU Execution Time)

Where total CU execution time is the total time the CU was active

Avg Size Average size of kernel data transfers (in KB):

Average Size = (Total KB) / (Number of Transfers)

Link Utilization (%) Link utilization (%):

Link Utilization = 100 * (Link Busy Cycles - Link Stall Cycles - Link Starve Cycles) / (Link Busy Cycles)

Link Starve (%) Link starve (%):

Link Starve = 100 * (Link Starve Cycles) / (Link Busy Cycles)

Link Stall (%) Link stall (%):

Link Stall = 100 * (Link Stall Cycles) / (Link Busy Cycles)

Host Data Transfers

This following table displays profile data for all write transfers between the host and device memory through PCI Express® link.

Table 8. Top Memory Writes
Name Description
Buffer Address Specifies the address location for the buffer
Context ID OpenCL Context ID on host
Command Queue ID OpenCL Command queue ID on host
Start Time Start time of write operation (in ms)
Duration Duration of write operation (in ms)
Buffer Size Amount of data being transferred (in KB)
Writing Rate Data transfer rate (in MB/s):

(Buffer Size)/(Duration)

This following table displays profile data for all read transfers between the host and device memory through PCI Express® link.

Table 9. Top Memory Reads
Name Description
Buffer Address Specifies the address location for the buffer
Context ID Context ID on host
Command Queue ID Command queue ID on host
Start Time Start time of read operation (in ms)
Duration Duration of read operation (in ms)
Buffer Size Amount of data being transferred (in KB)
Reading Rate Data transfer rate (in MB/s):

(Buffer Size) / (Duration)

This following table displays the data transfer for host to the global memory.

Table 10. Data Transfer
Name Description
Context:Number of Devices Context ID and number of devices in context
Transfer Type Type of kernel host transfers
Number of Buffer Transfers Number of host buffer transfers
Note: This might contain printf transfers.
Transfer Rate Rate of host buffer transfers (in MB/s):

Transfer Rate = (Total Bytes) / (Total Time in µs)

Avg Bandwidth Utilization (%) Average bandwidth of host buffer transfers:

Bandwidth Utilization (%) = (100 * Transfer Rate) / (Max. Theoretical Rate)

Avg Size Average size of host buffer transfers (in KB):

Average Size = (Total KB) / (Number of Transfers)

Total Time Sum of host buffer transfer durations (in ms)
Avg Time Average of host buffer transfer durations (in ms)

API Calls

This following table displays the profile data for all OpenCL host API function calls executed in the host application. The top displays a bar graph of the API call time as a percent of total time.

Table 11. API Calls
Name Description
API Name Name of the API function (for example, clCreateProgramWithBinary, clEnqueueNDRangeKernel)
Calls Number of calls to this API made by the host application
Total Time Sum of runtimes of all calls (in ms)
Minimum Time Minimum runtime of all calls (in ms)
Average Time Average Time (in ms)

(Total time) / (Number of calls)

Maximum Time Maximum runtime of all calls (in ms)

Device Power

This following table displays the profile data for device power.

Table 12. Device Power
Name Description
Power Used By Platform Shows a line graph of the three power rails on a Data Center acceleration card:
  • 12V Auxiliary
  • 12V PCIe
  • Internal power
These show the power (W) usage of the card over time.
Temperature One chart is created for each device that has non-zero temperature readings. Displays one line for each temperature sensor with readouts in (°C).
Fan Speed One chart is created for each device that has non-zero fan speed readings. The fan speed is measure in RPM.

Kernel Internals

This following table displays the running time for compute units in microseconds (µs) and reports stall time as a percent of the running time.

Tip: The Kernel Internals tab reports time in µs, while the rest of the Profile Summary reports time in milliseconds (ms).
Table 13. CU Runtime and Stalls
Name Description
Compute Unit Indicates the compute unit instance name
Running Time Reports the total running time for the CU (in µs)
Intra-Kernel Dataflow Stalls (%) Reports the percentage of running time consumed in stalls when streaming data between kernels
External Memory Stalls (%) Reports the percentage of running time consumed in stalls for memory transfers outside the CU
Inter-Kernel Pipe Stalls (%) Reports the percentage of running time consumed in stalls when streaming data to or from outside the CU

This following table displays the data transfer for specific ports on the compute unit.

Table 14. CU Port Data Transfers
Name Description
Port Indicates the port name on the compute unit
Compute Unit Indicates the compute unit instance name
Write Time Specifies the total data write time on the port (in µs)
Outstanding Write (%) Specifies the percentage of the runtime consumed in the write process
Read Time Specifies the total data read time on the port (in µs)
Outstanding Read (%) Specifies the percentage of the runtime consumed in the read process

This following table displays the functional port data transfers on the compute unit.

Table 15. Functional Port Data Transfers
Name Description
Port Name of port
Function Name of function
Compute Unit Name of compute unit
Write Time Total time the port had an outstanding write (in µs)
Outstanding Write (%) Percent time the port had an outstanding write
Read Time Total time the port had an outstanding read (in µs)
Outstanding Read (%) Percent time the port had an outstanding read

This following table displays the running time and stalls on the compute unit.

Table 16. Functions
Name Description
Compute Unit Name of compute unit
Function Name of function
Running Time Total time function was running (in ms)
Intra-Kernel Dataflow Stalls Percent time the function was stalling from intra-kernel streams (in ms)
External Memory Stalls Percent time the function was stalling from external memory accesses (in ms)
Inter-Kernel Pipe Stalls Percent time the function was stalling from inter-kernel pipe accesses (in ms)

Shell Data Transfers

This following table displays the DMA data transfers.

Table 17. DMA Data Transfer
Name Description
Device Name of device (format: <device>-<ID>)
Transfer Type Type of data transfers
Number of Transfers Number of data transfers (in AXI transactions)
Transfer Rate Rate of data transfers (in MB/s):

Transfer Rate = (Total Bytes) / (Total Time in µs)

Total Data Transfer Total amount of data transferred (in MB)
Total Time Total duration of data transfers (in ms)
Avg Size Average size of data transfers (in KB):

Average Size = (Total KB) / (Number of Transfers)

Avg Latency Average latency of data transfers (in ns)

For DMA bypass and Global Memory to Global Memory data transfers, see the DMA Data Transfer table above.

NoC Counters

NoC Counters display the NoC Counters Read and NoC Counters Write. These sections are only displayed if there is a non-zero NoC counter data.

Each section has a table containing summary data with line graphs for transfer rate and latency. The graphs can have multiple NoC counters, so you can toggle the counters ON/OFF through check boxes in the Chart column of the table.

Depending on the design, it can be possible to correlate NoC counters to CU ports. In this case, the CU port appears in the table, and selecting it cross-probes to the system diagram, profile summary, and any other views that include CU ports as selectable objects.

Table 18. NoC Counters Read or Write
Name Description
Compute Unit Port Name of compute unit/port
Name Name of NoC port
Traffic Class Traffic class type
Requested QoS QoS (MB/s): Requested quality of service (in MB/s)
Min Transfer Rate Rate of minimum data transfers (in MB/s)
Avg Transfer Rate Rate of average data transfers (in MB/s)
Max Transfer Rate Rate of maximum data transfers (in MB/s)
Avg Size Average size of data transfers (in KB):

Average Size = (Total KB) / (Number of Transfers)

Min Latency Minimum latency of data transfers (in ns)
Avg Latency Average latency of data transfers (in ns)
Max Latency Maximum latency of data transfers (in ns)

AI Engine Counters

AI Engine counters display if there is a non-zero AI Engine counter data. If there is an incompatible configuration of the AI Engine counters, this section displays a message stating that the configuration does not support performance profiling.

This section has a table containing summary data with line graphs for active time and usage. The usage chart is only available if stall profiling is enabled.

The graphs can have multiple AI Engine counters, so you can toggle the counters ON/OFF through check boxes in the Chart column of the table.

It is possible to cross-probe tiles to the AI Engine array and graph views.

Note: Depending on how the AI Engine counters are configured, one or more metric columns might appear. These include memory stall, stream stall, call inst time, group error time, etc. For more information, see Versal ACAP AI Engine Programming Environment User Guide (UG1076).
Table 19. AI Engine Counters
Name Description
Tile AI Engine Tile [Column, Row]
Clock Frequency (MHz) Frequency (in MHz) of clock used for AI Engine tiles