Interpreting the Profile Summary

The profile summary includes a number of useful statistics for your host application and kernels. The report provides a general idea of the functional bottlenecks in your application. The following tables show the profile summary descriptions.

Settings

This displays the report and XRT configuration settings.

Summary

This displays summary statistics including device execution time and device power.

Kernels & Compute Units

The following table displays the profile summary data for all kernel functions scheduled and executed.

Table 1. Kernel Execution
Name	Description
Kernel	Name of kernel
Enqueues	Number of times kernel is enqueued. When the kernel is enqueued only once, the following stats are all the same.
Total Time	Sum of runtimes of all enqueues (measured from START to END in OpenCL execution model) (in ms)
Minimum Time	Minimum runtime of all enqueues
Average Time	Average kernel runtime (in ms) (Total time) / (Number of enqueues)
Maximum Time	Maximum runtime of all enqueues (in ms)

The following table displays the profile summary data for top kernel functions.

Table 2. Top Kernel Execution
Name	Description
Kernel	Name of kernel
Kernel Instance Address	Host address of kernel instance (in hex)
Context ID	Context ID on host
Command Queue ID	Command queue ID on host
Device	Name of device where kernel was executed (format: `<device>-<ID>`)
Start Time	Start time of execution (in ms)
Duration	Duration of execution (in ms)

This following table displays the profile summary data for all compute units on the device.

Table 3. Compute Unit Utilization
Name	Description
Compute Unit	Name of compute unit
Kernel	Kernel this compute unit is associated with
Device	Name of the device (format: `<device>-<ID>`)
Calls	Number of times the compute unit is called
Dataflow Execution	Specifies whether the CU is executed with dataflow
Max Parallel Executions	Number of executions in the dataflow region
Dataflow Acceleration	Shows the performance improvement due to dataflow execution
CU Utilization (%)	Shows the percent of the total kernel runtime that is consumed by the CU
Total Time	Sum of the runtimes of all calls (in ms)
Minimum Time	Minimum runtime of all calls (in ms)
Minimum runtime of all calls	(Total time) / (Number of work groups)
Maximum Time	Maximum runtime of all calls (in ms)
Clock Frequency	Clock frequency used for a given accelerator (in MHz)

This following table displays the profile summary data for running times and stalls for compute units on the device.

Table 4. Compute Unit Running Times & Stalls
Name	Description
Compute Unit	Name of compute unit
Execution Count	Execution count of the compute unit
Running Time	Total time compute unit was running (in µs)
Intra-Kernel Dataflow Stalls (%)	Percent time the compute unit was stalling from intra-kernel streams
External Memory Stalls (%)	Percent time the compute unit was stalling from external memory accesses
Inter-Kernel Pipe Stalls (%)	Percent time the compute unit was stalling from inter-kernel pipe accesses

Kernel Data Transfers

This following table displays the data transfer for kernels to the global memory.

Table 5. Data Transfer
Name	Description
Compute Unit Port	Name of compute unit/port
Kernel Arguments	List of kernel arguments attached to this port
Device	Name of device (format: `<device>-<ID>`)
Memory Resources	Memory resource accessed by this port
Transfer Type	Type of kernel data transfers
Number of Transfers	Number of kernel data transfers (in AXI transactions) Note: This might contain printf transfers.
Transfer Rate	Rate of kernel data transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total CU Execution Time) Where total CU execution time is the total time the CU was active
Bandwidth Utilization with regard to Current Port Configuration	Application bandwidth usage on this port with respect to the current configuration: Bandwidth Utilization (%) = (100 * Transfer Rate) / (Max Achievable BW) where Max Achievable BW is based on the bit-width of the port and the clock speed of the kernel in the design
Maximum Bandwidth with regard to Current Port Configuration	Maximum achievable bandwidth on the current port configuration: Bandwidth MB/s) = (Current port bit width/8) * (Running PL clock rate in MHz)
Bandwidth Utilization with regard to Ideal Port Configuration	Application bandwidth usage against the maximum possible with ideal conditions: Bandwidth Utilization (%) = (100 * Transfer Rate) / (Max Possible BW) where Max Possible BW is based on the max bit-width of a port (512 bits) and the max clock speed of a kernel on this platform
Maximum Bandwidth with regard to Ideal Port Configuration	Maximum theoretical bandwidth on an ideal port configuration: Bandwidth MB/s) = (Maximum possible port bit width/8) * (Highest possible PL clock rate in MHz)
Avg Size	Average size of kernel data transfers (in KB): Average Size = (Total KB) / (Number of Transfers)
Avg Latency	Average latency of kernel data transfers (in ns)

This following table displays the top data transfer for kernels to the global memory.

Table 6. Top Data Transfer
Name	Description
Compute Unit	Name of compute unit
Device	Name of device
Number of Transfers	Number of write and read data transfers
Avg Bytes per Transfer	Average bytes of kernel data transfers: Average Bytes = (Total Bytes) / (Number of Transfers)
Transfer Efficiency (%)	Efficiency of kernel data transfers: Efficiency = (Average Bytes) / min((Memory Byte Width * 256), 4096)
Total Data Transfer	Total data transferred by kernels (in MB): Total Data = (Total Write) + (Total Read)
Total Write	Total data written by kernels (in MB)
Total Read	Total data read by kernels (in MB)
Total Transfer Rate	Average total data transfer rate (in MB/s): Total Transfer Rate = (Total Data Transfer) / (Total CU Execution Time) Where total CU execution time is the total time the CU was active

This following table displays the data transfer streams.

Note: This table is only shown if there is stream data

Table 7. Data Transfer Streams
Name	Description
Master Port	Name of master compute unit and port
Master Kernel Arguments	List of kernel arguments attached to this port
Slave Port	Name of slave compute unit and port
Slave Kernel Arguments	List of kernel arguments attached to this port
Device	Name of device (format: `<device>-<ID>`)
Number of Transfers	Number of stream data packets
Transfer Rate	Rate of stream data transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total CU Execution Time) Where total CU execution time is the total time the CU was active
Avg Size	Average size of kernel data transfers (in KB): Average Size = (Total KB) / (Number of Transfers)
Link Utilization (%)	Link utilization (%): Link Utilization = 100 * (Link Busy Cycles - Link Stall Cycles - Link Starve Cycles) / (Link Busy Cycles)
Link Starve (%)	Link starve (%): Link Starve = 100 * (Link Starve Cycles) / (Link Busy Cycles)
Link Stall (%)	Link stall (%): Link Stall = 100 * (Link Stall Cycles) / (Link Busy Cycles)

Host Data Transfers

This following table displays profile data for all write transfers between the host and device memory through PCI Express® link.

Table 8. Top Memory Writes
Name	Description
Buffer Address	Specifies the address location for the buffer
Context ID	OpenCL Context ID on host
Command Queue ID	OpenCL Command queue ID on host
Start Time	Start time of write operation (in ms)
Duration	Duration of write operation (in ms)
Buffer Size	Amount of data being transferred (in KB)
Writing Rate	Data transfer rate (in MB/s): (Buffer Size)/(Duration)

This following table displays profile data for all read transfers between the host and device memory through PCI Express® link.

Table 9. Top Memory Reads
Name	Description
Buffer Address	Specifies the address location for the buffer
Context ID	Context ID on host
Command Queue ID	Command queue ID on host
Start Time	Start time of read operation (in ms)
Duration	Duration of read operation (in ms)
Buffer Size	Amount of data being transferred (in KB)
Reading Rate	Data transfer rate (in MB/s): (Buffer Size) / (Duration)

This following table displays the data transfer for host to the global memory.

Table 10. Data Transfer
Name	Description
Context:Number of Devices	Context ID and number of devices in context
Transfer Type	Type of kernel host transfers
Number of Buffer Transfers	Number of host buffer transfers Note: This might contain printf transfers.
Transfer Rate	Rate of host buffer transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total Time in µs)
Avg Bandwidth Utilization (%)	Average bandwidth of host buffer transfers: Bandwidth Utilization (%) = (100 * Transfer Rate) / (Max. Theoretical Rate)
Avg Size	Average size of host buffer transfers (in KB): Average Size = (Total KB) / (Number of Transfers)
Total Time	Sum of host buffer transfer durations (in ms)
Avg Time	Average of host buffer transfer durations (in ms)

API Calls

This following table displays the profile data for all OpenCL host API function calls executed in the host application. The top displays a bar graph of the API call time as a percent of total time.

Table 11. API Calls
Name	Description
API Name	Name of the API function (for example, `clCreateProgramWithBinary`, `clEnqueueNDRangeKernel`)
Calls	Number of calls to this API made by the host application
Total Time	Sum of runtimes of all calls (in ms)
Minimum Time	Minimum runtime of all calls (in ms)
Average Time	Average Time (in ms) (Total time) / (Number of calls)
Maximum Time	Maximum runtime of all calls (in ms)

Device Power

This following table displays the profile data for device power.

Table 12. Device Power
Name	Description
Power Used By Platform	Shows a line graph of the three power rails on a Data Center acceleration card: 12V Auxiliary 12V PCIe Internal power These show the power (W) usage of the card over time.
Temperature	One chart is created for each device that has non-zero temperature readings. Displays one line for each temperature sensor with readouts in (°C).
Fan Speed	One chart is created for each device that has non-zero fan speed readings. The fan speed is measure in RPM.

Kernel Internals

This following table displays the running time for compute units in microseconds (µs) and reports stall time as a percent of the running time.

Tip: The Kernel Internals tab reports time in µs, while the rest of the Profile Summary reports time in milliseconds (ms).

Table 13. CU Runtime and Stalls
Name	Description
Compute Unit	Indicates the compute unit instance name
Running Time	Reports the total running time for the CU (in µs)
Intra-Kernel Dataflow Stalls (%)	Reports the percentage of running time consumed in stalls when streaming data between kernels
External Memory Stalls (%)	Reports the percentage of running time consumed in stalls for memory transfers outside the CU
Inter-Kernel Pipe Stalls (%)	Reports the percentage of running time consumed in stalls when streaming data to or from outside the CU

This following table displays the data transfer for specific ports on the compute unit.

Table 14. CU Port Data Transfers
Name	Description
Port	Indicates the port name on the compute unit
Compute Unit	Indicates the compute unit instance name
Write Time	Specifies the total data write time on the port (in µs)
Outstanding Write (%)	Specifies the percentage of the runtime consumed in the write process
Read Time	Specifies the total data read time on the port (in µs)
Outstanding Read (%)	Specifies the percentage of the runtime consumed in the read process

This following table displays the functional port data transfers on the compute unit.

Table 15. Functional Port Data Transfers
Name	Description
Port	Name of port
Function	Name of function
Compute Unit	Name of compute unit
Write Time	Total time the port had an outstanding write (in µs)
Outstanding Write (%)	Percent time the port had an outstanding write
Read Time	Total time the port had an outstanding read (in µs)
Outstanding Read (%)	Percent time the port had an outstanding read

This following table displays the running time and stalls on the compute unit.

Table 16. Functions
Name	Description
Compute Unit	Name of compute unit
Function	Name of function
Running Time	Total time function was running (in ms)
Intra-Kernel Dataflow Stalls	Percent time the function was stalling from intra-kernel streams (in ms)
External Memory Stalls	Percent time the function was stalling from external memory accesses (in ms)
Inter-Kernel Pipe Stalls	Percent time the function was stalling from inter-kernel pipe accesses (in ms)

Shell Data Transfers

This following table displays the DMA data transfers.

Table 17. DMA Data Transfer
Name	Description
Device	Name of device (format: `<device>-<ID>`)
Transfer Type	Type of data transfers
Number of Transfers	Number of data transfers (in AXI transactions)
Transfer Rate	Rate of data transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total Time in µs)
Total Data Transfer	Total amount of data transferred (in MB)
Total Time	Total duration of data transfers (in ms)
Avg Size	Average size of data transfers (in KB): Average Size = (Total KB) / (Number of Transfers)
Avg Latency	Average latency of data transfers (in ns)

For DMA bypass and Global Memory to Global Memory data transfers, see the DMA Data Transfer table above.

NoC Counters

Tip: This data is not displayed unless it has been specifically generated during implementation.

NoC Counters display the NoC Counters Read and NoC Counters Write. These sections are only displayed if there is a non-zero NoC counter data.

Each section has a table containing summary data with line graphs for transfer rate and latency. The graphs can have multiple NoC counters, so you can toggle the counters ON/OFF through check boxes in the Chart column of the table.

Depending on the design, it can be possible to correlate NoC counters to CU ports. In this case, the CU port appears in the table, and selecting it cross-probes to the system diagram, profile summary, and any other views that include CU ports as selectable objects.

Table 18. NoC Counters Read or Write
Name	Description
Compute Unit Port	Name of compute unit/port
Name	Name of NoC port
Traffic Class	Traffic class type
Requested QoS	QoS (MB/s): Requested quality of service (in MB/s)
Min Transfer Rate	Rate of minimum data transfers (in MB/s)
Avg Transfer Rate	Rate of average data transfers (in MB/s)
Max Transfer Rate	Rate of maximum data transfers (in MB/s)
Avg Size	Average size of data transfers (in KB): Average Size = (Total KB) / (Number of Transfers)
Min Latency	Minimum latency of data transfers (in ns)
Avg Latency	Average latency of data transfers (in ns)
Max Latency	Maximum latency of data transfers (in ns)

AI Engine Counters

AI Engine counters display if there is a non-zero AI Engine counter data. If there is an incompatible configuration of the AI Engine counters, this section displays a message stating that the configuration does not support performance profiling.

This section has a table containing summary data with line graphs for active time and usage. The usage chart is only available if stall profiling is enabled.

The graphs can have multiple AI Engine counters, so you can toggle the counters ON/OFF through check boxes in the Chart column of the table.

It is possible to cross-probe tiles to the AI Engine array and graph views.

Note: Depending on how the AI Engine counters are configured, one or more metric columns might appear. These include memory stall, stream stall, call inst time, group error time, etc. For more information, see AI Engine Tools and Flows User Guide (UG1076).

Table 19. AI Engine Counters
Name	Description
Tile	AI Engine Tile [Column, Row]
Clock Frequency (MHz)	Frequency (in MHz) of clock used for AI Engine tiles

Interpreting the Profile Summary - 2022.2 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Settings

Summary

Kernels & Compute Units

Kernel Data Transfers

Host Data Transfers

API Calls

Device Power

Kernel Internals

Shell Data Transfers

NoC Counters

AI Engine Counters