Panorama-View Profiling - 1.4 English

Vitis AI User Guide (UG1414)

Document ID
UG1414
Release Date
2021-07-22
Version
1.4 English

DSight delivers the visual format profiling statistics to let the users have a panorama view over DPU cores’ utilization so that they can locate the application’s bottleneck and further optimize performance. Ideally, the models should be compiled by VAI_C into normal mode DPU kernels before performing panorama view profiling.

The following steps describe how to conduct profiling with DSight:

  • Switch N2Cube into profile mode using the command dexplorer -m profile.
  • Run the DPU application and stop the process after it stays under the typical performance situation for several seconds A profile file with the name dpu_trace_[PID].prof is generated within the application’s directory for further processing. (PID is the process ID of the launched DPU application).
  • Launch the DSight tool with the command dsight -p dpu_trace_[PID].prof. An html file with the name dpu_trace_[PID].html is generated by DSight
  • Open this generated html web page with any web browser and visual charts will be shown. One profiling example for multi-threading ResNet-50 over triple DPU cores is shown in the following figure.
DPU Utilization (Y-axis)
List out each DPU core’s utilization. Higher percentage means DPU computing power is fully utilized to accelerate the model’s execution. For lower percentage, the users can try to change the DPU configuration to reduce its required logic resources or try to re-design the algorithm so that DPU computing resources match the algorithm’s requirement better.
Schedule Efficiency (X-axis)
Indicate what percentages of each DPU core are scheduled by runtime N2Cube. If the percentage number is lower, the users can try to improve the application’s thread number so that DPU cores have more chances to be triggered. To further improve DPU cores’ schedule efficiency, the users should try to optimize the other parts of computation workloads running on Arm CPU side, such as using NEON intrinsic, assembly instructions, or using Vitis accelerated libraries (e.g., xfOpenCV). Typically, such non-DPU parts workloads include pre-processing, post-processing, or DPU unsupported deep learning operators.