After the models are compiled and deployed over edge DPU, the utility DExplorer can be used to perform fined-grained profiling to check layer-by-layer execution time and DDR memory bandwidth. This is very useful for the model’s performance bottleneck analysis.
Note: The model should be compiled by Vitis AI compiler into debug mode kernel; fine-grained profiling is not available for normal mode kernel.
There are two approaches to enable fine-grained profiling for debug mode kernel:
dexplorer -m profilebefore launch the running of DPU application. This will change N2Cube global running mode and all the DPU tasks (debug mode) will run under the profiling mode.
dpuEnableTaskProfile()to enable profiling mode for the dedicated DPU task only. Other tasks will not be affected.
The following figure shows a profiling screen capture over ResNet50 model. The profiling information for each DPU layer (or node) over ResNet50 kernel is listed out.
Note: For each DPU node, it may include several layers or operators from original Caffe or TensorFlow models because Vitis AI compiler performs layer/operator fusion to optimize execution performance and DDR memory access.
Figure 1. Fine-grained Profiling for ResNet50
The following fields are included:
- The index ID of DPU node.
- DPU node name.
- Workload (MOP)
- Computation workload (MAC indicates two operations).
- Mem (MB)
- Memory size for code, parameter, and feature map for this DPU node.
- Runtime (ms)
- The execution time in unit of Millisecond.
- Perf (GOPS)
- The DPU performance in unit of GOP per second.
- Utilization (%)
- The DPU utilization in percent.
- The average DDR memory access bandwidth.