After the models are compiled and deployed over edge DPU, the utility DExplorer can be used to perform fined-grained profiling to check layer-by-layer execution time and DDR memory bandwidth. This is very useful for the model’s performance bottleneck analysis.
Note: The model should be compiled by
Vitis AI compiler into debug mode kernel;
fine-grained profiling is not available for normal mode kernel.
There are two approaches to enable fine-grained profiling for debug mode kernel:
- Run
dexplorer -m profile
before launch the running of DPU application. This will change N2Cube global running mode and all the DPU tasks (debug mode) will run under the profiling mode. - Use
dpuCreateTask()
withflag T_MODE_PROF
ordpuEnableTaskProfile()
to enable profiling mode for the dedicated DPU task only. Other tasks will not be affected.
The following figure shows a profiling screen capture over ResNet50 model. The profiling information for each DPU layer (or node) over ResNet50 kernel is listed out.
Note: For each DPU node, it may include
several layers or operators from original Caffe or TensorFlow models because Vitis AI
compiler performs layer/operator fusion to optimize execution performance and DDR memory
access.
Figure 1. Fine-grained Profiling for ResNet50
The following fields are included:
- ID
- The index ID of DPU node.
- NodeName
- DPU node name.
- Workload (MOP)
- Computation workload (MAC indicates two operations).
- Mem (MB)
- Memory size for code, parameter, and feature map for this DPU node.
- Runtime (ms)
- The execution time in unit of Millisecond.
- Perf (GOPS)
- The DPU performance in unit of GOP per second.
- Utilization (%)
- The DPU utilization in percent.
- MB/S
- The average DDR memory access bandwidth.