The detailed hardware architecture of the DPUCVDX8H is shown in the following figure. Each implement could have one DPU instance, and each DPU may have two, four, six, or eight processing engines instances, the number of DPU instances depends on FPGA resource.
The Conv computing unit is implemented on AI Engine. The Conv control unit, Load unit, save unit, and MISC unit (pooling and element-wise processing) are implemented in programmable logic. All processing engines share the weight unit and scheduler unit, implemented with programmable logic. DRAM is used as system memory to store network instructions, input images, output results, and intermediate data. After bring-up, DPU fetches instructions from system memory to control the operations of the computing engine.
On-chip memory is used to buffer weights, bias, and intermediate data to achieve high throughput. Feature map banks are private to each batch engine. All processing engines share weights buffer in the same DPU instance. The data is reused as much as possible to reduce the memory bandwidth. The Conv processing engines (PE) take full advantage of the computing power of the AI Engine to get high performance.