Hardware Architecture - 1.0 English

DPUCAHX8H for Convolutional Neural Networks (PG367)

Document ID
Release Date
1.0 English

The detailed hardware architecture of the DPUCAHX8H is shown in the following figure. Each implementation has one to three DPU cores, and each DPU has one to five processing engines. The number of cores and PEs/cores are chosen based on throughput needs versus FPGA resource usage. The HBM memory space is divided into virtual banks and system memory. The virtual banks are used to store temporary data and the system memory is used to store instructions, input images, output results, and user data. After starting up, the DPU fetches model instructions from system memory to control the operation of the computing engine. The model instructions are generated by the Vitis AI compiler (running on the host server) which performs substantial optimizations.

On-chip memory is used to buffer weights, bias, intermediate data, and output data to achieve high throughput and efficiency. The local buffer is private to each PE; the global buffer is shared by the PEs in the same DPU core. A deeply pipelined design is used for the computing engine. PEs which include the conv engine, depthwise conv engine, and misc logic take full advantage of the fine-grained building blocks such as multipliers, adders, and accumulators in Xilinx® devices.

Figure 1. DPU Hardware Architecture