The DPU can be configured with some predefined options, which includes the number of DPU cores, the convolution architecture, DSP cascade, DSP usage, and UltraRAM usage. These options allow you to set the DSP slice, LUT, block RAM, and UltraRAM usage. The following figure shows the configuration page of the DPU.
- Number of DPU Cores
- A maximum of four cores can be selected in one DPU IP. Multiple DPU cores can be used to achieve
higher performance. Consequently, it consumes more programmable logic
resources.
Contact your local Xilinx sales representative if more than four cores are required.
- Architecture of DPU
- The DPU IP can be
configured with various convolution architectures which are related to the
parallelism of the convolution unit. The architectures for the DPU IP include B512, B800, B1024,
B1152, B1600, B2304, B3136, and B4096.
There are three dimensions of parallelism in the DPU convolution architecture: pixel parallelism, input channel parallelism, and output channel parallelism. The input channel parallelism is always equal to the output channel parallelism (this is equivalent to channel_parallel in the previous table). The different architectures require different programmable logic resources. The larger architectures can achieve higher performance with more resources. The parallelism for the different architectures is listed in the following table.
Table 1. Parallelism for Different Convolution Architectures DPU Architecture Pixel Parallelism (PP) Input Channel Parallelism (ICP) Output Channel Parallelism (OCP) Peak Ops (operations/per clock) B512 4 8 8 512 B800 4 10 10 800 B1024 8 8 8 1024 B1152 4 12 12 1150 B1600 8 10 10 1600 B2304 8 12 12 2304 B3136 8 14 14 3136 B4096 8 16 16 4096 - In each clock cycle, the convolution array performs a multiplication and an accumulation, which are counted as two operations. Thus, the peak number of operations per cycle is equal to PP*ICP*OCP*2.
- Resources Utilization
- The resources utilization of a referenced DPU single core project is as
follows. The data is based on the ZCU102 platform with Low RAM Usage, Depthwise
Convolution, Average Pooling, Channel Augmentation, Average Pool, Leaky ReLU +
ReLU6 features and Low DSP Usage.
Table 2. Resources of Different DPU Architectures DPU Architecture LUT Register Block RAM DSP B512 (4x8x8) 27893 35435 73.5 78 B800 (4x10x10) 30468 42773 91.5 117 B1024 (8x8x8) 34471 50763 105.5 154 B1152 (4x12x12) 33238 49040 123 164 B1600 (8x10x10) 38716 63033 127.5 232 B2304 (8x12x12) 42842 73326 167 326 B3136 (8x14x14) 47667 85778 210 436 B4096 (8x16x16) 53540 105008 257 562
- RAM Usage
- The weights, bias, and intermediate features are buffered in
the on-chip memory. The on-chip memory consists of RAM which can be instantiated
as block RAM and UltraRAM. The RAM Usage option determines the total amount of
on-chip memory used in different DPU architectures, and the setting is for all the
DPU cores in the DPU IP. High RAM Usage means that the
on-chip memory block will be larger, allowing the DPU more flexibility in handling the intermediate data.
High RAM Usage implies higher performance in each DPU
core. The number of BRAM36K
blocks used in different architectures for low and high RAM Usage is illustrated
in the following table.Note: The DPU instruction set for different options of RAM Usage is different. When the RAM Usage option is modified, the DPU instructions file should be regenerated by recompiling the neural network. The following results are based on a DPU with depthwise convolution.
Table 3. Number of BRAM36K Blocks in Different Architectures for Each DPU Core DPU Architecture Low RAM Usage High RAM Usage B512 (4x8x8) 73.5 89.5 B800 (4x10x10) 91.5 109.5 B1024 (8x8x8) 105.5 137.5 B1152 (4x12x12) 123 145 B1600 (8x10x10) 127.5 163.5 B2304 (8x12x12) 167 211 B3136 (8x14x14) 210 262 B4096 (8x16x16) 257 317.5 - Channel Augmentation
- Channel augmentation is an optional feature for improving the
efficiency of the DPU when the
number of input channels is much lower than the available channel parallelism.
For example, the input channel of the first layer in most CNNs is three, which
does not fully use all the available hardware channels. However, when the number
of input channels is larger than the channel parallelism, then enabling channel
augmentation.
Thus, channel augmentation can improve the total efficiency for most CNNs, but it will cost extra logic resources. The following table illustrates the extra LUT resources used with channel augmentation and the statistics are for reference.
Table 4. Extra LUTs of DPU with Channel Augmentation DPU Architecture Extra LUTs with Channel Augmentation B512(4x8x8) 3121 B800(4x10x10) 2624 B1024(8x8x8) 3133 B1152(4x12x12) 1744 B1600(8x10x10) 2476 B2304(8x12x12) 1710 B3136(8x14x14) 1946 B4096(8x16x16) 1701 - DepthwiseConv
- In standard convolution, each input channel needs to perform
the operation with one specific kernel, and then the result is obtained by
combining the results of all channels together.
In depthwise separable convolution, the operation is performed in two steps: depthwise convolution and pointwise convolution. Depthwise convolution is performed for each feature map separately as shown on the left side of the following figure. The next step is to perform pointwise convolution, which is the same as standard convolution with kernel size 1x1. The parallelism of depthwise convolution is half that of the pixel parallelism.
Figure 2. Depthwise Convolution and Pointwise Convolution
DPU Architecture | Extra LUTs | Extra BRAMs | Extra DSPs |
---|---|---|---|
B512(4x12x12) | 1734 | 4 | 12 |
B800(4x10x10) | 2293 | 4.5 | 15 |
B1024(8x8x8) | 2744 | 4 | 24 |
B1152(4x12x12) | 2365 | 5.5 | 18 |
B1600(8x10x10) | 3392 | 4.5 | 30 |
B2304(8x12x12) | 3943 | 5.5 | 36 |
B3136(8x14x14) | 4269 | 6.5 | 42 |
B4096(8x16x16) | 4930 | 7.5 | 48 |
- ElementWise Multiply
- The ElementWise Multiply can perform dot multiplication on most two input
feature maps. The input channel of EM(ElementWise Multiply) ranges from 1 to 256
* channel_parallel. Note: The ElementWise Multiply is currently not supported in Zynq-7000 devices.
The extra resources with ElementWise Multiply is listed in the following table.
Table 6. Extra resources of DPU with ElementWise Multiply DPU Architecture Extra LUTs Extra FFs 1 Extra DSPs B512(4x12x12) 159 -113 8 B800(4x10x10) 295 -93 10 B1024(8x8x8) 211 -65 8 B1152(4x12x12) 364 -274 12 B1600(8x10x10) 111 292 10 B2304(8x12x12) 210 -158 12 B3136(8x14x14) 329 -267 14 B4096(8x16x16) 287 78 16 - Negative numbers imply a relative decrease.
- AveragePool
- The AveragePool option determines whether the average pooling
operation will be performed on the DPU or not. The supported sizes range from 2x2, 3x3, …,
to 8x8, with only square sizes supported.
The extra resources with Average Pool is listed in the following table.
Table 7. Extra LUTs of DPU with Average Pool DPU Architecture Extra LUTs B512(4x12x12) 1507 B800(4x10x10) 2016 B1024(8x8x8) 1564 B1152(4x12x12) 2352 B1600(8x10x10) 1862 B2304(8x12x12) 2338 B3136(8x14x14) 2574 B4096(8x16x16) 3081
- ReLU Type
- The ReLU Type option determines which kind of ReLU function can be used in
the DPU. ReLU and ReLU6 are
supported by default.
The option “ReLU + LeakyReLU + ReLU6“ means that LeakyReLU becomes available as an activation function.
Note: LeakyReLU coefficient is fixed to 0.1.Table 8. Extra LUTs with ReLU + LeakyReLU + ReLU6 compared to ReLU+ReLU6 DPU Architecture Extra LUTs B512(4x12x12) 347 B800(4x10x10) 725 B1024(8x8x8) 451 B1152(4x12x12) 780 B1600(8x10x10) 467 B2304(8x12x12) 706 B3136(8x14x14) 831 B4096(8x16x16) 925 - Softmax
- This option allows the softmax function to be implemented in hardware. The
hardware implementation of softmax can be 160 times faster than a software
implementation. Enabling this option depends on the available hardware resources
and desired throughput.
When softmax is enabled, an AXI master interface named SFM_M_AXI and an interrupt port named
sfm_interrupt
will appear in the DPU IP. The softmax module usesm_axi_dpu_aclk
as the AXI clock for SFM_M_AXI as well as for computation. The softmax function is not supported on DPUs targeting Zynq®-7000 devices.The extra resources with Softmax enabled are listed in the following table.
Table 9. Extra resources with Softmax IP Name Extra LUTs Extra FFs Extra BRAMs Extra DSPs Softmax 9580 8019 4 14