In this section, the reference design is in testcase_dmafifo_opt
. From performing the above analysis, it can be seen that the bottleneck of this design contains the following issues:
The interface bandwidth is not optimal. The design uses a PLIO width of 32 bits running at 312.5 MHz. Change it to 128 bits running at 312.5 MHz. The relevant code is in
aie/graph.h
:in=input_plio::create("Datain0", plio_128_bits, "data/input.txt"); dataout=output_plio::create("Dataout0", plio_128_bits, "data/output.txt");
The overhead of the graph iterations is too large. The hierarchy of the design should not be touched. Increase the buffer size from 128 bits to 4096 bits. To avoid deadlock, the FIFO size also needs to be increased. The relevant code is in
aie/graph.h
:connect< >net0(in, k[0].in[0]); connect< stream >net1(k[0].out[0], k[1].in[0]); connect< >net2(k[0].out[1], k[1].in[1]); connect< stream >net3(k[1].out[0], dataout); fifo_depth(net1)=1024;
Note: When the FIFO depth is large, the DMA FIFO is used. Do not set the FIFO depth to larger than (or equal to) 8192 for a single DMA FIFO.
The kernel is not well-pipelined. As well as increasing the loop count to deal with more data, more instructions should be added in the loop body and a
__restrict
keyword should be added to the ports to make the tool schedule instructions more freely. The optimized code foraie_dest1
is as follows:using namespace adf; __attribute__ ((noinline)) void aie_dest1(input_buffer<int32,extents<1024>> & __restrict in, output_stream<int32> * __restrict out, output_buffer<int32,extents<1024>> & __restrict outm){ auto inIter=aie::begin_vector<4>(in); auto outmIter=aie::begin_vector<4>(outm); aie::vector<int32,4> tmp; for(int i=0;i<128;i++) chess_prepare_for_pipelining { tmp=*inIter++; writeincr(out,tmp); *outmIter++=tmp; tmp=*inIter++; writeincr(out,tmp); *outmIter++=tmp; } }
Similar optimization is done for
aie_dest2
. For more information about loop analysis and optimization, refer to the AI Engine Kernel Coding Best Practices Guide.
After making these optimizations, run the following command:
```
make aiesim
```
It can be seen that the design performance can be increased from around 828 MBps to around 3748 MBps. This is approaching the theoretical limit of the design (4 GBps).
Next, run the design in hardware emulation:
```
make run_hw_emu
```
In QEMU, run the following commands:
```
mount /dev/mmcblk0p1 /mnt
cd /mnt
./host.exe a.xclbin
```
Build the design for hardware:
```
make package TARGET=hw
```
The performance in hardware is similar:
```
cycle count:110610
Throughput of the graph: 4628.88 MB/s
```