Design Optimization Considerations - 2022.2 English

Vitis Tutorials: AI Engine Development

Document ID
XD100
Release Date
2022-12-01
Version
2022.2 English

In this section, the reference design is in testcase_dmafifo_opt. From performing the above analysis, it can be seen that the bottleneck of this design contains the following issues:

  • The interface bandwidth is not optimal. The design uses a PLIO width of 32 bits running at 250 MHz. Change it to 128 bits running at 250 MHz. The relevant code is in aie/graph.h:

    in=input_plio::create("Datain0", plio_128_bits,  "data/input.txt");
    dataout=output_plio::create("Dataout0", plio_128_bits,  "data/output.txt");
    
  • The overhead of the graph iterations is too large. The hierarchy of the design should not be touched. Increase the window buffer size from 128 bits to 4096 bits. To avoid deadlock, the FIFO size also needs to be increased. The relevant code is in aie/graph.h:

    connect< window<4096> >net0(in, k[0].in[0]);
    connect< stream >net1(k[0].out[0], k[1].in[0]);
    connect< window<4096> >net2(k[0].out[1], k[1].in[1]);
    connect< stream >net3(k[1].out[0], dataout);
    fifo_depth(net1)=1024;
    

    Note: When the FIFO depth is large, the DMA FIFO is used. Do not set the FIFO depth to larger than (or equal to) 8192 for a single DMA FIFO.

  • The kernel is not well pipelined. As well as increasing the loop count to deal with more data, more instructions should be added in the loop body and a __restrict keyword should be added to the ports to make the tool schedule instructions more freely. The optimized code for aie_dest1 is as follows:

    __attribute__ ((noinline)) void aie_dest1(input_window<int32> * __restrict in, 
            output_stream<int32> * __restrict out, output_window<int32> * __restrict outm){
    	aie::vector<int32,4> tmp;
    	for(int i=0;i<128;i++)
    	chess_prepare_for_pipelining
    	{
    		tmp=window_readincr_v<4>(in);
    		writeincr(out,tmp);
    		window_writeincr(outm,tmp);
    		tmp=window_readincr_v<4>(in);
    		writeincr(out,tmp);
    		window_writeincr(outm,tmp);
    	}
    }
    

    Similar optimization is done for aie_dest2. For more information about loop analysis and optimization, refer to the AI Engine Kernel Coding Best Practices Guide.

After making these optimizations, run the following command:

```
make aiesim
```

It can be seen that the design performance can be increased from around 828 MB/s to around 3748 MB/s. This is approaching the theoretical limit of the design (4 GB/s).

Next, run the design in hardware emulation:

```
make run_hw_emu
```

In QEMU, run the following commands:

```
mount /dev/mmcblk0p1 /mnt
cd /mnt
./host.exe a.xclbin
```

Build the design for hardware:

```
make package TARGET=hw
```

The performance in hardware is similar:

```
cycle count:109435
Throughput of the graph: 3742.86 MB/s
```