Design Optimization Considerations - 2023.2 English

Vitis Tutorials: AI Engine (XD100)

Document ID
XD100
Release Date
2024-03-05
Version
2023.2 English

In this section, the reference design is in testcase_dmafifo_opt. From performing the above analysis, it can be seen that the bottleneck of this design contains the following issues:

  • The interface bandwidth is not optimal. The design uses a PLIO width of 32 bits running at 312.5 MHz. Change it to 128 bits running at 312.5 MHz. The relevant code is in aie/graph.h:

    in=input_plio::create("Datain0", plio_128_bits,  "data/input.txt");
    dataout=output_plio::create("Dataout0", plio_128_bits,  "data/output.txt");
    
  • The overhead of the graph iterations is too large. The hierarchy of the design should not be touched. Increase the buffer size from 128 bits to 4096 bits. To avoid deadlock, the FIFO size also needs to be increased. The relevant code is in aie/graph.h:

    connect< >net0(in, k[0].in[0]);
    connect< stream >net1(k[0].out[0], k[1].in[0]);
    connect< >net2(k[0].out[1], k[1].in[1]);
    connect< stream >net3(k[1].out[0], dataout);
    fifo_depth(net1)=1024;
    

    Note: When the FIFO depth is large, the DMA FIFO is used. Do not set the FIFO depth to larger than (or equal to) 8192 for a single DMA FIFO.

  • The kernel is not well-pipelined. As well as increasing the loop count to deal with more data, more instructions should be added in the loop body and a __restrict keyword should be added to the ports to make the tool schedule instructions more freely. The optimized code for aie_dest1 is as follows:

    using namespace adf;
    __attribute__ ((noinline)) void aie_dest1(input_buffer<int32,extents<1024>> & __restrict in, 
        output_stream<int32> * __restrict out, output_buffer<int32,extents<1024>> & __restrict outm){
    	auto inIter=aie::begin_vector<4>(in);
    	auto outmIter=aie::begin_vector<4>(outm);
    	aie::vector<int32,4> tmp;
    	for(int i=0;i<128;i++)
    	chess_prepare_for_pipelining
    	{
    		tmp=*inIter++;
    		writeincr(out,tmp);
    		*outmIter++=tmp;
    		tmp=*inIter++;
    		writeincr(out,tmp);
    		*outmIter++=tmp;
    	}
    }
    

    Similar optimization is done for aie_dest2. For more information about loop analysis and optimization, refer to the AI Engine Kernel Coding Best Practices Guide.

After making these optimizations, run the following command:

```
make aiesim
```

It can be seen that the design performance can be increased from around 828 MBps to around 3748 MBps. This is approaching the theoretical limit of the design (4 GBps).

Next, run the design in hardware emulation:

```
make run_hw_emu
```

In QEMU, run the following commands:

```
mount /dev/mmcblk0p1 /mnt
cd /mnt
./host.exe a.xclbin
```

Build the design for hardware:

```
make package TARGET=hw
```

The performance in hardware is similar:

```
cycle count:110610
Throughput of the graph: 4628.88 MB/s
```