Designing the Kernel - 2023.2 English

Vitis Tutorials: AI Engine (XD100)

Document ID

XD100

Release Date

2024-03-05

Version

2023.2 English

As in the Single-kernel tutorial, this design will use the streaming input and output but the performances must be improved. The limitations can come from two sources:

Limit on the bandwidth side
Limit in the compute performance side

In the single-kernel section of the tutorial, the maximum throughput was 225 Msps, which shows that the streams are starved due to a limitation of the compute performance. The data type cint16 is 32-bit wide and the maximum bandwidth of the AXI-Stream connection array is 1x cint16 per clock cycle on a single stream. In the single-kernel part, four of them were read in four clock cycles, but the computation was taking 16 clock cycles for the 32 taps. For the optimal trade-off, the computation should take only four clock cycles for each of the four input samples read from the stream. In four clock cycles, eight taps can be processed, the complete filtering operation should be split onto four AI Engines.

The Single-Kernel Filter can be represented by this convolution:

missing image

After subdivision into four Kernels, each one on a different AI Engine, the filter can be represented by four smaller filters in parallel running on the same data stream, except that for some of these kernels, the beginning of the stream is discarded:

The four AI Engines perform the computations for a subset of the coefficients. Their results must be added together to get the overall result. The AI Engine architecture allows a number of accumulators to be sent to a neighboring AI Engine to be used as a starting point for a number of mac operations. For computations being performed on four lanes, the accumulator vector is v4cacc48, which is a 384-bit vector that can be sent to the next AI Engine in the chain in one clock cycle.

missing image