Data and Coefficients Management - 2023.2 English

Vitis Tutorials: AI Engine (XD100)

Document ID
XD100
Release Date
2024-03-05
Version
2023.2 English

The data register is limited to 1024 bits (v32cint16) and the coefficient register maximum bitwidth is 512 bits (v16cint16). Having a streaming interface (single stream to start with), four cint16 can be read in one instruction, but it takes four clock cycles to be able to perform the same operation again. Reading four samples at a time allows the use of mul4 and mac4 intrinsics.

Not all intrinics exist for the AI Engine. Only two intrinsics handle four lanes for complex 16 bits x complex 16 bits:

missing image

In this tutorial, finite length loops (by default 512 input/ouput samples) are assumed for ease of debugging. This number of iterations can be increased as desired up to infinite loops (while(1) { ...}). Between two calls of the kernel, the status of the delay-line of the filter needs to be maintained. This delay-line must be at least 31 samples for a 32-tap filter. 32 samples fit in a Y register and that is why why we use a v32cint16 variable to keep this delay-line. At the beginning of the kernel call, this delay-line is loaded from the memory, and at the end it is stored there. For the coefficients, there is no option; it will be v8cint16.

A mul4 operating on cint16 x cint16 can perform eight operations in one clock cycle, leading to two operations per lane.