Implementation Details

Block-by-Block Configurable Fast Fourier Transform Implementation on AI Engine (XAPP1356)

Document ID
XAPP1356
Release Date
2021-01-11
Revision
1.0 English

The input to the FFT consists of a control word of three bits and four data blocks of 1024 samples each. The least significant two bits of the control word, Sz[1:0], indicate the FFT size, that is, N= 4096/2Sz. Bit 2 of the control word specifies whether FFT or IFFT is to be performed on the data. The following figure shows the definition of the data blocks of various FFT sizes.

Figure 1. Input Data Format

The control word and Data Block 0 and 1 are multiplexed onto an AXI packet stream of 1 GSPS, and Data Block 2 and 3 are multiplexed onto the other packet stream. Packet IDs are allocated by the Xilinx tools during compilation and reported in a JSON file for easy post-processing. For details on AI Engine array packet switching, see the Versal ACAP AI Engine Programming Environment User Guide (UG1076). The following figure shows a simplified timing diagram of the input packet streams of the FFT design.

Figure 2. Input Packet Streams Timing Diagram

The output of the FFT is two streams of 1 GSPS samples, and the data format depends on the FFT size. For some applications a different output format might simplify programmable logic design; the C function of FFTz can be modified accordingly.

Figure 3. FFT Output Format

The linear phase rotation between 1024-point and 4-point FFTs is implemented in FFTb, FFTc, and FFTd kernels using the sincos() function built into the AI Engine scalar unit instead of using a large ROM. This approach saves three AI Engines x 1024 x 4-byte/sample = 12,288 bytes memory, equivalent to 1.5 memory banks. More specifically, a vector of eight twiddle factors is computed in parallel as follows.



On the right-hand side of the equation, the first term is computed by the sincos() function in the AI Engine scalar unit and the second term is a vector pre-computed before the loop, also using sincos().

Location pinning for AI Engines and buffers are used to pack two FFT modules into a 5x2 AI Engine array with minimum memory conflicts. As shown in Figure 4, all memory banks and AI Engines inside the 5x2 AI Engine array are used by the design to maximize throughput. In Figure 4, the shaded AI Engines are one set of five AI Engines and the un-shaded set are the other set in the 5x2 AI Engine array.

Figure 4. Two FFT Modules Packed into a 5x2 AI Engine Array