FFT on a Single AI Engine

Block-by-Block Configurable Fast Fourier Transform Implementation on AI Engine (XAPP1356)

Document ID
XAPP1356
Release Date
2021-01-11
Revision
1.0 English

A Discrete Fourier Transform (DFT) takes time-domain samples {x0, x1, … xN-1} and outputs frequency-domain signal on the subcarriers {y0, y1, … yN-1} as follows:

Direct implementation of this equation requires N2 complex multiply-and-accumulate (MAC) operations, which can be reduced to (N log2N) by the algorithm Cooley and Tukey proposed (see J. W. Cooley and J. W. Tukey, "An algorithm for the machine calculation of complex Fourier series", Math. Comput., vol. 19, no. 90, pp. 297-301 ).

Figure 1. Block Diagram of FFT in DSP Library

The Xilinx DSP Library ( Xilinx Digital Signal Processing Library for AI Engine (UG1295) (Registration required)) has included FFT/IFFT kernels of various sizes that are not currently configurable during run-time. The preceding figure shows the block diagram of the FFT/IFFT implementation on a single AI Engine. The sizes of most memory buffers are proportional to the FFT size N, and the total memory size is (36N + 2076) bytes. In the 3GPP 5G NR standard, N can be up to 4096 and a single-AI Engine implementation would require 150 KB memory for data storage, more than 128 KB one AI Engine has access to. Although it is possible to reduce the memory footprint at the cost of 50% throughput, a multi-AI Engine architecture should be considered for higher efficiency and performance.