Block-by-Block Configurable Fast Fourier Transform Implementation on AI Engine (XAPP1356)

Document ID
Release Date
1.0 English

Orthogonal frequency division multiplexing (OFDM) has been adopted by many wireless communication systems ranging from digital video broadcasting (DVB) to the latest 5G New Radio (NR) access network ( 3GPP Std TS 38.211 ). Recent wireless systems often have dynamically allocated component carriers of various subcarrier spacing on each antenna, which requires the FFT sizes to be configurable on a block-by-block basis. Also, the throughput of the FFT must be high enough to meet system bandwidth requirements. In a 64-antenna 5G NR system with two 100 MHz carriers and 200 MHz occupied bandwidth, 1792 FFTs of 4096 points need be performed in 0.5 ms, which is equivalent to a minimum throughput of 1792 x 4096 / 0.5 ms = 14.68 GSPS. For FFTs with a fully pipelined architecture in programmable logic (PL) running at 491 MHz, it takes at least 14.68 GSPS / 491.52 MSPS = 30 instances to achieve the throughput. A large amount of logic resource will be occupied by the FFTs.

The AI Engine is designed for intensive compute in various use cases including but not limited to 5G wireless. One AI Engine tile consists of one AI Engine, 32 KB data memory, and two DMA engines for automatic data transportation. Every AI Engine is equipped with a vector processor that is capable of 32 real-by-real 16-bit multiply-and-accumulate (MAC) operations in one clock cycle. The memory access unit inside the AI Engine reads 512 bits operands and writes 256 bits computation results every clock cycle to match the capability of the vector processor. In one Versal AI Core device, there are hundreds of AI Engine tiles interconnected through cascading buses, AXI streams, and shared local memory according to the dataflow defined by the user at compilation time. For more detailed information about the AI Engine, see Xilinx AI Engine and Their Applications (WP506).

Figure 1. Block Diagram of One AI Engine Tile

This application note shows one method to implement an FFT of 1.85 GSPS throughput on five AI Engines and 20 memory banks. The input consists of four data blocks of 1024 samples each; a three-bit control word specifies the operation to be performed (see the following table).

Table 1. Control Word Definition
Control Word Operation FFT/IFFT Size No. Data Blocks
000 FFT 4096 1
001 FFT 2048 2
010 FFT 1024 4
011 FFT 512 8
100 IFFT 4096 1
101 IFFT 2048 2
110 IFFT 1024 4
111 IFFT 512 8

Packet switching is used to multiplex all input data blocks onto two AXI streams of 1 GSPS from the PL to the AI Engine. The output is sent back to the PL on two AXI streams of 1 GSPS to match the FFT throughput at 1.85 GSPS. Detailed input and output formats are shown in Figure 1 and Figure 3, respectively. Two FFT modules can fit into a 5x2 AI Engine array making it highly scalable in large systems. For the above example of 64-antenna 200 MHz 5G NR system, it takes eight FFT modules in an array of 5x8=40 AI Engine tiles to achieve 14.68 GSPS throughput, which is only 10% of that available in the Xilinx VC1902 device, making room for other compute intensive functions such as random access channel processing, beamforming, and channel filtering.

AI Engine designs are guaranteed to run at 1 GHz minimum on Versal AI Core devices without the need to worry about timing closure. The C functions of the AI Engine reference design can also be modified to integrate more functionality into the AI Engines, for example, delay and phase compensation, carrier extraction, and subcarrier mapping.