Novel Filter Design on AI Engine

It is observed that the 5G NR and LTE carriers do not co-exist in the system. Also, the maximum utilization between the 5G NR channel filter and the 5c LTE channel filter is less than 80% of the AI Engine capacity. (As shown in the previous figure, the utilization of AI Engine #0 is about 55% and the utilization of AI Engine #1 is about 77%). This allows packing the FIR89 and FIR199 filter chain on a single AI Engine, resulting in one antenna design using two AI Engines instead of three. The proposed design partitioning is illustrated in the following figure.

Figure 1. DDC System Partition (Two AI Engines)

A header at the front of the input window, which can change on a block-by-block basis, indicates the carrier configuration. The output window has a fixed size but can contain the data for one 5G NR 100 MHz carrier or five LTE carriers. A diagram of input/output data format is shown in the following figure.

Figure 2. DDC Interface Format

The optimized design implementation is explained step by step in the following.

First, the memory footprint has to be reduced by manually managing filter overlaps. Consider the buffer allocated to one filter kernel; only the overlap portion of the buffer is unique to the filter and the data portion can be shared by multiple filters. Especially for short filters, the saving in memory is large. The following figure shows the new approach where each kernel is assigned a pair of ping-pong buffers without overlap, and a separate memory is allocated for overlap only. The separation of data and overlap makes it possible for filters of different taps and sample rates to share a single data buffer.

Figure 3. Novel Filter Kernel Behavior

The following figure illustrates how a half band filter (HBF23) works with such an overlap buffer scheme, which differs from the traditional method in the removal of the overlap buffer in the input window to eliminate the automatic overlap copy. During the first three cycles, the kernel initializes the register by loading data from the overlap buffer rather than the input window. The main loop starts from the fourth to the N^th cycle, during which the kernel reads data from the input window, shuffles them, and performs MAC operations. Before the end of the function, the data from the last few cycles (three in this case) are stored back into the overlap buffer.

Figure 4. Concept of Manually Managed Overlap

The separation of overlap and data buffers leads to an implementation challenge. If the overlap is short enough to be loaded into registers without the need for reloading, which is the case for HBF11 and HBF23, then it is easy to handle, in that the overlap can be loaded into register space at the beginning as shown in the preceding figure. However, when the filter is long, a memory copy is required to merge the memory space. FIR89 and FIR199 are such cases. This application note proposes to use the overlap memory buffer for filtering and parallelize the data copy process with computation to maximize throughput.

The Versal ACAP AI Engine is a very long instruction word (VLIW) vector engine. The VLIW based instruction level parallelism implemented in the AI Engine allows execution of up to seven different operations in one cycle. The AI Engine can support two loads, one store, one vector MAC, one scalar ALU operation, and two data move instructions in a single cycle. This application note proposes a method that can use the AI Engine VLIW instruction bundling to find spare cycles to write data into an internal overlap from the input window in parallel with other operations. Refer to the following figure which uses a 64-sample input window. The first two cycles are used to load data into register files (left and right buffer, 16 samples each). The third to the N^th cycle is the main body of the for loop which is the key part of the filter design. The idea is to find the spare cycle to write the internal buffer with new data from the input window. Here the fourth or fifth cycle can be the spare cycle. From the sixth cycle the new data is overwritten by another load operation. In this way the costly overlap copy is perfectly merged with the MAC operation without using any extra cycles.

Figure 5. Spare Cycle