Design Challenges

In this section, a traditional method of filter design on the AI Engine is reviewed and the challenges to the DDC chain implementation are analyzed.

In a traditional AI Engine design, the input window contains the incoming data block and the Vitis™ unified software platform automatically puts an overlap in front of it. For example, a filter with 89 taps might need an overlap of 88 samples as shown in the following figure. In this case, the physical buffer size is assigned to 88 + 64 samples. The following figure shows the pointer movement during the processing of the data. Note that a prerequisite for the traditional filter to work is that the overlap has to be placed directly in front of the data in a circular buffer.

Figure 1. Traditional FIR89 Pointers Behavior

When the data comes from another AI Engine or the PL, a ping-pong buffer is implemented for the input window of the kernel, and the overlap is automatically copied to the front of the data before each run of the filter function. This memory copy operation is illustrated in the following figure. When a filter has many taps that are more than the size of data window, copying the overlap for every execution of the filter function can consume a considerable amount of time.

Figure 2. Conventional Filter Kernel Behavior

Note that the overlap is determined by the filter taps and its size must be a multiple of 256 bits for maximum memory access efficiency. The size of overlap is subject to the following equation:

The size of overlap and window must be specified at compilation time, which means that filters of different taps cannot share data buffers. Using traditional filter architecture, it is necessary to partition the DDC into three AI Engines for each antenna as shown in the following figure.

Figure 3. DDC System Partition (Three AI Engines)

Though it is possible to bypass some DDC filters depending on the carrier configuration, the traditional filter design approach under consideration has the following challenges:

The memory footprint is very high because every filter of every channel needs a buffer for input samples + overlap.
The number of output windows is high because they cannot be shared by multiple filters.
The copying overlap of long channel filters, that is, FIR89 and FIR199, leads to considerable efficiency loss.
The AI Engine efficiency of the long half-band decimation filter is low because of the difficulty in achieving a perfect inner loop.

Though it is possible to design the filter chain using the traditional method, the design will have a large foot-print and require at least three AI Engines instead of two for each DDC. This 50% increase in AI Engines is significant when the number of antennas is high.

This application note proposes an innovative design approach for the stated carrier configuration to reduce AI Engine resource utilization by 30% by leveraging adaptable and scalable compute capabilities designed into the Versal AI Engine. The savings will be even higher in 5G NR wireless radio systems supporting a larger number of carrier configurations. The proposed approach is also applicable to digital up-conversion chain (DUC) and other modules in the system.