Systolic FIR Filter

Versal ACAP DSP Engine Architecture Manual (AM004)

Document ID
AM004
Release Date
2022-09-11
Revision
1.2.1 English

The systolic FIR filter is considered an optimal solution for parallel filter architectures. The systolic FIR filter also uses adder chains to be able to take full advantage of the DSP58 architecture (see the following figure).

Figure 1. Systolic FIR Filter

The input data is fed into a cascade of registers acting as a data buffer. Each register delivers a sample to a multiplier where it is multiplied by the respective coefficient. The coefficients are aligned from left to right with the first coefficients on the left side of the structure. The adder chain stores the gradually combined inner products to form the final result. No external logic is required to support the filter and the structure is extendable to support any number of coefficients.

Note: Dedicated cascade connections (PCOUT and PCIN) are leveraged to achieve maximum performance (adder chain structure versus adder tree).

The configuration of DSP58 for each segment of the systolic FIR filter is shown in the following figure. Apart from the very first segment, all processing elements have the same structure. If rounding is performed, the ALU in the first segment must be driven by the C input (dynamic/static rounding) or RND attribute (static rounding) with the correct value. For all DSP instances, except the first instance, OPMODE is set to feed the ALU with the multiplier result of the same instance and the result from the previous DSP in the chain through the dedicated cascade path (PCOUT → PCIN). Notice that the two leftmost bits of OPMODE (through the WMUX) can be used if rounding is implemented. The dedicated cascade input in the first DSP instance (BCIN) and dedicated cascade output (BCOUT) are used to create the necessary input data buffer cascade.

Note: This design is supported by inference, therefore, the A and B inputs can be swapped depending on the tool choice. This means that ACIN and ACOUT (instead of BCIN and BCOUT) can be used to create the cascade.
Figure 2. Systolic Multiply-Add Processing Element

The advantages of using the systolic FIR filter are as follows.

Highest Performance
Maximum performance can be achieved with this structure because there is no high fanout input signal. Dedicated cascading avoids the need to pass through programmable interconnect. Larger filters can be routing-limited if the number of coefficients exceeds the number of DSP Engines in a column on a device.
Efficient mapping to the DSP58
Mapping is enabled by the adder chain structure of the systolic FIR filter. This extendable structure supports large and small FIR filters.
No External Logic
No external programmable logic is required, thus enabling the highest possible performance.

The disadvantages of using the systolic FIR filter are as follows.

Higher Latency
The latency of the filter is a function of the number of coefficients present in the filter. The larger the filter, the higher the latency.
More Resource Usage
Larger number of DSPs are used compared to the MACC FIR filter.

Coding examples will be available in Language Templates from the Vivado® Integrated Design Environment (IDE) in a subsequent release.