Coding for Optimal DSP and Arithmetic Inference

Coding for Optimal DSP and Arithmetic Inference - 2023.2 English

UltraFast Design Methodology Guide for FPGAs and SoCs (UG949)

Document ID

UG949

Release Date

2023-11-29

Version

2023.2 English

The DSP blocks within the AMD devices can perform many different functions, including:

Multiplication
Addition and subtraction
Comparators
Counters
General logic

The DSP blocks are highly pipelined blocks with multiple register stages allowing for high-speed operation while reducing the overall power footprint of the resource. AMD recommends that you fully pipeline the code intended to map into the DSP48, so that all pipeline stages are utilized. To allow the flexibility of use of this additional resource, a set condition cannot exist in the function for it to properly map to this resource.

DSP48 slice registers within AMD devices contain only resets, and not sets. Accordingly, unless necessary, do not code a set (value equals logic 1 upon an applied signal) around multipliers, adders, counters, or other logic that can be implemented within a DSP48 slice. Additionally, avoid asynchronous resets, since the DSP slice only supports synchronous reset operations. Code resulting in sets or asynchronous resets may produce suboptimal results in terms of area, performance, or power.

Many DSP designs are well-suited for the AMD architecture. To obtain best use of the architecture, you must be familiar with the underlying features and capabilities so that design entry code can take advantage of these resources.

The DSP48 blocks use a signed arithmetic implementation. AMD recommends code using signed values in the HDL source to best match the resource capabilities and, in general, obtain the most efficient mapping. If unsigned bus values are used in the code, the synthesis tools may still be able to use this resource, but might not obtain the full bit precision of the component due to the unsigned-to-signed conversion.

If the target design is expected to contain a large number of adders, AMD recommends that you evaluate the design to make greater use of the DSP48 slice pre-adders and post-adders. For example, with FIR filters, the adder cascade can be used to build a systolic filter rather than using multiple successive add functions (adder trees). If the filter is symmetric, you can evaluate using the dedicated pre-adder to further consolidate the function into both fewer LUTs and flip-flops and also fewer DSP slices as well (in most cases, half the resources).

If adder trees are necessary, the 6-input LUT architecture can efficiently create ternary addition (A + B + C = D) using the same amount of resources as a simple 2-input addition. This can help save and conserve carry logic resources. In many cases, there is no need to use these techniques.

By knowing these capabilities, the proper trade-offs can be acknowledged up front and accounted for in the RTL code to allow for a smoother and more efficient implementation from the start. In most cases, AMD recommends inferring DSP resources.

For more information about the features and capabilities of the DSP48 slice, and how to best leverage this resource for your design needs, see the 7 Series DSP48E1 Slice User Guide (UG479) and UltraScale Architecture DSP Slice User Guide (UG579).