Optimizing Paths with Dedicated Blocks and Macro Primitives

Optimizing Paths with Dedicated Blocks and Macro Primitives - 2023.2 English

Versal Adaptive SoC System Integration and Validation Methodology Guide (UG1388)

Document ID

UG1388

Release Date

2023-11-15

Version

2023.2 English

Paths from/to/between dedicated blocks and macro primitives (e.g., DSP, block RAM, UltraRAM, NoC master unit/slave unit (NMU/NSU), AI Engines, and XPIO) need special attention because these primitives usually have the following timing characteristics:

Higher setup/hold/clock-to-output timing arc values for some pins. For example, a block RAM has a clock-to-output delay around 1.2 ns without the optional output register and 0.3 ns with the optional output register. Review the data sheet of your target device series for complete details.
Higher clock-to-output timing arc values for NoC output pins. For example, a NoC NSU has a clock-to-output delay around 0.65 ns.
Higher routing delays than regular FD/LUT connections.
Higher clock skew variation than regular FD-FD paths.
Higher routing delays between the fabric and dedicated blocks on the top/bottom of the device (for example, AI Engines, dedicated blocks within the XPIO, such as XPHY logic, I/O logic, and clocking modifying blocks).

Also, their availability and site locations are restricted compared to CLB slices, which usually makes their placement more challenging and often incurs some QoR penalty.

For these reasons, AMD recommends the following:

Pipeline paths from and to dedicated blocks and macro primitives as much as possible.
Restructure the combinational logic connected to these cells to reduce the logic levels by at least 1 or 2 cells if latency incurred by pipelining is a concern.
Meet setup timing by at least 500 ps on these paths before placement.
Replicate cones of logic connected to too many dedicated blocks or macro primitives if they need to be placed far apart.
When the design has tight timing requirements to, within, or from a DSP block, run opt_design -dsp_register_opt to move registers to a more timing optimal position.
Note: Because timing is approximate during opt_design, you might also need to run phys_opt_design -dsp_register_opt to correct movements where timing was not accurately represented at the pre-placement stage.
Use the boundary logic interface (BLI) flip-flops for the placement of pipeline flip-flops interfacing with AI Engines and dedicated blocks within the XPIO, such as XPHY logic, I/O logic, and clock-modifying blocks. Some IP provide an option to utilize the BLI flip-flops.