Subsystem Assembly and Verification Using Hardware Emulation - 2020.2 English

Versal ACAP Design Guide (UG1273)

Document ID
UG1273
Release Date
2021-03-26
Version
2020.2 English

In the second step of this design flow, you gradually assemble subsystem components (PS, PL, and AI Engine) on top of the target platform and use the Vitis hardware emulation flow to simulate the integrated system. Hardware emulation is a cycle approximate simulation of the system. The AI Engine graph runs in the SystemC simulator (aiesimulator). RTL behavioral models of the PL run in the Vivado simulator or a supported third-party simulator. The software code executing on the PS is simulated using the Xilinx Quick Emulator (QEMU).

The target platform contains all of the necessary hardware and software infrastructure resources required for the project. It is possible to target a standard Xilinx platform or a custom platform for your project. At this step in the flow, Xilinx recommends using a standard and pre-verified platform to reduce uncertainty in the process and focus efforts on the system components (graph and kernels).

The Vitis linker (v++ --link) is used to assemble the compiled AI Engine graph (libadf.a) and PL kernels (.xo) with the targeted platform. The Vitis linker establishes connections between the AI Engine ports, PL kernels, and other platform resources.

Because this design flow progresses gradually, certain elements might not exist in early iterations. You might need to terminate unconnected signals, drive signals, or provide sinks. In this case, unterminated streaming connections between the AI Engine graph and PL kernels (PLIOs and AXI4-Stream) require the addition of simulation I/Os and traffic generator IP for emulation purposes, which can be added during the linking process using v++ options.

The Vitis linker automatically inserts FIFOs on streaming interfaces as well as clock domain converters (CDC) and data width converters (DWC) between the AI Engine and PL kernels as needed. On the Versal ACAP, the clock on the AI Engine array can run at 1 GHz, but the clock in the PL region runs at a different, lower frequency. This means there can be a difference between the data throughput of the AI Engine kernels and the PL kernels based on their clock frequencies. When linking the subsystem, the Vitis compiler can insert CDCs, DWCs, and FIFOs to match the throughput capacities of the PL and AI Engine regions.

The Vitis packager (v++ --package) is used to add the PS application and firmware and to generate the required setup to run hardware emulation. The PS application controls the AI Engine graph, including how it is loaded, initialized, run, and updated, and the PL kernels. To control the AI Engine graph, you must use the graph APIs generated by the aiecompiler or the standard Xilinx Runtime (XRT) APIs. To control the PL kernels, Xilinx recommends using the standard XRT APIs. XRT is an open-source library that makes it easy to interact with PL kernels and AI Engine graphs from a software application, either embedded or x86-based.

Optionally, you can build higher-level functionality on top of the graph and PL drivers. For the PS subsystem, you write code in this step that did not fully exist in the first step. Drivers or firmware interact directly with the kernels and a higher-level application that uses these drivers.

You can develop PS firmware, graph drivers, and PL kernels as follows:

PS firmware
Use the test bench from the first step in the design flow, which drives and manages the graph using graph APIs.
Graph drivers
Use the graph APIs to test the graph and to interact with RTPs and GMIOs.
PL kernel drivers
Use XRT APIs or UIO drivers to interact with the PL kernels.

In this step, most models are cycle accurate. However, some models are only approximate, and other models are transaction-level models (TLM). PL kernels are simulated using the target clock, which is not guaranteed to be met during implementation. The interactions between the AI Engine graph and PL kernels are modeled at the cycle level, but overall accuracy depends on the accuracy of the patterns produced by the traffic generators and other test bench modules. The impact of other subsystems or complex I/O interactions cannot be accurately modeled. The slower performance of the emulation environment limits the number of traffic/vectors that can be tested.

Note: Meeting performance in hardware emulation is necessary but is not a guarantee of results. Hardware emulation is cycle approximate with better accuracy in performance than during the first step in the design flow. However, performance results are still not final at this stage.