Hardware and Software Organization

Hardware and Software Organization - 2022.1 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID

UG1393

Release Date

2022-05-25

Version

2022.1 English

A good system design model makes it very easy to use hardware acceleration for specific functions in an existing application with minimal changes to instantiate compute hardware and run it efficiently. In the Vitis HLS based acceleration flow, the efficiency of the compute hardware will still depend on modeling/coding style and pragmas. In the case of RTL flow, it depends on the chosen architecture. The invocation of accelerated function or CU and interaction with the host should be automated as much as possible, this includes pipelining data through the hardware, using and composing multiple CU's etc.

VSC provides a way to compile your accelerator design and the application software interface from a unified C++ model. The right side of the figure shows the hardware design is a system using the AXI4 framework which is plugged into the dynamic region of a standard Vitis platform. This user-defined composition might consist of replicable compute units (CU), where each CU can be a data-pipelined network of processing elements (PE), and each PE can work on the data:

in the device memory, typically a DDR, local to the accelerator card having the FPGA
in a smartSSD connected to the FPGA over PCIe
arriving to the PE input through one or more AXI4-Stream.

The CUs must connect to platform ports which are typically memory-mapped AXI4 (M_AXI) for data transfers to/from a host CPU through a DDR, or AXI4-Lite for low bandwidth scalar word transfers. The CUs may operate on independent data sets to achieve macro parallelism inherent to the application, achieving compelling acceleration. VSC provides the ability to use a data-mover (DM) for each M_AXI. The DM is an RTL IP that efficiently implements DDR transfers by automating well-defined protocols such as AXI-bursting. The CUs may also transfer data to another user-defined accelerator's CUs through the device memory.

VSC provides an application layer interface as shown on the left-side of the above figure. This is a C++ API interface consisting primarily of two threads for each hardware accelerator, or a cluster of CUs. The send-thread controls forwarding data and launching jobs on the accelerator, while the receive-thread allows gathering results from the accelerator. The send-thread uses a named C-function called compute() which acts as the software interface to launch the corresponding call-job on the accelerator. The run time layer will automate the several details in scheduling such jobs onto CU group and managing efficient data transfers of the compute() arguments. These independent threads allow the software to asynchronously interact with the hardware execution, thereby efficiently modeling the application-specific computation and data transfers. The VSC software interface also provides several controls for user-driven synchronization with the hardware.

VSC provides a unified system composition paradigm in C++, provides a runtime layer that allows a hardware composition with streamlined data transfer between CUs and device memory, and efficiently implements hardware-software interactions out of the box.

Because the compilation of hardware is a very time-consuming process, it is important that changes to the hardware code should not trigger recompilation of the hardware. This is avoided by using a specific coding style from the user, and VSC will allow the creation of reusable user-space libraries. Those libraries also act as software stack (of C++ APIs) on top of the hardware accelerator system specified by the user. Such a library may even be used as a dynamic run time shared library to be integrated with a third-party software application.