This chapter is intended for C/C++ software developers who want to accelerate their data center applications using Xilinx FPGA-based Alveo™ Accelerator Cards. The goal of this guide is to introduce key concepts and provide a pathway for software developers to begin accelerating applications using the Vitis compiler and integrated development environment (IDE).
FPGAs offer many advantages over traditional CPU/GPU acceleration, including a custom architecture capable of implementing any function that can run on a processor, resulting in better performance at lower power dissipation. When compared with processor architectures, the structures that comprise the programmable logic (PL) fabric in an Xilinx device enable a high degree of parallelism in application execution.
The following are key concepts for creating accelerated applications on FPGAs resulting in greater acceleration performance vs CPU:
- Applications written for CPU and FPGA are quite different, and rewriting the functions to be accelerated on an FPGA is required. Functions are executed sequentially on the CPU and must infer parallelism on FPGA for greater performance.
- For application acceleration, the software program is split into a host application that runs on the CPU and compute functions, or kernels that run on the Alveo data center accelerator card. The XRT runtime library provides an API enabling the host application to interact with the kernels on the accelerator cards.
- Data transfers between the host and global memory introduce latency, which can be costly to the overall application. To achieve acceleration in a real system, the performance achieved by the hardware acceleration kernels must outweigh the added latency of the data transfers.
- The software developer should profile the original application and identify functions with the potential to be accelerated. Once the target functions are identified, a performance budget for each kernel should be determined to meet the overall application performance goal.
- The memory hierarchy plays a key role in overall application performance. The memory accessed by kernels should be grouped as memory reads and writes in separate functions using a load-compute-store architecture. The kernels should access contiguous memory if possible and the number of accesses should be optimized by removing redundant accesses or by creating a local cache.
You are encouraged to review this material as well as the extended material referred to in the following topics. After reviewing this document that lists key concepts and examples along with extended reference material, you should have a practical understanding for developing or modifying existing functions to be targeted for acceleration with proper architecture that meets your performance needs.