One, if not the most important aspect for performance is the data width required for the implementation. The tool propagates port widths throughout the algorithm. In some cases, especially when starting out with an algorithmic description, the C/C++/OpenCL code might only use large data types such as integers even at the ports of the design. However, as the algorithm is mapped to a fully configurable implementation, smaller data types such as 10-/12-bit might often suffice. It is beneficial to check the size of basic operations in the HLS Synthesis report during optimization.
In general, when the Vitis core development kit maps an algorithm onto the FPGA, more processing is required to comprehend the C/C++/OpenCL API structure and extract operational dependencies. Therefore, to perform this mapping the Vitis core development kit generally partitions the source code into operational units which are then mapped onto the FPGA. Several aspects influence the number and size of these operational units (ops) as seen by the tool.
In the following figure, the basic operations and their bit-width are reported.
Look for bit widths of 16, 32, and 64 bits commonly used in algorithmic descriptions and verify that the associated operation from the C/C++/OpenCL API source actually requires the bit width to be this large. This can considerably improve the implementation of the algorithm, as smaller operations require less computation time.
Some applications use floating-point computation only because they are optimized for other hardware architecture. Using fixed-point arithmetic for applications like deep learning can save the power efficiency and area significantly while keeping the same level of accuracy.
It is sometimes advantageous to think about larger computational elements. The tool will operate on the source code independently of the remaining source code, effectively mapping the algorithm without consideration of surrounding operations onto the FPGA. When applied, the Vitis technology keeps operational boundaries, effectively creating macro operations for specific code. This uses the following principles:
- Operational locality to the mapping process
- Reduction in complexity for the heuristics
This might create vastly different results when applied. In C/C++, macro
operations are created with the help of
#pragma HLS inline
off. While in the OpenCL API, the
same kind of macro operation can be generated by not specifying the following attribute when defining a function:
For more information, see pragma HLS inline .
Using Optimized Libraries
The OpenCL specification provides many
math built-in functions. All math built-in functions with the
native_ prefix are mapped to one or more native device instructions
and will typically have better performance compared to the corresponding functions
native_ prefix). The accuracy and in
some cases the input ranges of these functions is implementation-defined. In the
Vitis technology, these
native_ built-in functions use the equivalent functions
in the Vitis HLS tool Math library, which are
already optimized for Xilinx FPGAs in terms of
area and performance.
native_built-in functions or the HLS tool Math library if the accuracy meets the application requirement.