Floats and Doubles - 2021.2 English

Vitis High-Level Synthesis User Guide (UG1399)

Document ID
English (United States)
Release Date
2021.2 English

Vitis HLS supports float and double types for synthesis. Both data types are synthesized with IEEE-754 standard partial compliance (see Floating-Point Operator LogiCORE IP Product Guide (PG060)).

  • Single-precision 32-bit
    • 24-bit fraction
    • 8-bit exponent
  • Double-precision 64-bit
    • 53-bit fraction
    • 11-bit exponent

In addition to using floats and doubles for standard arithmetic operations (such as +, -, * ) floats and doubles are commonly used with the math.h (and cmath.h for C++). This section discusses support for standard operators.

The following code example shows the header file used with Standard Types updated to define the data types to be double and float types.

#include <stdio.h>
#include <stdint.h>
#include <math.h>

#define N 9

typedef double din_A;
typedef double din_B;
typedef double din_C;
typedef float din_D;

typedef double dout_1;
typedef double dout_2;
typedef double dout_3;
typedef float dout_4;

void types_float_double(din_A inA,din_B inB,din_C inC,din_D inD,dout_1 
*out1,dout_2 *out2,dout_3 *out3,dout_4 *out4);

This updated header file is used with the following code example where a sqrtf() function is used.

#include "types_float_double.h"

void types_float_double(
 din_A  inA,
 din_B  inB,
 din_C  inC,
 din_D  inD,
 dout_1 *out1,
 dout_2 *out2,
 dout_3 *out3,
 dout_4 *out4
 ) {

 // Basic arithmetic & math.h sqrtf() 
 *out1 = inA * inB;
 *out2 = inB + inA;
 *out3 = inC / inA;
 *out4 = sqrtf(inD);


When the example above is synthesized, it results in 64-bit double-precision multiplier, adder, and divider operators. These operators are implemented by the appropriate floating-point Xilinx IP catalog cores.

The square-root function used sqrtf() is implemented using a 32-bit single-precision floating-point core.

If the double-precision square-root function sqrt() was used, it would result in additional logic to cast to and from the 32-bit single-precision float types used for inD and out4: sqrt() is a double-precision (double) function, while sqrtf() is a single precision (float) function.

In C functions, be careful when mixing float and double types as float-to-double and double-to-float conversion units are inferred in the hardware.

float foo_f    = 3.1459;
float var_f = sqrt(foo_f); 

The above code results in the following hardware:

-> Float-to-Double Converter unit
-> Double-Precision Square Root unit
-> Double-to-Float Converter unit
-> wire (var_f)

Using a sqrtf() function:

  • Removes the need for the type converters in hardware
  • Saves area
  • Improves timing

When synthesizing float and double types, Vitis HLS maintains the order of operations performed in the C code to ensure that the results are the same as the C simulation. Due to saturation and truncation, the following are not guaranteed to be the same in single and double precision operations:

       A=B*C; A=B*F;
       D=E*F; D=E*C;
       O1=A*D O2=A*D;

With float and double types, O1 and O2 are not guaranteed to be the same.

Tip: In some cases (design dependent), optimizations such as unrolling or partial unrolling of loops, might not be able to take full advantage of parallel computations as Vitis HLS maintains the strict order of the operations when synthesizing float and double types. This restriction can be overridden using config_compile -unsafe_math_optimizations.

For C++ designs, Vitis HLS provides a bit-approximate implementation of the most commonly used math functions.

Floating-Point Accumulator and MAC

Floating point accumulators (facc), multiply and accumulate (fmacc), and multiply and add (fmadd) can be enabled using the config_op command shown below:

config_op <facc|fmacc|fmadd> -impl <none|auto> -precision <low|standard|high>

Vitis HLS supports different levels of precision for these operators that tradeoff between performance, area, and precision on both Versal and non-Versal devices.

  • Low-precision accumulation is suitable for high-throughput low-precision floating point accumulation and multiply-accumulation, this mode is only available in non-Versal devices.
    • It uses an integer accumulator with a pre-scaler and a post-scaler (to convert input and output to single-precision or double-precision floating point).
      • It uses a 60 bit and 100 bit accumulator for single and double precision inputs respectively.
      • It can cause cosim mismatches due to insufficient precision with respect to C++ simulation
    • It can always be pipelined with an II=1 without source code changes
    • It uses approximately 3X the resources of standard-precision floating point accumulation, which achieves an II that is typically between 3 and 5, depending on clock frequency and target device.
    Using low-precision, accumulation for floats and doubles is supported with an initiation interval (II) of 1 on all devices. This means that the following code can be pipelined with an II of 1 without any additional coding:
    float foo(float A[10], float B[10]) {
     float sum = 0.0;
     for (int i = 0; i < 10; i++) {
     sum += A[i] * B[i];
     return sum;
  • Standard-precision accumulation and multiply-add is suitable for most uses of floating-point, and is available on Versal and non-Versal devices.
    • It always uses a true floating-point accumulator
    • It can be pipelined with an II=1 on Versal devices.
    • It can be pipelined with an II that is typically between 3 and 5 (depending on clock frequency and target device) on non-Versal devices. The standard precision mode is more efficient on Versal devices than on non-Versal devices.
  • High-precision fused multiply-add is suitable for high-precision applications and is available on Versal devices.
    • It uses one extra bit of precision
    • It always uses a single fused multiply-add, with a single rounding at the end, although it uses more resources than the unfused multiply-add
    • It can cause cosim mismatches due to the extra precision with respect to C++ simulation
Tip: To achieve the same results as in prior releases, use the following configuration:
config_op facc -impl auto -precision low