Dataflow Optimization Limitations - 2021.2 English

Vitis High-Level Synthesis User Guide (UG1399)

Document ID
UG1399
ft:locale
English (United States)
Release Date
2021-12-15
Version
2021.2 English

The DATAFLOW optimization optimizes the flow of data between tasks (functions and loops), and ideally pipelined functions and loops for maximum performance. It does not require these tasks to be chained, one after the other, however there are some limitations in how the data is transferred.

The following behaviors can prevent or limit the overlapping that Vitis HLS can perform with DATAFLOW optimization:

  • Reading from function inputs or writing to function outputs in the middle of the dataflow region
  • Single-producer-consumer violations
  • Bypassing tasks and channel sizing
  • Feedback between tasks
  • Conditional execution of tasks
  • Loops with multiple exit conditions
Important: If any of these coding styles are present, Vitis HLS issues a message describing the situation.
Note: You can use the Dataflow viewer in the Analysis perspective to view the structure when the DATAFLOW directive is applied.

Reading from Inputs/Writing to Outputs

Reading of inputs of the function should be done at the start of the dataflow region, and writing to outputs should be done at the end of the dataflow region. Reading/writing to the ports of the function can cause the processes to be executed in sequence rather than in an overlapped fashion, adversely impacting performance.

Single-producer-consumer Violations

For Vitis HLS to perform the DATAFLOW optimization, all elements passed between tasks must follow a single-producer-consumer model. Each variable must be driven from a single task and only be consumed by a single task. In the following code example, temp1 fans out and is consumed by both Loop2 and Loop3. This violates the single-producer-consumer model.

void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {
   int temp1[N];

   Loop1: for(int i = 0; i < N; i++) {
     temp1[i] = data_in[i] * scale;   
   }
   Loop2: for(int j = 0; j < N; j++) {
     data_out1[j] = temp1[j] * 123;   
   }
   Loop3: for(int k = 0; k < N; k++) {
    data_out2[k] = temp1[k] * 456;
   }
}

A modified version of this code uses function Split to create a single-producer-consumer design. The following code block example shows how the data flows with the function Split. The data now flows between all four tasks, and Vitis HLS can perform the DATAFLOW optimization.

void Split (in[N], out1[N], out2[N]) {
// Duplicated data
 L1:for(int i=1;i<N;i++) {
 out1[i] = in[i]; 
 out2[i] = in[i];     
 }
}
void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {

 int temp1[N], temp2[N]. temp3[N]; 
 Loop1: for(int i = 0; i < N; i++) {
 temp1[i] = data_in[i] * scale;
 }
 Split(temp1, temp2, temp3);
 Loop2: for(int j = 0; j < N; j++) {
 data_out1[j] = temp2[j] * 123;
 }
 Loop3: for(int k = 0; k < N; k++) {
 data_out2[k] = temp3[k] * 456;
 }
}

Bypassing Tasks and Channel Sizing

In addition, data should generally flow from one task to another. If you bypass tasks, this can reduce the performance of the DATAFLOW optimization. In the following example, Loop1 generates the values for temp1 and temp2. However, the next task, Loop2, only uses the value of temp1. The value of temp2 is not consumed until after Loop2. Therefore, temp2 bypasses the next task in the sequence, which can limit the performance of the DATAFLOW optimization.

void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {
  int temp1[N], temp2[N]. temp3[N];
  Loop1: for(int i = 0; i < N; i++) {
  temp1[i] = data_in[i] * scale;
  temp2[i] = data_in[i] >> scale;
  }
  Loop2: for(int j = 0; j < N; j++) {
  temp3[j] = temp1[j] + 123;
  }
  Loop3: for(int k = 0; k < N; k++) {
  data_out[k] = temp2[k] + temp3[k];
  }
}
In this case, you should increase the depth of the PIPO buffer used to store temp2 to be 3, instead of the default depth of 2. This lets the buffer store the value intended for Loop3, while Loop2 is being executed. Similarly, a PIPO that bypasses two processes should have a depth of 4. Set the depth of the buffer with the STREAM pragma or directive:
#pragma HLS STREAM type=pipo variable=temp2 depth=3
Important: Channel sizing can also similarly affect performance. Having mismatched FIFO/PIPO depths can inadvertently cause synchronization points inside the dataflow region because of back pressure from the FIFO/PIPO.

Feedback between Tasks

Feedback occurs when the output from a task is consumed by a previous task in the DATAFLOW region. Feedback between tasks is not recommended in a DATAFLOW region. When Vitis HLS detects feedback, it issues a warning, depending on the situation, and might not perform the DATAFLOW optimization.

However, DATAFLOW can support feedback when used with hls::streams. The following example demonstrates this exception.

#include "ap_axi_sdata.h"
#include "hls_stream.h"

void firstProc(hls::stream<int> &forwardOUT, hls::stream<int> &backwardIN) {
  static bool first = true;
  int fromSecond;

  //Initialize stream
  if (first) 
    fromSecond = 10; // Initial stream value
  else
    //Read from stream
    fromSecond = backwardIN.read(); //Feedback value
  first = false;

  //Write to stream
  forwardOUT.write(fromSecond*2);
}

void secondProc(hls::stream<int> &forwardIN, hls::stream<int> &backwardOUT) {
  backwardOUT.write(forwardIN.read() + 1);
}

void top(...) {
#pragma HLS dataflow
  hls::stream<int> forward, backward;
  firstProc(forward, backward);
  secondProc(forward, backward);
}

In this simple design, when firstProc is executed, it uses 10 as an initial value for input. Because hls::streams do not support an initial value, this technique can be used to provide one without violating the single-producer-consumer rule. In subsequent iterations firstProc reads from the hls::stream through the backwardIN interface.

firstProc processes the value and sends it to secondProc, via a stream that goes forward in terms of the original C++ function execution order. secondProc reads the value on forwardIN, adds 1 to it, and sends it back to firstProc via the feedback stream that goes backwards in the execution order.

From the second execution, firstProc uses the value read from the stream to do its computation, and the two processes can keep going forever, with both forward and feedback communication, using an initial value for the first execution.

Conditional Execution of Tasks

The DATAFLOW optimization does not optimize tasks that are conditionally executed. The following example highlights this limitation. In this example, the conditional execution of Loop1 and Loop2 prevents Vitis HLS from optimizing the data flow between these loops, because the data does not flow from one loop into the next.

void foo(int data_in1[N], int data_out[N], int sel) {

 int temp1[N], temp2[N];

 if (sel) {
 Loop1: for(int i = 0; i < N; i++) {
 temp1[i] = data_in[i] * 123;
 temp2[i] = data_in[i];
 }
 } else {
 Loop2: for(int j = 0; j < N; j++) {
 temp1[j] = data_in[j] * 321;
 temp2[j] = data_in[j];
 }
 }
 Loop3: for(int k = 0; k < N; k++) {
 data_out[k] = temp1[k] * temp2[k];
 }
}

To ensure each loop is executed in all cases, you must transform the code as shown in the following example. In this example, the conditional statement is moved into the first loop. Both loops are always executed, and data always flows from one loop to the next.

void foo(int data_in[N], int data_out[N], int sel) {

 int temp1[N], temp2[N];

 Loop1: for(int i = 0; i < N; i++) {
 if (sel) {
 temp1[i] = data_in[i] * 123;
 } else {
 temp1[i] = data_in[i] * 321;
 }
 }
 Loop2: for(int j = 0; j < N; j++) {
 temp2[j] = data_in[j];
 }
 Loop3: for(int k = 0; k < N; k++) {
 data_out[k] = temp1[k] * temp2[k];
 }
}

Loops with Multiple Exit Conditions

Loops with multiple exit points cannot be used in a DATAFLOW region. In the following example, Loop2 has three exit conditions:

  • An exit defined by the value of N; the loop will exit when k>=N.
  • An exit defined by the break statement.
  • An exit defined by the continue statement.
    #include "ap_int.h"
    #define N 16
    
    typedef ap_int<8> din_t;
    typedef ap_int<15> dout_t;
    typedef ap_uint<8> dsc_t;
    typedef ap_uint<1> dsel_t;
    
    void multi_exit(din_t data_in[N], dsc_t scale, dsel_t select, dout_t data_out[N]) {
     dout_t temp1[N], temp2[N];
     int i,k;
    
     Loop1: for(i = 0; i < N; i++) {
     temp1[i] = data_in[i] * scale;
     temp2[i] = data_in[i] >> scale;
     }
    
     Loop2: for(k = 0; k < N; k++) {
     switch(select) {
            case  0: data_out[k] = temp1[k] + temp2[k];
            case  1: continue;
            default: break;
     }
     }
    }

    Because a loop’s exit condition is always defined by the loop bounds, the use of break or continue statements will prohibit the loop being used in a DATAFLOW region.

    Finally, the DATAFLOW optimization has no hierarchical implementation. If a sub-function or loop contains additional tasks that might benefit from the DATAFLOW optimization, you must apply the DATAFLOW optimization to the loop, the sub-function, or inline the sub-function.

You can also use std::complex inside the DATAFLOW region. However, they should be used with an __attribute__((no_ctor)) as shown in the following example:
void proc_1(std::complex<float> (&buffer)[50], const std::complex<float> *in);
void proc_2(hls::Stream<std::complex<float>> &fifo, const std::complex<float> (&buffer)[50], std::complex<float> &acc);
void proc_3(std::complex<float> *out, hls::Stream<std::complex<float>> &fifo, const std::complex<float> acc);

void top(std::complex<float> *out, const std::complex<float> *in) {
#pragma HLS DATAFLOW

  std::complex<float> acc __attribute((no_ctor)); // Here
  std::complex<float> buffer[50] __attribute__((no_ctor)); // Here
  hls::Stream<std::complex<float>, 5> fifo; // Not here

  proc_1(buffer, in);
  proc_2(fifo, buffer, acc);
  proc_3(out, fifo, acc);
}