Improve Performance Using Stream-of-Blocks

Improve Performance Using Stream-of-Blocks - 2022.1 English

Vitis High-Level Synthesis User Guide (UG1399)

Document ID

UG1399

Release Date

2022-06-07

Version

2022.1 English

The hls::stream_of_blocks type provides a user-synchronized stream that supports streaming blocks of data for process-level interfaces in a dataflow context, where each block is an array or multidimensional array. The intended use of stream-of blocks is to replace array-based communication between a pair of processes within a dataflow region. Refer to Vitis-HLS-Introductory-Examples/Dataflow/Channels/using_stream_of_blocks on Github for an example.

Currently, Vitis HLS implements arrays written by a producer process and read by a consumer process in a dataflow region by mapping them to ping pong buffers (PIPOs). The buffer exchange for a PIPO buffer is driven by the ap_done/ap_continue handshake of the producer process, and by the ap_start/ap_ready handshake of the consumer process. In other words, the exchange occurs at the return of the producer function and the calling of the consumer function in C++.

While this ensures a concurrent communication semantic that is fully compliant with the sequential C++ execution semantics, it also implies that the consumer cannot start until the producer is done, as shown in the following code example.

void producer (int b[M][N], ...) {
  for (int i = 0; i < M; i++)
    for (int j = 0; j < N; j++)
      b[i][f(j)] = ...;
}
 
void consumer(int b[M][N], ...) {
   for (int i = 0; i < M; i++)
     for (int j = 0; j < N; j++)
       ... = b[i][g(j)] ...;;
}
 
void top(...) {
#pragma HLS dataflow
  int b[M][N];
#pragma HLS stream off variable=b
 
  producer(b, ...);
  consumer(b, ...);
}

This can unnecessarily limit throughput if the producer generates data for the consumer in smaller blocks, for example by writing one row of the buffer output inside a nested loop, and the consumer uses the data in smaller blocks by reading one row of the buffer input inside a nested loop, as the example above does. In this example, due to the non-sequential buffer column access in the inner loop you cannot simply stream the array b. However, the row access in the outer loop is sequential thus supporting hls::stream_of_blocks communication where each block is a 1-dimensional array of size N.

The main purpose of the hls::stream_of_blocks feature is to provide PIPO-like functionality, but with user-managed explicit synchronization, accesses, and better coding style. Stream-of-blocks lets you avoid the use of dataflow in a loop containing the producer and consumer, which would have been a way to optimize the example above. However, in this case, the use of the dataflow loop containing the producer and consumer requires the use of a very large PIPO buffer (2xMxN) as shown in the following example:

void producer (int b[N], ...) {
  for (int j = 0; j < N; j++)
    b[f(j)] = ...;
}
 
void consumer(int b[N], ...) {
   for (int j = 0; j < N; j++)
     ... = b[g(j)];
}
 
void top(...) {
// The loop below is very constrained in terms of how it must be written
  for (int i = 0; i < M; i++) {
#pragma HLS dataflow
    int b[N];
#pragma HLS stream off variable=b
 
    producer(b, ...); // writes b
    consumer(b, ...); // reads b
  }
}

The dataflow-in-a-loop code above is also not desirable because this structure has several limitations in Vitis HLS, such as the loop structure must be very constrained (single induction variable, starting from 0 and compared with a constant or a function argument and incremented by 1).