Specifying Arrays as Stream-of-Blocks - 2022.2 English

Vitis High-Level Synthesis User Guide (UG1399)

Document ID
UG1399
Release Date
2022-12-07
Version
2022.2 English

The hls::stream_of_blocks type provides a user-synchronized stream that supports streaming blocks of data for process-level interfaces in a dataflow context, where each block is an array or multidimensional array. The intended use of stream-of-blocks is to replace array-based communication between a pair of processes within a dataflow region. Refer to the using_stream_of_blocks example on Github.

Currently, Vitis HLS implements arrays written by a producer process and read by a consumer process in a dataflow region by mapping them to ping pong buffers (PIPOs). The buffer exchange for a PIPO buffer occurs at the return of the producer function and the calling of the consumer function in C++.

Stream-of-Blocks Modeling Style

On the other hand, for a stream-of-blocks the communication between the producer and the consumer is modeled as a stream of array-like objects, providing several advantages over array transfer through PIPO.

The use of stream-of-blocks in your code requires the following include file:

#include "hls_streamofblocks.h"

The stream-of-blocks object template is:

hls::stream_of_blocks<block_type, depth> v

Where:

  • <block_type> specifies the datatype of the array or multidimensional array held by the stream-of-blocks
  • <depth> is an optional argument that provides depth control just like hls::stream or PIPOs, and specifies the total number of blocks, including the one acquired by the producer and the one acquired by the consumer at any given time. The default value is 2
  • v specifies the variable name for the stream-of-blocks object

Use the following steps to access a block in a stream of blocks:

  1. The producer or consumer process that wants to access the stream first needs to acquire access to it, using a hls::write_lock or hls::read_lock object.
  2. After the producer has acquired the lock it can start writing (or reading) the acquired block. Once the block has been fully initialized, it can be released by the producer, when the write_lock object goes out of scope.

    Note: The producer process with a write_lock can also read the block as long as it only reads from already written locations, because the newly acquired buffer must be assumed to contain uninitialized data. The ability to write and read the block is unique to the producer process, and is not supported for the consumer.

  3. Then the block is queued in the stream-of-blocks in a FIFO fashion, and when the consumer acquires a read_lock object, the block can be read by the consumer process.

The main difference between hls::stream_of_blocks and the PIPO mechanism seen in the prior examples is that the block becomes available to the consumer as soon as the write_lock goes out of scope, rather than only at the return of the producer process. Hence the size of storage required to manage the original example (without the dataflow loop) is much less with stream-of-blocks than with just PIPOs: namely 2N instead of 2xMxN in the example.

Rewriting the prior example to use hls::stream_of_blocks is shown in the example below. The producer acquires the block by constructing an hls::write_lock object called b, and passing it the reference to the stream-of-blocks object, called s. The write_lock object provides an overloaded array access operator, letting it be accessed as an array to access underlying storage in random order as shown in the example below.

The acquisition of the lock is performed by constructing the write_lock/read_lock object, and the release occurs automatically when that object is destructed as it goes out of scope. This approach uses the common Resource Acquisition Is Initialization (RAII) style of locking and unlocking.

#include "hls_streamofblocks.h"
typedef int buf[N];
void producer (hls::stream_of_blocks<buf> &s, ...) {
  for (int i = 0; i < M; i++) {
    // Allocation of hls::write_lock acquires the block for the producer
    hls::write_lock<buf> b(s);
    for (int j = 0; j < N; j++)
      b[f(j)] = ...;
    // Deallocation of hls::write_lock releases the block for the consumer
  }
}
  
void consumer(hls::stream_of_blocks<buf> &s, ...) {
  for (int i = 0; i < M; i++) {
    // Allocation of hls::read_lock acquires the block for the consumer
    hls::read_lock<buf> b(s);
    for (int j = 0; j < N; j++)
       ... = b[g(j)] ...;
    // Deallocation of hls::write_lock releases the block to be reused by the producer
  }
}
  
void top(...) {
#pragma HLS dataflow
  hls::stream_of_blocks<buf> s;
  
  producer(b, ...);
  consumer(b, ...);
}

The key features of this approach include:

  • The expected performance of the outer loop in the producer above is to achieve an overall Initiation Interval (II) of 1
  • A locked block can be used as though it were private to the producer or the consumer process until it is released.
  • The initial state of the array object for the producer is undefined, whereas it contains the values written by the producer for the consumer.
  • The principal advantage of stream-of-blocks is to provide overlapped execution of multiple iterations of the consumer and the producer to increase throughput.

Resource Usage

The resource cost when increasing the depth beyond the default value of 2 is similar to the resource cost of PIPOs. Namely, each increment of 1 will require enough memory for a block, e.g., in the example above N * 32-bit words.

The stream of blocks object can be bound to a specific RAM type, by placing the BIND_STORAGE pragma where the stream-of-blocks is declared, for example in the top-level function. The stream of blocks uses 2-port BRAM (type=RAM_2P) by default.