HLS Split/Merge Library - 2023.2 English

Vitis High-Level Synthesis User Guide (UG1399)

Document ID

UG1399

Release Date

2023-12-18

Version

2023.2 English

Important: To use hls::split<> or hls::merge<> objects in your code include the header file hls_np_channel.h as shown in the example below.

For use in Dataflow processes, split/merge channels let you create one-to-many or many-to-one type channels to distribute data to multiple tasks, or aggregate data from multiple tasks. These channels have a built-in job scheduler using either a round-robin approach in which data are sequentially distributed or gathered across the channels, or a load balancing approach that is determined based on channel availability.

Tip: Load balancing can lead to non-deterministic results in RTL/Co-simulation. In this case, you will need to write a test bench that is agnostic as to the order of results.

As shown in the figure below, data is read from an input stream and split through the round-robin scheduler mechanism, and distributed to associated worker tasks. After a worker completes the task, it writes the output which is merged also using the round-robin scheduler, into a single stream.

Figure 1. Split/Merge Dataflow

A split channel has one producer and many consumers, and can be typically used to distribute tasks to a set of workers, abstracting and implementing in RTL the distribution logic, and thus leading to both better performance and fewer resources. The distribution of an input to one of the N outputs can be:

Round-robin, where the consumers read the input data in a fixed rotating order, thus ensuring deterministic behavior, but not allowing load sharing with dynamically varying computational loads for the workers.
Load balancing, where the first consumer to attempt a read will read the first input data, thus ensuring good load balancing, but with non-deterministic results.

A merge channel has many producers and a single consumer, and operates based on the reverse logic:

Round-robin, where the producer output data is merged using a fixed rotating order, thus ensuring deterministic behavior, but not allowing load sharing with dynamically varying computational loads for the workers.
The load balancing merge channel, where the first producer that completes the work will write first into the channel with non-deterministic results.

The general idea of split and merge is that with the round_robin scheduler data are distributed around to workers for the split, and read from workers for the merge, in a deterministic fashion. So if all workers compute the same function the result is the same as with a single worker, but the performance is better.

If the workers perform different functions, then your design must ensure that the correct data item is sent to the correct function in the round-robin order of workers, starting from out[0] or in[0] respectively.

Specification

Specification of split/merge channels is as follows:

hls::split::load_balancing<DATATYPE, NUM_PORTS[, DEPTH, N_PORT_DEPTH]> name; 
hls::split::round_robin<DATATYPE, NUM_PORTS[, DEPTH]> name
hls::merge::load_balancing<DATATYPE, NUM_PORTS[, DEPTH]> name
hls::merge::round_robin<DATATYPE, NUM_PORTS[, DEPTH]> name

Where:

round_robin/load_balancing: Specifies the type of scheduler mechanism used for the channel.
DATATYPE: Specifies the data type on the channel. This has the same restrictions as standard hls::stream. The DATATYPE can be:
- Any C++ native data type
- A Vitis HLS arbitrary precision type (for example, ap_int<>,ap_ufixed<>)
- A user-defined struct containing either of the above types
NUM_PORTS: Indicates the number of write ports required for split (1:num) or read-ports required for merge (num:1) operation.
DEPTH: Optional argument is the depth of the main buffer, located before the split or after the merge. This is optional, and the default depth is 2 when not specified.
N_PORT_DEPTH: Optional field for round-robin to specify the depth of output buffers applied after split, or before merge. This is optional and the default depth is 0 when not specified.
Tip: To specify the optional N_PORT_DEPTH value, you must also specify DEPTH.
name: Indicates the name of the created channel object

Following is an example which can be found at mixed_control_and_data_driven available on GitHub:

#include "hls_np_channel.h"

const int N = 16;
const int NP = 4;

void dut(int in[N], int out[N], int n) {
#pragma HLS dataflow
  hls::split::round_robin<int, NP> split1;
  hls::merge::round_robin<int, NP> merge1;
 
  read_in(in, n, split1.in);
 
  // Task-Channels
  hls_thread_local hls::task t[NP];
  for (int i=0; i<NP; i++) {
#pragma HLS unroll
    t[i](worker, split1.out[i], merge1.in[i]);
  }
 
  write_out(merge1.out, out, n);
}

Tip: The example above shows the workers implemented as hls::task objects. However, this is simply a feature of the example and not a requirement of split/merge channels.

Application of Split/Merge

The main use of split and merge is to support multiple compute engine instantiation to fully exploit the bandwidth of a DDR or HBM port. In this case, the producer is a load process that reads a burst of data from MAXI, and then passes the individual packets of data to be processed to a number of workers via the split channel. Use the round-robin protocol if the workers take similar amounts of time, or load balancing the execution time per input if variable. The consumer performs the reverse, writing data back into DRAM.

Tip: The write back address can be passed through the split and the merge, along with the data, in the case of load balancing.

These channels are modeled as implementing hls::stream objects at both ends of the split or the merge channel. This means that a split or merge channel end can be connected to any process that takes an hls::stream as an input or an output. The process does not need to be aware of the type of channel connection. Therefore, they can be used both for standard dataflow and for hls::task objects.

The following example shows how split can be used by a single produce and multiple consumers:

#include "hls_np_channel.h"
 
void producer(hls::stream<int> &s) {
  s.write(xxx);
}
 
void consumer1(hls::stream<int> &s) {
  ... = s.read();
}
 
void consumer2(hls::stream<int> &s) {
  ... = s.read();
}

void top-func() {
#pragma HLS dataflow
  hls::split::load_balancing<int, 4, 6> s; // NUM_PORTS=4, DEPTH=6
 
  producer(s.in, ...);
  consumer1(s.out[0], ...);
  consumer2(s.out[1], ...);
  consumer3(s.out[2], ...);
  consumer4(s.out[3], ...);
}