Abstract Parallel Programming Model for HLS

Abstract Parallel Programming Model for HLS - 2023.2 English

Vitis High-Level Synthesis User Guide (UG1399)

Document ID

UG1399

Release Date

2023-12-18

Version

2023.2 English

In order to achieve high performance hardware, the HLS tool must infer parallelism from sequential code and exploit it to achieve greater performance. This is not an easy problem to solve. In addition, good software design often uses well-defined rules and practices such as runtime type information (RTTI), recursion, and dynamic memory allocation. Many of these techniques have no direct equivalence in hardware and present challenges for the HLS tool. This generally means that off-the-shelf software cannot be efficiently converted into hardware. At a bare minimum, such software needs to be examined for non-synthesizable constructs and the code needs to be refactored to make it synthesizable. Even if a software program can be automatically converted (or synthesized) into hardware, to assist the tool you need to understand the best practices for writing good software for execution on the FPGA device.

The Design Principles section introduced the three main paradigms that need to be understood for writing good software for FPGA platforms: producer-consumer, streaming data, and pipelining. The underlying parallel programming model that these paradigms work on is as follows:

The design/program needs to be constructed as a collection of tasks that communicate by sending messages to each other through communication links (aka channels)
Tasks can be structured as control-driven, waiting for some signal to start execution, or data-driven in which the presence of data on the channel drives the execution of the task
A task consists of an executable unit that has some local storage/memory and a collection of input/output (I/O) ports.
The local memory contains private data, for example, the data to which the task has exclusive access
Access to this private memory is called local data access - like data stored in block RAM/URAM. This type of access is fast. The only way that a task can send copies of its local data to other tasks is through its output ports, and conversely, it can only receive data through its input ports
An I/O port is an abstraction; it corresponds to a channel that the task uses for sending or receiving data and it is connected by the caller of the module, or at the system integration level if it is a top port
Data sent or received through a channel is called non-local data access. A channel is a data queue that connects one task's output port to another task's input port
A channel is assumed to be reliable and has the following behaviors:
- Data written at the output of the producer are read at the input port of the consumer in the same order for FIFOs. Data can be read/written in random order for PIPOs
- No data values are lost
Both blocking and non-blocking read and write semantics are supported for channels, as described in HLS Stream Library

Figure 1. Blocking/Non-Blocking Semantics

When blocking semantics are used in the model, a read to an empty channel results in the blocking of the reading process. Similarly, a write to a full channel results in the blocking of the writing process. The resulting process/channel network exhibits deterministic behavior that does not depend on the timing of computation nor on communication delays. These style of models have proven convenient for modeling embedded systems, high-performance computing systems, signal processing systems, stream processing systems, dataflow programming languages, and other computational tasks.

The blocking style of modeling can result in deadlocks due to insufficient sizing of the channel queue (when the channels are FIFOs) and/or due to differing rates of production between producers and consumers. If non-blocking semantics are used in the model, a read to an empty channel results in the reading of uninitialized data or in the re-reading of the last data item. Similarly, a write to a full queue can result in that data being lost. To avoid such loss of data, the design must first check the status of the queue before performing the read/write. But this causes the simulation of such models to be non-deterministic because it relies on decisions made based on the runtime status of the channel. This makes verifying the results of this model much more challenging.

Both blocking and non-blocking semantics are supported by theVitis HLS abstract parallel programming model.