Abstract Parallel Programming Model for HLS - 2022.2 English

Vitis High-Level Synthesis User Guide (UG1399)

Document ID
UG1399
Release Date
2022-12-07
Version
2022.2 English

In order to achieve high performance hardware, the HLS tool must infer parallelism from sequential code and exploit it to achieve greater performance. This is not an easy problem to solve. In addition, good software design often uses well-defined rules and practices such as run-time type information (RTTI), recursion, and dynamic memory allocation. Many of these techniques have no direct equivalence in hardware and present challenges for the HLS tool. This generally means that off-the-shelf software cannot be efficiently converted into hardware. At a bare minimum, such software needs to be examined for non-synthesizable constructs and the code needs to be refactored to make it synthesizable. Even if a software program can be automatically converted (or synthesized) into hardware, to assist the tool you need to understand the best practices for writing good software for execution on the FPGA device.

The Design Principles section introduced the three main paradigms that need to be understood for writing good software for FPGA platforms: producer-consumer, streaming data, and pipelining. The underlying parallel programming model that these paradigms work on is as follows:

  • The design/program needs to be constructed as a collection of tasks that communicate by sending messages to each other through communication links (aka channels)
  • Tasks can be structured as control-driven, waiting for some signal to start execution, or data-driven in which the presence of data on the channel drives the execution of the task
  • A task consists of an executable unit that has some local storage/memory and a collection of input/output (I/O) ports.
  • The local memory contains private data, i.e., the data to which the task has exclusive access
  • Access to this private memory is called local data access - like data stored in BRAM/URAM. This type of access is fast. The only way that a task can send copies of its local data to other tasks is through its output ports, and conversely, it can only receive data through its input ports
  • An I/O port is an abstraction; it corresponds to a channel that the task will use for sending or receiving data and it is connected by the caller of the module, or at the system integration level if it is a top port
  • Data sent or received through a channel is called non-local data access. A channel is a data queue that connects one task's output port to another task's input port
  • A channel is assumed to be reliable and has the following behaviors:
    • Data written at the output of the producer are read at the input port of the consumer in the same order for FIFOs. Data can be read/written in random order for PIPOs
    • No data values are lost
  • Both blocking and non-blocking read and write semantics are supported for channels, as described in HLS Stream Library
Figure 1. Blocking/Non-Blocking Semantics

When blocking semantics are used in the model, a read to an empty channel results in the blocking of the reading process. Similarly, a write to a full channel results in the blocking of the writing process. The resulting process/channel network exhibits deterministic behavior that does not depend on the timing of computation nor on communication delays. These style of models have proven convenient for modeling embedded systems, high-performance computing systems, signal processing systems, stream processing systems, dataflow programming languages, and other computational tasks.

The blocking style of modeling can result in deadlocks due to insufficient sizing of the channel queue (when the channels are FIFOs) and/or due to differing rates of production between producers and consumers. If non-blocking semantics are used in the model, a read to an empty channel results in the reading of uninitialized data or in the re-reading of the last data item. Similarly, a write to a full queue can result in that data being lost. To avoid such loss of data, the design must first check the status of the queue before performing the read/write. But this will cause the simulation of such models to be non-deterministic since it relies on decisions made based on the run-time status of the channel. This will make verifying the results of this model much more challenging.

Both blocking and non-blocking semantics are supported by theVitis HLS abstract parallel programming model.