Pipelining Loops - 2022.1 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID

UG1393

Release Date

2022-05-25

Version

2022.1 English

Pipelining loops allow you to overlap iterations of a loop in time, as discussed in Loop Pipelining. Allowing loop iterations to operate concurrently is often a good approach, as resources can be shared between iterations (less resource utilization), while requiring less execution time compared to loops that are not unrolled.

Pipelining is enabled in C/C++ through the pragma HLS pipeline :

#pragma HLS PIPELINE

While the OpenCL API uses the xcl_pipeline_loop attribute:

__attribute__((xcl_pipeline_loop))

Note: The OpenCL API has an additional method of specifying loop pipelining. The reason is the work item loops are not explicitly stated and pipelining these loops require this attribute:

__attribute__((xcl_pipeline_workitems))

In this example, the Schedule Viewer in the HLS Project produces the following information:

Figure 1. Pipelining Loops in Schedule Viewer

With the overall estimates being:

Figure 2. Performance Estimates

Because each iteration of a loop consumes only two cycles of latency, there can only be a single iteration overlap. This enables the total latency to be cut into half compared to the original, resulting in 257 cycles of total latency. However, this reduction in latency was achieved using fewer resources when compared to unrolling.

In most cases, loop pipelining by itself can improve overall performance. Yet, the effectiveness of the pipelining depends on the structure of the loop. Some common limitations are:

Resources with limited availability such as memory ports or process channels can limit the overlap of the iterations (Initiation Interval).
Loop-carry dependencies, such as those created by variable conditions computed in one iteration affecting the next, might increase the II of the pipeline.

These are reported by the tool during high-level synthesis and can be observed and examined in the Schedule Viewer. For the best possible performance, the code might have to be modified to remove these limiting factors, or the tool needs to be instructed to eliminate some dependency by restructuring the memory implementation of an array, or breaking the dependencies all together.