Pipelining loops allow you to overlap iterations of a loop in time, as discussed in Loop Pipelining. Allowing loop iterations to operate concurrently is often a good approach, as resources can be shared between iterations (less resource utilization), while requiring less execution time compared to loops that are not unrolled.
Pipelining is enabled in C/C++ through the pragma HLS pipeline :
#pragma HLS PIPELINE
While the OpenCL API uses the xcl_pipeline_loop attribute:
In this example, the Schedule Viewer in the HLS Project produces the following information:
With the overall estimates being:
Because each iteration of a loop consumes only two cycles of latency, there can only be a single iteration overlap. This enables the total latency to be cut into half compared to the original, resulting in 257 cycles of total latency. However, this reduction in latency was achieved using fewer resources when compared to unrolling.
In most cases, loop pipelining by itself can improve overall performance. Yet, the effectiveness of the pipelining depends on the structure of the loop. Some common limitations are:
- Resources with limited availability such as memory ports or process channels can limit the overlap of the iterations (Initiation Interval).
- Loop-carry dependencies, such as those created by variable conditions computed in one iteration affecting the next, might increase the II of the pipeline.
These are reported by the tool during high-level synthesis and can be observed and examined in the Schedule Viewer. For the best possible performance, the code might have to be modified to remove these limiting factors, or the tool needs to be instructed to eliminate some dependency by restructuring the memory implementation of an array, or breaking the dependencies all together.