Unrolling a loop enables the full parallelism of the model to be used. To perform this, mark a loop to be unrolled and the tool will create the implementation with the most parallelism possible. To mark a loop to unroll, an OpenCL loop can be marked with the UNROLL attribute:
Or a C/C++ loop can use the unroll pragma:
#pragma HLS UNROLL
For more information, see Loop Unrolling.
When applied to this specific example, the Schedule Viewer in the HLS Project will be:
The following figure shows the estimated performance:
Therefore, the total latency was considerably improved to be 127 cycles and as expected the computational hardware was increased to 4845 LUTs, to perform the same computation in parallel.
However, if you analyze the for-loop, you might ask why this algorithm cannot
be implemented in a single cycle, as each addition is completely independent of the
previous loop iteration. The reason is the memory interface is used for the variable
out. The Vitis
core development kit uses dual port memory by default for an array. However, this
implies that at most two values can be written to the memory per cycle. Thus to see a
fully parallel implementation, you must specify that the variable
out should be kept in registers as in this example:
#pragma HLS array_partition variable= out complete dim= 0
For more information, see pragma HLS array_partition .
The results of this transformation can be observed in the following Schedule Viewer:
The associated estimates are:
Accordingly, this code can be implemented as a combinatorial function requiring only a fraction of the cycle to complete.