Description
Some variables are accessed by more instructions than their hardware implementation can sustain in a single cycle, preventing some loops from being accelerated. Partition these variables to accelerate your design.
Explanation
The parallelism in the loop is limited by the number of memory ports available. Higher performance can be reached if more memory ports are made available.
When dependencies allow it, memory accesses are tentatively scheduled in parallel (at the same clock cycle). However, in order for all the memory accesses to execute at the same time, the memory they access must have at least one port available for each access.
The number of ports of a memory can be indirectly increased using memory
partitioning, typically using the array_partition
pragma.
In some cases, the bind_storage
pragma can also be used to
control the amount of ports available on the memory that stores a particular variable.
In the following example, all writes to A
can be performed in parallel only if at least four ports are available for this variable.
The default implementations of BRAM-backed memory tend to have one or two ports only,
preventing the parallel execution of all the memory accesses. In the example below,
partitioning the array A
with a factor of 2 solves the
contention and accelerates the loop.
for (int i = 0; i < 16; i += 4) {
A[i] = ...
A[i + 1] = ...
A[i + 2] = ...
A[i + 3] = ...
}
Recommendation
Partition the variables to accelerate your design.