Array Accesses and Performance - 2022.2 English

Vitis High-Level Synthesis User Guide (UG1399)

Document ID
Release Date
2022.2 English

In a previous section, we introduced optimization concepts such as loop unrolling and pipelining as a means for exploring parallelism. However, this was done without considering how array access patterns may prevent such optimizations when the arrays are mapped to memories instead of registers. Arrays mapped to memories can become the bottleneck in a design’s performance. Vitis HLS provides a number of optimizations, such as array reshaping and array partitioning, that can remove these memory bottlenecks. Whenever possible, these automatic memory optimizations should be used, minimizing the number of code modifications. However, there may be situations where explicitly coding the memory architecture is either required to meet performance or may allow designers to achieve an even better quality of results. In these cases, it is essential that array accesses are coded in such a way as to not limit performance. This means analyzing array access patterns and organizing the memories in a design so that the desired throughput and area can be achieved. The following code example shows a case in which access to an array can limit performance in the final RTL design. In this example, there are three accesses to the array mem[N] to create a summed result. Refer to Vitis-HLS-Introductory-Examples/Interface/Memory/memory_bottleneck for the full version of this example.

#include "array_mem_bottleneck.h"
dout_t array_mem_bottleneck(din_t mem[N]) { 
 dout_t sum=0;
 int i;
   sum += mem[i] + mem[i-1] + mem[i-2];
 return sum;

During synthesis, the array is implemented as a RAM. If the RAM is specified as a single-port RAM it is impossible to pipeline loop SUM_LOOP to process a new loop iteration every clock cycle.

Trying to pipeline SUM_LOOP with an initiation interval of 1 results in the following message (after failing to achieve a throughput of 1, Vitis HLS relaxes the constraint):

INFO: [SCHED 61] Pipelining loop 'SUM_LOOP'.
WARNING: [SCHED 69] Unable to schedule 'load' operation ('mem_load_2', 
bottleneck.c:62) on array 'mem' due to limited memory ports.
INFO: [SCHED 61] Pipelining result: Target II: 1, Final II: 2, Depth: 3.

The issue here is that the single-port RAM has only a single data port: only one read (or one write) can be performed in each clock cycle.

  • SUM_LOOP Cycle1: read mem[i];
  • SUM_LOOP Cycle2: read mem[i-1], sum values;
  • SUM_LOOP Cycle3: read mem[i-2], sum values;

A dual-port RAM could be used, but this allows only two accesses per clock cycle. Three reads are required to calculate the value of sum, and so three accesses per clock cycle are required to pipeline the loop with an new iteration every clock cycle.

The code in the example above can be rewritten as shown in the following code example to allow the code to be pipelined with a throughput of 1. In the following code example, by performing pre-reads and manually pipelining the data accesses, there is only one array read specified in each iteration of the loop. This ensures that only a single-port RAM is required to achieve the performance.

#include "array_mem_perform.h"
dout_t array_mem_perform(din_t mem[N]) { 
 din_t tmp0, tmp1, tmp2;
 dout_t sum=0;
 int i;
 tmp0 = mem[0];
 tmp1 = mem[1];
 SUM_LOOP:for (i = 2; i < N; i++) {
 tmp2 = mem[i];
 sum += tmp2 + tmp1 + tmp0;
 tmp0 = tmp1;
 tmp1 = tmp2;
 return sum;

Such changes to the source code as shown above are not always required. The more typical case is to use optimization directives/pragmas to achieve the same result. Vitis HLS includes optimization directives for changing how arrays are implemented and accessed. There are two main classes of optimization:

  • Array Partition splits apart the original array into smaller arrays or into individual registers.
  • Array Reshape reorganizes the array into a different memory arrangement to increase parallelism but without splitting apart the original array.