Automatic Port Width Resizing

Automatic Port Width Resizing - 2022.2 English

Vitis High-Level Synthesis User Guide (UG1399)

Document ID

UG1399

Release Date

2022-12-07

Version

2022.2 English

In the Vitis tool flow Vitis HLS provides the ability to automatically re-size m_axi interface ports to 512-bits to improve burst access. However, automatic port width resizing only supports standard C data types and does not support aggregate types such as ap_int, ap_uint, struct, or array.

Important: Structs on the interface prevent automatic widening of the port. You must break the struct into individual elements to enable this feature.

Vitis HLS controls automatic port width resizing using the following two commands:

config_interface -m_axi_max_widen_bitwidth <N>: Directs the tool to automatically widen bursts on M-AXI interfaces up to the specified bitwidth. The value of <N> must be a power-of-two between 0 and 1024.
config_interface -m_axi_alignment_byte_size <N>: Note that burst widening also requires strong alignment properties. Assume pointers that are mapped to m_axi interfaces are at least aligned to the provided width in bytes (power of two). This can help automatic burst widening.

In the Vitis Kernel flow automatic port width resizing is enabled by default with the following:

config_interface -m_axi_max_widen_bitwidth 512
config_interface -m_axi_alignment_byte_size 64

In the Vivado IP flow this feature is disabled by default:

config_interface -m_axi_max_widen_bitwidth 0
config_interface -m_axi_alignment_byte_size 0

Automatic port width resizing will only re-size the port if a burst access can be seen by the tool. Therefore all the preconditions needed for bursting, as described in AXI Burst Transfers, are also needed for port resizing. These conditions include:

Must be a monotonically increasing order of access (both in terms of the memory location being accessed as well as in time). You cannot access a memory location that is in between two previously accessed memory locations- aka no overlap.
The access pattern from the global memory should be in sequential order, and with the following additional requirements:
- The sequential accesses need to be on a non-vector type
- The start of the sequential accesses needs to be aligned to the widen word size
- The length of the sequential accesses needs to be divisible by the widen factor

The following code example is used in the calculations that follow:

vadd_pipeline:
  for (int i = 0; i < iterations; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_len/c_n max = c_len/c_n

  // Pipelining loops that access only one variable is the ideal way to
  // increase the global memory bandwidth.
  read_a:
    for (int x = 0; x < N; ++x) {
#pragma HLS LOOP_TRIPCOUNT min = c_n max = c_n
#pragma HLS PIPELINE II = 1
      result[x] = a[i * N + x];
    }

  read_b:
    for (int x = 0; x < N; ++x) {
#pragma HLS LOOP_TRIPCOUNT min = c_n max = c_n
#pragma HLS PIPELINE II = 1
      result[x] += b[i * N + x];
    }

  write_c:
    for (int x = 0; x < N; ++x) {
#pragma HLS LOOP_TRIPCOUNT min = c_n max = c_n
#pragma HLS PIPELINE II = 1
      c[i * N + x] = result[x];
    }
  }
}
}

The width of the automatic optimization for the code above is performed in three steps:

The tool checks for the number of access patterns in the read_a loop. There is one access during one loop iteration, so the optimization determines the interface bit-width as 32= 32 *1 (bitwidth of the int variable * accesses).
The tool tries to reach the default max specified by the config_interface -m_axi_max_widen_bitwidth 512, using the following expression terms:
```
length = (ceil((loop-bound of index inner loops) * 
(loop-bound of index - outer loops)) * #(of access-patterns))
```
- In the above code, the outer loop is an imperfect loop so there will not be burst transfers on the outer-loop. Therefore the length will only include the inner-loop. Therefore the formula will be shortened to:
```
length = (ceil((loop-bound of index inner loops)) * #(of access-patterns))
```
  or: length = ceil(128) *32 = 4096
Is the calculated length a power of 2? If Yes, then the length will be capped to the width specified by -m_axi_max_widen_bitwidth.

There are some pros and cons to using the automatic port width resizing which you should consider when using this feature. This feature improves the read latency from the DDR as the tool is reading a big vector, instead of the data type size. It also adds more resources as it needs to buffer the huge vector and shift the data accordingly to the data path size.