Using Manual Burst - 2023.2 English

Burst transfers improve the throughput of the I/O of the kernel by reading or writing large chunks of data to the global memory. The larger the size of the burst, the higher the throughput, this metric is calculated as follows ((# of bytes transferred)* (kernel frequency)/(Time)). The maximum kernel interface bitwidth is 512 bits and if the kernel is compiled at 300 MHz, then it can theoretically achieve = (80-95% efficiency of the DDR)*(512* 300 MHz)/1 sec = ~17-19 GBps for a DDR. As explained, Vitis HLS performs automatic burst optimizations which intelligently aggregates the memory accesses of the loops/functions from the user code and performs read/write of a particular size in a single burst request. However, burst transfer also has requirements that can sometimes be overburdening or difficult to meet, as discussed in Preconditions and Limitations of Burst Transfer.

In some cases, where autmatic burst access has failed, an efficient solution is to re-write the code or use manual burst. In such cases, if you are familiar with the AXI4 m_axi protocol, and understand hardware transaction modeling, you can implement manual burst transfers using the hls::burst_maxi class as described below. Refer to Vitis-HLS-Introductory-Examples/Interface/Memory/manual_burst on GitHub for examples of these concepts. Another solution might be to use cache memory in the AXI4 interface using the CACHE pragma or directive.

hls::burst_maxi Class

The hls::burst_maxi class provides a mechanism to perform read/write access to the DDR memory. These methods will translate the class methods usage behavior into respective AXI4 protocol and send and receive requests on the AXI4 bus signals - AW, AR, WDATA, BVALID, RDATA. These methods control the burst behavior of the HLS scheduler. The adapter, which receives the commands from the scheduler, is responsible for sending the data to the DDR memory. These requests will adhere to the user specified INTERFACE pragma options, such as max_read_burst_length and max_write_burst_length. The class methods should only be used in the kernel code, and not in the test bench (except for the class constructor as described below).

Constructors:
- ```
burst_maxi(const burst_maxi &B) : Ptr(B.Ptr) {}
```
- ```
burst_maxi(T *p) : Ptr(p) {}
```
  Important: The HLS design and test bench must be in different files, because the constructor burst_maxi(T *p) is only available in C-simulation model.
Read Methods:
- ```
void read_request(size_t offset, size_t len);
```
  This method is used to perform a read request to the m_axi adapter. The function returns immediately if the read request queue inside m_axi adapter is not full, otherwise it waits until space becomes available.
  - offset: Specify the memory offset from which to read the data
  - len: Specify the scheduler burst length. This burst length is sent to the adapter, which can then convert it to the standard AXI AMBA protocol
- ```
T read();
```
  This method is used to transfer the data from the m_axi adapter to the scheduler FIFO. If the data is not available, read() will be blocking. The read() method should be called len number of times, as specified in the read_request().
Write Methods:
- ```
void write_request(size_t offset, size_t len);
```
  This method is used to perform a write request to the m_axi adapter. The function returns immediately if the write request queue inside m_axi adapter is not full.
  - offset: Specify the memory offset into which the data should be written
  - len: Specify the scheduler burst length. This burst length is sent to the adapter, which can then convert it to the standard AXI AMBA protocol
- ```
void write(const T &val, ap_int<sizeof(T)> byteenable_mask = -1); 
```
  This method is used to transfer data from the internal buffer of the scheduler to the m_axi adapter. It blocks if the internal write buffer is full. The byteenable_mask is used to enable the bytes in the WDATA. By default it will enable all the bytes of the transfer. The write() method should be called len number of times, as specified in the write_request().
- ```
void write_response();
```
  This method blocks until all write responses are back from the global memory. This method should be called the same number of times as write_request().

Using Manual Burst in HLS Design

In the HLS design, when you find that automatic burst transfers are not working as desired, and you cannot optimize the design as needed, you can implement the read and write transactions using the hls::burst_maxi object. In this case, you will need to modify your code to replace the original pointer argument with burst_maxi as a function argument. These arguments must be accessed by the explicit read and write methods of the burst_maxi class, as shown in the following examples.

The following shows an original code sample, which uses a pointer argument to read data from global memory.

void dut(int *A) {
  for (int i = 0; i < 64; i++) {
  #pragma pipeline II=1
      ... = A[i]
  }
}

In the modified code below, the pointer is replaced with the hls::burst_maxi<> class objects and methods. In the example, the HLS scheduler puts 4 requests of len 16 from port A to the m_axi adapter. The Adapter stores them inside a FIFO and whenever the AW/AR bus is available it will send the request to the global memory. In the 64 loop iterations, the read() command issues a blocking call that will wait for the data to come back from the global memory. After the data becomes available the HLS scheduler will read it from the m_axi adapter FIFO.

#include "hls_burst_maxi.h"
void dut(hls::burst_maxi<int> A) {
  // Issue 4 burst requests
  A.read_request(0, 16); // request 16 elements, starting from A[0]
  A.read_request(128, 16); // request 16 elements, starting from A[128]
  A.read_request(256, 16); // request 16 elements, starting from A[256]
  A.read_request(384, 16); // request 16 elements, starting from A[384]
  for (int i = 0; i < 64; i++) {
  #pragma pipeline II=1
      ... = A.read(); // Read the requested data
  }
}

In example 2 below, the HLS scheduler/kernel puts 2 requests from port A to the adapter, the first request of len 2, and the second request of len 1, for a total of 2 write requests. It then issues corresponding, because the total burst length is 3 write commands. The Adapter stores these requests inside a FIFO and whenever the AW, W bus is available it will send the request and data to the global memory. Finally, two write_response commands are used, to await response for the two write_requests.

void trf(hls::burst_maxi<int> A) {
  A.write_request(0, 2);
  A.write(x); // write A[0]
  A.write_request(10, 1);
  A.write(x, 2); // write A[1] with byte enable 0010
  A.write(x); // write A[10]
  A.write_response(); // response of write_request(0, 2)
  A.write_response(); // response of write_request(10, 1)
}

Using Manual Burst in C-Simulation

You can pass a regular array to the top function, and the array will be transformed to hls::burst_maxi automatically by the constructor.

Important: The HLS design and test bench must be in different files, because the

burst_maxi(T
					*p)

constructor is only valid for use in C simulation model.

#include "hls_burst_maxi.h"
void dut(hls::burst_maxi<int> A);
 
int main() {
  int Array[1000];
  dut(Array);
  ......
}

Using Manual Burst to Optimize Performance

Vitis HLS characterizes two types of burst behaviors: pipeline burst, and sequential burst.

Pipeline Burst

Pipeline Burst improves throughput by reading or writing the maximum amount of data in a single request. The compiler infers pipeline burst if the read_request, write_request and write_response calls are outside the loop, as shown in the following code example. In the below example the size is a variable that is sent from the test bench.

9  int buf[8192];
10  in.read_request(0, size);
11  for (int i = 0; i < size; i++) {
12  #pragma HLS PIPELINE II=1
13     buf[i] = in.read();
14     out.write_request(0, size*NT);
17     for (int i = 0; i < NT; i++) {
19        for (int j = 0; j < size; j++) {
20        #pragma HLS PIPELINE II=1
21           int a = buf[j];
22           out.write(a);
23  }
24 }
25 out.write_response();

Figure 1. Synthesis Results

As you can see from the preceding figure, the tool has inferred the burst from the user code and length is mentioned as variable at compile time.

Figure 2. Performance Benefits

During the runtime the HLS compiler sends a burst request of length = size and the adapter will partition them into the user-specified burst_length pragma option. In this case the default burst length is set to 16, which is used in the ARlen and AWlen channels. The read/write channel achieved maximum throughput because there are no bubbles during the transfer.

Figure 3. Co-sim Results

Sequential Burst

This burst is a sequential burst of smaller data sizes, where the read requests, write requests and write responses are inside the loop body as shown in the below snippet. The drawback of the sequential burst is that the future request (i+1) depends on the previous request (i) to finish because it is waiting for the read request, write request and write response to complete, this will cause gaps between requests. Sequential burst is not as effective as pipeline burst because it is reading or writing a small data size multiple times to compensate for the loop bounds. Although this will limit the improvement to throughput, sequential burst is still better than no burst.

  void transfer_kernel(hls::burst_maxi<int> in,hls::burst_maxi<int> out, const int size )
{
  #pragma HLS INTERFACE m_axi port=in depth=512 latency=32 offset=slave
  #pragma HLS INTERFACE m_axi port=out depth=5120 offset=slave latency=32
 
        int buf[8192];
 
 
        for (int i = 0; i < size; i++) {
             in.read_request(i, 1);
        #pragma HLS PIPELINE II=1
            buf[i] = in.read();
        }
 
 
 
        for (int i = 0; i < NT; i++) {
            for (int j = 0; j < size; j++) {
                out.write_request(j, 1);
#pragma HLS PIPELINE II=1
                int a = buf[j];
                out.write(a);
                out.write_response();
 
            }
 
        }
 
    }

Figure 4. Synthesis Results

As you can see from the report sample above, the tool achieved a burst of length =1.

Figure 5. Performance Impacts

The read/write loop R/WDATA channel has gaps equal to read/write latency, as discussed in AXI4 Master Interface. For the read channel, the loop waits for all the read data to come back from the global memory. For the write channel, the innermost loop waits for the response (BVALID) to come back from the global memory. This results in performance degradation. The co-sim results also show that a 2x degradation in performance for this burst semantics.

Figure 6. Performance Estimates

Features and Limitations

If the m_axi element is a struct:
- The struct will be packed into a wide int. Disaggregation of the struct is not allowed.
- The size of struct must be a power-of-2, and should not exceed 1024 bits or the max width specified by the config_interface -m_axi_max_bitwidth command.
ARRAY_PARTITION and ARRAY_RESHAPE of burst_maxi ports is not allowed.

You can apply the INTERFACE pragma or directive to hls::burst_maxi, defining an m_axi interface. If the burst_maxi port is bundled with other ports, all ports in this bundle must be hls::burst_maxi and must have the same element type.

void dut(hls::burst_maxi<int> A, hls::burst_maxi<int> B, int *C, hls::burst_maxi<short> D) {
  #pragma HLS interface m_axi port=A offset=slave bundle=gmem // OK
  #pragma HLS interface m_axi port=B offset=slave bundle=gmem // OK
  #pragma HLS interface m_axi port=C offset=slave bundle=gmem // Bad. C must also be hls::burst_maxi type, because it shares the same bundle 'gmem' with A and B
  #pragma HLS interface m_axi port=D offset=slave bundle=gmem  // Bad. D should have 'int' element type,  because it shares the same bundle 'gmem' with A and B
}

You can use the INTERFACE pragma or directive to specify the num_read_outstanding and num_write_outstanding, and the max_read_burst_length and max_write_burst_length to define the size of the internal buffer of the m_axi adapter.
```
void dut(hls::burst_maxi<int> A) {
  #pragma HLS interface m_axi port=A num_read_outstanding=32 num_write_outstanding=32 max_read_burst_length=16 max_write_burst_length=16
}
```
The INTERFACE pragma or directive max_widen_bitwidth is not supported, because HLS will not change the bit width of hls::burst_maxi ports.

You must make a read_request before read, or write_request before write:

void dut(hls::burst_maxi<int> A) {
  ... = A.read();  // Bad because read() before read_request(). You can catch this error in C-sim.
  A.read_request(0, 1); 
}

If the address and life time of the read group (read_request() > read()) and write group (write_request() > write() > write_response()) overlap, the tool cannot guarantee the access order. C-simulation will report an error.

void dut(hls::burst_maxi<int> A) {
  A.write_request(0, 1);
  A.write(x);
  A.read_request(0, 1);
  ... = A.read();  // What value is read? It is undefined. It could be original A[0] or updated A[0].
  A.write_response();
}
 
void dut(hls::burst_maxi<int> A) {
  A.write_request(0, 1);
  A.write(x);
  A.write_response();
  A.read_request(0, 1);
  ... = A.read();  // this will read the updated A[0].
}

If multiple hls::burst_maxi ports are bundled to same m_axi adapter and their transaction lifetimes overlap, the behavior is unexpected.

void dut(hls::burst_maxi<int> A, hls::burst_maxi<int> B) {
    #pragma HLS INTERFACE m_axi port=A bundle=gmem depth = 10
    #pragma HLS INTERFACE m_axi port=B bundle=gmem depth = 10 
    A.read_request(0, 10);
    B.read_request(0, 10);
     
    for (int i = 0; i < 10; i++) {
        #pragma HLS pipeline II=1
        …… = A.read(); // get value of A[0], A[2], A[4] …
        …… = B.read();  // get value of A[1], A[3], A[5] …
    }
}

Read or write requests and read or writes in different dataflow process are not supported. Dataflow checker will report an error: multiple writes in different dataflow processes are not allowed.
For example:
```
void transfer(hls::burst_maxi<int> A)  {
#pragma HLS dataflow
   IssueRequests(A); // issue multiple wirte_request() of A
   Write(A); // multiple writes to A
   GetResponse(A); // write_response() of  A
}
```

Potential Pitfalls

The following are some concerns you must be aware of when implementing manual burst techniques:

Deadlock: Improper use of manual burst can lead to deadlocks.
Too many read_requests before read() commands will cause deadlock because the read_request loop will push the request into the read requests FIFO, and this FIFO will only be emptied after the read from the global memory is completed. The job of the read() command is to read the data from the adapter FIFO and mark the request done, after which the read_request will be popped from the FIFO and a new request can be pushed onto it.
```
//reads/writes. will deadlock if N is larger
for (i = 0; i < N; i++)
 {   A.read_request(i * 128, 16);} 
for (i = 0; i < 16 *N; i++) {  … = A.read();}
 
 
for (int i = 0; i < N; i++) {
    p.write_request(i * 128, 16);
  }
  
  for (int i = 0; i < N * 16; i++) {
    p.write(i);
  }
  
  for (int i = 0; i < N; i++) {
    p.write_response();
  }
```
In the example above, if N is large then the read_request and read FIFO will be full as it tends to N/2. The read request loop would not finish, and the read command loop would not start, which results in deadlock.

Note: This is case also true for write_request() and write() commands.
AXI protocol violation: There should be an equal number of write requests and write responses. An unequal number of requests and responses would lead to AXI protocol violation