AXI Performance Case Study

Introduction

The objective of the case study is to show a step-by-step optimization to improve the throughput of the read/write loops/functions using HLS metrics. These optimizations will improve the kernel time and throughput of the system by performing efficient data transfers from global memory to the kernel. The transfer_kernel example below performs a DDR simple read/write (of variable size and NUM_ITERATIONS ).

Tip: The host code, which is not shown, only transfers the data and enqueues the kernel in an in-order queue.

1 #include "config.h"
 2 #include "assert.h"
 3 extern "C" {
 4    void transfer_kernel(wd* in,wd* out, const int size, const int iter ) {
 5 ···
 6        wd buf[256];
 7        int off = (size/16);
 8  
 9        read_loop: for (int i = 0; i <off; i++)
10        {
11           buf[i] = in[i];
12        }
13
14     write_loop: L1: for (int i = 0; i < iter; i++) {
15        L2: for (int j = 0; j <off; j++) {
16        #pragma HLS PIPELINE II=1
17           out[j+off*i] = buf[j];
18           }
19        }
20 ···
21    }
22 }

This case study is divided into 4 steps:

Baseline kernel run time with port width set to 512-bit width
Improve performance by changing latency parameter
Improve the auto burst inference of the write loop.
No further improvements using multiple ports and number write outstanding

Step 1: Baseline the Kernel with 512-bit Port Width

Baseline the kernel time using the default settings. During this run, the auto burst inferences the following for the read and write loops:

The Read loop achieves the pipeline burst since the tool can predict the consecutive memory access pattern. So the pipelined requests to read from the DDR of variable size.
The Write outer loop, L1, gets sequential burst because the compiler iterates over all the combinations and identifies that since the size is unknown at compile-time, it inserts an if condition in the L1 loop before the start of the L2 loop. At the same time, the inner-most loop - L2 achieves pipeline burst. The L2 loop requests a write request of variable size, while L1 waits for all the data of L2 Loop to come back from the DDR to start the next iteration of L1.

After building and running the application, the performance can be evaluated using the Vitis Analyzer tool to view the reports generated by the build process or the run summary. Review the Burst Summary available in the Synthesis Report from Vitis HLS. It confirms the success and failures of the burst for the Read loop and Write loops.

Figure 1. Synthesis Report - Burst Summary

In Vitis Analyzer, the Profile Summary and Timeline Trace reports are also useful tools to analyze the performance of the FPGA-accelerated application. In the Profile Summary the Kernels & Compute Unit: Kernel Execution reports the total time required by the transfer_kernel in the baseline build.

Figure 2. Profile Summary - Kernel Execution

Step 2: Improve Performance Latency

Vitis HLS uses the default latency of 64 kernel cycles, which in some cases may be too high. The latency depends on the system characteristics. For this example, the latency is reduced from the default to 21 kernel cycles. The code can be changed to specify the latency using the INTERFACE pragma or directive as shown in the following example:

1 #include "config.h"
 2 #include "assert.h"
 3 extern "C" {
 4    void transfer_kernel(wd* in,wd* out, const int size, const int iter ) {
 5    #pragma HLS INTERFACE m_axi port=in0_index offset=slave latency=21
 6    #pragma HLS INTERFACE m_axi port=out offset=slave latency=21

 7 ...

Build and run the application and use Vitis Analyzer to review the reports generated by the build process or the run summary. Review the Synthesis Report from Vitis HLS, and examine the HW Interface table to see the specified latency has been applied.

Figure 3. Synthesis Report - HW Interface

Review the Burst Summary to examine the success or failures of that process.

Figure 4. Synthesis Report - Burst Summary 2

Examine the Kernel Execution in the Profile Summary report, and notice the performance improvement due to setting the latency for the interface.

Figure 5. Profile Summary - Kernel Execution 2

Step 3: Improve the Automatic Burst Inference of the Write Loop

The compiler is pessimistic in auto burst inference because size and loop trip counts are unknown at compile time. You can modify the code to help the compiler infer pipelined burst, as shown below.

1 #include "config.h"
  2 #include "assert.h"
  3 extern "C" {
  4    void transfer_kernel(wd* in,wd* out, const int size, const int iter ) {
  5    #pragma HLS INTERFACE m_axi port=in offset=slave latency=21
  6    #pragma HLS INTERFACE m_axi port=out offset=slave latency=21
  7
  8       int k=0;
  9       wd buf[256];
 10       int off = (size/16);
 11 
 12       read_loop: for (int i = 0; i <off; i++)
 13       {
 14          buf[i] = in[i];
 15       }
 16
 17       write_loop: for (int j = 0; j <off*iter; j++) {
 18       #pragma HLS PIPELINE II=1
 19          out[k++] = buf[j%off];
 20       }
 21    }
 22 }

Build and run the application and use Vitis Analyzer to review the reports generated by the build process or the run summary. The Synthesis Report confirms that the burst hints to the compiler fixed the sequential burst of the write loop. The Burst and Widening Missed messages are related to widening ports to 512 bits. Since this example already has a 512 port width, it can be ignored. If the width isn't 512-bi in your code, you might need to focus on resolving these messages.

Figure 6. Synthesis Report - Burst Summary 3

Examine the Kernel Execution in the Profile Summary report, and notice the performance improvement due to the latency change from Step 2, and the pipeline burst for the write loop in the current step.

Figure 7. Profile Summary - Kernel Execution 3

Summary

There are no further improvements that can be made from the Vitis HLS interface metrics. The case study example does not have concurrent read or write, so targeting multiple ports will not help in this case. In this example the tool has achieved pipeline burst for the maximum throughput, so the number of outstanding reads and writes can also be ignored. No further improvements can be confirmed from the kernel time.

As seen in the case study, implementing efficient load-store functions is dependent on the HLS interface metrics of port width, burst access, latency, multiple ports, and the number of outstanding reads and writes. AMD recommends the following guidelines for improving your system performance:

Port width: Maximize the port width of the interface, i.e., the bit-width of each AXI port, by using hls::vector or ap_(u)int<512> as the data type of the port.
Multiple ports: Analyze the concurrent memory reads/writes and have a dedicated/independent port for concurrent accesses.
Pipeline burst: The AXI latency parameter does not have an impact on pipelined burst, the user is advised to write code to achieve the pipelined burst which can significantly improve the performance.
Sequential burst: The AXI latency parameter has a significant impact on sequential burst, decreasing the latency number from the default latency of the tool will improve the performance.
Num outstanding: In most of the cases of burst length >=16, the default num outstanding should be sufficient. For a burst of size less than 16, AMD recommends doubling the size of the num outstanding from the default(=16).
Data Re-ordering: Achieving pipelined burst is always recommended, but at times because of the memory access pattern compiler can achieve only a sequential burst. In order to improve the performance, the developer can also consider different ways of storing the data in memory. For instance, accessing data in DRAM in a column-major fashion can be very inefficient. Rather than implementing a dedicated data-mover in the kernel, it may be better to transpose the data in SW and store in row-major order instead which will greatly simply HW access patterns.

AXI Performance Case Study - 2023.2 English

Vitis High-Level Synthesis User Guide (UG1399)

Introduction

Step 1: Baseline the Kernel with 512-bit Port Width

Step 2: Improve Performance Latency

Step 3: Improve the Automatic Burst Inference of the Write Loop

Summary