Reducing Kernel to Kernel Communication Latency in OpenCL Kernels

Reducing Kernel to Kernel Communication Latency in OpenCL Kernels - 2021.1 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID

UG1393

Release Date

2022-03-29

Version

2021.1 English

The OpenCL API 2.0 specification introduces a new memory object called a pipe. A pipe stores data organized as a FIFO. Pipe objects can only be accessed using built-in functions that read from and write to a pipe. Pipe objects are not accessible from the host. Pipes can be used to stream data from one kernel to another inside the FPGA without having to use the external memory, which greatly improves the overall system latency. For more information, see Pipe Functions on Version 2.0 of the OpenCL C Specification from Khronos Group.

In the Vitis IDE, pipes must be statically defined outside of all kernel functions. Dynamic pipe allocation using the OpenCL 2.x clCreatePipe API is not supported. The depth of a pipe must be specified by using the OpenCL attribute xcl_reqd_pipe_depth in the pipe declaration. For more information, see xcl_reqd_pipe_depth.

As specified in xcl_reqd_pipe_depth, the valid depth values are as follows: 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768.

A given pipe can have one and only one producer and consumer in different kernels.

pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));

Pipes can be accessed using the Xilinx extended read_pipe_block() and write_pipe_block() functions in blocking mode.

Tip: Reading or writing to pipes using the non-blocking read_pipe() or write_pipe() functions is not supported.

The status of pipes can be queried using OpenCL get_pipe_num_packets() and get_pipe_max_packets() built-in functions.

The following function signatures are the currently supported pipe functions, where gentype indicates the built-in OpenCL C scalar integer or floating-point data types.

int read_pipe_block (pipe gentype p, gentype *ptr) 
int write_pipe_block (pipe gentype p, const gentype *ptr)

The following “dataflow/dataflow_pipes_ocl” from Xilinx Getting Started Examples on GitHub uses pipes to pass data from one processing stage to the next using blocking read_pipe_block() and write_pipe_block() functions:

pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));
pipe int p1 __attribute__((xcl_reqd_pipe_depth(32)));
// Input Stage Kernel : Read Data from Global Memory and write into Pipe P0
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void input_stage(__global int *input, int size)
{
    __attribute__((xcl_pipeline_loop)) 
    mem_rd: for (int i = 0 ; i < size ; i++)
    {
        //blocking Write command to pipe P0
        write_pipe_block(p0, &input[i]);
    }
}
// Adder Stage Kernel: Read Input data from Pipe P0 and write the result 
// into Pipe P1
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void adder_stage(int inc, int size)
{
    __attribute__((xcl_pipeline_loop))
    execute: for(int i = 0 ; i < size ;  i++)
    {
        int input_data, output_data;
        //blocking read command to Pipe P0
        read_pipe_block(p0, &input_data);
        output_data = input_data + inc;
        //blocking write command to Pipe P1
        write_pipe_block(p1, &output_data);
    }
}
// Output Stage Kernel: Read result from Pipe P1 and write the result to Global
// Memory
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void output_stage(__global int *output, int size)
{
    __attribute__((xcl_pipeline_loop))
    mem_wr: for (int i = 0 ; i < size ; i++)
    {
        //blocking read command to Pipe P1
        read_pipe_block(p1, &output[i]);
    }
}

The Device Traceline view shows the detailed activities and stalls on the OpenCL pipes after hardware emulation is run. This information can be used to choose the correct FIFO sizes to achieve the optimal application area and performance.

Figure 1. Device Traceline View