OpenCL パイプを使用したカーネル間通信のレイテンシの削減

OpenCL パイプを使用したカーネル間通信のレイテンシの削減 - 2019.2 Japanese

Vitis 統合ソフトウェアプラットフォームの資料: アプリケーションアクセラレーション開発 (UG1393)

Document ID

UG1393

Release Date

2020-02-28

Version

2019.2 Japanese

OpenCL API 2.0 仕様には、パイプと呼ばれる新しいメモリオブジェクトが導入されています。パイプには、FIFO として構成されたデータが格納されます。パイプオブジェクトには、パイプから読み出してパイプに書き込むビルドイン関数を使用してのみアクセスできます。パイプオブジェクトはホストからはアクセスできません。パイプを使用すると、データを外部メモリなしで FPGA 内の 1 つのカーネルから別のカーネルにストリーミングでき、全体的なシステムレイテンシを大幅に向上できます。詳細は、Khronos Group からの OpenCL C 仕様バージョン 2.0 の Pipe Functions を参照してください。

Vitis IDE では、パイプはすべてのカーネル関数の外部でスタティックに定義する必要があります。OpenCL 2.x clCreatePipe API を使用したダイナミックパイプ割り当てはサポートされていません。パイプの深さは、パイプ宣言内で OpenCL 属性 xcl_reqd_pipe_depth を使用して指定する必要があります。詳細は、xcl_reqd_pipe_depth を参照してください。

xcl_reqd_pipe_depth に指定されているように、有効な値は 16、32、64、128、256、512、1024、2048、4096、8192、16384、32768 です。

1 つのパイプは、異なるカーネル内に 1 つのプロデューサーおよびコンシューマーのみを持つことができます。

pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));

パイプには、ノンブロッキングモードの標準 OpenCL read_pipe() および write_pipe() ビルトイン関数、またはブロッキングモードのザイリンクス拡張 read_pipe_block() および write_pipe_block() 関数を使用してアクセス可能です。

パイプのステータスは、OpenCL get_pipe_num_packets() および get_pipe_max_packets() ビルトイン関数を使用してクエリできます。

次の関数シグネチャは現在サポートされているパイプ関数で、gentype はビルトイン OpenCL C スカラー整数または浮動小数点データ型を示します。

int read_pipe_block (pipe gentype p, gentype *ptr) 
int write_pipe_block (pipe gentype p, const gentype *ptr)

次は GitHub の Xilinx Getting Started Examples からの dataflow/dataflow_pipes_ocl 例で、パイプを使用してブロッキング read_pipe_block() および write_pipe_block() 関数によりデータを 1 つの処理段階から次の処理段階に渡しています。

pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));
pipe int p1 __attribute__((xcl_reqd_pipe_depth(32)));
// Input Stage Kernel : Read Data from Global Memory and write into Pipe P0
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void input_stage(__global int *input, int size)
{
    __attribute__((xcl_pipeline_loop)) 
    mem_rd: for (int i = 0 ; i < size ; i++)
    {
        //blocking Write command to pipe P0
        write_pipe_block(p0, &input[i]);
    }
}
// Adder Stage Kernel: Read Input data from Pipe P0 and write the result 
// into Pipe P1
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void adder_stage(int inc, int size)
{
    __attribute__((xcl_pipeline_loop))
    execute: for(int i = 0 ; i < size ;  i++)
    {
        int input_data, output_data;
        //blocking read command to Pipe P0
        read_pipe_block(p0, &input_data);
        output_data = input_data + inc;
        //blocking write command to Pipe P1
        write_pipe_block(p1, &output_data);
    }
}
// Output Stage Kernel: Read result from Pipe P1 and write the result to Global
// Memory
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void output_stage(__global int *output, int size)
{
    __attribute__((xcl_pipeline_loop))
    mem_wr: for (int i = 0 ; i < size ; i++)
    {
        //blocking read command to Pipe P1
        read_pipe_block(p1, &output[i]);
    }
}

[Device Traceline] ビューには、ハードウェアエミュレーション実行後の OpenCL パイプの詳細なアクティビティおよびストールが表示されます。この情報は、最適なアプリケーションのエリアおよびパフォーマンスを達成する正しい FIFO サイズを選択するために使用できます。

図 1. [Device Traceline] ビュー