Step 1 - Synchronous GMIO Transfer - 2023.2 English

Vitis Tutorials: AI Engine

Document ID
XD100
Release Date
2023-11-29
Version
2023.2 English

In this step, the synchronous GMIO transfer mode is introduced. Change the working directory to single_aie_gmio/step1. Looking at the graph code aie/graph.h, it can be seen that the design has one output gmioOut with type output_gmio, one input gmioIn with type input_gmio, and an AI Engine kernel weighted_sum_with_margin.

	class mygraph: public adf::graph
	{
	private:
	  adf::kernel k_m;

	public:
	  adf::output_gmio gmioOut;
	  adf::input_gmio gmioIn;

	  mygraph()
	  {
		k_m = adf::kernel::create(weighted_sum_with_margin);
		gmioOut = adf::output_gmio::create("gmioOut",64,1000);
		gmioIn = adf::input_gmio::create("gmioIn",64,1000);

		adf::connect<>(gmioIn.out[0], k_m.in[0]);
		adf::connect<>(k_m.out[0], gmioOut.in[0]);
		adf::source(k_m) = "weighted_sum.cc";
		adf::runtime<adf::ratio>(k_m)= 0.9;
	  };
	};

The GMIO ports gmioIn and gmioOut, are created and connected as follows:

	gmioOut = adf::output_gmio::create("gmioOut",64,1000);
	gmioIn = adf::input_gmio::create("gmioIn",64,1000);

	adf::connect<>(gmioIn.out[0], k_m.in[0]);
	adf::connect<>(k_m.out[0], gmioOut.in[0]);

The GMIO instantiation gmioIn represents the DDR memory space to be read by the AI Engine and gmioOut represents the DDR memory space to be written by the AI Engine. The creator specifies the logical name of the GMIO, burst length (that can be 64, 128, or 256 bytes) of the memory-mapped AXI4 transaction, and the required bandwidth in MB/s (here 1000 MB/s).

Inside the main function of aie/graph.cpp, two 256-element int32 arrays (1024 bytes) are allocated by GMIO::malloc. The dinArray points to the memory space to be read by the AI Engine and the doutArray points to the memory space to be written by the AI Engine. In Linux, the virtual address passed to GMIO::gm2aie_nb, GMIO::aie2gm_nb, GMIO::gm2aie, and GMIO::aie2gm must be allocated by GMIO::malloc. After the input data is allocated, it can be initialized.

int32* dinArray=(int32*)GMIO::malloc(BLOCK_SIZE_in_Bytes);
int32* doutArray=(int32*)GMIO::malloc(BLOCK_SIZE_in_Bytes);

doutRef is used for golden output reference. It can be allocated by a standard malloc because it does not involve GMIO transfer.

int32* doutRef=(int32*)malloc(BLOCK_SIZE_in_Bytes);

GMIO::gm2aie and GMIO::gm2aie_nb are used to initiate read transfers from the AI Engine to DDR memory using memory-mapped AXI transactions. The first argument in GMIO::gm2aie and GMIO::gm2aie_nb is the pointer to the start address of the memory space for the transaction (here dinArray). The second argument is the transaction size in bytes. The memory space for the transaction must be within the memory space allocated by GMIO::malloc. Similarly, GMIO::aie2gm and GMIO::aie2gm_nb are used to initiate write transfers from the AI Engine to DDR memory. GMIO::gm2aie_nb and GMIO::aie2gm_nb are non-blocking functions that return immediately when the transaction is issued. They do not wait for the transaction to complete. In contrast, the functions, GMIO::gm2aie and GMIO::aie2gm behave in a blocking manner.

    gr.gmioIn.gm2aie(dinArray,BLOCK_SIZE_in_Bytes);
    gr.run(ITERATION);
    gr.gmioOut.aie2gm(doutArray,BLOCK_SIZE_in_Bytes);

The blocking transfer (gmioIn.gm2aie) has to be completed before gr.run() because the GMIO transfer is in synchronous mode here. But the buffer input of the graph (in PING-PONG manner by default) has only two buffers to store the received data. This means that at the maximum, two blocks of buffer input data can be transferred by GMIO blocking transfer. Otherwise, the GMIO::gm2aie will block the design. In this example program, ITERATION is set to one.

Because GMIO::aie2gm() is working in synchronous mode, the output processing can be done just after it is completed.

Note: The memory is non-cacheable for GMIO in Linux.

In the example program, the design runs four iterations in a loop. In the loop, pre-processing and post-processing are done before and after data transfer.

    for(int i=0;i<4;i++){
      //pre-processing
      for(int j=0;j<ITERATION*1024/4;j++){
        dinArray[j]=j+i;
      }

      gr.gmioIn.gm2aie(dinArray,BLOCK_SIZE_in_Bytes);
      gr.run(ITERATION);
      gr.gmioOut.aie2gm(doutArray,BLOCK_SIZE_in_Bytes);

      //post-processing
      ref_func(dinArray,coeff,doutRef,ITERATION*1024/4);
      for(int j=0;j<ITERATION*1024/4;j++){
        if(doutArray[j]!=doutRef[j]){
          std::cout<<"ERROR:dout["<<j<<"]="<<doutArray[j]<<",gold="<<doutRef[j]<<std::endl;
          error++;
        }
      }
    }

When PS has completed processing, the memory space allocated by GMIO::malloc can be released by GMIO::free.

    GMIO::free(dinArray);
    GMIO::free(doutArray);