Programming Model for AI Engine–DDR Memory Connection

Programming Model for AI Engine–DDR Memory Connection - 2022.2 English

AI Engine Kernel and Graph Programming Guide (UG1079)

Document ID

UG1079

Release Date

2022-10-19

Version

2022.2 English

The input_gmio/output_gmio port attribute can be used to initiate AI Engine–DDR memory read and write transactions in the PS program. This enables data transfer between an AI Engine and the DDR controller through APIs written in the PS program. The following example shows how to use GMIO APIs to send data to an AI Engine for processing and retrieve the processed data back to the DDR through the PS program.

graph.h

class mygraph: public adf::graph
{
private:
  adf::kernel k_m;

public:
  adf::output_gmio gmioOut;
  adf::input_gmio gmioIn;
  mygraph()
  { 
    k_m = adf::kernel::create(weighted_sum_with_margin);
    gmioOut = adf::output_gmio::create("gmioOut", 64, 1000);
    gmioIn = adf::input_gmio::create("gmioIn", 64, 1000);

    adf::connect<adf::window<1024,32>>(gmioIn.out[0], k_m.in[0]);
    adf::connect<adf::window<1024>>(k_m.out[0], gmioOut.in[0]);
    adf::source(k_m) = "weighted_sum.cc";
    adf::runtime<adf::ratio>(k_m)= 0.9;
  };
};

graph.cpp

myGraph gr;
int main(int argc, char ** argv)
{
    const int BLOCK_SIZE=256;
    int32 *inputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
    int32 *outputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
 
    // provide input data to AI Engine in inputArray
    for (int i=0; i<BLOCK_SIZE; i++) {
        inputArray[i] = i;
    }
 
    gr.init();
          
    gr.gmioIn.gm2aie_nb(inputArray, BLOCK_SIZE*sizeof(int32));
    gr.gmioOut.aie2gm_nb(outputArray, BLOCK_SIZE*sizeof(int32));
 
    gr.run(8);
 
    gr.gmioOut.wait();
 
    // can start to access output data from AI Engine in outputArray
	... 

    GMIO::free(inputArray);
    GMIO::free(outputArray);
    gr.end();
}

This example declares two I/O objects: gmioIn represents the DDR memory space to be read by the AI Engine, and gmioOut represents the DDR memory space to be written by the AI Engine. The constructor specifies the logical name of the GMIO, burst length (that can be 64, 128, or 256 bytes) of the memory-mapped AXI4 transaction, and the required bandwidth (in MB/s).

gmioOut = adf::output_gmio::create("gmioOut", 64, 1000);
gmioIn  = adf::input_gmio::create("gmioIn", 64, 1000);

The application graph (myGraph) has an input port (myGraph::gmioIn) connecting to the processing kernels. The kernels produce data to the output port (myGraph::gmioOut).

adf::connect<adf::window<1024,32>>(gmioIn.out[0], k_m.in[0]);
adf::connect<adf::window<1024>>(k_m.out[0], gmioOut.in[0]);

Inside the main function, two 256-element int32 arrays are allocated by GMIO::malloc. The inputArray points to the memory space to be read by the AI Engine and the outputArray points to the memory space to be written by the AI Engine. In Linux, the virtual address passed to GMIO::gm2aie_nb, GMIO::aie2gm_nb, GMIO::gm2aie and GMIO::aie2gm must be allocated by GMIO::malloc. After the input data is allocated, it can be initialized.

const int BLOCK_SIZE=256; 
int32 *inputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
int32 *outputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));

gr.gmioIn.gm2aie_nb() is used to initiate memory-mapped AXI4 transactions for the AI Engine to read from DDR memory spaces. The first argument in gr.gmioIn.gm2aie_nb() is the pointer to the start address of the memory space for the transaction. The second argument is the transaction size in bytes. The memory space for the transaction must be within the memory space allocated by GMIO::malloc. Similarly, gr.gmioOut.aie2gm_nb() is used to initiate memory-mapped AXI4 transactions for the AI Engine to write to DDR memory spaces. gr.gmioOut.gm2aie_nb() or gr.gmioOut.aie2gm_nb() is a non-blocking function in a sense that it returns immediately when the transaction is issued, that is, it does not wait for the transaction to complete. By contrast, gr.gmioIn.gm2aie() or gr.gmioOut.aie2gm() behaves in a blocking manner.

In this example, assuming in one iteration, the graph consumes 32 int32 data from the input port and produces 32 int32 data to the output port. To run eight iterations, the graph consumes 256 int32 data and produces 256 int32 data. The corresponding memory-mapped AXI4 transactions are initiated using the following code, one gr.gmioIn.gm2aie_nb() call to issue a read transaction for eight-iteration worth of data, and one gr.gmioOut.aie2gm_nb() call to issue a write transaction for eight-iteration worth of data.

gr.gmioIn.gm2aie_nb(inputArray, BLOCK_SIZE*sizeof(int32));
gr.gmioOut.aie2gm_nb(outputArray, BLOCK_SIZE*sizeof(int32));

gr.run(8) is also a non-blocking call to run the graph for eight iterations. To synchronize between the PS and AI Engine for DDR memory read/write access, you can use gr.gmioOut.wait() to block PS execution until the GMIO transaction is complete. In this example, gr.gmioOut.wait() is called to wait for the output data to be written to outputArray DDR memory space.

Note: The memory is non-cachable for GMIO in Linux.

After that, the PS program can access the data. When PS has completed processing, the memory space allocated by GMIO::malloc can be released by GMIO::free.

GMIO::free(inputArray);
GMIO::free(outputArray);

The input_gmio/output_gmio APIs can be used in various ways to perform different level of control for read/write access and synchronization between the AI Engine, PS, and DDR memory. Either input_gmio::gm2aie, output_gmio::aie2gm, input_gmio::gm2aie_nb or output_gmio::aie2gm_nb can be called multiple times to associate different memory spaces for the same input_gmio/output_gmio object during different phases of graph execution. Different input_gmio/output_gmio objects can be associated with the same memory space for in-place AI Engine–DDR read/write access. Blocking versions of input_gmio::gm2aie and output_gmio::aie2gm APIs themselves are synchronization point for data transportation and kernel execution. Calling input_gmio::gm2aie (or output_gmio::aie2gm) is equivalent to calling input_gmio::gm2aie_nb (or output_gmio::aie2gm_nb) followed immediately by output_gmio::wait. The following example shows the combination of the aforementioned use cases.

myGraph gr;

 int main(int argc, char ** argv)
{

    const int BLOCK_SIZE=256;
    // dynamically allocate memory spaces for in-place AI Engine read/write access
    int32* inoutArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32)); 
    gr.init();
 
    for (int k=0; k<4; k++)
    {
        // provide input data to AI Engine in inoutArray
        for(int i=0;i<BLOCK_SIZE;i++){
            inoutArray[i]=i;
        }

        gr.run(8);
        for (int i=0; i<8; i++)
        {
            gr.gmioIn.gm2aie(inoutArray+i*32, 32*sizeof(int32)); //blocking call to ensure transaction data is read from DDR to AI Engine
            gr.gmioOut.aie2gm_nb(inoutArray+i*32, 32*sizeof(int32));
        }
        gr.gmioOut.wait();
     
        // can start to access output data from AI Engine in inoutArray
	//	...        
    }
    GMIO::free(inoutArray);
    gr.end();
    return 0;
}

In the previous example, the two GMIO objects gmioIn and gmioOut are using the same memory space allocated by inoutArray for in-place read and write access.

Without knowing data flow dependency among the kernels inside the graph, and to ensure write-after-read for the inoutArray memory space, the blocking version gr.gmioIn.gm2aie() is called to ensure transaction data is copied from DDR memory to AI Engine local memory before issuing a write transaction to the same memory space in gr.gmioOut.aie2gm_nb().

gr.gmioIn.gm2aie(inoutArray+i*32, 32*sizeof(int32)); //blocking call to ensure transaction data is read from DDR to AI Engine
gr.gmioOut.aie2gm_nb(inoutArray+i*32, 32*sizeof(int32));

gr.gmioOut.wait() is to ensure that data has been migrated to DDR memory. After it is done, the PS can access output data for post-processing.

The graph execution is divided into four phases in the for loop, for (int k=0; k<4; k++). inoutArray can be re-initialized in the for loop with different data to be processed in different phases.