The GMIO port attribute can be used to initiate AI Engine–DDR memory read and write transactions in the PS program. This enables data transfer between an AI Engine and the DDR controller through APIs written in the PS program. The following example shows how to use GMIO APIs to send data to an AI Engine for processing and retrieve the processed data back to the DDR through the PS program.
GMIO gmioIn("gmioIn", 64, 1000);
GMIO gmioOut("gmioOut", 64, 1000);
simulation::platform<1,1> plat(&gmioIn, &gmioOut);
myGraph gr;
connect<> c0(plat.src[0], gr.in);
connect<> c1(gr.out, plat.sink[0]);
int main(int argc, char ** argv)
{
const int BLOCK_SIZE=256;
int32 *inputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
int32 *outputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
// provide input data to AI Engine in inputArray
for(int i=0;i<BLOCK_SIZE;i++){
inputArray[i]=i;
}
gr.init();
gmioIn.gm2aie_nb(inputArray, BLOCK_SIZE*sizeof(int32));
gmioOut.aie2gm_nb(outputArray, BLOCK_SIZE*sizeof(int32));
gr.run(8);
gmioOut.wait();
// can start to access output data from AI Engine in outputArray
...
GMIO::free(inputArray);
GMIO::free(outputArray);
gr.end();
}
This example declares two GMIO objects. gmioIn
represents the DDR memory space to be read by
the AI Engine and gmioOut
represents the DDR memory space to be written by the AI Engine. The constructor specifies the logical name
of the GMIO, burst length (that can be 64, 128, or 256 bytes) of the memory-mapped
AXI4 transaction, and the required bandwidth
(in
MB/s).
GMIO gmioIn("gmioIn", 64, 1000);
GMIO gmioOut("gmioOut", 64, 1000);
Assuming the
application graph (myGraph
) has an input port
(myGraph::in
) connecting to the processing
kernels and an output port (myGraph::out
) producing
the processed data from the kernels, the following code connects gm1
(as a platform source) to the input port of the
graph and connects gm2
(as a platform sink) to the
output port of the
graph.
simulation::platform<1,1> plat(&gmioIn, &gmioOut);
connect<> c0(plat.src[0], gr.in);
connect<> c1(gr.out, plat.sink[0]);
Inside the
main function, two 256-element int32 arrays are allocated by GMIO::malloc
. The inputArray
points
to the memory space to be read by the AI Engine
and the outputArray
points to the memory space to
be written by the AI Engine.In Linux, the vitual
address passed to GMIO::gm2aie_nb
, GMIO::aie2gm_nb
, GMIO::gm2aie
and GMIO::aie2gm
must be
allocated by GMIO::malloc
. After the input data is
allocated, it can be
initialized.
const int BLOCK_SIZE=256;
int32 *inputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
int32 *outputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
GMIO::gm2aie_nb
is used to initiate memory-mapped AXI4 transactions for the AI Engine to read from DDR memory spaces. The first argument in GMIO::gm2aie_nb
is the pointer to the start address of the
memory space for the transaction. The second argument is the transaction size in bytes.
The memory space for the transaction must be within the memory space allocated by
GMIO::malloc
. Similarly, GMIO::aie2gm_nb
is used to initiate memory-mapped AXI4 transactions for the AI Engine to
write to DDR memory spaces. GMIO::gm2aie_nb
or GMIO::aie2gm_nb
is a non-blocking function in a sense that
it returns immediately when the transaction is issued - it does not wait for the
transaction to complete. By contrast, GMIO::gm2aie
or
GMIO::aie2gm
behaves in a blocking
manner.gm1.gm2aie_nb
call to issue a read transaction for eight-iteration worth
of data, and one gm2.aie2gm_nb
call to issue a write
transaction for eight-iteration worth of
data.gmioIn.gm2aie_nb(inputArray, BLOCK_SIZE*sizeof(int32));
gmioOut.aie2gm_nb(outputArray, BLOCK_SIZE*sizeof(int32));
gr.run(8)
is also a non-blocking
call to run the graph for eight iterations. To synchronize between the PS and
AI Engine for DDR memory read/write access,
you can use GMIO::wait
to block PS execution until
the GMIO transaction is complete. In this example, gmioOut.wait()
is called to wait for the output data to be written to
outputArray
DDR memory space.
After that, the PS program can access
the data. When PS has completed processing, the memory space allocated by GMIO::malloc
can be released by GMIO::free
.
GMIO::free(inputArray);
GMIO::free(outputArray);
GMIO APIs can be used in various ways to perform
different level of control for read/write access and synchronization between the
AI Engine, PS, and DDR memory. GMIO::gm2aie
, GMIO::aie2gm
, GMIO::gm2aie_nb
or
GMIO::aie2gm_nb
can be called multiple times
to associate different memory spaces for the same GMIO object during different
phases of graph execution. Different GMIO objects can be associated with the same
memory space for in-place AI Engine–DDR
read/write access. Blocking versions of GMIO::gm2aie
and GMIO::aie2gm
APIs
themselves are synchronization point for data transportation and kernel execution.
Calling GMIO::gm2aie
(or GMIO::aie2gm
) is equivalent to calling GMIO::gm2aie_nb
(or GMIO::aie2gm_nb
)
followed immediately by GMIO::wait
. The following
example shows the combination of the aforementioned use
cases.
GMIO gmioIn("gmioIn", 64, 1000);
GMIO gmioOut("gmioOut", 64, 1000);
simulation::platform<1,1> plat(&gmioIn, &gmioOut);
myGraph gr;
connect<> c0(plat.src[0], gr.in);
connect<> c1(gr.out, plat.sink[0]);
int main(int argc, char ** argv)
{
const int BLOCK_SIZE=256;
// dynamically allocate memory spaces for in-place AI Engine read/write access
int32* inoutArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
gr.init();
for (int k=0; k<4; k++)
{
// provide input data to AI Engine in inoutArray
for(int i=0;i<BLOCK_SIZE;i++){
inoutArray[i]=i;
}
gr.run(8);
for (int i=0; i<8; i++)
{
gmioIn.gm2aie(inoutArray+i*32, 32*sizeof(int32)); //blocking call to ensure transaction data is read from DDR to AI Engine
gmioOut.aie2gm_nb(inoutArray+i*32, 32*sizeof(int32));
}
gmioOut.wait();
// can start to access output data from AI Engine in inoutArray
...
}
GMIO::free(inoutArray);
gr.end();
}
In the previous example, the two GMIO objects
gmioIn
and gmioOut
are using the same memory space allocated by inoutArray
for in-place read and write access.
Without knowing data flow dependency among the kernels inside the
graph, and to ensure write-after-read for the inoutArray
memory space, the blocking version gmioIn.gm2aie
is called to ensure transaction data is copied from DDR
memory to AI Engine local memory before issuing a
write transaction to the same memory space in gmioOut.aie2gm_nb
.
gmioIn.gm2aie(inoutArray+i*32, 32*sizeof(int32)); //blocking call to ensure transaction data is read from DDR to AI Engine
gmioOut.aie2gm_nb(inoutArray+i*32, 32*sizeof(int32));
gmioOut.wait()
is to ensure that
data has been migrated to DDR memory. After it is done, the PS can access output
data for post-processing.
The graph execution is divided into
four phases in the for loop, for (int k=0; k<4;
k++)
. inoutArray
can be
re-initialized in the for
loop with different data
to be processed in different phases.