This design has a graph that has 32 AI Engine kernels. Each kernel has one input and one output. Thus, 32 AI Engine GMIO inputs and 32 AI Engine GMIO outputs are connected to the graph.
Change the working directory to perf_profile_aie_gmio
. Take a look at the graph code in aie/graph.h
.
static const int col[32]={6,13,14,45,18,42,4,30,48,49,9,16,29,39,40,31,2,3,46,0,43,27,41,26,11,17,47,1,19,10,34,7};
class mygraph: public adf::graph
{
private:
adf::kernel k[32];
public:
adf::input_gmio gmioIn[32];
adf::output_gmio gmioOut[32];
mygraph()
{
for(int i=0;i<32;i++){
gmioIn[i]=adf::input_gmio::create("gmioIn"+std::to_string(i),/*size_t burst_length*/256,/*size_t bandwidth*/100);
gmioOut[i]=adf::output_gmio::create("gmioOut"+std::to_string(i),/*size_t burst_length*/256,/*size_t bandwidth*/100);
k[i] = adf::kernel::create(vec_incr);
adf::connect<adf::window<1024>>(gmioIn[i].out[0], k[i].in[0]);
adf::connect<adf::window<1032>>(k[i].out[0], gmioOut[i].in[0]);
adf::source(k[i]) = "vec_incr.cc";
adf::runtime<adf::ratio>(k[i])= 1;
adf::location<adf::kernel>(k[i])=adf::tile(col[i],0);
}
};
};
In the previous code, there are location constraints adf::location
for each kernel. This is to save time for aiecompiler
. Note that each kernel has an input window size of 1024 bytes and output window size of 1032 bytes.
Next, examine the kernel code aie/vec_incr.cc
. It adds each int32 input by one and additionally outputs the cycle counter of the AI Engine tile. Due to the later introduction, this counter can be used to calculate the system throughput.
#include <aie_api/aie.hpp>
#include <aie_api/aie_adf.hpp>
#include <aie_api/utils.hpp>
void vec_incr(input_window<int32>* data,output_window<int32>* out){
aie::vector<int32,16> vec1=aie::broadcast<int32>(1);
for(int i=0;i<16;i++)
chess_prepare_for_pipelining
chess_loop_range(4,)
{
aie::vector<int32,16> vdata=window_readincr_v<16>(data);
aie::vector<int32,16> vresult=aie::add(vdata,vec1);
window_writeincr(out,vresult);
}
aie::tile tile=aie::tile::current();
unsigned long long time=tile.cycles();//cycle counter of the AI Engine tile
window_writeincr(out,time);
}
Next, examine the host code aie/graph.cpp
. The concepts introduced in AIE GMIO Programming Model apply here. This section explains new concepts and how performance profiling is done. Some constants defined in the code are as follows:
#if !defined(__AIESIM__) && !defined(__X86SIM__) && !defined(__ADF_FRONTEND__)
const int ITERATION=512;
#else
const int ITERATION=4;
#endif
const int BLOCK_SIZE_in_Bytes=1024*ITERATION;
const int BLOCK_SIZE_out_Bytes=1032*ITERATION;
If it is for hardware flow, ITERATION
is 512 otherwise, it is 4. This is to make sure that the AI Engine simulator can finish in a short amount of time.
In the main function, the PS code is going to profile num
GMIO inputs and outputs, and num
is from 1, 2, 4, to 32. Non-blocking GMIO APIs (GMIO::gm2aie_nb
and GMIO::aie2gm_nb
) are used for GMIO transactions, and GMIO::wait
is used for output data synchronization. Only when the input and output data are transferred for the kernel, can the kernel be finished. This is because the graph is started for all the AI Engine kernels, but only some of the kernels are profiled. After the code for profiling, the remaining kernels are flushed by transferring data to and from the remaining AI Engine kernels.
for(int num=1;num<=32;num*=2){
//Pre-processing
for(int i=0;i<32;i++){
for(int j=0;j<BLOCK_SIZE_in_Bytes/sizeof(int);j++){
dinArray[i][j]=j+num;
}
}
gr.run(ITERATION);
//Profile starts here
for(int i=0;i<num;i++){
gr.gmioIn[i].gm2aie_nb(dinArray[i], BLOCK_SIZE_in_Bytes);
gr.gmioOut[i].aie2gm_nb(doutArray[i], BLOCK_SIZE_out_Bytes);
}
for(int i=0;i<num;i++){
gr.gmioOut[i].wait();
}
//Profile ends here
//check output correctness
for(int i=0;i<num;i++){
for(int j=0;j<BLOCK_SIZE_out_Bytes/sizeof(int);j++){
if(j%258!=256 && j%258!=257 && doutArray[i][j]!=j+num+1-j/258*2){
std::cout<<"ERROR:dout["<<i<<"]["<<j<<"]="<<doutArray[i][j]<<std::endl;
error++;
break;
}
}
}
//flush remain stalling kernels
for(int i=num;i<32;i++){
gr.gmioIn[i].gm2aie_nb(dinArray[i], BLOCK_SIZE_in_Bytes);
gr.gmioOut[i].aie2gm_nb(doutArray[i], BLOCK_SIZE_out_Bytes);
}
gr.wait();
}
<<<<<<< HEAD