Design Introduction - 2022.2 English

Vitis Tutorials: AI Engine Development

Document ID
XD100
Release Date
2022-12-01
Version
2022.2 English

This design has a graph that has 32 AI Engine kernels. Each kernel has one input and one output. Thus, 32 AI Engine GMIO inputs and 32 AI Engine GMIO outputs are connected to the graph.

Change the working directory to perf_profile_aie_gmio. Take a look at the graph code in aie/graph.h.

static const int col[32]={6,13,14,45,18,42,4,30,48,49,9,16,29,39,40,31,2,3,46,0,43,27,41,26,11,17,47,1,19,10,34,7};

class mygraph: public adf::graph
{
private:
  adf::kernel k[32];

public:
  adf::input_gmio gmioIn[32];
  adf::output_gmio gmioOut[32];

  mygraph()
  {
	for(int i=0;i<32;i++){
		gmioIn[i]=adf::input_gmio::create("gmioIn"+std::to_string(i),/*size_t burst_length*/256,/*size_t bandwidth*/100);
		gmioOut[i]=adf::output_gmio::create("gmioOut"+std::to_string(i),/*size_t burst_length*/256,/*size_t bandwidth*/100);
		k[i] = adf::kernel::create(vec_incr);
		adf::connect<adf::window<1024>>(gmioIn[i].out[0], k[i].in[0]);
		adf::connect<adf::window<1032>>(k[i].out[0], gmioOut[i].in[0]);
		adf::source(k[i]) = "vec_incr.cc";
		adf::runtime<adf::ratio>(k[i])= 1;
		adf::location<adf::kernel>(k[i])=adf::tile(col[i],0);
	}
  };
};

In the previous code, there are location constraints adf::location for each kernel. This is to save time for aiecompiler. Note that each kernel has an input window size of 1024 bytes and output window size of 1032 bytes.

Next, examine the kernel code aie/vec_incr.cc. It adds each int32 input by one and additionally outputs the cycle counter of the AI Engine tile. Due to the later introduction, this counter can be used to calculate the system throughput.

#include <aie_api/aie.hpp>
#include <aie_api/aie_adf.hpp>
#include <aie_api/utils.hpp>

void vec_incr(input_window<int32>* data,output_window<int32>* out){
	aie::vector<int32,16> vec1=aie::broadcast<int32>(1);
	for(int i=0;i<16;i++)
	chess_prepare_for_pipelining
	chess_loop_range(4,)
	{
		aie::vector<int32,16> vdata=window_readincr_v<16>(data);
		aie::vector<int32,16> vresult=aie::add(vdata,vec1);
		window_writeincr(out,vresult);
	}
	aie::tile tile=aie::tile::current();
	unsigned long long time=tile.cycles();//cycle counter of the AI Engine tile
	window_writeincr(out,time);
}

Next, examine the host code aie/graph.cpp. The concepts introduced in AIE GMIO Programming Model apply here. This section explains new concepts and how performance profiling is done. Some constants defined in the code are as follows:

#if !defined(__AIESIM__) && !defined(__X86SIM__) && !defined(__ADF_FRONTEND__)
const int ITERATION=512;
#else
const int ITERATION=4;
#endif
const int BLOCK_SIZE_in_Bytes=1024*ITERATION;
const int BLOCK_SIZE_out_Bytes=1032*ITERATION;

If it is for hardware flow, ITERATION is 512 otherwise, it is 4. This is to make sure that the AI Engine simulator can finish in a short amount of time.

In the main function, the PS code is going to profile num GMIO inputs and outputs, and num is from 1, 2, 4, to 32. Non-blocking GMIO APIs (GMIO::gm2aie_nb and GMIO::aie2gm_nb) are used for GMIO transactions, and GMIO::wait is used for output data synchronization. Only when the input and output data are transferred for the kernel, can the kernel be finished. This is because the graph is started for all the AI Engine kernels, but only some of the kernels are profiled. After the code for profiling, the remaining kernels are flushed by transferring data to and from the remaining AI Engine kernels.

for(int num=1;num<=32;num*=2){
  //Pre-processing
  for(int i=0;i<32;i++){
    for(int j=0;j<BLOCK_SIZE_in_Bytes/sizeof(int);j++){
     dinArray[i][j]=j+num;
    }
   }
   gr.run(ITERATION);

   //Profile starts here
   for(int i=0;i<num;i++){
   	gr.gmioIn[i].gm2aie_nb(dinArray[i], BLOCK_SIZE_in_Bytes);
    gr.gmioOut[i].aie2gm_nb(doutArray[i], BLOCK_SIZE_out_Bytes);
   }
   for(int i=0;i<num;i++){
    gr.gmioOut[i].wait();
   }
   //Profile ends here

   //check output correctness
   for(int i=0;i<num;i++){
        for(int j=0;j<BLOCK_SIZE_out_Bytes/sizeof(int);j++){
          if(j%258!=256 && j%258!=257 && doutArray[i][j]!=j+num+1-j/258*2){
             std::cout<<"ERROR:dout["<<i<<"]["<<j<<"]="<<doutArray[i][j]<<std::endl;
             error++;
           break;
          }
       }
    }

  //flush remain stalling kernels
  for(int i=num;i<32;i++){
   gr.gmioIn[i].gm2aie_nb(dinArray[i], BLOCK_SIZE_in_Bytes);
   gr.gmioOut[i].aie2gm_nb(doutArray[i], BLOCK_SIZE_out_Bytes);
  }
  gr.wait();
}

<<<<<<< HEAD