Design Introduction - 2023.2 English

Vitis Tutorials: AI Engine (XD100)

Document ID
XD100
Release Date
2024-03-05
Version
2023.2 English

This design has a graph that has four AI Engine kernels. Each kernel has one input and one output. Thus, four AI Engine GMIO inputs and four AI Engine GMIO outputs are connected to the graph.

Change the working directory to perf_profile_aie_gmio. Take a look at the graph code in aie/graph.h.

static const int col[8]={2,6,10,18,26,34,42,46};
static const int NUM=4;

class topgraph: public adf::graph
	{
	public:
		adf::kernel k[NUM];
		adf::input_gmio gmioIn[NUM];	
		adf::output_gmio gmioOut[NUM];
		
		topgraph(){
			for(int i=0;i<NUM;i++){
				k[i] = adf::kernel::create(vec_incr);
				adf::source(k[i]) = "vec_incr.cc";
				adf::runtime<adf::ratio>(k[i])= 1;
				gmioIn[i]=adf::input_gmio::create("gmioIn"+std::to_string(i),/*size_t burst_length*/256,/*size_t bandwidth*/100);
				gmioOut[i]=adf::output_gmio::create("gmioOut"+std::to_string(i),/*size_t burst_length*/256,/*size_t bandwidth*/100);
				adf::connect<>(gmioIn[i].out[0], k[i].in[0]);	
				adf::connect<>(k[i].out[0], gmioOut[i].in[0]);
	
				adf::location<adf::kernel>(k[i])=adf::tile(col[i],0);
				location<GMIO>(gmioIn[i]) = location<kernel>(k[i]) + relative_offset({.col_offset=0});	
				location<GMIO>(gmioOut[i]) = location<kernel>(k[i]) + relative_offset({.col_offset=1});
			}
		}
	};

In the previous code, there are location constraints adf::location for each kernel and their relative constraints for GMIO inputs and GMIO outputs. This means when GMIO ports are placed on different columns, performance counters will not run out when profiling all ports with the event API at the same time.

Next, examine the kernel code aie/vec_incr.cc. It increments each int32 input by one and additionally outputs the cycle counter of the AI Engine tile. Due to the later introduction, this counter can be used to calculate the system throughput.

using namespace adf;
void vec_incr(input_buffer<int32,extents<256>>& __restrict data,output_buffer<int32,extents<258>>& __restrict out){
	auto inIter=aie::begin_vector<16>(data);
	auto outIter=aie::begin_vector<16>(out);
	aie::vector<int32,16> vec1=aie::broadcast<int32>(1);
		for(int i=0;i<16;i++)
		chess_prepare_for_pipelining
		{
			aie::vector<int32,16> vdata=*inIter++;
			aie::vector<int32,16> vresult=aie::add(vdata,vec1);
			*outIter++=vresult;
		}
	aie::tile tile=aie::tile::current();
	unsigned long long time=tile.cycles();//cycle counter of the AI Engine tile
	decltype(aie::begin(out)) p=*(decltype(aie::begin(out))*)&outIter;
	*p++=time&0xffffffff;
	*p++=(time>>32)&0xffffffff;
	}

Next, examine the host code sw/host.cpp. The concepts introduced in AIE GMIO Programming Model apply here. This section explains new concepts and how performance profiling is done. Some constants defined in the code are as follows:

const int NUM=4;
int ITERATION=8192;	
char* emu_mode = getenv("XCL_EMULATION_MODE");
    if (emu_mode != nullptr) {
		ITERATION=4;
	}
const int BLOCK_SIZE_in_Bytes=1024*ITERATION;
const int BLOCK_SIZE_out_Bytes=1032*ITERATION;

If it is for hardware flow, ITERATION is 8192; otherwise, it is four. This is to ensure that the AI Engine simulator can conclude quickly.

In the main function, the PS code will profile NUM GMIO inputs and outputs, where NUM is 4. Non-blocking GMIO APIs (GMIO::gm2aie_nb and GMIO::aie2gm_nb) are used for GMIO transactions, and GMIO::wait is used for output data synchronization.

//Pre-processing
......

//start graph and GMIO output ports first
gr.run(ITERATION);
for(int i=0;i<NUM;i++){
	gr.gmioOut[i].aie2gm_nb(doutArray[i], BLOCK_SIZE_out_Bytes);
	}

//Profile starts here
......

//start GMIO inputs and wait for GMIO outputs to complete
for(int i=0;i<NUM;i++){
	gr.gmioIn[i].gm2aie_nb(dinArray[i], BLOCK_SIZE_in_Bytes);
	}
		for(int i=0;i<NUM;i++){
			gr.gmioOut[i].wait();
	}

//Profile ends here
......

//check output correctness 
......