Data Communication via AXI4-Stream Interconnect - 2021.1 English

AI Engine Kernel Coding Best Practices Guide (UG1079)

Document ID
UG1079
Release Date
2021-07-19
Version
2021.1 English

AI Engines can directly communicate through the AXI4-Stream interconnect without any DMA and memory interaction. Data can be sent from one AI Engine to another or broadcast through the streaming interface. The data bandwidth of a streaming connection is 32-bit per cycle and built-in handshake and backpressure mechanisms are available.

For streaming input and output interfaces, when the performance is limited by the stream number, the AI Engine is able to use two streaming inputs or two streaming outputs in parallel, instead of one streaming input or output. To use two parallel streams, it is recommended to use the following pairs of macros, where idx1 and idx2 are the two streams. Add the __restrict keyword to stream ports to ensure they are optimized for parallel processing.

READINCR(SS_rsrc1, idx1) and READINCR(SS_rsrc2, idx2) 
READINCRW(WSS_rsrc1, idx1) and READINCRW(WSS_rsrc2, idx2) 
WRITEINCR(MS_rsrc1, idx1, val) and WRITEINCR(MS_rsrc2, idx2, val) 
WRITEINCRW(WMS_rsrc1, idx1, val) and WRITEINCRW(WMS_rsrc2, idx2, val)

Following is a sample code to use two parallel input streams to achieve pipelining with interval 1. Interval 1 means that two read, one write, and one add are in every cycle.

void simple(   input_stream_int32 * __restrict data0,  
			input_stream_int32 * __restrict data1,   			
			output_stream_int32 * __restrict out) { 
	for(int i=0; i<1024; i++) 
	chess_prepare_for_pipelining 
	{ 
		int32_t d = READINCR(SS_rsrc1, data0) ; 
		int32_t e = READINCR(SS_rsrc2, data1) ; 
		WRITEINCR(MS_rsrc1,out,d+e); 
	} 
}

Intrinsics can be used to perform stream operations directly, but it is important to not swap the two streams based on the mapping AI Engine tools has found.

v16float off = *(v16float*)offset; 
v8float v8in = undef_v8float(); 
v8float v8out = undef_v8float(); 
for(int i=0;i<128/4;i++) 
chess_prepare_for_pipelining 
{ 
	v8in = concat(getf_wss(0),getf_wss(1)); // reads 8 float values 
	v8out = fpadd(v8in,off,0,0); 
	put_wms(0,ext_v(v8out,0)); 
	put_wms(1,ext_v(v8out,1)); 
} 

The stream connection can be unicast or multicast. Note that in the case of multicast communication, the data is sent to all the destination ports at the same time and only when all destinations are ready to receive data.