Vectorization - 2021.2 English

AI Engine Kernel Coding Best Practices Guide (UG1079)

Document ID

UG1079

Release Date

2021-11-10

Version

2021.2 English

The scalar reference code for this matrix multiplication example is shown as follows. Note that the data is stored in columns.

void matmul_mat8_scalar(input_window_int16* matA,
		input_window_int16* matB,
		output_window_int16* matC){

	for(int i=0; i<M; i++){//M=64
		for(int j=0;j<L;j++){//L=2
			int temp = 0 ;
			for(int k=0; k<N; k++){//N=8
				temp += window_read(matA)*window_readincr(matB);//B is circular buffer, size N*L
				window_incr(matA,64); //Jump of 64 elements to access the next element of the same row
			}
			window_write(matC,(int16_t)(temp>>15)) ;
			window_incr(matC,64); //Jump to the next column
		}
		window_incr(matA,1); //Jump of one element for moving to the next row.
		window_incr(matC,1); //Jump to the next row
	}
}

As analyzed in the previous example, mac16 intrinsic is the best choice for computing 16 lanes together because 16 int16 from a column can be loaded at once. To compute 16 output data in a column, four mac16 operations are needed. The same data in vector "a" is used twice to compute the data for two output columns. Thus, two columns of data can be loaded and two mac16 used for accumulations to the two output columns. These two loads and two MACs are repeated four times to get the results of two output columns. This method is shown in the following pseudo-code.

C_[0:15,0] = A_[0:15,0:1]*B_[0:1,0] 
C_[0:15,1] = A_[0:15,0:1]*B_[0:1,1] 

C_[0:15,0]+= A_[0:15,2:3]*B_[2:3,0] 
C_[0:15,1]+= A_[0:15,2:3]*B_[2:3,1]
 
C_[0:15,0]+= A_[0:15,4:5]*B_[4:5,0] 
C_[0:15,1]+= A_[0:15,4:5]*B_[4:5,1]

C_[0:15,0]+= A_[0:15,6:7]*B_[6:7,0] 
C_[0:15,1]+= A_[0:15,6:7]*B_[6:7,1]

In the previous code, each "*" denotes a MAC operation. C_[0:15,0] and C_[0:15,1] denote two output columns that are accumulated separately. A_[0:15,0:1] denotes the column 0 and 1, and each column has 16 elements. B_[0:1,0] denotes column 0 with 2 elements. There will be a loop for the code in the real vectorized code because there are 64 output rows. The mac16 intrinsic function to be used has the following interface.

v16acc48 mac16	(	v16acc48 	acc,
	v64int16 	xbuff,
	int 	xstart,
	unsigned int 	xoffsets,
	unsigned int 	xoffsets_hi,
	unsigned int 	xsquare,
	v16int16 	zbuff,
	int 	zstart,
	unsigned int 	zoffsets,
	unsigned int 	zoffsets_hi,
	int 	zstep 
)

The buffers contain parameters (start, offsets, square, and step) to compute the indexing into buffers (vector registers). For details about the lane addressing scheme with these parameters, see MAC Intrinsics.

Note that the mac16 intrinsic function prototype is different with the one introduced in the previous matrix vector multiplication example. The xbuff here is v64int16 which allows two sets of data to be stored and used in an interleaved way.

Coding with MAC intrinsics can be seen in the following section.