Vector Registers - 2020.2 English

AI Engine Kernel Coding Best Practices Guide (UG1079)

Document ID
UG1079
Release Date
2021-02-04
Version
2020.2 English

All vector intrinsic functions require the operands to be present in the AI Engine vector registers. The following table shows the set of vector registers and how smaller registers are combined to form large registers.

Table 1. Vector Registers
128-bit 256-bit 512-bit 1024-bit
vrl0 wr0 xa ya N/A
vrh0
vrl1 wr1
vrh1
vrl2 wr2 xb yd (msbs)
vrh2
vrl3 wr3
vrh3
vcl0 wc0 xc N/A N/A
vch0
vcl1 wc1
vch1
vdl0 wd0 xd N/A yd (lsbs)
vdh0
vdl1 wd1
vdh1

The underlying basic hardware registers are 128-bit wide and prefixed with the letter v. Two v registers can be grouped to form a 256-bit register prefixed with w. wr, wc, and wd registers are grouped in pairs to form 512-bit registers (xa, xb, xc, and xd). xa and xb form the 1024-bit wide ya register, while xd and xb form the 1024-bit wide yd register. This means the xb register is shared between ya and yd registers. xb contains the most significant bits (MSBs) for both ya and yd registers.

The vector register name can be used with the chess_storage directive to force vector data to be stored in a particular vector register. For example:

v8int32 chess_storage(wr0) bufA;
v8int32 chess_storage(WR) bufB;

When upper case is used in the chess_storage directive, it means register files (for example, any of the four wr registers), whereas lower case in the directive means just a particular register (for example, wr0 in the previous code example) will be chosen.

Vector registers are a valuable resource. If the compiler runs out of available vector registers during code generation, then it generates code to spill the register contents into local memory and read the contents back when needed. This consumes extra clock cycles.

The name of the vector register used by the kernel during its execution is shown for vector load/store and other vector-based instructions in the kernel microcode. This microcode is available in the disassembly view in Vitis IDE. For additional details on Vitis IDE usage, see Using Vitis IDE and Reports.

Many intrinsic functions only accept specific vector data types but sometimes not all values from the vector are required. For example, certain intrinsic functions only accept 512-bit vectors. If the kernel code has smaller sized data, one technique that can help is to use the concat() intrinsic to concatenate this smaller sized data with an undefined vector (a vector with its type defined, but not initialized).

For example, the lmul8 intrinsic only accepts a v16int32 or v32int32 vector for its xbuff parameter. The intrinsic prototype is:


v8acc80 lmul8	(	v16int32 	xbuff,
	int 	xstart,
	unsigned int 	xoffsets,
	v8int32 	zbuff,
	int 	zstart,
	unsigned int 	zoffsets 
)	

The xbuff parameter expects a 16 element vector (v16int32). In the following example, there is an eight element vector (v8int32) rva. The concat() intrinsic is used to upgrade it to a 16 element vector. After concatenation, the lower half of the 16 element vector has the contents of rva. The upper half of the 16 element vector is uninitialized due to concatenation with the undefined v8int32 vector.

int32 a[8] = {1, 2, 3, 4, 5, 6, 7, 8};
v8int32 rva = *((v8int32*)a);
acc = lmul8(concat(rva,undef_v8int32()),0,0x76543210,rvb,0,0x76543210);

For more information about how vector-based intrinsic functions work, refer to Vector Register Lane Permutations.