Accumulator Registers - 2020.2 English

AI Engine Kernel Coding Best Practices Guide (UG1079)

Document ID
UG1079
Release Date
2021-02-04
Version
2020.2 English

The accumulation registers are 384 bits wide and can be viewed as eight vector lanes of 48 bits each. The idea is to have 32-bit multiplication results and accumulate over those results without bit overflows. The 16 guard bits allow up to 216 accumulations. The output of fixed-point vector MAC and MUL intrinsic functions is stored in the accumulator registers. The following table shows the set of accumulator registers and how smaller registers are combined to form large registers.

Table 1. Accumulator Registers
384-bit 768-bit
aml0 bm0
amh0
aml1 bm1
amh1
aml2 bm2
amh2
aml3 bm3
amh3

The accumulator registers are prefixed with the letters 'am'. Two of them are aliased to form a 768-bit register that is prefixed with 'bm'.

The shift-round-saturate srs() intrinsic is used to move a value from an accumulator register to a vector register with any required shifting and rounding.

v8int32 res = srs(acc, 8); // shift right 8 bits, from accumulator register to vector register

The upshift ups() intrinsic is used to move a value from an vector register to an accumulator register with upshifting:

v8acc48 acc = ups(v, 8); //shift left 8 bits, from vector register to accumulator register

The set_rnd() and set_sat() instrinsics are used to set the rounding and saturation mode of the accumulation result, while clr_rnd() and clr_sat() intrinsics are used to clear the rounding and saturation mode, that is to truncate the accumulation result.

Note that only when operations are going through the shift-round-saturate data path, the shifting, rounding, or saturation mode will be effective. Some intrinsics only use the vector pre-adder operations, where there will be no shifting, rounding, or saturation mode for configuration. Such operations are adds, subs, abs, vector compares, or vector selections/shuffles. It is possible to choose MAC intrinsics instead to do subtraction with shifting, rounding, or saturation mode configuration. The following code performs subtraction between va and vb with mul instead of sub intrinsics.

v16cint16 va, vb;
int32 zbuff[8]={1,1,1,1,1,1,1,1};
v8int32 coeff=*(v8int32*)zbuff;
v8acc48 acc = mul8_antisym(va, 0, 0x76543210, vb, 0, false, coeff, 0 , 0x76543210);
v8int32 res = srs(acc,0);

Floating-point intrinsic functions do not have separate accumulation registers and instead return their results in a vector register.