Fixed-Point Vector Unit

The fixed-point vector unit contains three separate and largely independent data paths.

Multiply Accumulator (MAC) Path: The main multiplication path reads values from vector registers, permutes them in a user controllable manner, performs optional pre-adding, multiplies them, and after some post-adding, accumulates them to the previous value of the accumulator register.
Upshift Path: The path runs in parallel to the MAC path. It reads data from the permute units in the MAC path or from the vector register, left-shifts, and feeds it to the accumulator registers.
Shift-round Saturate (SRS) Path: This path reads from the accumulator registers and stores to the vector registers or the data memory. It is needed because the accumulators are 48 or 80 bits wide per lane and the vector registers and the data memory have 8, 16, 32, or 64-bit power-of-two widths. Therefore, the data needs to be right shifted on a lane-by-lane basis.; The SRS unit uses the saturation and rounding control register MD with its fields Q and R to influence its behavior, and the status register MC to provide information back to the environment. The shift control register S determines the shift amount. The unit supports various rounding modes based on the value of the field R in register MD. If R is set to 0, the value is truncated on the LSB side. If R is set to 1, a ceiling behavior is achieved, which means that there is no actual rounding. For R = 2 to 7, the modes are PosInf, NegInf, SymInf, SymZero, ConvEven, and ConvOdd (respectively).

The following figure is the pipeline diagram of the main multiplication and upshift path. After the instruction decode stage (ID), the six execute stages are numbered E1 to E6. The dark gray boxes, which always cross two stages, are registers. The light gray boxes, that can span multiple stages, are the functional units. The white box represents hardware registers that are internal to the processor description. Between all boxes there are arrow connectors. They are nML transitories, which are pure non-storing wires. In addition to the elements shown in the diagram, there are multiplexers that realize different connectivity depending on the instruction that is executed. The to UPS unit implies a multiplexer that selects among the three permute units and the VD register. There is an internal unit that reads the inputs and pre-adds two values before outputting the data to the UPS unit.

Figure 1. Pipeline Diagram of AI Engine Fixed-point Vector Unit Multiplication and Upshift Paths

The following table shows the functional units in the main multiplication path. The pre-adding unit PRA can also be used for doing some vector elementary functions, such as determining the minimum or maximum of two vectors or comparing two vectors. In these cases, the pre-adder (PRA) is configured for subtraction, and the sign bit is checked to choose the input selected (for MIN and MAX), and also written to register R for a pure vector comparison.

Table 1. Functional Units in the Multiplication Path
Functional Unit	Description
Permute Units
PMXL	Permutes the data from the vector registers for the left input of the pre-adder PRA.
PMXR	Permutes the data from the vector registers for the right input of the pre-adder PRA or alternatively for the input of the YMX unit.
PMC	Permutes the data from the vector registers for the input of the YMX unit.
Pre-adder Units
PRA	Pre-add or pre-subtract the PMXL and PMXR outputs to form the first multiplier argument. Additionally, some restricted permute takes place to compensate for the 32-bit granularity of the PMXL/PMXR units.
YMX	Special operations, such as inserting the constant 1 to multiply the pre-adder result with a 1, or sign extending the individual lanes. Additionally, some restricted permute takes place to compensate for the 32-bit granularity of the PMXR unit.
Multiplier Unit
MPY	Multiplies the PRA and YMX outputs.
Post-adder Units
PSA	First post-adding stage that reduces the 32-MPY output lanes to 16 lanes.
PSB	Second post-adding stage that further reduces the lanes to 8. Alternatively, it forwards the inputs to the output.
Accumulator
ACM	Multiplexes the data that is to be added to the post-adder output. Can be the old accumulator value, the output of the upshift path, or the cascade stream input.
ACC	Adds or subtracts the ACM and PSB outputs.

The following table shows the functional units in the upshift path.

Table 2. Upshift Path
Upshift Units	Description
to UPS	Reads vector register and selects only certain lanes.
UPS	Perform the actual upshifting and output to the ACM unit in the main data path.

The following figure is a pipeline diagram of the shift-round-saturate path. An accumulator register is read, the shift-round-saturate operation occurs, and the output is either written into any vector register or to the data memory. The value is stored in memory in the E3 stage and arrives in memory in the E6 stage.

Figure 2. Pipeline Diagram of AI Engine Shift-Round-Saturate Data Path

The following table shows the functional units in the shift-round-saturate path.

Table 3. Shift-Round-Saturate Path
Shift-Round-Saturate Units	Description
SRSA	Performs the combination of the two parts of an 80-bit accumulator. It bypasses the data when 48-bit accumulators are to be shifted. Works on eight 48-bit lanes or four 80-bit lanes in parallel. The functionality is split into a low and high part to perform the same operation in parallel.
SRSB	Perform the actual shifting of the lanes. The functionality is split into a low and high part to perform the same operation in parallel.
SRS Interleaver	Interleaves the outputs of the SRSB high and low units when accumulator interleaving is required (controlled by the MSB of the shift amount register S).