The fixed-point vector unit contains three separate and largely independent data paths.
- Multiply Accumulator (MAC) Path
- The main multiplication path reads values from vector registers, permutes them in a user controllable manner, performs optional pre-adding, multiplies them, and after some post-adding, accumulates them to the previous value of the accumulator register.
- Upshift Path
- The path runs in parallel to the MAC path. It reads data from the permute units in the MAC path or from the vector register, left-shifts, and feeds it to the accumulator registers.
- Shift-round Saturate (SRS) Path
- This path reads from the accumulator registers and stores to the vector registers or the data memory. It is needed because the accumulators are 48 or 80 bits wide per lane and the vector registers and the data memory have 8, 16, 32, or 64-bit power-of-two widths. Therefore, the data needs to be right shifted on a lane-by-lane basis.
- The SRS unit uses the saturation and rounding control register MD with its fields Q and R to influence its behavior, and the status register MC to provide information back to the environment. The shift control register S determines the shift amount. The unit supports various rounding modes based on the value of the field R in register MD. If R is set to 0, the value is truncated on the LSB side. If R is set to 1, a ceiling behavior is achieved, which means that there is no actual rounding. For R = 2 to 7, the modes are PosInf, NegInf, SymInf, SymZero, ConvEven, and ConvOdd (respectively).
The following figure is the pipeline diagram of the main multiplication and upshift path. After the instruction decode stage (ID), the six execute stages are numbered E1 to E6. The dark gray boxes, which always cross two stages, are registers. The light gray boxes, that can span multiple stages, are the functional units. The white box represents hardware registers that are internal to the processor description. Between all boxes there are arrow connectors. They are nML transitories, which are pure non-storing wires. In addition to the elements shown in the diagram, there are multiplexers that realize different connectivity depending on the instruction that is executed. The to UPS unit implies a multiplexer that selects among the three permute units and the VD register. There is an internal unit that reads the inputs and pre-adds two values before outputting the data to the UPS unit.
The following table shows the functional units in the main multiplication path. The pre-adding unit PRA can also be used for doing some vector elementary functions, such as determining the minimum or maximum of two vectors or comparing two vectors. In these cases, the pre-adder (PRA) is configured for subtraction, and the sign bit is checked to choose the input selected (for MIN and MAX), and also written to register R for a pure vector comparison.
|Permutes the data from the vector registers for the left input of the pre-adder PRA.
|Permutes the data from the vector registers for the right input of the pre-adder PRA or alternatively for the input of the YMX unit.
|Permutes the data from the vector registers for the input of the YMX unit.
|Pre-add or pre-subtract the PMXL and PMXR outputs to form the first multiplier argument. Additionally, some restricted permute takes place to compensate for the 32-bit granularity of the PMXL/PMXR units.
|Special operations, such as inserting the constant 1 to multiply the pre-adder result with a 1, or sign extending the individual lanes. Additionally, some restricted permute takes place to compensate for the 32-bit granularity of the PMXR unit.
|Multiplies the PRA and YMX outputs.
|First post-adding stage that reduces the 32-MPY output lanes to 16 lanes.
|Second post-adding stage that further reduces the lanes to 8. Alternatively, it forwards the inputs to the output.
|Multiplexes the data that is to be added to the post-adder output. Can be the old accumulator value, the output of the upshift path, or the cascade stream input.
|Adds or subtracts the ACM and PSB outputs.
The following table shows the functional units in the upshift path.
|Reads vector register and selects only certain lanes.
|Perform the actual upshifting and output to the ACM unit in the main data path.
The following figure is a pipeline diagram of the shift-round-saturate path. An accumulator register is read, the shift-round-saturate operation occurs, and the output is either written into any vector register or to the data memory. The value is stored in memory in the E3 stage and arrives in memory in the E6 stage.
The following table shows the functional units in the shift-round-saturate path.
|Performs the combination of the two parts of an 80-bit accumulator. It bypasses the data when 48-bit accumulators are to be shifted. Works on eight 48-bit lanes or four 80-bit lanes in parallel. The functionality is split into a low and high part to perform the same operation in parallel.
|Perform the actual shifting of the lanes. The functionality is split into a low and high part to perform the same operation in parallel.
|Interleaves the outputs of the SRSB high and low units when accumulator interleaving is required (controlled by the MSB of the shift amount register S).