Floating-Point Vector Unit

Versal Adaptive SoC AI Engine Architecture Manual (AM009)

Document ID
Release Date
1.3 English

The AI Engine provides eight lanes of single-precision floating-point multiplication and accumulation. The unit reuses the vector register files and permute network of the fixed-point data path. In general, only one vector instruction per cycle can be done in fixed-point or floating-point.

The following figure shows the pipeline diagram of the single precision floating-point data flow. Compared to the fixed-point vector unit, only the PMXL and PMC units are used (the PMXR unit is removed). FPYMX is in the style of YMX and the results from FPYMX and PMXL are forwarded to a single-precision multiplier unit (FPMPY) that can compute eight products in parallel. The operation in FPMPY has a three-cycle latency and a one-cycle throughput. Next, there is an FPSGN unit that allows sign negation of the results on a per-lane basis.

After the FPSGN unit there is a two-stage accumulator unit called FPACC. It accumulates the multiplication results with values from various sources, such as zeroes or values directly from another vector register. However, it is not possible to add lanes within the same vector directly. The accumulator does not support subtraction as it is handled by the FPSGN unit.

Figure 1. Pipeline Diagram of the AI Engine Floating-point Vector Unit Single-precision Floating-point Data Path

The AI Engine supports several vector elementary function for the floating-point format. These functions include a vector comparison, minimum, and maximum. They operate in an element-wise fashion comparing two vectors. The hardware needed is very similar to the fixed-point vector comparison. The fixed-point unit PRA is extended to handle floating-point comparison, and the operation is done at the same time as the FPYMX block. The floating-point data path supports a vector fixed-point to single precision floating-point conversion as well as a reverse operation of a floating-point to fixed-point conversion, but only at a lower performance through the scalar unit. In that situation, extract elements extracted from the vector, perform the scalar conversion, and push the results back into a vector. When implemented in an efficiently pipelined loop, close to a one sample per cycle conversion performance can be achieved.

The floating-point unit can issue events that correspond to standard floating-point exceptions and the status registers MC keep track of the events. There are eight exception bits per floating-point functional unit. The exceptions are (from bit 0 to 7): Zero, Infinity, Tiny (Underflow), Huge (Overflow), Inexact, Huge Int, and Divide by Zero. Of the eight exceptions, Tiny, Huge, Invalid, and Divide by Zero can be converted into an event that can be broadcast to the AI Engine array interface, and then sent to the PS/PMC as an interrupt.

Some features are not supported by the AI Engine floating-point data path.

  • Double-precision operations
  • Half-precision operations
  • Custom floating-point formats, for example 2-bit exponent, and 14-bit mantissa (E2:M14)
  • Pre-adding before multiplication
  • Post-adding between multiplication and accumulator
  • Increased precision between multiplier and accumulator
  • Denormalized and subnormal floating-point numbers