Functional Overview

The AI Engine is a highly-optimized processor featuring single-instruction multiple-data (SIMD) and very-long instruction word (VLIW) processor that supports both fixed-point and floating-point precision. As shown in the following figure, the AI Engine has a memory interface, a scalar unit, a vector unit, two load units, one store unit, and an instruction fetch and decode unit.

Figure 1. AI Engine

The features of the AI Engine include:

32-bit scalar RISC processor
- General purpose pointer and configuration register files
- Supports non-linear functions (for example: sqrt, Sin/Cos, and InvSqrt)
- A scalar ALU, including 32 x 32-bit scalar multiplier
- Supports conversion of the data type between scalar fixed point and scalar floating point
Three address generator units (AGU)
- Support for multiple addressing modes: Fixed, indirect, post-incremental, or cyclic
- Supports Fast Fourier Transform (FFT) address generation
- Two AGUs dedicated for two load units
- One AGU dedicated for the store unit

Vector fixed-point/integer unit

Concurrent operations on multiple vector lanes

Accommodate multiple precision for complex and real operand (see Table 1).

Note: While cfloat is a vector data type, it is not directly supported by the AI Engine vector processor. Two instructions must be issued.

Table 1. Supported Precision Width of the Vector Data Path
X Operand	Z Operand	Output	Number of MACs
8 real	8 real	48 real	128
16 real	8 real	48 real	64
16 real	16 real	48 real	32
16 real	16 complex	48 complex	16
16 complex	16 real	48 complex	16
16 complex	16 complex	48 complex	8
16 real	32 real	48/80 real	16
16 real	32 complex	48/80 complex	8
16 complex	32 real	48/80 complex	8
16 complex	32 complex	48/80 complex	4
32 real	16 real	48/80 real	16
32 real	16 complex	48/80 complex	8
32 complex	16 real	48/80 complex	8
32 complex	16 complex	48/80 complex	4
32 real	32 real	80 real	8
32 real	32 complex	80 complex	4
32 complex	32 real	80 complex	4
32 complex	32 complex	80 complex	2
32 SPFP	32 SPFP	32 SPFP	8

Can be configured to perform eight complex 16-bit multiplications
Full permute unit with 32-bit granularity
Shift, round, and saturate with multiple rounding and saturation modes
Two-step post adding along with 768-bit intermediate results
The X operand is 1024 bits wide and the Z operand is 256 bits wide. In terms of component use, consider the first row in Table 1. The multiplier operands come from the same 1024-bit and 256-bit input registers, but some values are broadcast to multiple multipliers. There are 128 8-bit single multipliers and the results are post-added and accumulated into 16 or 8 accumulator lanes of 48 bits each.

Single-precision floating-point (SPFP) vector unit
- Use same permute as a fixed-point vector unit
- Concurrent operation of multiple vector lanes
- Eight single-precision multiplier–accumulators (MACs) per cycle
Balanced pipeline
- Different pipeline on each functional unit (eight stages maximum)
- Load and store units manage the 5-cycle latency of data memory
Three data memory ports
- Two load ports and one store port
- Each port operates in 256-bit/128-bit vector register mode or 32-bit/16-bit/8-bit scalar register mode. The 8-bit and 16-bit stores are implemented as read-modify-write instructions
- Concurrent operation of all three ports
- A bank conflict on any port stalls the entire data path
Very-long instruction word (VLIW) function
- Concurrent issuing of operation to all functional units
- Support for multiple instruction formats and variable length instructions
- Up to seven operations can be issued in parallel using one VLIW word
Direct stream interface
- Two input streams and two output streams
- Each stream can be configured to be either 32-bit or 128-bit wide
- One cascade stream in, one cascade stream out (384-bit)
Interface to the following modules
- Lock module
- Stall module
- Debug and trace module
Event interface is a 16-bit wide output interface from the AI Engine