Arithmetic Logic Unit, Scalar Functions, and Data Type Conversions

The arithmetic logic unit (ALU) in the AI Engine manages the following operations. In all cases the issue rate is one instruction per cycle.

Integer addition and subtraction: 32 bits. The operation has a one cycle latency.
Bit-wise logical operation on 32-bit integer numbers (BAND, BOR, BXOR). The operation has a one cycle latency.
Integer multiplication: 32 x 32 bit with output result of 32 bits stored in the R register file. The operation has a three cycle latency.
Shift operation: Both left and right shift are supported. A positive shift amount is used for left shift and a negative shift amount is used for right shift. The shift amount is passed through a general purpose register. A one bit operand to the shift operation indicates whether a positive or negative shift is required. The operation has a one cycle latency.

There are two types of scalar elementary functions in the AI Engine: fixed-point and floating-point. The following describes each function.

Fixed-point non-linear functions
- Sine and cosine:
  - Input is from the upper 20 bits of a 32-bit input
  - Output is a concatenated word with the upper 16-bit sine and the lower 16-bit cosine
  - The operations have a four cycle latency
- Absolute value (ABS): Invert the input number and add one. The operation has a one cycle latency
- Count leading zeroes (CLZ): Count leading zeroes in a 32-bit input. The operation has a one cycle latency
- Minimum/maximum (lesser than (LG)/greater than (GT)): Two inputs are compared to find the minimum or maximum. The operation has a one cycle latency
- Square Root, Inverse Square Root, and Inverse: These operations are implemented with floating-point precision. For fixed-point implementation the input needs to be first converted to floating-point precision and then passed as input to these non-linear operations. Also, the output is in a floating-point format that needs to be converted back to fixed-point integer format. The operations have a four cycle latency
Floating-point non-linear functions
- Square root: Both input and output are single precision floating-point numbers operating on R register file. The operation has a four cycle latency
- Inverse square root: Both input and output are single precision floating-point numbers operating on R register file. The operation has a four cycle latency
- Inverse: Both input and output are single precision floating-point numbers operating on R register file. The operation has a four cycle latency
- Absolute value (ABS), the operation has a one cycle latency
- Minimum/maximum, the operation has a one cycle latency

There is no floating-point unit in the scalar unit. The floating-point operations are supported through emulation. In general, it is preferred to perform add and multiply in the vector unit.

The AI Engine scalar unit supports data-type conversion operations to convert input data from a fixed-point to a floating-point and from a floating-point to a fixed-point. The fix2float operation and float2fix operation provide support for variable decimal point where input is a 32-bit value. Along with the input, the decimal point position is also taken as another input. The operations scale the value up or down, if required. Both operations have a one cycle latency.

The AI Engine floating-point is not completely compliant with the IEEE standards and there are restrictions on a set of functionality. The exceptions are outlined in this section.

When the float2fix function is called with a very large positive or negative number and the additional exponent increment is greater than zero, the instruction returns 0 instead of the correct saturation value of either 2³¹–1 or –2³¹.
The float2fix function takes two input parameters:
- n: The floating-point input value to be converted.
- sft: A 6-bit signed number representing the number of fractional bits in fixed-point representation (from –32 to 31).
Consider the two scenarios:
- If n*2^sft > 2¹²⁹, the output should return 0x7FFFFFFF. Instead, it returns 0x00000000.
- If n*2^sft < –2¹²⁹, the output should return 0X80000000. Instead, it returns 0x00000000.
In general, you should ensure that the n floating-point input value stays in the bug-free range of –2^(129–sft) < n < 2^(129–sft) for sft > 0.

Two implementations are introduced to provide a workaround:
- float2fix_safe: This is the default mode if you specify float2fix without any option. The implementation returns the correct value for any range, but is slower.
- float2fix_fast: This implementation returns the correct value only in the bug-free range and you need to ensure the range is valid. To choose the floatfix_fast implementation, you need to add the preprocessor FLOAT2FIX_FAST to the project file.
A fixed-point value has a legal range of –2³¹ to 2³¹–1. When the float2fix function returns a value of –2³¹, the value is within range but an overflow exception is incorrectly set. There is no workaround for this overflow exception.