# Compute Optimization - 2023.2 English

Document ID
XD100
Release Date
2023-11-29
Version
2023.2 English

## AI Engine-ML matrix multiplication Instruction Set

The AI Engine-ML has specific hardware instructions for matrix multiplications. Depending on the bitwidth of the operands, various matrix sizes are supported. In the following table the notation `MxKxN` means that matrix multiplication with a first operand of size M rows x K columns and a second operand of size K rows x N columns is supported.

Matrix Multiplication modes for real types

8b x 4b 8b x 8b 16b x 8b 8b x 16b 16b x 16b 32b x 16b 16b x 32b 32b x 32b bfloat16 x bfloat16
4x16x8 4x8x4 4x4x4 4x4x8 4x4x4 2x4x8 2x4x8 4x2x4 4x8x4
8x16x8 4x16x4 8x4x4 4x4x4 2x4x8 4x4x4 4x4x4 4x2x4
4x32x8 8x8x4 4x8x4 4x4x8 4x2x4 8x2x4
2x8x8 4x4x8 4x2x8
4x8x8
2x16x8
4x16x8

Matrix Multiplication modes for complex types

c16b x 16b c16b x c16b c32b x c16b c32b x c32b
2x4x8 1x4x8 1x2x4 1x2x8
4x4x4 1x2x8
2x2x8
1x4x8
2x4x8

## IO or Compute bound?

One thing is to support a matrix multiply of some size, another is to verify that the 2 loads, the store and the compute are equally optimized.

A complete table of the matrix multiply efficiency, including matrices load and vector compute, can be seen here: Performance Table

### Example 1

For example letâ€™s take the first element of the table which is 8b x 4b with a matrix size of 4x16x8:

• The sub matrix A is of size 4x16 on 8 bits which is 512 bits: 2 clocks cycles are necessary to load it

• The sub matrix B is of size 16x8 on 4 bits which is 512 bits: 2 clocks cycles are necessary to load it

• The sub matrix C is of size 4x8 on 16 or 32 bits which is 512 or 1024 bits: 2 or 4 clocks cycles are necessary to store it

• Finally, 512 MACs must be performed for this matrix which can be done in 1 clock cycles.

The overall efficiency is 50% (result in 16 bits) or 25% (results in 32 bits): 2 or 4 clock cycles for load/store, 1 clock cycle for the compute.

### Tutorial Example

In this tutorial, the matrix sizes are the same but the input data type is `int8` for both A and B matrices but the output data type can be either `int16` or `int32`.

• The sub matrix A is of size 4x16 on 8 bits which is 512 bits: 2 clocks cycles are necessary to load it

• The sub matrix B is of size 16x8 on 8 bits which is 1024 bits: 4 clocks cycles are necessary to load it

• The sub matrix C is of size 4x8 on 16 or 32 bits which is 512 or 1024 bits: 2 or 4 clocks cycles are necessary to store it, once every 4 sub-matrix multiplication-accumulation.

• Finally, 512 MACs must be performed for this matrix which can be done in 2 clock cycles (256 int8 x int8 multiplication-accumulations can be performed each cycle).

The overall maximum efficiency is 50%: The limitation comes from the load operation of the B sub-matrix.

A simple way to balance load/compute/store operations is to load 2 sub-matrices A and 1 sub-matrix B to perform 2 multiplication-accumulations for each B.

## Code analysis

In this new version of the kernel, we want to load 2 A sub-matrices while we load a single B sub-matrix. The 2 A sub-matrices must belong to the same tile column so that they have to be multiplied by the same B sub-matrix.

The simplest id to take 2 A tiles just one above the other, and multiply them by the same B sub-matrix. On the C side, the 2 tiles that will be computed will be also just one above the other.

In order to avoid too many pointer manipulations, the A tiles will be read 2 by 2 from Memory Tile so that they will be stored right next to each other in AI Engine ML Memory. B tiles will be read as in the previous basic solutions. Similarly to A, C tiles will be stored side by side in the AI Engine ML Memory. They will be reorganized when copying into the Memory Tile.

This way to do offloads the pointer manipulation to the DMA programming, freeing some scalar processor cycles.

The next 2 animated GIFs will show how A matrix is read from the Memory Tile and how C matrix is written to it. You can see that I chose to have super tiles consisting of 2 sub-matrices one above the other: