AI Engine-ML Programming - 2023.2 English

Vitis Tutorials: AI Engine

Document ID
Release Date
2023.2 English

Version: Vitis 2023.2


Xilinx introduced the Versal™ AI Edge series, designed to enable AI innovation from the edge to the endpoint. This new series is mainly based on the AI Engine-ML that delivers 4X machine learning compute compared to previous AI Engine architecture and integrates new accelerator RAM with an enhanced memory hierarchy for evolving AI algorithms.

IMPORTANT: Before beginning the tutorial make sure you have installed the Vitis 2023.2 software. The Vitis release includes all the embedded base platforms including the VEK280 base platform that is used in this tutorial. In addition, do ensure you have downloaded the Common Images for Embedded Vitis Platforms from this link The ‘common image’ package contains a prebuilt Linux kernel and root file system that can be used with the Versal board for embedded design development using Vitis. Before starting this tutorial run the following steps:

  1. Go to the directory where you have unzipped the Versal Common Image package

  2. In a Bash shell run the /Common Images Dir/xilinx-versal-common-v2023.2/environment-setup-cortexa72-cortexa53-xilinx-linux script. This script sets up the SDKTARGETSYSROOT and CXX variables. If the script is not present, you must run the /Common Images Dir/xilinx-versal-common-v2023.2/

  3. Set up your ROOTFS, and IMAGE to point to the rootfs.ext4 and Image files located in the /Common Images Dir/xilinx-versal-common-v2023.2 directory.

  4. Set up your PLATFORM_REPO_PATHS environment variable to$XILINX_VITIS/lin64/Vitis/2023.2/base_platforms This tutorial targets VEK280 es1 board for 2023.2 version.

Data generation for this tutorial requires Python 3. The following packages are required:

  • math

  • sys

  • numpy

  • random


After completing this tutorial, you will be able to:

  • Understand the differences between AI Engine and AI Engine-ML architecture.

  • How to declare and use shared buffers (memory tiles).

  • How to declare and use external buffers (external memory).

  • How to program buffer descriptors using tiling parameters

This tutorial is based on matrix multiplication which is a usual algorithm in Machine Learning applications.

Prerequisite knowledge

To follow this tutorial you need to understand the architecture of the AI Engine-ML as well as the art of buffer descriptor programming:

  • AI Engine ML Architecture:: am020

  • Programming Buffer Descriptors with Tiling parameters: UG1603

A short introduction to AI Engine-ML architecture is available here.

The various memory levels contains DMAs used to receive/transfer data to/from memory or Programmable Logic. These DMAs use Buffer Descriptors (BDs) that contains the parameters of these transfers. The best way to program these BDs is to use Tiling Parameters that are introduced here.

Matrix Multiplication

Matrix multiplication is very common algorithm that can be found in numerous standard applications. The basic equation is:

$$ C = A.B $$
$$ \left( c_{ij} \right)_{\substack{0\leq i \lt M \\ 0 \leq j \lt N}}  =  \sum_{k=0}^{k<K} a_{ik}.b_{kj}$$

Matrix Multiplication

Natural storage for a matrix is column major: all columns of row 0 are stored csequentially in memory, then row 1 and so on up to last row o the matrix. In the following image, index in the boxes shows the increasing address:

Matrix Storage

Taking advantage of AI Engine-ML architecture

The AI Engine-ML has specific hardware instructions for matrix multiplications. Depending on the bitwidth of the operands, various matrix sizes are supported. In the following table the notation MxKxN means that matrix multiplication with a first operand of size M rows x K columns and a second operand of size K rows x N columns is supported.

Matrix Multiplication modes for real types

8b x 4b 8b x 8b 16b x 8b 8b x 16b 16b x 16b 32b x 16b 16b x 32b 32b x 32b bfloat16 x bfloat16
4x16x8 4x8x4 4x4x4 4x4x8 4x4x4 2x4x8 2x4x8 4x2x4 4x8x4
8x16x8 4x16x4 8x4x4 4x4x4 2x4x8 4x4x4 4x4x4 4x2x4
4x32x8 8x8x4 4x8x4 4x4x8 4x2x4 8x2x4
2x8x8 4x4x8 4x2x8

Matrix Multiplication modes for complex types

c16b x 16b c16b x c16b c32b x c16b c32b x c32b
2x4x8 1x4x8 1x2x4 1x2x8
4x4x4 1x2x8

In the example developed in this tutorial the 3 matrices A, B and C are all 64x64 with 8-bit data:

$$A_{64x64}.B_{64x64} = C_{64x64}$$

The mode 4x16x8 will be used so that we need to decompose matrix A into 4x16sub-matrices, matrix B into 16x8sub-matrices in oder to compute C using 4x8 sub-results:

Matrix Multiplication using sub-matrices

In order to use these matrix multiplication modes we need to have one submatrix stored in a register and the other matrix in another register. Unfortunately, when an AI Engine-ML reads memory, it reads 256 contiguous bits from the memory. Multiple reads would be necessary to read a sub-matrix of the right size. A solution is to re-arrange data so that sub-matrices are in contiguous memory addresses. The adf graph API provides a very handy way to do such data ordering manipulation.

Let’s first have a look to the chosen architecture for this matrix multiply small application:

Block Diagram

Multiple A and B matrices are stored in DDR which are copied in a memory tile using ping-pong buffering. These matrices are then copied again to AI Engine-ML memory using also ping-pong buffering. The kernel operates on the 2 stored matrices to compute the output C matrix. This matrix is then copied to a memory tile and then DDR. Data reordering can be done either between DDR and memory tile, or between memory tile and AI Engine-ML memory. The latter choice has been done.

The goal of the reordering is to be able to have the sub-matrices needed by the block-based matrix multiplication in adjacent addresses. As we will compute the resulting matrix C block rows by block rows, the sub-blocks of matrix A will be stored row by row and the one of matrix B will be stored column by column. Computing the first row of C will require the user to read 8 times the first row of block of A and the full matrix B block column by block column.

In first place the block must be extracted using memory tile DMA and stored in the AI Engine-ML memory. The tiling has to occur when reading from the memory tile because it is currently impossible to provide a read or a write access pattern to the AI Engine-ML memory.


The first block, on the top-left of the picture is first extracted and stored row by row on the AI Engine-ML memory. The second block, starting with the column vector (8,72, 136, 200) is then also extracted from the memory tile and stored in the AI Engine-ML memory. Finally we obtain the following re-arrangement of the data:


AI Engine-ML code analysis

This tutorial has been built to allow the user to easily change matrices and sub-matrices sizes. Matrix A being of size (M,K) and matrix B of size (K,N), the resulting matrix C has size (M,N). The Makefile defines these default values to 64 (sizeM, sizeK, sizeN). The size of the sub-matrices used by the AIE API is also defined (subM, subK, subN). All these values can be overriden in the make command line.

In this part we focus on a straightforward implementation of the matrix multiply which will be selected by the macro OPTIMIZED_SOURCE = 0. The make command will be invoked using make OPT=0 ....

# Default values for A, B, C matrix sizes
# A:MxK    B:KxN    C:MxN
sizeM ?= 64
sizeK ?= 64
sizeN ?= 64

# Default for A, B and C sub matrices
# 4x16x8
subM ?= 4
subK ?= 16
subN ?= 8

#Default Number of iterations
NIterations ?= 16

The system_settings.h header file defines all the sizes that will be used internally by the kernel:

// Multiply 2 matrices   (MxK) x (KxN)
#define A_ROWS sizeM
#define A_COLS sizeK

#define B_ROWS A_COLS
#define B_COLS sizeN

#define C_ROWS A_ROWS
#define C_COLS B_COLS

// Non Sparse Tiling: 4x16x8
#define ATILES_ROWS_NS subM
#define ATILES_COLS_NS subK
#define BTILES_COLS_NS subN

As explained in previous section, the matrices will be transferred from DDR to memory tile without any change, and then from memory tile to AI Engine-ML memory with a reordering of the data to make them easier to read from the kernel.

Even the write access pattern to the memory tile on the input side as well as read access pattern on the output side is just linear contiguous addressing, it needs to be specified in the graph. All these tiling parameters are defined in the file tiling_parameters.h. Let’s have a look to these parameters for the input matrix A:

adf::tiling_parameters WriteAns_pattern = {
        {.dimension=1, .stride=1, .wrap=A_ROWS}

adf::tiling_parameters ReadAns_pattern = {
        {.dimension=0, .stride=ATILES_COLS_NS, .wrap=A_COLS/ATILES_COLS_NS},
        {.dimension=1, .stride=ATILES_ROWS_NS, .wrap=A_ROWS/ATILES_ROWS_NS}

The matrix is a 2D set of data dimension 0 being the number of columns, dimension 1 being the number of rows. When writing to the memory tile, data is stored column major in the memory. The read access of matrix A is completely different as we read the data block by block, each block being a sub-matrix of the matrix multiplication of the API, and we read the blocks column major from the memory (dimension 0 then dimension 1). For the matrix B it will be the same except that the block reading will be done row major (dimension 1 then dimension 0). C Matrix is written block by block, column major. The following animated GIF gives you the order the various A, B and C blocks are read and written to memory