template class xf::dsp::aie::blas::matrix_mult::matrix_mult_graph - 2023.2 English

Vitis Libraries

Release Date
2023-12-20
Version
2023.2 English
#include "matrix_mult_graph.hpp"

Overview

matrix_mult performs a GEneral Matrix Multiply (GEMM), taking two input matrices of configurable dimensions and data type.

These are the templates to configure the Matrix Multiply graph class.

Parameters:

TT_DATA_A

describes the type of individual data samples input of Matrix A to the gemm function. This is a typename and must be one of the following:

int16, cint16, int32, cint32, float, cfloat.

TT_DATA_B

describes the type of individual data samples input of Matrix B to the gemm function. This is a typename and must be one of the following:

int16, cint16, int32, cint32, float, cfloat. The following rules apply:

  • must be an integer type if TT_DATA_A is an integer type
  • must be a float type if TT_DATA_A is a float type.
TP_DIM_A is an unsigned integer which describes the number of elements along the unique dimension (rows) of Matrix A.
TP_DIM_AB is an unsigned integer which describes the number of elements along the common dimension of Matrix A (columns) and Matrix B (rows).
TP_DIM_B is an unsigned integer which describes the number of elements along the unique dimension (columns) of Matrix B.
TP_SHIFT describes power of 2 shift down applied to the accumulation of product terms before each output. TP_SHIFT must be in the range 0 to 61.
TP_RND

describes the selection of rounding to be applied during the shift down stage of processing. Although, TP_RND accepts unsigned integer values descriptive macros are recommended where

  • rnd_floor = Truncate LSB, always round down (towards negative infinity).

  • rnd_ceil = Always round up (towards positive infinity).

  • rnd_sym_floor = Truncate LSB, always round towards 0.

  • rnd_sym_ceil = Always round up towards infinity.

  • rnd_pos_inf = Round halfway towards positive infinity.

  • rnd_neg_inf = Round halfway towards negative infinity.

  • rnd_sym_inf = Round halfway towards infinity (away from zero).

  • rnd_sym_zero = Round halfway towards zero (away from infinity).

  • rnd_conv_even = Round halfway towards nearest even number.

  • rnd_conv_odd = Round halfway towards nearest odd number.

    No rounding is performed on ceil or floor mode variants.

    Other modes round to the nearest integer. They differ only in how they round for values of 0.5.

    Note: Rounding modes rnd_sym_floor and rnd_sym_ceil are only supported on AIE-ML device.

TP_DIM_A_LEADING describes the scheme in which the data should be stored in memory. ROW_MAJOR = 0, COL_MAJOR = 1. Note, a COL_MAJOR matrix can be transposed to become a ROW_MAJOR matrix.
TP_DIM_B_LEADING describes the scheme in which the data should be stored in memory. ROW_MAJOR = 0, COL_MAJOR = 1.
TP_DIM_OUT_LEADING describes the scheme in which the data should be stored in memory. ROW_MAJOR = 0, COL_MAJOR = 1.
TP_ADD_TILING_A

describes wether or not to add an additional kernel to rearrange the matrix samples into their required position.

Setting this option to 0 indicates that the re-arrangement will be done externally to the AIE matrix multiply graph.

TP_ADD_TILING_B

describes wether or not to add an additional kernel to rearrange the matrix samples into their required position.

Setting this option to 0 indicates that the re-arrangement will be done externally to the AIE matrix multiply graph.

TP_ADD_DETILING_OUT

describes wether or not to add an additional kernel to rearrange the matrix samples into their required position.

Setting this option to 0 indicates that the re-arrangement will be done externally to the AIE matrix multiply graph.

TP_INPUT_WINDOW_VSIZE_A

describes the number of samples in the window API used for input to Matrix A.

It must be of size TP_DIM_A*TP_DIM_AB*N. Typical use has N=1, however N>1 can be utilised to minimise overhead of window API.

This parameter is optional and has a default value of TP_DIM_A*TP_DIM_AB (N=1).

TP_INPUT_WINDOW_VSIZE_B

describes the number of samples in the window API used for input to Matrix B.

It must be of size TP_DIM_B*TP_DIM_AB*M. Typical use has M=1, however M>1 can be utilised to minimise overhead of window API.

This parameter is optional and has a default value of TP_DIM_B*TP_DIM_AB (M=1).

Note, the output window will be of size: (TP_INPUT_WINDOW_VSIZE_A/TP_DIM_AB * TP_INPUT_WINDOW_VSIZE_B/TP_DIM_AB). When N and M is 1, output window size will be TP_DIM_A * TP_DIM_B.

TP_CASC_LEN

describes the number of AIE Tiles to split the GEMM operation into.

TP_CASC_LEN splits the operation over TP_DIM_AB, where each kernel utilises the cascade stream to pass partial accumulation results to the next kernel. In effect, dot(A,B) + C.

Note, it is also possible to tile the operation over multiple AIE tiles by instantiating multiple GEMM graphs with smaller dimensions.

TP_SAT

describes the selection of saturation to be applied during the shift down stage of processing. TP_SAT accepts unsigned integer values, where:

  • 0: none = No saturation is performed and the value is truncated on the MSB side.
  • 1: saturate = Default. Saturation rounds an n-bit signed value in the range [- ( 2^(n-1) ) : +2^(n-1) - 1 ].
  • 3: symmetric = Controls symmetric saturation. Symmetric saturation rounds an n-bit signed value in the range [- ( 2^(n-1) -1 ) : +2^(n-1) - 1 ].
template <
    typename TT_DATA_A,
    typename TT_DATA_B,
    unsigned int TP_DIM_A,
    unsigned int TP_DIM_AB,
    unsigned int TP_DIM_B,
    unsigned int TP_SHIFT,
    unsigned int TP_RND,
    unsigned int TP_DIM_A_LEADING = ROW_MAJOR,
    unsigned int TP_DIM_B_LEADING = COL_MAJOR,
    unsigned int TP_DIM_OUT_LEADING = ROW_MAJOR,
    unsigned int TP_ADD_TILING_A = 1,
    unsigned int TP_ADD_TILING_B = 1,
    unsigned int TP_ADD_DETILING_OUT = 1,
    unsigned int TP_INPUT_WINDOW_VSIZE_A = TP_DIM_A* TP_DIM_AB,
    unsigned int TP_INPUT_WINDOW_VSIZE_B = TP_DIM_B* TP_DIM_AB,
    unsigned int TP_CASC_LEN = 1,
    unsigned int TP_SAT = 1
    >
class matrix_mult_graph: public graph

// typedefs

typedef matrix_mult <TT_DATA_A, TT_DATA_B, TP_DIM_A, (TP_DIM_AB/TP_CASC_LEN), TP_DIM_B, TP_SHIFT, TP_RND, TP_SAT, TP_DIM_A_LEADING, TP_DIM_B_LEADING, TP_DIM_OUT_LEADING, (TP_INPUT_WINDOW_VSIZE_A/TP_CASC_LEN), (TP_INPUT_WINDOW_VSIZE_B/TP_CASC_LEN), cascIn, cascOut> matMultCasc
typedef typename std::conditional < (TP_CASC_LEN==1), matMultCasc <false, false>, no_kernel>::type onlyMatMult
typedef typename std::conditional < (TP_CASC_LEN> 1), matMultCasc <false, true>, onlyMatMult>::type firstMatMult
typedef typename std::conditional < (TP_CASC_LEN> 1), matMultCasc <true, false>, firstMatMult>::type lastMatMult
typedef typename std::conditional < (TP_CASC_LEN> 2), matMultCasc <true, true>, lastMatMult>::type middleMatMult
typedef tilerKernelClass <tilingScheme.Atile, tilingScheme.ABtile, dimAPerKernel, (TP_DIM_AB/TP_CASC_LEN), TP_DIM_A_LEADING, TT_DATA_A> TilerClassA
typedef tilerKernelClass <tilingScheme.ABtile, tilingScheme.Btile, (TP_DIM_AB/TP_CASC_LEN), dimBPerKernel, TP_DIM_B_LEADING, TT_DATA_B> TilerClassB
typedef untilerKernelClass <tilingScheme.Atile, tilingScheme.Btile, dimAPerKernel, dimBPerKernel, TP_DIM_OUT_LEADING, outType_t <TT_DATA_A, TT_DATA_B>> DetilerClassOut

// structs

struct no_kernel

// fields

port <input> inA[TP_CASC_LEN]
port <input> inB[TP_CASC_LEN]
port <output> out[1]
kernel m_MatmultKernels[TP_CASC_LEN]
kernel untiler
kernel tilerA[TP_CASC_LEN]
kernel tilerB[TP_CASC_LEN]
static constexpr middleMatMult::tilingStruct tilingScheme
static constexpr unsigned int dimAPerKernel
static constexpr unsigned int dimBPerKernel
static constexpr bool isRedundantTilerA
static constexpr bool isRedundantTilerB
static constexpr bool isRedundantTilerOut

Fields

port <input> inA [TP_CASC_LEN]

The input A data to the function. This input is a window of samples of TT_DATA_A type. The number of samples in the window is described by TP_INPUT_WINDOW_VSIZE_A, which is derived from TP_DIM_A, TP_DIM_AB.

port <input> inB [TP_CASC_LEN]

The input B data to the function. This input is a window of samples of TT_DATA_B type. The number of samples in the window is described by TP_INPUT_WINDOW_VSIZE_B, which is derived from TP_DIM_AB and TP_DIM_B.

port <output> out [1]

A window API of TP_INPUT_WINDOW_VSIZE_A/TP_DIM_AB * TP_INPUT_WINDOW_VSIZE_B/TP_DIM_AB samples, or simply TP_DIM_A * TP_DIM_B samples of a derived output type.

kernel m_MatmultKernels [TP_CASC_LEN]

The array of kernels that will be created and mapped onto AIE tiles. Number of kernels ( TP_CASC_LEN ) will be connected with each other by cascade interface.

kernel untiler

The kernel that that will be created when output tiling is enabled ( TP_ADD_DETILING_OUT = 1 ).

kernel tilerA [TP_CASC_LEN]

The array of kernels that will be created when tiling on input A is enabled ( TP_ADD_TILING_A = 1 ). Kernels will pre-process and sent the data through cascade interface to corresponding: m_MatmultKernels .

kernel tilerB [TP_CASC_LEN]

The array of kernels that will be created when tiling on input A is enabled ( TP_ADD_TILING_A = 1 ). Kernels will pre-process and sent the data through cascade interface to corresponding: m_MatmultKernels .

Methods

getKernels

kernel* getKernels ()

Access function to get pointer to kernel (or first kernel in a chained configuration).

matrix_mult_graph

matrix_mult_graph ()

This is the constructor function for the Matrix Multiply graph.