GEMM kernel here is an implementation of the operation C = A * B + X, where A, B, X and C are matrices. This kernel is composed by three major parts, data movers, transpose and buffers, and a systolic array as shown in the figure below.