Reducing Window Buffer Sizes for Very High Memory Density Designs - 2021.2 English

Versal ACAP AI Engine Programming Environment User Guide (UG1076)

Document ID
UG1076
ft:locale
English (United States)
Release Date
2021-12-17
Version
2021.2 English

One of the main considerations when determining the window sizes for a design is that the number of cycles required for data loading is balanced with the number of compute cycles required by the kernel. This helps to pipeline the ping and pong buffer data loading with the kernel compute. For very high memory density designs, it makes sense to have smaller window sizes which can still balance the kernel compute because having larger window sizes might lead to mapper failure.

The following table shows the number of cycles required for the matrix multiplication of two matrices with 16-bit data. Example 1 and Example 2 have different matrix sizes, but both have their compute and data loading balanced. Note that only the larger of the A or B matrix size determines the data loading time whereas the time of kernel compute is determined by both sizes. This shows that Example 1 has smaller window sizes than Example 2, but the compute and data loading are balanced and can be pipelined.

Table 1. Matrix Multiplication Examples
  Matrix A Size Matrix B Size # of Multiplication Operations (MultOps) #Cycles for Compute

32 ops/ cycle

#Cycles for Data Loading

32 bits/ cycle

Example 1 16x64 64x16 16384 512

(16384/32)

512

(64x16x16/32)

Example 2 16x64 64x32 32768 1024

(32768/32)

1024

(64x32x16/32)