The inner loop is where the main work takes place with VLIW keeping the AI Engine busy with processing. Upon entering the loop, the first operation must have valid data to process. For the vector reg datamover, this means we need to preload the first 256-bit vector of data before entering the loop. A similar argument for the vector multiplication data mover, but in this case, we require two 256-bit vectors of data. This is because we process 8 lanes of data each clock cycle and the permutation for each lane offsets the data in the vector register according to the figure below.
To complete the full circle, the inner loop is unrolled manually so the for loop always starts in the same position. This is possible due to that the vector register will automatically wrap around according to its type declaration.