Pointer Locations

One major difference between the novel filter design and the traditional method is that samples for MAC operations are read from the overlap buffer instead of the data buffer. For every eight output results, it only takes one read operation in the input window, and all the other data are from the overlap buffer. During the MAC operations, the newly read eight input data are written to the overlap memory for the next iteration. Every overlap buffer has three pointers, a read pointer, a symmetry pointer, and a write pointer. The starting locations of the overlap buffer pointers can be different in each iteration depending on the size of input window.

In the case of FIR89, an overlap of 80 samples depth is needed. The following figure illustrates the behavior of each pointer. At first the read pointer points to address 0, the symmetry pointer points to the address, overlap-depth - 8, the write pointer points to address, overlap-size, and the input window pointer points to the beginning of the input widow. In each iteration, the read pointer and symmetry pointer move against each other in step sizes of eight samples until all the data in the delay line is processed. At the beginning of the next iteration all the pointers are reset to their initial locations with an offset of eight samples relative to the initial location of the previous iteration.

Figure 1. Novel FIR89 Pointers

When an overlap pointer reaches the bottom of the overlap, roll-over occurs. As illustrated in the following figure, the write pointer reaches the bottom and it rolls over to the beginning of the overlap buffer. This is implemented by the cyclic_add() function for the pointer update.

Figure 2. Write Pointer Rolls Over

The following figure is an example of pointer movement from the kernel execution point of view for FIR89. Each kernel execution contains several inner loop (for loop) iterations. Assuming the size of the input window is 64 samples, each inner loop consumes eight samples, and then one kernel execution has 64/8=8 inner loops. At first, the read pointer points to #0 (*v8cint16), the symmetry pointer points to #10 (*v8cint16) and the write pointer points to #11 (*v8cint16). Each pointer increases by a step of 8 samples (8 × 32 bits = 256 bits for maximum memory access efficiency). As shown in the following figure, at the beginning of the second inner loop iteration, the read pointer points to #1 (*v8cint16), the symmetry pointer points to #11 (*v8cint16) and the write pointer points to #12 (*v8cint16) respectively. If any of the pointers reaches the bottom of the overlap, it will roll over to the beginning of the overlap.

Figure 3. Overlap Buffer Pointer Movement

At the beginning of the second kernel execution, the pointer locations should be initialized to 8/2/3 respectively and then the locations will be 0/10/11 again at the beginning of the third execution. This pattern keeps repeating as the data processing continues.

The Versal AI Engine software tools support a function called cyclic_add in cardano.h. It can be used to implement the cyclic roll-over of the pointers for when the pointers reach the end of the buffer. For example, the following code defines an inline function of cyclic increase to construct a buffer of depth, 16 × v8cint16.

struct buffer_internal
{
  buffer_datatype * restrict head;
  buffer_datatype * restrict ptr
}
inline __attribute__((always_inline)) void buffer128_incr_v8(buffer_internal * w, int count) {
  w->ptr=cyclic_add(w->ptr, count, w->head, 16);
 }

where

w represents the overlap structure instance.
w->ptr is the current pointer to the overlap.
w->head refers to the starting address of the overlap.
count means how many steps(v8cint16) to increase.
The constant, 16, means it is an overlap with a fixed 128 sample depth (16 × v8cint16).

The following figure shows the microcode of FIR89. It is observed that the inner loop is perfect, and every cycle of the inner loop contains a MAC operation as indicated in the green box. As indicated in the blue box, the overlap buffer update operation (VST in microcode) is absorbed by the cycle that also performs the MAC operation.

Figure 4. FIR89 Kernel Compile Result