IO Wrapper (m76_kernel.cpp)

The wrapper takes as input of a parameter array, and it iterates through the array calling the Engine for each entry. The results are returned also as an array in order to make full use of DMA in the FPGA. Because a batch data transaction is much faster than multiple single transactions. The data is firstly read from global memory into local memory, then processed in the kernel and finally returned from local memory back to global memory. This is done because the extra time required by the copies is more than compensation by speedup the Engine in accessing local memory.

The wrapper will process up to 2048 calculations in one batch. This number can be increased by expense of memory in the FPGA, and it will give performance advantages when processing large amounts of data due to the kernel’s pipelining.

In order to speed up kernel execution for a large number of calculations, the input array and the sum array have been partitioned by a factor of 8. The inner loop has been unrolled which creates 8 engines working in parallel. The array partitioning is needed to let multiple read accesses possible in the same clock cycle. The partition factor is 8 because of a balance between efficiency and the amount of resource required in the FPGA.

IO Wrapper (m76_kernel.cpp) - 2023.2 English

Vitis Libraries