Implementation - 2023.2 English

The implementation of the covariance matrix and covariance regularization are very common and simple. There is not elaborated here. Here, the key optimization based on the design of FPGA of covariance matrix is introduced. According to the formula or implementation code of the covariance matrix, the core design requires 3 layers of loops, which will cause the biggest latency. Therefore, it needs to be optimized to improve throughput.

Firstly, the loop of the bottom and middle layers is unrolled to increase throughput. However, due to the self-addition operation in the loop, the effect of unroll operator is not particularly obvious to reduce latency. So, the core calculation part and the self-addition part in the underlying loop is split into two processes, passing the intermediate result through the stream by using pragma dataflow to improve throughput. See the function covCoreWrapper for details.