Implementation Details - 2023.2 English

The optimization happens in the first part. We don’t store all history of vectors generated, only the last \(n\) vectors. We use a circular queue in BRAMs to store these values. Depth of BRAM is set to be least 2’s power that larger than n. This will make the calculation of the address simpler. By keeping and updating the address of the starting vector, we can always calculate the address of vectors we need to access.

To generate k-th vector, we need 3 read ops, for \(X_{k}\), \(X_{k + 1}\) and \(X_{k + m}\). In the next iteration, we need to read \(X_{k + 1}\), \(X_{k + 2}\) and \(X_{k + m + 1}\). This means we only need to read \(X_{k + 2}\) and \(X_{k + m + 1}\), since we could save \(X_{k + 1}\) in a register. So, we need 2 read accesses at different vectors and 1 write access for generating the new vector. Since BRAM only allows 2 read or write accesses at a single cycle, it’s not capable of generating the new vector at each clock cycle. In the implementation, we copy the identical vectors to different BRAMs, and each of them provides sufficient read or write access port.