The goal of kernel optimization is to create processing logic that can consume all the data as soon as it arrives at the kernel interfaces. The key metric is the initiation interval (II), or the number of clock cycles before the kernel can accept new input data. Optimizing the II is generally achieved by expanding the processing code to match the data path with techniques such as function pipelining, loop unrolling, array partitioning, data flowing, etc.
Figure 1. Optimizing Kernel Computation Flow