In the example, the filter has 16 coefficients which do not fit within a 256-bit register. The register must be updated in the middle of the computation.
For data storage a small 512-bit register is used. It is decomposed in two 256-bit parts: W0, W1.
First iteration
Part W0 is loaded with first 8 samples (0…7)
Part W1 with the next 8 samples (8…15)
Part W0 with the following ones (16…23)
Second iteration
Part W0 : 8…15
Part W1 : 16…23
Part W0 : 24…31