cfloat x cfloat multiplications take two cycles to perform due to the abscence of the post add. These two parts can be interleaved with the two cycle latency of the accumulator.
There are still 16 coefficients but now they are complex; hence, double the size. The coefficients have to be updated four times for a complete iteration. The data transfer is also slightly more complex.