The AI Engine API is intended to improve productivity by increasing the level of abstraction relative to the low-level intrinsics. We recommend using the AI Engine API and only low-level intrinsics to achieve more performance to meet target specifications.
Throughput may be improved using the following techniques:
Reduce function call overhead by processing as many samples within the function as possible.
For floating-point accumulation, use two accumulators with low-level intrinsics.
Can the throughput be improved even further?
Floating-point allows 8 MACs per cycle. Using 32-bit fixed-point coefficients with 16-bit data allow 16 MACs per cycle, potentially doubling the throughput. 16-bit fixed-point coefficients with 16-bit data allow 32 MACs per cycle, potentially quadrupling the throughput. 16-bit fixed-point coefficients with 8-bit data allow 64 MACs per cycle, potentially improving the throughput by 8x.
Assuming that we stick with a floating-point implementation, doubling the number of processed samples and the number of AI Engines (That is, two AI Engines, each processing eight samples from a 16-sample window) may double the throughput.