The AI Engine processes data block by block and uses a data structure called window to describe one block of input or output data. The window size, which is the number of samples in each block of input or output data, represents a tradeoff between efficiency and processing latency. Long windows lead to high efficiency, but the latency increases proportionally to the window size. Sometimes a short latency is preferable at a loss of a 5-10% AI Engine processing efficiency.
For example, in this application note, the input window size is set to 512 samples to limit the latency to within 2.1 μs. The window sizes and sample rates of DDC filters are listed in the following table.
Filter | Input Sample Rate (MSPS) | Output Sample Rate (MSPS) | Input Window | Output Window |
---|---|---|---|---|
HB47 | 245.76 | 122.88 | 512 | 256 |
HB11 | 122.88 | 61.44 | 256 | 128 |
FIR199 | 122.88 | 122.88 | 256 | 256 |
HB23 | 61.44 | 30.72 | 128 | 64 |
FIR89 | 30.72 | 30.72 | 64 | 64 |
Mixer | 122.88 | 122.88 | 256 | 1280/5x carriers |
A cycle budget is the number of instruction cycles a function can take to compute a block of output data, given by:
At a 1 GHz AI Engine clock in the lowest speed-grade device, the processing of 512 samples at 245.76 MSPS has a cycle budget of 2083 cycles.
Suppose every output needs P 16-bit-real by 16-bit-real multiplications. The AI Engine can compute 32 such real-by-real multiplications every cycle. For an ideal implementation, the utilization lower boundary is given by:
Take FIR199 as an example. FIR199 has 199 real symmetric filter taps and it takes 100 16 bit-complex by 16 bit-real multiplications to compute each output. Therefore, every output of FIR199 needs 200 16-bit-real by 16-bit-real multiplications at 122.88 MSPS, and the utilization lower boundary is given by 200 cycles × 256 samples / (32 × 2083 cycle budget) = 76.8%. Similarly, the utilization lower bounds of other DDC filters are calculated and listed in the following table.
Filter | Input Window Size | Output Window Size | Number of Taps | Number of MACs/Output | Utilization / Instance | Number of Inst | Utilization Lower Bound |
---|---|---|---|---|---|---|---|
FIR199 | 256 | 256 | 199 | 200 | 76.8% | 1 | 76.8% |
FIR89 | 64 | 64 | 89 | 96 | 9.3% | 5 | 46.5% |
HB47 | 512 | 256 | 47 | 32 | 12.3% | 1 | 12.3% |
HB11 | 256 | 128 | 11 | 8 | 1.6% | 5 | 8% |
HB23 | 128 | 64 | 23 | 16 | 1.6% | 5 | 8% |
Mixer 1 | 256 | 1280 | - | 8 | 23% | 1 | 23% |
Total | 174.6% | ||||||
|
Although in theory this DDC can be implemented on two AI Engines with 87.3% utilization each, such high utilization requires very long windows and undesirable latency. One method to reduce the utilization is to take advantage of the fact that 5G NR and 4G LTE carriers do not co-exist in this case and the filters for unused carriers can be disabled during run time, depending on the carrier configuration. Detailed analysis and explanation is provided in the following sections.