AI Engine Utilization Estimation

The AI Engine processes data block by block and uses a data structure called window to describe one block of input or output data. The window size, which is the number of samples in each block of input or output data, represents a tradeoff between efficiency and processing latency. Long windows lead to high efficiency, but the latency increases proportionally to the window size. Sometimes a short latency is preferable at a loss of a 5-10% AI Engine processing efficiency.

For example, in this application note, the input window size is set to 512 samples to limit the latency to within 2.1 μs. The window sizes and sample rates of DDC filters are listed in the following table.

Table 1. Window Size and Sample Rate of DDC Filters
Filter	Input Sample Rate (MSPS)	Output Sample Rate (MSPS)	Input Window	Output Window
HB47	245.76	122.88	512	256
HB11	122.88	61.44	256	128
FIR199	122.88	122.88	256	256
HB23	61.44	30.72	128	64
FIR89	30.72	30.72	64	64
Mixer	122.88	122.88	256	1280/5x carriers

A cycle budget is the number of instruction cycles a function can take to compute a block of output data, given by:

At a 1 GHz AI Engine clock in the lowest speed-grade device, the processing of 512 samples at 245.76 MSPS has a cycle budget of 2083 cycles.

Suppose every output needs P 16-bit-real by 16-bit-real multiplications. The AI Engine can compute 32 such real-by-real multiplications every cycle. For an ideal implementation, the utilization lower boundary is given by:

Take FIR199 as an example. FIR199 has 199 real symmetric filter taps and it takes 100 16 bit-complex by 16 bit-real multiplications to compute each output. Therefore, every output of FIR199 needs 200 16-bit-real by 16-bit-real multiplications at 122.88 MSPS, and the utilization lower boundary is given by 200 cycles × 256 samples / (32 × 2083 cycle budget) = 76.8%. Similarly, the utilization lower bounds of other DDC filters are calculated and listed in the following table.

Table 2. AI Engine Utilization Lower Bound Analysis
Filter	Input Window Size	Output Window Size	Number of Taps	Number of MACs/Output	Utilization / Instance	Number of Inst	Utilization Lower Bound
FIR199	256	256	199	200	76.8%	1	76.8%
FIR89	64	64	89	96	9.3%	5	46.5%
HB47	512	256	47	32	12.3%	1	12.3%
HB11	256	128	11	8	1.6%	5	8%
HB23	128	64	23	16	1.6%	5	8%
Mixer¹	256	1280	-	8	23%	1	23%
						Total	174.6%
To support configurable carrier frequency at run time, an on-line DDS calculation consumes extra 180 cycles in each mixer kernel execution.

Although in theory this DDC can be implemented on two AI Engines with 87.3% utilization each, such high utilization requires very long windows and undesirable latency. One method to reduce the utilization is to take advantage of the fact that 5G NR and 4G LTE carriers do not co-exist in this case and the filters for unused carriers can be disabled during run time, depending on the carrier configuration. Detailed analysis and explanation is provided in the following sections.