AI Engine-to-PL Rate Matching

AI Engine-to-PL Rate Matching - 2021.2 English

Versal ACAP AI Engine Programming Environment User Guide (UG1076)

Document ID

UG1076

Release Date

2021-12-17

Version

2021.2 English

The AI Engine runs at 1 GHz (or more, depending on the device) and can write at most two streams with a 32-bit data width per cycle. In contrast, a PL kernel can run at 500 MHz (half the frequency of the AI Engine), while consuming a larger bit width. Rate matching is concerned with balancing the throughput from the producer to the consumer, and is used to ensure that neither of the processes creates a bottleneck with respect to the total performance. The following equation shows the rate matching for each channel:

Frequency AI Engine × Data AI Engine per cycle = Frequency PL × Data PL per cycle

The following table shows a PL rate matching example for a 32-bit channel written to each cycle by the AI Engine at 1 GHz for -1L speed grade devices. As shown, the PL IP has to consume two times the data at half the frequency or four times the data at one quarter of the frequency.

Table 1. Frequency Response of AI Engine Compared to PL Region
AI Engine		PL
Frequency	Data per Cycle	Frequency	Data per Cycle
1 GHz	32 bit	500 MHz	64 bit
		250 MHz	128 bit

Because the need to match frequency and adjust data-path width is well understood by the Vitis compiler (v++), the tool automatically extracts the port width from the PL kernel, the frequency from the clock specification, and introduces an upsizer/downsizer to temporarily store the data exchanged between the AI Engine and the PL regions to manage the rate match.

To avoid deadlocks, it is important to ensure that if multiple channels are read or written between the AI Engine and the PL, the data rate per cycle is concurrently achieved on all channels. For example, if one channel requires 32 bits, and the second 64 bits, the AI Engine code must ensure that both channels are written adequately to avoid back pressure or starvation on the channel. Additionally, to avoid deadlock, writing/reading from the AI Engine and reading/writing in the PL code must follow the same chronological order.

The number of interfaces used in the graph function definition for the PL defines the number of AXI4-Stream interfaces. Each argument results in the creation of a separate stream.