Case Studies

The generic matrix multiplication architecture proposed in this application note is scalable to various beamforming configurations. For illustration purposes, a 5G NR 100 MHz system is studied in this section and the following figure shows the beamformer implementation of several use cases. All the designs are based on a highly scalable architecture that can be built with a small number of kernels differing in the input and output interfaces only:

Each AI Engine performs (8 × 8) times (8 × 12) submatrix multiplication, that is, u = v = 8, L = 12.
The throughput of every AXI4-Stream for the input and output data is 8 × 100=800 MSPS, which is 80% of the capacity of 1 GSPS.
The throughput of every AXI4-Stream for the coefficients is 8 × 8/(12 × 1/100 MHz) = 533 MSPS.

Because the memory access of every AI Engine is within its own tile, no conflict with other AI Engines is expected.

Figure 1. 5G NR 100 MHz Beamforming Implementation on AI Engine

In narrowband systems, Equation 10 indicates a higher upper bound on the product (u v) and therefore more freedom in the selection of system parameters. One strategy is to select (u = v ≤ sqrt(8G/B)) such that downlink and uplink can share the same set of kernels. For an LTE 20 MHz system of 64 antennas and 16 streams, B=20 MHz, it is possible to select (u = v = 16) and construct the beamformer as those shown in Figure 2 (a) and (b).

Another strategy is to combine the AXI streams of data and coefficient into one, which is equivalent to changing the parallel transfer of X and H to serial. To keep the vector processor fully occupied, they must satisfy

In 3GPP systems where L=12, Equation 11 can be simplified to

Figure 2 (c) shows an alternative downlink beamforming architecture constructed with AI Engines handling (u = 32) antennas and (v = 8) layers each. However, because the number of layers is less than 32 in this case, the uplink will have a different set of kernels as shown in Figure 2 (b).

Figure 2. LTE 20 MHz 64 Antenna 16 Stream Beamforming Implementation on AI Engine

When there is a special need for u and v to be certain values, it is possible to construct a wideband beamformer with multiple instances of narrow-band ones at the cost of additional programmable logic (PL) for data multiplexing and demultiplexing. The following figure shows an example of 5G NR 100 MHz beamformer made up of four instances of 25 MHz beamforming units shown in the previous figure. Though the total number of AI Engines is the same as that of a single instance of wideband beamformer, the demux block at 1.6 GSPS and mux block at 6.4 GSPS will require considerable logic resources.

Figure 3. 5G NR 100 MHz Beamformer Using Four 25 MHz Beamforming Units