Beamforming Formulation

Beamforming Implementation on AI Engine (XAPP1352)

Document ID
XAPP1352
Release Date
2021-01-11
Revision
1.0 English

As shown in Figure 1, beamforming can be described as linear operations. In the downlink, the transmit signal on each antenna is a weighted summation of the layers, and in the uplink the equalized signal on each layer is a linear combination of the signals received on the antennas.

Write the vector of layers on subcarrier k as Xk = [xk,0, xk,1, xk,2, …, xk,M-1]T, where M is the number of streams; and the vector of frequency-domain transmit signal on the antennas as Yk = [yk,0, yk,1, yk,2, …, yk,N-1] T, where N is the number of antennas, and []T is vector transpose operation. Beamforming can be formulated as a matrix multiplication:

where H is a complex N x M matrix often known as beamforming coefficient, and L is the number of subcarriers sharing the same coefficient matrix H. Similarly, in the uplink the beamforming function can be written as

The preceding two equations suggest that both downlink and uplink beamforming can be formulated as matrix multiplications. In the downlink, the matrix dimension is (N x M) times (M x L), which requires N x M x L complex multiplication and addition (CMAC) operations. In the uplink, though the matrix dimension becomes (M x N) times (N x L), the number of CMACs is also (N x M x L).

According to the definition of OFDM, the time duration of one OFDM symbol with occupied bandwidth B is equal to the inverse of the subcarrier spacing, which is given by K/B, where K is the number of subcarriers. For downlink beamforming, (K/L) matrix multiplications of Equation 1 must be performed in (K/B) time, so the number of CMACs in one second is given by:

In 3GPP OFDM systems with cyclic prefix and some subcarriers not requiring beamforming, the number of CMACs given by the preceding equation is higher than the minimum requirement but is desirable for system dimensioning purpose.

In a 100 MHz 5G system of 64 antennas and 32 layers, downlink beamforming requires as high as (64 x 32 x 100e6) = 204.8G CMACs per second. At a 400 MHz clock, two DSP58s can compute 400M CMACs in one second, and it takes (204.8G / 400M x 2) = 1024 DSP58s to implement the same functionality.

In Versal™ AI Core devices, one AI Engine is capable of 8G CMACs per second. For the previous example, assuming 80% runtime ratio of the AI Engines, it takes (204.8G / 8G / 80%) = 32 AI Engines to implement the matrix multiplication, that is, one AI Engine is equivalent to 32 DSP58 blocks.

For a MIMO system with M layers and N antennas, the matrix multiplication of Equation 1 can be divided into (N/u) sub-matrix multiplication chains, each of which consists of (M/v) sub-matrix multiplication units that handle (u-by-v) times (v-by-L) matrix multiplication each. More specifically, Equation 1 can be written as follows:

where Yk is a matrix of (u-by-L) complex entries, Xm is a matrix of (v-by-L) complex elements, Hk,m is a (u-by-v) complex matrix, and they must satisfy

The preceding equation has a unified submatrix multiplication dimension of (u-by-v) times (v-by-L) that can be implemented on a single AI Engine. A chain of (M/v) AI Engines can be cascaded to accumulate the partial CMAC results for the final output.

Figure 1. Matrix Multiplication on AI Engine Array

The previous figure shows a possible implementation of Equation 4 on the AI Engine array. Each AI Engine handles (u-by-v) times (v-by-L) matrix multiplication, and the cascading bus connects the accumulation register of one AI Engine to another to form a full-precision accumulation chain. Every row of AI Engines implements Equation 5 for a given k, and the output matrix Yk is written into the memory of the last AI Engine tile in a row. The input to each AI Engine is provided by programmable logic and consists of a (u-by-v) matrix Hk,m and a (v-by-L) data matrix Xm, both of which are stored in the local memory by DMAs while the AI Engine is computing the product of the previous H and X in the double buffer. All AI Engines in one column of the array share the same data matrix Xm, so one input data stream can be multicast to all of them through AXI switches built in the AI Engine array. Note that the DMAs and AXI switches are configured by Xilinx tools automatically according to the dataflow defined by the user at compilation time.

Figure 2. Timing Diagram of Pipelining in One AI Engine

In one AI Engine, data transfer and computation are pipelined for high throughput. As shown in the previous figure, the time needed for data transfer and computation depends on the design parameters u, v, and L. To achieve 100% MAC efficiency in the AI Engine, the time needed for data transfer should not exceed that of computation, which means the following equation must hold.

The solution is:

In 3GPP LTE and 5G NR systems, the minimum value of L is 12, and setting (u = v = 8) is a good strategy to make sure the overall throughput is not limited by data transfer bandwidth.

Moreover, the input and output AXI buses will carry time-multiplexed data of v and u channels. Limited by the throughput of each AXI bus at 1 Gs/s, the maximum signal bandwidth is upper bounded by the sample rate of each data channel given by

All 5G NR carriers transmitted in FR1 frequency bands (below 7.125 GHz) are within 100 MHz, which fit in the above bandwidth range nicely. When there are multiple carriers and the total bandwidth is over 100 MHz, multiple instances of beamforming modules are used to meet the throughput requirement.

Larger u and v values reduce the number of AI Engines, however, the total amount of compute given by Equation 3 must be satisfied, that is:

After simplification:

Equation 10 gives an upper bound of the product of u and v, which can be translated into a lower bound of the number of AI Engines or the overall MAC efficiency of each AI Engine. For 100 MHz 5G NR carriers where B=100 MHz, the upper bound is u v ≤ (8000/100) = 80. The selection of (u = v = 8) represents a MAC efficiency of the vector processor in a single AI Engine of (8 × 8/80) = 80% and is optimal for many wireless systems where the numbers of antennas and layers are both integer multiples of eight.