Kernel Design Details

One feature of the proposed beamforming architecture is that only a small number of kernels is required by various system configurations. For instance, in 5G NR 100 MHz systems all the beamformers shown in Figure 1 can be built with three kernels, as shown in the following figure. Depending on the location of the kernels in the cascading chain, they are named first, middle, and last. All the kernels implement (8 x 8) times (8 x 12) matrix multiplication and only differ in the input and output interfaces. The first kernel in the cascading chain does not have cascading input, while the last one writes the output to local memory instead of the cascading bus.

Figure 1. Three Kernels Required by 5G NR Beamforming

Every beamforming kernel performs eight MAC4 operations on one column of eight inputs {x0, x1, x2, …, x7} to compute 8 outputs {y0, y1, y2, …, y7}. Each MAC4 operation takes eight coefficients and two inputs, and stores the result in a register of 384 bits. Two accumulation registers are allocated for {y0, …, y3} and {y4, …, y7}, respectively. At the end of computation, the partial summations are sent to the next AI Engine for further accumulation, or output to local memory after shift, round, and saturation. The following figure illustrates this process, which repeats L times until all the subcarriers sharing the same coefficient matrix have been processed.

Figure 2. Two mul4 Operations on Input Data x0 and x1

Figure 3. Two mac4 Operations on Input Data x₂ and x₃

Figure 4. Two mac4 Operations on Input Data x₄ and x₅

Figure 5. Two mac4 Operations on Input Data x₆ and x₇

The following figure is a timing diagram of the inner loop of the bf8x8_fst kernel. Before the loop, registers bufa and bufb are initialized with the first half of coefficients {h0, h1, …, h31}. One column of input data {x0, x1, … x7} is loaded into the register dat. During the first four clock cycles of the loop, in parallel to the MAC operations on the first half of coefficients, the second half is read into the registers. At clock cycle 7, {y0, y1, y2, y3} are computed and sent to the next AI Engine via the cascading bus, followed by {y4, y5, y6, y7}. From clock cycle 9 to 16, the computation of the next 8 data is performed in reverse order; the MAC operations start from the second half of the coefficients already available in the registers, and then the first half is loaded at cycle 13 and 14. The inner loop takes 16 clock cycles, during which 16 mul4/mac4 operations, 10 memory loads, and four cascading bus pushes are executed in parallel. The vector processor is fully occupied without any idle cycle. For L subcarriers, the inner loop runs for L/2 iterations.

Figure 6. Timing Diagram of Inner Loop of bf8x8_fst Kernel

The kernel bf8x8_mid reads the partial summation from the previous AI Engine before starting the first MAC operation. In the following C code, the intrinsic get_scd() loads the data on the cascading bus into an accumulation register, and the intrinsic mac4() resumes the accumulation without wasting any clock cycles.

acca = mac4(getc_scd(), bufa, 0, 0x3210, 8, dat, 0, 0x0000, 1); 
accb = mac4(getc_scd(), bufa, 4, 0x3210, 8, dat, 0, 0x0000, 1);

The kernel bf8x8_lst writes the final computation result into local memory. The vector {y0, y1, …, y7} is 256 bits can be written into memory in one clock cycle if the data come from a 768-bit 8-lane accumulation register. Because every mac4 operation only updates four lanes, the intrinsics ext_lo, ext_hi, upd_lo, and upd_hi are needed. The first four instructions of the loop are shown in the following for comparison with those of other kernels:

acc = upd_lo(acc, mac4(getc_scd(),  bufa, 0, 0x3210, 8, dat, 0,0x0000, 1));
acc = upd_hi(acc, mac4(getc_scd(),  bufa, 4, 0x3210, 8, dat, 0,0x0000, 1));
acc = upd_lo(acc, mac4(ext_lo(acc), bufb, 0, 0x3210, 8, dat, 2,0x0000, 1));
acc = upd_hi(acc, mac4(ext_hi(acc), bufb, 4, 0x3210, 8, dat, 2,0x0000, 1));