One feature of the proposed beamforming architecture is that only a small number of kernels is required by various system configurations. For instance, in 5G NR 100 MHz systems all the beamformers shown in Figure 1 can be built with three kernels, as shown in the following figure. Depending on the location of the kernels in the cascading chain, they are named first, middle, and last. All the kernels implement (8 x 8) times (8 x 12) matrix multiplication and only differ in the input and output interfaces. The first kernel in the cascading chain does not have cascading input, while the last one writes the output to local memory instead of the cascading bus.
Every beamforming kernel performs eight MAC4 operations on one column of eight inputs {x0, x1, x2, …, x7} to compute 8 outputs {y0, y1, y2, …, y7}. Each MAC4 operation takes eight coefficients and two inputs, and stores the result in a register of 384 bits. Two accumulation registers are allocated for {y0, …, y3} and {y4, …, y7}, respectively. At the end of computation, the partial summations are sent to the next AI Engine for further accumulation, or output to local memory after shift, round, and saturation. The following figure illustrates this process, which repeats L times until all the subcarriers sharing the same coefficient matrix have been processed.
mul4
Operations on Input Data x0 and x1
mac4
Operations on Input Data x2 and x3
mac4
Operations on Input Data x4 and x5
mac4
Operations on Input Data x6 and x7
The following figure is a timing diagram of the inner loop of the bf8x8_fst
kernel. Before the loop, registers bufa
and bufb
are
initialized with the first half of coefficients {h0, h1, …, h31}. One column of
input data {x0, x1, … x7} is loaded into the register dat
. During the first four clock cycles of the loop, in parallel to
the MAC operations on the first half of coefficients, the second half is read into
the registers. At clock cycle 7, {y0, y1, y2, y3} are computed and sent to the next
AI Engine via the cascading bus, followed by
{y4, y5, y6, y7}. From clock cycle 9 to 16, the computation of the next 8 data is
performed in reverse order; the MAC operations start from the second half of the
coefficients already available in the registers, and then the first half is loaded
at cycle 13 and 14. The inner loop takes 16 clock cycles, during which 16 mul4
/mac4
operations,
10 memory loads, and four cascading bus pushes are executed in parallel. The vector
processor is fully occupied without any idle cycle. For L subcarriers, the inner
loop runs for L/2 iterations.
bf8x8_fst
Kernel
The kernel bf8x8_mid
reads the partial
summation from the previous AI Engine before
starting the first MAC operation. In the following C code, the intrinsic get_scd()
loads the data on the cascading bus into an
accumulation register, and the intrinsic mac4()
resumes the accumulation without wasting any clock cycles.
acca = mac4(getc_scd(), bufa, 0, 0x3210, 8, dat, 0, 0x0000, 1);
accb = mac4(getc_scd(), bufa, 4, 0x3210, 8, dat, 0, 0x0000, 1);
The kernel bf8x8_lst
writes the final
computation result into local memory. The vector {y0, y1, …, y7} is 256 bits can be
written into memory in one clock cycle if the data come from a 768-bit 8-lane
accumulation register. Because every mac4
operation
only updates four lanes, the intrinsics ext_lo
,
ext_hi
, upd_lo
, and upd_hi
are needed. The first
four instructions of the loop are shown in the following for comparison with those
of other kernels:
acc = upd_lo(acc, mac4(getc_scd(), bufa, 0, 0x3210, 8, dat, 0,0x0000, 1));
acc = upd_hi(acc, mac4(getc_scd(), bufa, 4, 0x3210, 8, dat, 0,0x0000, 1));
acc = upd_lo(acc, mac4(ext_lo(acc), bufb, 0, 0x3210, 8, dat, 2,0x0000, 1));
acc = upd_hi(acc, mac4(ext_hi(acc), bufb, 4, 0x3210, 8, dat, 2,0x0000, 1));