Vectorized Version Using a Single Kernel - 2022.2 English

AI Engine Kernel and Graph Programming Guide (UG1079)

Document ID
Release Date
2022.2 English

AI Engine naturally supports multiple lanes of MAC operations. For variations of FIR applications, the group of aie::sliding_mul* classes and functions introduced in Multiple Lanes Multiplications - sliding_mul can be used.

In this section, we will choose aie::sliding_mul and aie::sliding_mac functions with Lanes=8 and Points=8. Both data and coefficient step sizes are 1, which is the default. For example, acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,8); performs:
Lane 0: acc[0]=acc[0]+coe[1][0]*buff[8]+coe[1][1]*buff[9]+...+coe[1][7]*buff[15];
Lane 1: acc[1]=acc[1]+coe[1][1]*buff[9]+coe[1][1]*buff[10]+...+coe[1][7]*buff[16];
Lane 7: acc[7]=acc[7]+coe[1][7]*buff[15]+coe[1][7]*buff[16]+...+coe[1][7]*buff[22];

Notice that the data buff starts from different indexes in different lanes. It requires more than 8 samples (from buff[8] to buff[22]) to be ready before execution.

Since it has 32 taps, the FIR requires one aie::sliding_mul<8,8> operation and three aie::sliding_mac<8,8> operations to calculate eight lanes of output. The data buffer is updated from stream port by buff.insert.

The vectorized kernel code is as follows:

//keep margin data between different executions of graph
static aie::vector<cint16,32> delay_line;

alignas(aie::vector_decl_align) static cint16 eq_coef[32]={{1,2},{3,4},...};

__attribute__((noinline)) void fir_32tap_vector(input_stream<cint16> * sig_in, output_stream<cint16> * sig_out){
  const int LSIZE=(SAMPLES/32);
  aie::accum<cacc48,8> acc;
  const aie::vector<cint16,8> coe[4] = {aie::load_v<8>(eq_coef),aie::load_v<8>(eq_coef+8),aie::load_v<8>(eq_coef+16),aie::load_v<8>(eq_coef+24)};
  aie::vector<cint16,32> buff=delay_line;
  for(int i=0;i<LSIZE;i++){
    //performace 1st 8 samples
    acc = aie::sliding_mul<8,8>(coe[0],0,buff,0);
    acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,8);
    acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,16);
    acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,24);

    //performace 2nd 8 samples
    acc = aie::sliding_mul<8,8>(coe[0],0,buff,8);
    acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,16);
    acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,24);
    acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,0);

    //performace 3rd 8 samples
    acc = aie::sliding_mul<8,8>(coe[0],0,buff,16);
    acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,24);
    acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,0);
    acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,8);

    //performace 4th 8 samples
    acc = aie::sliding_mul<8,8>(coe[0],0,buff,24);
    acc = aie::sliding_mac<8,8>(acc,coe[1],0,buff,0);
    acc = aie::sliding_mac<8,8>(acc,coe[2],0,buff,8);
    acc = aie::sliding_mac<8,8>(acc,coe[3],0,buff,16);
void fir_32tap_vector_init()
  //initialize data
  for (int i=0;i<8;i++){
    aie::vector<int16,8> tmp=get_wss(0);
  • alignas(aie::vector_decl_align) can be used to ensure data is aligned for vector load and store.
  • Each iteration of the main loop computes multiple samples. Consequently, the loop count is reduced.
  • Data update, calculation and data write are interleaved in the code. Determining which portion of data buffer buff to read is controlled using data_start of aie::sliding_mul.
  • For more information about supported data types and lane numbers for aie::sliding_mul, see AI Engine API User Guide (UG1529).

The initiation interval of the main loop should be identified. To locate the initiation interval of the loop:

  1. Add the -v option to aiecompiler to output a verbose report of kernel compilation.
  2. Open the kernel compilation log, for example, Work/aie/<COL_ROW>/<COL_ROW>.log.
  3. In the log, search keywords, such as do-loop, to find the initiation interval of the loop.
    An example result follows:
    HW do-loop #2821 in ".../", line 21: (loop #3) :
    critical cycle of length 130 : ...
    minimum length due to resources: 128
    scheduling HW do-loop #2821
    (algo 2) -> # cycles: ......
    NOTE: automatically decreased the number of used priority functions to 3 to reduce runtime
    -> # cycles: .....183 (exceeds -k 110) -> no folding: 183
    -> HW do-loop #2821 in ".../Vitis/2022.2/aietools/include/adf/stream/me/accessors.h", line 870: (loop #3) : 183 cycles
    • The initiation interval of the loop is 183. This means that a sample is produced in roughly 183/32~=6 cycles.
    • The message (exceeds -k 110) -> no folding indicates that the scheduler is not attempting software pipelining because the loop cycle count exceeds a limit.
  4. To override the loop cycle limit, add a user constraint, such as --Xchess="fir_32tap_vector:backend.mist2.maxfoldk=200" to the aiecompiler.

    The example result is then as follows:

    scheduling HW do-loop #2821
    (algo 2) -> # cycles: ......
    NOTE: automatically decreased the number of used priority functions to 3 to reduce runtime
    -> # cycles: .....183 
    (modulo) -> # cycles: ... ok (required budget ratio: 2)
    (resume algo) -> after folding: 161 (folded over 1 iterations)
    -> HW do-loop #2821 in ".../Vitis/2022.2/aietools/include/adf/stream/me/accessors.h", line 870: (loop #3) : 161 cycles

    where, the software requires roughly 161/32~=5 cycles to produce a sample.