We begin by opening the launch.json
file under Settings
in the Vitis Components
pane. Select Part2a_aiesim_1
to view the AIE Simulator parameters and check the box for Enable Profile
. Build, then run the simulation.
After the simulation completes, the “goodness” of the result can be checked by running:
$ julia check.jl aie
The result is “good” when the maximum(abs.(err))
is less than eps(Float32)
.
To view the profiler result, in the FLOW
pane, under AIE SIMULATOR / HARDWARE
, expand REPORTS
(below the Debug
icon) and click on Profile
.
In the AIE SIMULATION
pane, click on Total Function Time
to show the number of cycles consumed by each function.
Note: The kernel function, SecondOrderSection<1>
was executed 32 times and ran for 2,313 cycles. Each function call consumed 2,313/32 = 72.28 cycles. The minimum function time is 72 cycles and the maximum is 81 cycles. This implies that the first call consumed nine more cycles (81 + 31 * 72 = 2,313).
Another item of interest is the top-level main
function which calls my_graph.run()
, which calls SecondOrderSection<1>
. The Total Function + Descendants Time (cycles)
column shows the number of cycles consumed by that function, including all other routines called within it. This includes setting up the heap and stack, initialization, actual processing, etc. For this implementation, 4,579 cycles were used to process 256 samples, or 4579/256 = 17.89 cycles/sample. Assuming that the AI Engine runs with a 1 GHz clock, the throughput would be 1e9 cycles/sec / 17.89 cycles/sample = 55.897 Msamples/sec.
Note: The main processing occurs in SecondOrderSection<1>
, which consumes 2,313 cycles. Thus, 4,579 - 2,313 = 2,266 “overhead” cycles are not used for sample processing.
Click Profile Details
to view the generated assembly code.
Scroll down to where the VFPMAC
assembler mnemonics become visible.
From the kernel code, the following statement generates the VFPMAC
mnemonic (vector floating-point multiply-accumulate). Also, the for
loop is unrolled, and there is a NOP
(no operation) between each VFPMAC
to account for the two-cycle floating-point accumulation latency.:
acc = aie::mac(acc, coeff, xval); // acc[] += coeff[] * xval
VFPMAC
uses a 1024-bit y
register as an input buffer (see Table 9 of AM009).
The ya
register is comprised of the four 256-bit wr[0:3]
registers. For this example, the wr0
register is updated with the columns of the coefficient matrix using the VLDA
(vector load A) mnemonic. The VLDA
mnemonic transfers eight floating-point values from data memory to a vector register. In this example, there is a seven to eight-cycle latency from the VLDA
mnemonic (loading data into wr0
) to the time the data is used for computation with VFPMAC
.