The following table describes the total number of floating-point operations (FLOP) for 1 iteration of a single nbody()
AI Engine kernel:
Section of Code | mac | mul | add | sub | invsqr | Total FLOP |
---|---|---|---|---|---|---|
Step 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Step 2 | 96 | 0 | 0 | 0 | 0 | 192 |
Step 3 | 2,470,400 | 1,228,800 | 51,200 | 1,228,800 | 3,276,800 | 10,726,400 |
Note: Each section is clearly commented in the nbody.cc
source file.
Note: To calculate the total, each mac
is considered 2 operations (mul
and add
).
Thus, each nbody()
kernel executes ~10.7 million FLOP/iteration. Since we have 400 AI Engine tiles (i.e. 400 nbody()
kernels) that execute simulatenously, the total number for the entire AI Engine array becomes ~4.2 billion FLOP/iteration. We calculated each iteration of the entire design (including data movement from DDR to AI Engine) takes an average of 0.004657468 seconds. Therefore the effective throughput of the entire design is ~921.2 GFLOP/s.
The theoretical peak throughput the AI Engine array alone can acheive is ~8 Tera FLOP/s, and we’re only using 1/10th of its potential!
Effective Throughput | Theoretical Peak Throughput |
---|---|
0.9212 TFLOP/s | 8 TFLOP/s |
This design of an N-Body Simulator on the AI Engine is a straightforward implementation without any major optimizations done. To further maximize the throughput of the entire design:
you can explore increasing
FMAX
of the PL kernels from 200 MHz to closer to 500 MHz to reduce the latency of moving data from DDR to the AI EnginePL kernels currently implement a round-robin method of transmitting data. They could be designed to cache and schedule in an optimized way to increate data bandwidth
you can refactor the
nbody()
kernel to reduce its reliance on the scalar processor and only use the vector processor in each AI Engine tile by approximating inverse square root