Design Throughput Calculations (Effective vs. Theoretical) - 2022.2 English

Vitis Tutorials: AI Engine Development

Document ID
XD100
Release Date
2022-12-01
Version
2022.2 English

The following table describes the total number of floating-point operations (FLOP) for 1 iteration of a single nbody() AI Engine kernel:

Section of Code mac mul add sub invsqr Total FLOP
Step 1 0 0 0 0 0 0
Step 2 96 0 0 0 0 192
Step 3 2,470,400 1,228,800 51,200 1,228,800 3,276,800 10,726,400

Note: Each section is clearly commented in the nbody.cc source file.

Note: To calculate the total, each mac is considered 2 operations (mul and add).

Thus, each nbody() kernel executes ~10.7 million FLOP/iteration. Since we have 400 AI Engine tiles (i.e. 400 nbody() kernels) that execute simulatenously, the total number for the entire AI Engine array becomes ~4.2 billion FLOP/iteration. We calculated each iteration of the entire design (including data movement from DDR to AI Engine) takes an average of 0.004657468 seconds. Therefore the effective throughput of the entire design is ~921.2 GFLOP/s.

The theoretical peak throughput the AI Engine array alone can acheive is ~8 Tera FLOP/s, and we’re only using 1/10th of its potential!

Effective Throughput Theoretical Peak Throughput
0.9212 TFLOP/s 8 TFLOP/s

This design of an N-Body Simulator on the AI Engine is a straightforward implementation without any major optimizations done. To further maximize the throughput of the entire design:

  • you can explore increasing FMAX of the PL kernels from 200 MHz to closer to 500 MHz to reduce the latency of moving data from DDR to the AI Engine

  • PL kernels currently implement a round-robin method of transmitting data. They could be designed to cache and schedule in an optimized way to increate data bandwidth

  • you can refactor the nbody() kernel to reduce its reliance on the scalar processor and only use the vector processor in each AI Engine tile by approximating inverse square root