Viewing Dataflow Performance using Waveforms

Viewing Dataflow Performance using Waveforms - 2023.2 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID

XD099

Release Date

2023-11-13

Version

2023.2 English

The Dataflow viewer by design can only show you the static view of the dataflow optimization. The graph shows the call-graph like structure of the dataflow region (as shown below). In this graph, you can get a sense of the throughput of your design by observing the II and latency of each function along a given path.

Throughput

It is difficult to see how the functions inside the dataflow region are executed in parallel and how the execution of the functions overlap. In order to visualize this dynamic timeline you can use the AMD Vivado™ XSIM simulator and waveform viewer.

To launch the simulator waveform viewer you need to re-run RTL co-simulation with a few new settings:

From the menu, select the Solutions > Run C/RTL Co-Simulation command.

The Co-simulation dialog box displays as shown in the following figure.
Make the following selections:
1. Ensure that the Vivado XSIM simulator is chosen.
2. Select all for the Dump Trace option to trace all ports and signals. Note: This is a small design and so we can dump and trace all the signals. For a large design, this might cause an increased simulation run time as well as the creation of a large waveform database.
3. Enable the Wave Debug option to interactive launch the XSIM waveform viewer during simulation.
4. Enable the Channel (PIPO/FIFO) Profiling checkbox.
5. Click OK.

At this point, the Vitis HLS GUI will reinvoke RTL co-simulation. The difference this time around is that when it is done with simulation, it will display the Vivado XSIM waveform viewer (due to the Wave Debug option), to let you inspect the waveforms generated during simulation (by the Dump Trace option). You will see something like the following figure:

RTL CoSim Waveform Summary

To easily explain how the dataflow optimization executes the functions inside the dataflow region in parallel, the waveforms are analyzed to track process starts and stops and a summary of this activity is presented in the waveform viewer. In the above diagram, note the following details:

The top function in the design is the diamond function. In the waveform viewer, this is shown as AESL_inst_diamond.
Note that the first item in the Name column is the HLS Process Summary. This section show the activity traces (using cyan colored bars) of the dataflow region inside the diamond function. This is in fact, a replica of the activity traces found under the AESL_inst_diamond_activity item. The HLS Process Summary just brings together the function activity waveforms together in one section in the waveform viewer. The first line shows a summary of the number of active iterations of the diamond function that are executing in parallel at that particular time point (1, 2, 3, 2, 1).
Expand this level to show the individual active invocations of the functions (funcA, funcB, funcC, & funcD). In the provided testbench for this test, the top level function diamond is called 3 times. So the activity traces for each function shows when each of the three calls to a function are executed. Also what is visible is the order in which the functions are executed inside the body of function diamond. First funcA starts followed by the parallel execution of funcB and funcC and once these functions are done, funcD starts executing. Small gaps in execution indicated by the yellow elipses can be situations where execution is stalled and worthy of a closer look. This view shows how the functions inside the dataflow region are executed in a pipelined manner — except that it is done in a dynamic pipeline instead of a static pipeline.
Expand the AESL_inst_diamond_activity level to see a much more detailed view and to see how the three calls to the top level function are executed (#0, #1, #2). These are shown with green color bars. The iteration count starts at zero and ends at two for this particular testbench. You can compare the time take for each iteration to complete and you can also see how the iterations overlap in time. So even the multiple calls to the top level function are dynamically pipelined.
You can investigate the activity traces for each of the sub-functions to see when each invocation of the sub-function starts and stops (shown by the green #0, #1, #2 bars while the cyan (1, 1, 1) bars just shows the active iterations at the given time point).
Additional details such a StallNoContinue signal is shown to highlight any back pressure that can cause stalling of the function executions. In the above diagram, back pressure from funcD can be seen for funcB and funcC (highlighted on the various StallNoContinue waveforms by the red ellipses).
The RTL level signals are also available for inspection when you expand the RTL Signals section.
It should be noted that, in this default form of HLS dataflow (i.e., with PIPO channels only), successive communicating tasks in a kernel run do not overlap: funcB and funcC can only start once their buffer from funcA (ping or pong) is released. funcB and funcC could possibly start earlier, if FIFOs were used as an alternative channel to ping-pong buffers, when the data are consumed in the same order in which they are produced. PIPOs are generally used when data is written into the buffer in random order and therefore, the entire buffer is locked until all processing has been completed before releasing access to the buffer. FIFOs are generally used when you have a streaming type of application where data is consumed in the order that it is created. This allows for the consumer to start processing as soon as there is data in the FIFO.