Task Parallelism - 2021.2 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID
UG1393
ft:locale
English (United States)
Release Date
2021-12-15
Version
2021.2 English

Task parallelism allows you to take advantage of dataflow parallelism. In contrast to loop parallelism, when task parallelism is deployed, full execution units (tasks) are allowed to operate in parallel taking advantage of extra buffering introduced between the tasks.

See the following example:

void run (ap_uint<16> in[1024],
	  ap_uint<16> out[1024]
	  ) {
  ap_uint<16> tmp[128];
  for(int i = 0; i<8; i++) {
    processA(&(in[i*128]), tmp);
    processB(tmp, &(out[i*128]));
  }
}

When this code is executed, the function processA and processB are executed sequentially 128 times in a row. Given the combined latency for processA and processB, the loop is set to 278 and the total latency can be estimated as:

Figure 1. Performance Estimates

The extra cycle is due to loop setup and can be observed in the Schedule Viewer.

For C/C++ code, task parallelism is performed by adding the DATAFLOW pragma into the for-loop:

#pragma HLS DATAFLOW

For OpenCL API code, add the attribute before the for-loop:

__attribute__ ((xcl_dataflow))

Refer to Dataflow Optimization, HLS Pragmas, and OpenCL Attributes for more details on this topic.

As illustrated by the estimates in the HLS report, applying the transformation will considerably improve the overall performance effectively using a double (ping-pong) buffer scheme between the tasks:

Figure 2. Performances Estimates

The overall latency of the design has almost halved in this case due to concurrent execution of the different tasks of the different iterations. Given the 139 cycles per processing function and the full overlap of the 128 iterations, this allows the total latency to be:

(1x only processA + 127x both processes + 1x only processB) * 139 cycles = 17931 cycles 

Using task parallelism is a powerful method to improve performance when it comes to implementation. However, the effectiveness of applying the DATAFLOW pragma to a specific and arbitrary piece of code might vary vastly. It is often necessary to look at the execution pattern of the individual tasks to understand the final implementation of the DATAFLOW pragma. Finally, the Vitis core development kit provides the Detailed Kernel Trace, which illustrates concurrent execution.

Figure 3. Detailed Kernel Trace

For this Detailed Kernel Trace, the tool displays the start of the dataflow loop, as shown in the previous figure. It illustrates how processA is starting up right away with the beginning of the loop, while processB waits until the completion of the processA before it can start up its first iteration. However, while processB completes the first iteration of the loop, processA begins operating on the second iteration, etc.

A more abstract representation of this information is presented in Timeline Trace for the host and device activity.