Lab 2: Kernel and Host Code Synchronization

Lab 2: Kernel and Host Code Synchronization - 2023.2 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID

XD099

Release Date

2023-11-13

Version

2023.2 English

For this step, look at the source code in src/sync_host.cpp, and examine the execution loop (line 55). This is the same code used in the previous section of this tutorial.

  // -- Execution -----------------------------------------------------------

  for(unsigned int i=0; i < numBuffers; i++) {
    tasks[i].run(api);
  }
  clFinish(api.getQueue());

In this example, the code implements a free-running pipeline. No synchronization is performed until the end, when a call to clFinish is performed on the event queue. While this creates an effective pipeline, this implementation has an issue related to buffer allocation, as well as execution order. This is because it is only possible to release buffers after they are no longer needed, which implies a synchronization point.

For example, there could be issues if the numBuffer variable is increased to a large number, which would occur when processing a video stream. In this case, buffer allocation and memory usage can become problematic because the host memory is pre-allocated and shared with the FPGA. In such a case, this example will probably run out of memory.

Similarly, as each of the calls to execute the accelerator are independent and un-synchronized (out-of-order queue), it is likely that the order of execution between the different invocations is not aligned with the enqueue order. As a result, if the host code is waiting for a specific block to be finished, this might not occur until much later than expected. This effectively disables any host code parallelism while the accelerator is operating.

To alleviate these issues, the OpenCL framework provides two methods of synchronization.

clFinish call
clWaitForEvents call

Open the src/sync_host.cpp file in an editor, and look at the Execution region. To illustrate the behavior, make the following modifications to the execution loop.

// -- Execution -----------------------------------------------------------

int count = 0;
for(unsigned int i=0; i < numBuffers; i++) {
  count++;
  tasks[i].run(api);
  if(count == 3) {
    count = 0;
    clFinish(api.getQueue());
  }
}
clFinish(api.getQueue());

Compile and execute the sync_host.cpp code.

make run TARGET=hw DEVICE=xilinx_u200_gen3x16_xdma_2_202110_1 LAB=sync

After the run completes, open the Application Timeline using the Vitis analyzer, then click the Application Timeline located on left side panel.
```
vitis_analyzer sync/xrt.run_summary
```
If you zoom in on the Application Timeline, an image is displayed similar to the following figure.

In the figure, the key elements are the red box named clFinish, and the large gap between the kernel that enqueues every three invocations of the accelerator.

The call to clFinish creates a synchronization point on the complete OpenCL command queue. This implies that all commands enqueued onto the given queue will have to be completed before clFinish returns control to the host program. As a result, all activities, including the buffer communication, need to be completed before the next set of three accelerator invocations can resume. This is effectively a barrier synchronization.

While this enables a synchronization point where buffers can be released, and all processes are guaranteed to have completed, it also prevents overlap at the synchronization point.

Look at an alternative synchronization scheme, where the synchronization is performed based on the completion of a previous execution of a call to the accelerator. Edit the sync_host.cpp file to change the execution loop as follows.

  // -- Execution -----------------------------------------------------------

  for(unsigned int i=0; i < numBuffers; i++) {
    if(i < 3) {
      tasks[i].run(api);
    } else {
      tasks[i].run(api, tasks[i-3].getDoneEv());
    }
  }
  clFinish(api.getQueue());

Recompile the application, rerun the program, and review the run_summary in the Vitis analyzer:
```
make run TARGET=hw DEVICE=xilinx_u200_gen3x16_xdma_2_202110_1 LAB=sync
vitis_analyzer sync/xrt.run_summary
```
If you zoom in on the Application Timeline, an image is displayed similar to the following figure.

In the later part of the timeline, there are five executions of pass executed without any unnecessary gaps. However, even more telling are the data transfers at the point of the marker. At this point, three packages were sent over to be processed by the accelerator, and one was already received back. Because you have synchronized the next scheduling of Write/Execute/Read on the completion of the first accelerator invocation, you now observe another write operation before the third pass has even completed. This clearly identifies an overlapping execution.

In this case, you synchronized the full next accelerator execution on the completion of the execution scheduled three invocations earlier by using the following event synchronization in the run method of the class task.
```
    if(prevEvent != nullptr) {
      clEnqueueMigrateMemObjects(api.getQueue(), 1, &m_inBuffer[0],
                                0, 1, prevEvent, &m_inEv);
   } else {
     clEnqueueMigrateMemObjects(api.getQueue(), 1, &m_inBuffer[0],
                                0, 0, nullptr, &m_inEv);
    }
```
While this is the common synchronization scheme between enqueued objects in OpenCL, you can alternatively synchronize the host code by calling the following API.
```
  clWaitForEvents(1,prevEvent);
```
This allows for additional host code computation while the accelerator is operating on earlier enqueued tasks. This is not explored further here, but rather left to you as an additional exercise.

NOTE: Because this synchronization scheme allows the host code to operate after the completion of an event, it is possible to code up a buffer management scheme. This will avoid running out of memory for long running applications.