Asynchronous Host Control of Accelerator - 2022.1 English

Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)

Document ID
Release Date
2022.1 English

The VSC mode allows compilation of accelerators with CUs that contain user-defined hardware pipelines, as described in Building Hardware. Such a pipeline is composed of PEs that connect to each other through AXI4-Stream and can also connect to platform ports that are AXI4 connections, such as global memory or IO interfaces such as an ethernet QSFP port. The platform will provide IP that translate such interfaces into AXI4-Stream ports which can be connected to PEs in the user-defined pipeline.

Using VSC such hardware pipelines can be easily configured to dynamically change processing behavior at runtime from an application running in the host CPU. The following describes how such an accelerator can be created. An example system composition is show in the picture given below.

Figure 1. Accelerator Implementation

The Eth_Rx and Eth_Tx modules are typically platform IP that translate AXI4-Stream words into ethernet packets. These can also be custom IP with user-defined AXI4 interfaces.

The rest of the accelerator pipeline, shown in the white box, is created with VSC using AXI4-Stream connections. The PEs in the pipeline are user-defined functionality, such as packet processing like an internet protocol packet filter. In this example there is a pipeline created with two tasks, which are the PEs called mod and smp. Additionally, the system is composed of another control PE that has AXI4-Stream connections to these pipeline PEs. Example accelerator code is provided below, with the .hpp file on the left and the .cpp on the right.

// -- file: ETH.hpp --
#include "vpp_acc.hpp"
class ETH : public VPP_ACC<ETH,1>

    SYS_PORT(dIn,  MEM_BANK0);
    SYS_PORT(dOut, MEM_BANK1);

    static void compute(int cmd, Pkt* dIn, Pkt* dOut);

    static void control(...
    static void fsk_mod(...
    static void fsk_smp(...
    static void eth_tx(...
    static void eth_rx(...
// -- file: ETH.cpp --
void ETH::compute(int cmd, Pkt* dIn, Pkt* dOut)
    static IntStream dropS("drop");
    static IntStream addS("add");
    static IntStream reqS("req");
    static PktStream smpS("smp");
    static BitStream getS("get");
    static IntStream sntS("snt");
    static PktStream Ax("Ax", /*post_check=*/false);
    static PktStream Bx("Bx", /*post_check=*/false);
    static PktStream Cx("Cx", /*post_check=*/false);
    control(cmd, dIn, dOut,
            dropS, addS, reqS, smpS, getS, sntS);
    eth_rx (Ax);
    fsk_mod(Ax, Bx, dropS, addS);
    fsk_smp(Bx, Cx, reqS, smpS);
    eth_tx (Cx, getS, sntS);

In this example, five PEs are defined including the eth_tx and eth_rx which mock the platform IP behavior in receiving and transmitting words in the AXI4-Stream. The compute() scope implements the accelerator pipeline using AXI4-Stream connections between these PEs. The control PE can send command words on these streams and the task PEs (fsk_mod and fsk_smp) monitor these command AXI4-Stream and react by changing behavior. The fsk_smp PE reacts by sampling a requested number of packets back to the control PE. The fsk_mod PE reacts by adding a value to the packet data or by dropping packets that are being passed from Eth_Rx into Eth_Tx.

The pipeline PEs, fsk_mod and fsk_smp, are FREE_RUNNING as described in Guidance Macros because they are never-ending PEs driven to operation by the words in their input streams.

The control PE talks to the host CPU through two SYS_PORT connections for interface argument data pointers for input (dIn) and output (dOut), as well as the scalar command argument (cmd). The control PE is not free-running and reacts to compute() calls from the host CPU. This system composition is entirely user-defined including the nature of the commands and corresponding PE functionality.

The host code snapshot is shown here and the entire example is available on GitHub.

// -- file: host.cpp --
#include "vpp_acc_core.hpp" // required
#include "ETH.hpp"
int config_sample(int sz)
    printf("main: sample %d\n", sz);

    Pkt* sample  = (Pkt*)ETH::alloc_buf(sz * sizeof(Pkt), vpp::output);
    Pkt* config  = (Pkt*)ETH::alloc_buf(sizeof(Pkt), vpp::input);
    config[0].dt = sz;

    auto fut = ETH::compute_async(cmd_sample, config, sample);
    print_sample(sample, sz);
    int pkt_nr = sample[0].nr;
    return pkt_nr;

The job commands issued by the VSC host code, specifically using the compute_async() API, enables the control PE to translate the command and in-turn pass configuration words to the pipeline PEs through the command streams. This snaphot shows a user-defined API that issues a packet sampling command. A sample buffer of a required sz is allocated at run time, and the compute_async() call will trigger the control PE to capture sz number of packets and return the words back to the host. The fut returned by compute() is blocking in the host code until the results are available. However, the compute_async() as name denotes is an asychronous call that triggers the accelerator. Once the sample words are returned and processed by the host and the corresponding buffers can be freed.

Because host control in this case is not a continuous pipeline of compute jobs, but just an occasional, non-timing critical job, the send_while/receive_all thread will not manage this. Instead, the synchronization is application managed using the compute_async() API defined in vpp_acc_core.hpp.

Important: This vpp_acc_core.hpp header file needs to be included only in the host code file, and before any vpp_acc.hpp is included.

With VSC such hardware pipelines can be composed and be controlled asynchronously from a host CPU. One of the applications for such an accelerator is a packet processing accelerator on a NIC card. For example the X3 Hybrid platforms provides ethernet transmission and reception IPs which convert ethernet packets arriving at the QSFP ports into AXI4-Stream, through a MAC interface. Furthermore, a NIC interface allows the accelerator to provide data to a connected host CPU over PCIe, or a Host-Memory access may be used for direct host CPU memory access over PCIe. Using VSC, the accelerator packet processing pipeline can be composed on the PL and can be controlled by a CPU asynchronously over PCIe.