Build for Hardware Emulation and Hardware Flow - 2022.2 English

Vitis Tutorials: AI Engine Development

Document ID
XD100
Release Date
2022-12-01
Version
2022.2 English

In the previous step, you generated the AI Engine design graph (libadf.a) using the AI Engine compiler. Note that the graph has instantiated a PLIO (adf::output_plio in aie/graph.h), which will be connected to the PL side.

```
 out = adf::output_plio::create("Dataout", adf::plio_32_bits, "data/output.txt");
 ```

Here, plio_32_bits indicates the interface to the PL side is 32 bits wide. In the PL side, an HLS kernel s2mm will be instantiated. It will receive stream data from the AI Engine graph, and output data to global memory, which will be read by the host code in the PS.

Note: In this section, the make commands apply to hw_emu mode by default. Taking the hw_emu mode as an example, to target hw mode, add TARGET=hw to the make commands. For detailed commands, change the -t hw_emu option to -t hw.

To compile the HLS PL kernel, run the following make command:

```
make kernels
```

The corresponding v++ compiler command is as follows:

```
v++ -c --platform xilinx_vck190_es1_base_202220_1 -k s2mm s2mm.cpp -o s2mm.xo --verbose --save-temps
```

Switches for the v++ compiler are as follows:

  • -c: compiles the kernel source into Xilinx object (.xo) files.

  • --platform: specifies the name of a supported platform as specified by the PLATFORM_REPO_PATHS environment variable, or the full path to the platform .xpfm file.

  • -k: specifies the kernel name.

The next step is to link the AI Engine graph and PL kernels to generate the hardware platform. The make command for this is as follows:

make xclbin

This make takes 10 minutes or more to complete. The corresponding v++ linker command is as follows:

```
v++ -g -l --platform xilinx_vck190_es1_base_202220_1 pl_kernels/s2mm.xo libadf.a -t hw_emu --save-temps --verbose --config system.cfg -o vck190_aie_base_graph_hw_emu.xclbin
```

Switches for the v++ linker are as follows:

  • -l: links the PL kernels, AI Engine graph and platform into an FPGA binary file (xclbin).

  • -t: specifies the link target, hw for hardware run, hw_emu for HW emulation.

  • --config: specifies the configuration file. The configuration file (system.cfg), specifies stream connections between the Graph and PL kernels, and other optional selections.

After generating the hardware platform, compile the host code (sw/host.cpp) using the following make command:

```
make host
```

The detailed commands for compiling the host code are as follows:

```
${CXX} -std=c++14 -I$XILINX_HLS/include/ -I$PLATFORM_REPO_PATHS/sw/versal/xilinx-versal-common-v2022.2/sysroots/aarch64-xilinx-linux//usr/include/xrt/ -O0 -g -Wall -c -fmessage-length=0 --sysroot=$PLATFORM_REPO_PATHS/sw/versal/xilinx-versal-common-v2022.2/sysroots/aarch64-xilinx-linux/ -I$XILINX_VITIS/aietools/include -I../ -I../aie -o aie_control_xrt.o aie_control_xrt.cpp
${CXX} -std=c++14 -I$XILINX_HLS/include/ -I$PLATFORM_REPO_PATHS/sw/versal/xilinx-versal-common-v2022.2/sysroots/aarch64-xilinx-linux//usr/include/xrt/ -O0 -g -Wall -c -fmessage-length=0 --sysroot=$PLATFORM_REPO_PATHS/sw/versal/xilinx-versal-common-v2022.2/sysroots/aarch64-xilinx-linux/ -I$XILINX_VITIS/aietools/include -I../ -I../aie -o host.o host.cpp
${CXX} -o ../host.exe aie_control_xrt.o host.o -ladf_api_xrt -lgcc -lc -lxrt_coreutil -lxilinxopencl -lpthread -lrt -ldl -lcrypt -lstdc++ -L$PLATFORM_REPO_PATHS/sw/versal/xilinx-versal-common-v2022.2/sysroots/aarch64-xilinx-linux//usr/lib/ --sysroot=$PLATFORM_REPO_PATHS/sw/versal/xilinx-versal-common-v2022.2/sysroots/aarch64-xilinx-linux/ -L$XILINX_VITIS/aietools/lib/aarch64.o
```

Here, the cross compiler pointed by CXX is used to compile the linux host code. aie_control_xrt.cpp is copied from the directory Work/ps/c_rts.

The host code for HW emulation and HW (sw/host.cpp) includes OpenCL APIs to control the executions of PL kernels, and adf APIs (*init(),update(),run(),wait()*). The execution model of the PL kernel is composed of the following steps:

  1. Get the OpenCL platform and device:

    a. Prepare OpenCL context and command queue.

    b. Program xclbin.

    c. Get kernel objects from the program.

  2. Prepare the device buffers for kernels. Transfer data from the host memory to the global memory in the device.

  3. The host program sets up the kernel with its input parameters and triggers the execution of the kernel on the FPGA.

  4. Wait for kernel completion.

  5. Transfer data from the device global memory to host memory.

  6. Host code performs post-processing on the host memory.

Following is a code snippet from sw/host.cpp to illustrate these concepts:

```
#include "adf/adf_api/XRTConfig.h"
#include "experimental/xrt_kernel.h"
...
//1. Get OpenCL platform and device, prepare OpenCL context and command queue. Program xclbin, and get kernel objects from the program. adf::registerXRT() is needed for ADF API.
cl::Device device;
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
...
cl::Context context(device);
cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE | CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE);
...
cl::Program::Binaries bins;
cl::Program program(context, devices, bins);
cl::Kernel krnl_s2mm(program,"s2mm"); //get kernel object
...
// Create XRT device handle for ADF API
void *dh;
device.getInfo(CL_DEVICE_HANDLE, &dh);
auto dhdl = xrtDeviceOpenFromXcl(dh);
auto top = reinterpret_cast<const axlf*>(buf);
adf::registerXRT(dhdl, uuid);

//2. Prepare device buffers for kernels. Transfer data from host memory to global memory in device. 
std::complex<short> *host_out; //host buffer
cl::Buffer buffer_out(context, CL_MEM_WRITE_ONLY, output_size_in_bytes);
host_out=(std::complex<short>*)q.enqueueMapBuffer(buffer_out,true,CL_MAP_READ,0,sizeof(int)*OUTPUT_SIZE,nullptr,nullptr,nullptr);

//3. The host program sets up the kernel with its input parameters
krnl_s2mm.setArg(0,buffer_out);
krnl_s2mm.setArg(2,OUTPUT_SIZE);

//Launch the Kernel
q.enqueueTask(krnl_s2mm);

// ADF API: Initialize, run and update graph parameters (RTP)
gr.run(4);
gr.update(gr.trigger,10);
gr.update(gr.trigger,10);
gr.update(gr.trigger,100);
gr.update(gr.trigger,100);
gr.wait();

//4. Wait for kernel completion. 
q.finish();//Wait for s2mm to complete    

//5. Transfer data from global memory in device to host memory.
q.enqueueMigrateMemObjects({buffer_out},CL_MIGRATE_MEM_OBJECT_HOST);	
q.finish();//Wait for memory transfer to complete

//6. post-processing on host memory - "host_out"
```

Head files adf/adf_api/XRTConfig.h and experimental/xrt_kernel.h are needed by the adf API and XRT API.

Note: In this example, graph execution needs to start before finish() for the command queue. If finish() is invoked first, which is a blocking call, the graph will never start and provide output to s2mm, and hence the application will hang on the blocked point.

The next step is to use v++ with -p to generate the package file. The make command is:

make package

The corresponding v++ command is:

```
v++ -p -t hw_emu -f $PLATFORM_REPO_PATHS/xilinx_vck190_es1_base_202220_1/xilinx_vck190_es1_base_202220_1.xpfm \
--package.rootfs $PLATFORM_REPO_PATHS/sw/versal/xilinx-versal-common-v2022.2/rootfs.ext4  \
--package.kernel_image $PLATFORM_REPO_PATHS/sw/versal/xilinx-versal-common-v2022.2/Image  \
--package.boot_mode=sd \
--package.image_format=ext4 \
--package.defer_aie_run \
--package.sd_dir data \
--package.sd_file host.exe vck190_aie_base_graph_hw_emu.xclbin libadf.a
```

Here --package.defer_aie_run specifies that the Versal AI Engine cores will be enabled by the PS. When not specified, the tool will generate CDO commands to enable the AI Engine cores during PDI load instead.

--package.sd_dir <arg> specifies a directory path to package into the *sd_card* directory/image, which is helpful for including some golden data into the package.

--package.sd_file <arg>” is used to specify files to package into the *sd_card* directory/image.

For more details about v++ -p (--package) options, refer to Application Acceleration Development (UG1393).