Vitis AI Examples - 2.5 English

Vitis AI User Guide (UG1414)

Document ID
UG1414
Release Date
2022-06-15
Version
2.5 English
Vitis AI provides several C++ and Python examples to demonstrate the use of the unified cloud-edge runtime programming APIs.
Note: The sample code helps you get started with the new runtime (VART). They are not meant for performance benchmarking.
To familiarize yourself with the unified APIs, use the VART examples. These examples are only to understand the APIs and do not provide high performance. These APIs are compatible between the edge and cloud, though cloud boards may have different software optimizations such as batching and on the edge would require multi-threading to achieve higher performance. If you desire higher performance, see the Vitis AI Library samples and demo software.

If you want to do optimizations to achieve high performance, here are some suggestions:

  • Rearrange the thread pipeline structure so that every DPU thread has its own "DPU" runner object.
  • Optimize display thread so that when DPU FPS is higher than display rate, skipping some frames. 200 FPS is too high for video display.
  • Pre-decoding. The video file might be H.264 encoded. The decoder is slower than the DPU and consumes a lot of CPU resources. The video file has to be first decoded and transformed into raw format.
  • The batch mode on Alveo boards needs special consideration as it may cause video frame jittering. ZCU102 has no batch mode support.
  • OpenCV cv::imshow is slow, so you need to use libdrm.so. This is only for local display, not through X server.

The following table below describes these Vitis AI examples.

Table 1. Vitis AI Examples
ID Example Name Models Framework Notes
1 resnet50 ResNet-50 Caffe Image classification with Vitis AI unified C++ APIs.
2 resnet50_pt ResNet-50 PyTorch Image classification with Vitis AI unified extension C++ APIs.
3 resnet50_ext ResNet-50 Caffe Image classification with Vitis AI unified extension C++ APIs.
4 resnet50_mt_py ResNet-50 Caffe Multi-threading image classification with Vitis AI unified Python APIs.
5 inception_v1_mt_py Inception-v1 TensorFlow Multi-threading image classification with Vitis AI unified Python APIs.
6 pose_detection SSD, Pose detection Caffe Pose detection with Vitis AI unified C++ APIs.
7 video_analysis SSD Caffe Traffic detection with Vitis AI unified C++ APIs.
8 adas_detection YOLOv3 Caffe ADAS detection with Vitis AI unified C++ APIs.
9 segmentation FPN Caffe Semantic segmentation with Vitis AI unified C++ APIs.
10 squeezenet_pytorch Squeezenet PyTorch Image classification with Vitis AI unified C++ APIs.

The typical code snippet to deploy models with Vitis AI unified C++ high-level APIs is as follows:

// get dpu subgraph by parsing model file
auto runner = vart::Runner::create_runner(subgraph, "run");
// get input scale and output scale,
// they are used for fixed-floating point conversion
auto outputTensors = runner->get_output_tensors();
auto inputTensors = runner->get_input_tensors();
auto input_scale = get_input_scale(inputTensors[0]);
auto output_scale = get_output_scale(outputTensors[0]);
// do the image pre-process, convert float data to fixed point data
// populate input/output tensors
auto job_id = runner->execute_async(inputsPtr, outputsPtr);
runner->wait(job_id.first, -1);
// process outputs, convert fixed point data to float data 

The typical code snippet to deploy models with Vitis AI unified extension C++ high-level APIs is as follows:

// get dpu subgraph by parsing model file
std::unique_ptr<vart::RunnerExt> runner =
          vart::RunnerExt::create_runner(subgraph, attrs.get());
// get input & output tensor buffers
auto input_tensor_buffers = runner->get_inputs();
auto output_tensor_buffers = runner->get_outputs();
// get input scale and output scale,
// they are used for fixed-floating point conversion
auto input_tensor = input_tensor_buffers[0]->get_tensor();
auto output_tensor = output_tensor_buffers[0]->get_tensor();
auto input_scale = get_input_scale(input_tensor);
auto output_scale = get_output_scale(output_tensor);
// do the image pre-process, convert float data to fixed point data
setImageBGR(images[batch_idx], (void*)data_in, input_scale);
// sync data for input
input->sync_for_write(0, input->get_tensor()->get_data_size() /
                         input->get_tensor()->get_shape()[0]); 
// populate input/output tensors
auto v = runner->execute_async(input_tensor_buffers, output_tensor_buffers);
auto status = runner->wait((int)v.first, -1);
// sync data for output
output->sync_for_read(0, output->get_tensor()->get_data_size() /
                         output->get_tensor()->get_shape()[0]);
// process outputs, conver fixed point data to float data 

The typical code snippet to deploy models with Vitis AI unified Python high-level APIs is shown below:

dpu_runner = runner.Runner(subgraph,"run")
# populate input/output tensors
jid = dpu_runner.execute_async(fpgaInput, fpgaOutput)
dpu_runner.wait(jid)
# process fpgaOutput
Note:
  • For VART, the input data format supported are fp32 and int8.
  • The input and output of DPU is NHWC format.
  • For the Softmax IP on the MPSOC, the input is int8 and the output is float32.
Note: DPU processes only work with the input and output of fixed-point data. For improved performance and more efficient memory usage, use int8 data as input and run the float to fixed-point conversion along with preprocessing. If the input data is float, the VART converts the float data to fixed-point data which consumes more time.
Note: Since the default rounding mode of quantizer is "HALF_UP", users need to use the same rounding mode when convert inputs and outputs to/from INT8. This ensures that the pre- and post-processing parts run the same on the board as they do on the server. But to balance the performance and accuracy, we usually use “cut off” to convert inputs and outputs to/from INT8.