Overview - 3.0 English

Vitis AI User Guide (UG1414)

Document ID
UG1414
Release Date
2023-02-24
Version
3.0 English

The process of inference is computation intensive and requires a high memory bandwidth to satisfy the low-latency and high-throughput requirement of Edge applications.

Quantization and channel pruning techniques are employed to address these issues while achieving high performance and high energy efficiency with little degradation in accuracy. Quantization makes it possible to use integer computing units and to represent weights and activations by lower bits, while pruning reduces the overall required operations. In the Vitis™ AI quantizer, only the quantization tool is included. The pruning tool is packaged in the Vitis AI optimizer. Contact the support team for the Vitis AI development kit if you require the pruning tool.

Figure 1. Pruning and Quantization Flow

Generally, 32-bit floating-point weights and activation values are used when training neural networks. By converting the 32-bit floating-point weights and activations to 8-bit integer (INT8) format, the Vitis AI quantizer can reduce computing complexity without losing prediction accuracy. The fixed-point network model requires less memory bandwidth, thus providing faster speed and higher power efficiency than the floating-point model. The Vitis AI quantizer supports common layers in neural networks, including, but not limited to, convolution, pooling, fully connected, and batchnorm.

The Vitis AI quantizer now supports TensorFlow (both 1.x and 2.x), and PyTorch. . The quantizer names are vai_q_tensorflow and vai_q_pytorch, respectively. Quantizer for Caffe has been deprecated in Vitis AI 2.5. If you want to use Vitis AI quantizer for Caffe, please refer to Vitis AI 2.0. In Vitis AI 2.5 and previous versions, for TensorFlow 1.x, the Vitis AI quantizer is based on TensorFlow 1.15 and released with Tensorflow 1.15 package. Starting from Vitis AI 3.0, the Vitis AI quantizer is a standalone Python package with several quantization APIs for both Tensorflow1.x and Tensorflow2.x. You can import this package, and the Vitis AI quantizer works like a plugin for TensorFlow.

Table 1. Vitis AI Quantizer Supported Frameworks and Features
Model Versions Features
Post Training Quantization (PTQ) Quantization Aware Training (QAT) Fast Finetuning ( Advanced Calibration) Inspector
TensorFlow 1.x Supports 1.15 Yes Yes No No
TensorFlow 2.x Supports 2.3 - 2.10 Yes Yes Yes Yes
PyTorch Supports 1.2 - 1.12 Yes Yes Yes Yes

Post training quantization (PTQ) requires only a small set of unlabeled images to analyze the distribution of activations. The running time of quantize calibration varies from a few seconds to several minutes, depending on the size of the neural network. Generally, there is some drop in accuracy after quantization. However, for some networks such as Mobilenet, the accuracy loss might be large. In this situation, quantization aware training (QAT) can be used to further improve the accuracy of the quantized models. QAT requires the original training dataset. Several epochs of finetuning are needed and the finetune time varies from several minutes to several hours. It is recommended to use small learning rates when performing QAT.

Note: From Vitis AI 1.4 onwards, the term "quantize calibration" is replaced with "post training quantization" and "quantize finetuning" is replaced with "quantization aware training."
Note: Vitis AI only performs signed quantization. It is strongly recommended that standardization (i.e., scale the input pixel values to have zero mean and unit variance) be performed such that the DPU effectively sees values in the range [-1.0, +1.0). Note that scaled unsigned inputs, e.g., dividing the raw input by 255.0 to obtain an input range of [0.0, 1.0] will effectively "lose" a bit since the sign bit must always be zero to denote a positive value. Tensorflow 2.x and Pytorch quantizers provide configurations to preform unsigned quantization for experiments purposes, the results are not deployable for DPU now.
Note: When viewing a model with a tool like Netron, there will be a fix_point parameter for some layers indicating the quantization parameters used for that layer. The fix_point parameter refers to the number of fractional bits used. For example, for 8-bit signed quantization with fix_point= 7, the Q-format representation will be Q0.7, i.e., 1 sign bit, 0 integer bits, and 7 fractional bits. To convert an integer value in Q-format to floating-point, multiply the integer value by 2^-fixed_point.

For PTQ, the cross layer equalization 1 algorithm is implemented. Cross layer equalization can improve the calibration performance, especially for networks including depth-wise convolution.

With a small set of unlabeled data, the AdaQuant algorithm 2 not only calibrates the activations but also finetunes the weights. AdaQuant uses a small set of unlabeled data similar to calibration but it changes the model, which is like finetuning. Vitis AI quantizer implements this algorithm and call it "fast finetuning" or "advanced calibration." Fast finetuning can achieve better performance than quantize calibration but it is slightly slower. One thing worth noting is that for fast finetuning, each run will get a different result. This is similar to finetuning.

Note:
  1. Markus Nagel et al., Data-Free Quantization through Weight Equalization and Bias Correction, arXiv:1906.04721, 2019.
  2. Itay Hubara et.al., Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming, arXiv:2006.10518, 2020.