Overview - 3.5 English

Vitis AI User Guide (UG1414)

Document ID

UG1414

Release Date

2023-09-28

Version

3.5 English

The inference process is computationally intensive and requires a high memory bandwidth to satisfy the low-latency and high-throughput requirements of Edge applications.

Quantization and channel pruning techniques address these challenges while simultaneously achieving optimal performance and high energy efficiency with minimal degradation in accuracy. Through quantization, integer computing units become viable, and weights and activations can be represented with reduced precision. On the other hand, pruning reduces the overall required operations. The AMD Vitis AI quantizer includes the quantization tool, whereas the pruning tool is integrated into the Vitis AI optimizer.

Figure 1. Pruning and Quantization Flow

Generally, 32-bit floating-point weights and activation values are used when training neural networks. The Vitis AI quantizer can reduce computational complexity without losing prediction accuracy by converting the 32-bit floating-point weights and activations to an 8-bit integer (INT8) format. Deployment of the fixed-point network model requires reduced memory bandwidth, thus providing faster speed and higher power efficiency than would be possible with floating-point model. The Vitis AI quantizer supports common layers in neural networks, including, but not limited to, convolution, pooling, fully connected, and batch normalization.

The Vitis AI quantizer supports TensorFlow (1.x and 2.x) and PyTorch. The quantizer names are vai_q_tensorflow and vai_q_pytorch, respectively. In Vitis AI 2.5 and earlier versions, the Vitis AI quantizer for TensorFlow 1.x was based on TensorFlow 1.15 and released as part of the TensorFlow 1.15 package. However, beginning with Vitis AI 3.0, the Vitis AI quantizer is offered as a standalone Python package featuring multiple quantization APIs for both TensorFlow 1.x and TensorFlow 2.x. You can import this package, and once imported, the Vitis AI quantizer functions as a plugin for TensorFlow.

Table 1. Vitis AI Quantizer Supported Frameworks and Features
Model	Versions	Features
Model	Versions	Post Training Quantization (PTQ)	Quantization Aware Training (QAT)	Fast fine-tuning ( Advanced Calibration)	Inspector
TensorFlow 1.x	Supports 1.15	Yes	Yes	Yes	No
TensorFlow 2.x	Supports 2.3 - 2.12	Yes	Yes	Yes	Yes
PyTorch	Supports 1.2 - 1.13, 2.0	Yes	Yes	Yes	Yes

Post-training quantization (PTQ) requires only a small set of unlabeled images to analyze the distribution of activations. The run time of post-training quantization varies from a few seconds to several minutes, depending on the neural network size. Generally, there is some tolerable drop in accuracy after quantization. However, the accuracy loss might be considerable for some networks, such as Mobilenet, and this excessive loss might not be tolerable. In such cases, quantization-aware training (QAT) can further improve the accuracy of the quantized models. To conduct QAT, the original training dataset is necessary. The process requires several epochs of fine-tuning, with the fine-tuning duration ranging from several minutes to hours. It is recommended to use small learning rates when performing QAT.

Note: Starting from Vitis AI 1.4, the term quantize calibration is replaced with post-training quantization, and quantize fine-tuning is replaced with quantization aware training.

Note: Vitis AI only performs signed quantization.It is highly recommended to apply standardization, which involves scaling the input pixel values to have a zero mean and unit variance. It ensures that the DPU sees values within the range of [-1.0, +1.0]. Using scaled unsigned inputs, achieved by dividing the raw input by 255.0 to obtain a range of [0.0, 1.0], results in a loss of dynamic range because only half the input range is used. TensorFlow 2.x and PyTorch quantizers provide configurations to perform unsigned quantization for experiments. The results obtained are not currently deployable for DPUs at this time.

Note: When viewing a model with a tool like Netron, a fix_point parameter for some layers indicates the quantization parameters used for that layer. The fix_point parameter refers to the number of fractional bits used. For example, for 8-bit signed quantization with fix_point= 7, the Q-format representation is Q0.7, which means one sign bit, zero integer bits, and seven fractional bits. To convert an integer value in Q-format to a floating-point, multiply the integer value by 2^-fixed_point.

For post-training quantization, the cross-layer equalization ¹ algorithm is implemented. Cross-layer equalization can improve calibration performance, especially for networks including depthwise convolution.

With a small set of unlabeled data, the AdaQuant algorithm ² not only calibrates the activations but also fine-tunes the weights. AdaQuant uses a small set of unlabeled data, similar to post-training quantization, but it changes the model, which is like fine-tuning. Vitis AI quantizer implements this algorithm and calls it "fast fine-tuning" or "advanced calibration." Fast fine-tuning can perform better than post-training quantization but is slightly slower.

Note: For fast fine-tuning, each run fetches a different result. This is similar to fine-tuning.

References

Markus Nagel et al., Data-Free Quantization through Weight Equalization and Bias Correction, arXiv:1906.04721, 2019.
Itay Hubara et al.., Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming, arXiv:2006.10518, 2020.