The process of inference is computation intensive and requires a high memory bandwidth to satisfy the low-latency and high-throughput requirement of Edge applications.
Quantization and channel pruning techniques are employed to address these issues while achieving high performance and high energy efficiency with little degradation in accuracy. Quantization makes it possible to use integer computing units and to represent weights and activations by lower bits, while pruning reduces the overall required operations. In the Vitis™ AI quantizer, only the quantization tool is included. The pruning tool is packaged in the Vitis AI optimizer. Contact the support team for the Vitis AI development kit if you require the pruning tool.
Generally, 32-bit floating-point weights and activation values are used when training neural networks. By converting the 32-bit floating-point weights and activations to 8-bit integer (INT8) format, the Vitis AI quantizer can reduce computing complexity without losing prediction accuracy. The fixed-point network model requires less memory bandwidth, thus providing faster speed and higher power efficiency than the floating-point model. The Vitis AI quantizer supports common layers in neural networks, including, but not limited to, convolution, pooling, fully connected, and batchnorm.
The Vitis AI quantizer now supports TensorFlow (both 1.x and 2.x), PyTorch, and Caffe. The quantizer names are vai_q_tensorflow, vai_q_pytorch, and vai_q_caffe, respectively. The Vitis AI quantizer for TensorFlow 1.x and TensorFlow 2.x are implemented in different ways and are released separately. For TensorFlow 1.x, the Vitis AI quantizer is based on TensorFlow 1.15. After adding quantization features, the Vitis AI quantizer rebuilds and redistributes a standalone package. For TensorFlow 2.x, the Vitis AI quantizer is a Python package with several quantization APIs. You can import this package, and the Vitis AI quantizer works like a plugin for TensorFlow.
|Post Training Quantization (PTQ)||Quantization Aware Training (QAT)||Fast Finetuning ( Advanced Calibration)|
|TensorFlow 1.x||Based on 1.15||Yes||Yes||No|
|TensorFlow 2.x||Supports 2.3||Yes||Yes||Yes|
|PyTorch||Supports 1.2 - 1.9||Yes||Yes||Yes|
Post training quantization (PTQ) requires only a small set of unlabeled images to analyze the distribution of activations. The running time of quantize calibration varies from a few seconds to several minutes, depending on the size of the neural network. Generally, there is some drop in accuracy after quantization. However, for some networks such as Mobilenet, the accuracy loss might be large. In this situation, quantization aware training (QAT) can be used to further improve the accuracy of the quantized models. QAT requires the original training dataset. Several epochs of finetuning are needed and the finetune time varies from several minutes to several hours. It is recommended to use small learning rates when performing QAT.
For PTQ, the cross layer equalization 1 algorithm is implemented. Cross layer equalization can improve the calibration performance, especially for networks including depth-wise convolution.
With a small set of unlabeled data, the AdaQuant algorithm 2 not only calibrates the activations but also finetunes the weights. AdaQuant uses a small set of unlabeled data similar to calibration but it changes the model, which is like finetuning. Vitis AI quantizer implements this algorithm and call it "fast finetuning" or "advanced calibration." Fast finetuning can achieve better performance than quantize calibration but it is slightly slower. One thing worth noting is that for fast finetuning, each run will get a different result. This is similar to finetuning.
- Markus Nagel et al., Data-Free Quantization through Weight Equalization and Bias Correction, arXiv:1906.04721, 2019.
- Itay Hubara et.al., Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming, arXiv:2006.10518, 2020.