The process of inference is computation intensive and requires a high memory bandwidth to satisfy the low-latency and high-throughput requirement of Edge applications.
Quantization and channel pruning techniques are employed to address these issues while achieving high performance and high energy efficiency with little degradation in accuracy. Quantization makes it possible to use integer computing units and to represent weights and activations by lower bits, while pruning reduces the overall required operations. In the Vitis™ AI quantizer, only the quantization tool is included. The pruning tool is packaged in the Vitis AI optimizer. Contact the support team for the Vitis AI development kit if you require the pruning tool.
Generally, 32-bit floating-point weights and activation values are used when training neural networks. By converting the 32-bit floating-point weights and activations to 8-bit integer (INT8) format, the Vitis AI quantizer can reduce computing complexity without losing prediction accuracy. The fixed-point network model requires less memory bandwidth, thus providing faster speed and higher power efficiency than the floating-point model. The Vitis AI quantizer supports common layers in neural networks, including, but not limited to, convolution, pooling, fully connected, and batchnorm.
The Vitis AI quantizer now supports TensorFlow (both 1.x and 2.x), and PyTorch. . The quantizer names are vai_q_tensorflow and vai_q_pytorch, respectively. Quantizer for Caffe has been deprecated in Vitis AI 2.5. If you want to use Vitis AI quantizer for Caffe, please refer to Vitis AI 2.0. In Vitis AI 2.5 and previous versions, for TensorFlow 1.x, the Vitis AI quantizer is based on TensorFlow 1.15 and released with Tensorflow 1.15 package. Starting from Vitis AI 3.0, the Vitis AI quantizer is a standalone Python package with several quantization APIs for both Tensorflow1.x and Tensorflow2.x. You can import this package, and the Vitis AI quantizer works like a plugin for TensorFlow.
|Post Training Quantization (PTQ)||Quantization Aware Training (QAT)||Fast Finetuning ( Advanced Calibration)||Inspector|
|TensorFlow 1.x||Supports 1.15||Yes||Yes||No||No|
|TensorFlow 2.x||Supports 2.3 - 2.10||Yes||Yes||Yes||Yes|
|PyTorch||Supports 1.2 - 1.12||Yes||Yes||Yes||Yes|
Post training quantization (PTQ) requires only a small set of unlabeled images to analyze the distribution of activations. The running time of quantize calibration varies from a few seconds to several minutes, depending on the size of the neural network. Generally, there is some drop in accuracy after quantization. However, for some networks such as Mobilenet, the accuracy loss might be large. In this situation, quantization aware training (QAT) can be used to further improve the accuracy of the quantized models. QAT requires the original training dataset. Several epochs of finetuning are needed and the finetune time varies from several minutes to several hours. It is recommended to use small learning rates when performing QAT.
For PTQ, the cross layer equalization 1 algorithm is implemented. Cross layer equalization can improve the calibration performance, especially for networks including depth-wise convolution.
With a small set of unlabeled data, the AdaQuant algorithm 2 not only calibrates the activations but also finetunes the weights. AdaQuant uses a small set of unlabeled data similar to calibration but it changes the model, which is like finetuning. Vitis AI quantizer implements this algorithm and call it "fast finetuning" or "advanced calibration." Fast finetuning can achieve better performance than quantize calibration but it is slightly slower. One thing worth noting is that for fast finetuning, each run will get a different result. This is similar to finetuning.
- Markus Nagel et al., Data-Free Quantization through Weight Equalization and Bias Correction, arXiv:1906.04721, 2019.
- Itay Hubara et.al., Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming, arXiv:2006.10518, 2020.