The inference process is computationally intensive and requires a high memory bandwidth to satisfy the low-latency and high-throughput requirements of Edge applications.
Quantization and channel pruning techniques address these challenges while simultaneously achieving optimal performance and high energy efficiency with minimal degradation in accuracy. Through quantization, integer computing units become viable, and weights and activations can be represented with reduced precision. On the other hand, pruning reduces the overall required operations. The AMD Vitis AI quantizer includes the quantization tool, whereas the pruning tool is integrated into the Vitis AI optimizer.
Generally, 32-bit floating-point weights and activation values are used when training neural networks. The Vitis AI quantizer can reduce computational complexity without losing prediction accuracy by converting the 32-bit floating-point weights and activations to an 8-bit integer (INT8) format. Deployment of the fixed-point network model requires reduced memory bandwidth, thus providing faster speed and higher power efficiency than would be possible with floating-point model. The Vitis AI quantizer supports common layers in neural networks, including, but not limited to, convolution, pooling, fully connected, and batch normalization.
The Vitis AI quantizer supports TensorFlow (1.x and 2.x) and PyTorch. The quantizer names are vai_q_tensorflow and vai_q_pytorch, respectively. In Vitis AI 2.5 and earlier versions, the Vitis AI quantizer for TensorFlow 1.x was based on TensorFlow 1.15 and released as part of the TensorFlow 1.15 package. However, beginning with Vitis AI 3.0, the Vitis AI quantizer is offered as a standalone Python package featuring multiple quantization APIs for both TensorFlow 1.x and TensorFlow 2.x. You can import this package, and once imported, the Vitis AI quantizer functions as a plugin for TensorFlow.
Model | Versions | Features | |||
---|---|---|---|---|---|
Post Training Quantization (PTQ) | Quantization Aware Training (QAT) | Fast fine-tuning ( Advanced Calibration) | Inspector | ||
TensorFlow 1.x | Supports 1.15 | Yes | Yes | Yes | No |
TensorFlow 2.x | Supports 2.3 - 2.12 | Yes | Yes | Yes | Yes |
PyTorch | Supports 1.2 - 1.13, 2.0 | Yes | Yes | Yes | Yes |
Post-training quantization (PTQ) requires only a small set of unlabeled images to analyze the distribution of activations. The run time of post-training quantization varies from a few seconds to several minutes, depending on the neural network size. Generally, there is some tolerable drop in accuracy after quantization. However, the accuracy loss might be considerable for some networks, such as Mobilenet, and this excessive loss might not be tolerable. In such cases, quantization-aware training (QAT) can further improve the accuracy of the quantized models. To conduct QAT, the original training dataset is necessary. The process requires several epochs of fine-tuning, with the fine-tuning duration ranging from several minutes to hours. It is recommended to use small learning rates when performing QAT.
For post-training quantization, the cross-layer equalization 1 algorithm is implemented. Cross-layer equalization can improve calibration performance, especially for networks including depthwise convolution.
References
- Markus Nagel et al., Data-Free Quantization through Weight Equalization and Bias Correction, arXiv:1906.04721, 2019.
- Itay Hubara et al.., Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming, arXiv:2006.10518, 2020.