Generally, there is a minor accuracy loss after quantization, but for specific networks like MobileNet, the accuracy loss can be significant. To address this, fast fine-tuning uses the AdaQuant algorithm, adjusting weights and quantizing parameters layer-by-layer with the unlabeled calibration dataset to improve accuracy for specific models.
Although fast fine-tuning takes longer than normal PTQ (still significantly shorter than QAT, given the smaller calib_dataset), it is turned off by default. However, you can enable it to enhance performance if you encounter accuracy issues. A recommended workflow is to try PTQ without fast fine-tuning, then attempt quantization with fast fine-tuning if the accuracy is unsatisfactory.
While QAT is another method to improve accuracy, it requires more time and relies on the
training dataset. To activate fast fine-tuning during post-training quantization, set
include_fast_ft=True
.
quantized_model = quantizer.quantize_model(calib_dataset=calib_dataset, calib_steps=None, calib_batch_size=None, include_fast_ft=True, fast_ft_epochs=10)
Here,
-
include_fast_ft
determines whether to perform fast finetuning or not. -
fast_ft_epochs
indicates the number of finetuning epochs for each layer.