The overall model quantization flow is detailed in the following figure.
The Vitis AI quantizer takes a floating-point model as input and performs pre-processing (folds batchnorms and removes nodes not required for inference), and then quantizes the weights/biases and activations to the given bit width.
Before quantizing the float model, there is an optional step called "inspector". It is used to inspect the model before quantizing it. Inspector will output the partition information, indicating which operators will run on which device (DPU/CPU). In general, DPU is faster than CPU. The idea is to run as many operators as possible on DPU devices. The partition results also include messages on why this operator cannot be run on DPU. This will help users to better understand DPU's ability and can further help users fit their model to DPU.
To capture activation statistics and improve the accuracy of quantized models, the Vitis AI quantizer must run several iterations of inference to calibrate the activations. A calibration image dataset input is, therefore, required. Generally, the quantizer works well with 100–1000 calibration images. Because there is no need for back propagation, an unlabeled dataset is sufficient.
After calibration, the quantized model is transformed into a DPU deployable model (named deploy_model.pb for vai_q_tensorflow, model_name.xmodel for vai_q_pytorch), which follows the data format of a DPU. This model can then be compiled by the Vitis AI compiler and deployed to the DPU. The quantized model cannot be taken in by the standard version of TensorFlow or PyTorch framework.