Configuration of Quantization Strategy - 3.5 English

Vitis AI User Guide (UG1414)

Document ID
UG1414
Release Date
2023-06-29
Version
3.5 English
For multiple quantization strategy configurations, vai_q_pytorch supports quantization configuration file in JSON format.

Usage

To make the customized configuration take effect, you only need to pass the configuration file to torch_quantizer API.
config_file = "./pytorch_quantize_config.json"
quantizer = torch_quantizer(quant_mode=quant_mode, 
                            module=model, 
                            input_args=(input), 
                            device=device, 
                            quant_config_file=config_file)
The ./example/ directory contains the following three examples: int_config.json, bfloat16_config.json, and mix_precision_config.json. You can use the configuration files with the -config_xxx_config.json command to quantize the model.
python resnet18_quant.py --quant_mode calib --config_file int_config.json
python resnet18_quant.py --quant_mode test --config_file int_config.json
In the example configuration file, the model configuration in overall_quantizer_config is set to entropy calibration method and per_tensor quantization.
"overall_quantize_config": {
  ...
  "method": "entropy",
  ...
  "per_channel": false,
  ...
},
The configuration of weights in tensor_quantize_config is maxmin calibration method and per_tensor quantization. It means weights use different quantization method from model configuration.
"tensor_quantize_config": {
  ...
  "weights": {
    ...
    "method": "maxmin",
    ...
    "per_channel": false,
    ...
    }
Besides, there is one layer quantization configuration in layer_quantize_config list. The configuration is based on layer_type and set torch.nn.Conv2d layer to per_channel quantization.
"layer_quantize_config": [
  {
    "layer_type": "torch.nn.Conv2d",
    ...
    "overall_quantize_config": {
      ...
      "per_channel": false,

Configurations

convert_relu6_to_relu
(Global quantizer setting) Whether to convert ReLU6 to ReLU. Options: True or False.
include_cle
(Global quantizer setting) Whether to use cross layer equalization. Options: True or False.
include_bias_corr
(Global quantizer setting) Whether to use bias correction. Options: True or False
target_device
(Global quantizer setting) Device to deploy quantized model, options: DPU, CPU, GPU
quantizable_data_type
(Global quantizer setting) tensor types to be quantized in model
data_type
(Tensor quantization setting) data type used in quantization, option: int, bfloat16, float16, float32
bit_width
(Tensor quantization setting) Bit width used in quantization. Only for quantization when data type is int.
method
(Tensor quantization setting)Method used to calibrate the quantization scale. Options: Maxmin, Percentile, Entropy, MSE, diffs. Only for quantization when data type is int.
round_mode
(Tensor quantization setting)Rounding method in quantization process. Options: half_even, half_up, half_down, std_round. Only for quantization when data type is int.
symmetry
(Tensor quantization setting)Whether to use symmetric quantization. Options: True or False. Only for quantization when data type is int.
per_channel
(Tensor quantization setting)Whether to use per_channel quantization. Options: True or False. Only for quantization when data type is int.
signed
(Tensor quantization setting)Whether to use signed quantization. Options: True or False. Only for quantization when data type is int.
narrow_range
(Tensor quantization setting)Whether to use symmetric integer range for signed quantization. Options: True or False. Only for quantization when data type is int.
scale_type
(Tensor quantization setting)Scale type used in quantization process. Options: Float, poweroftwo. Only for quantization when data type is int.
calib_statistic_method
(Tensor quantization setting)Method to choose one optimal quantization scale if got different scales using multiple batch data. Options: modal, max, mean, median. Only for quantization when data type is int.
Hierarchical Configuration:
Quantization configuration is in hierarchical structure.
  • If the configuration file is not provided in the torch_quantizer API, the default configuration is used, which is adapted to DPU device and uses poweroftwo quantization method.
  • If configuration file is provided, model configuration, including global quantizer settings and global tensor quantization settings are required.
  • If only model configuration is provided in the configuration file, all tensors in the model will use the same configuration.
  • Layer configuration could be used to set some layers to specific configuration parameters.
Default Configurations:
Details of default configuration are shown below.
"convert_relu6_to_relu": false,
"include_cle": true,
"include_bias_corr": true,
"target_device": "DPU",
"quantizable_data_type": [
  "input", 
  "weights", 
  "bias", 
  "activation"],
"datatype": "int",
"bit_width": 8, 
"method": "diffs", 
"round_mode": "std_round", 
"symmetry": true, 
"per_channel": false, 
"signed": true, 
"narrow_range": false, 
"scale_type": "poweroftwo", 
"calib_statistic_method": "modal"
Model Configurations:
In the example configuration file "int_config.json", all tensors in the model are set as same int8 quantization configurations. In this case, just set the global quantization parameters, and these parameters must be set under the "overall_quantize_config" keyword. As shown below.
  "convert_relu6_to_relu": false,
  "include_cle": false,
  "keep_first_last_layer_accuracy": false,
  "keep_add_layer_accuracy": false,
  "include_bias_corr": false,
  "target_device": "CPU",
  "quantizable_data_type": [
    "input",
    "weights",
    "bias",
    "activation"],
"overall_quantize_config": {
    "datatype": "int",
    "bit_width": 8, 
    "method": "maxmin", 
    "round_mode": "half_even", 
    "symmetry": true, 
    "per_channel": false, 
    "signed": true, 
    "narrow_range": false, 
    "scale_type": "float", 
    "calib_statistic_method": "max"
}
Similar to int_config.json, all tensors in the model are set as same bfloat16 quantization configurations in bfloat16_config.json. The only datatype is set in the global quantization parameters, as shown below:
  "convert_relu6_to_relu": false,
  "convert_silu_to_hswish": false,
  "include_cle": false,
  "keep_first_last_layer_accuracy": false,
  "keep_add_layer_accuracy": false,
  "include_bias_corr": false,
  "target_device": "CPU",
  "quantizable_data_type": [
    "input",
    "weights",
    "bias",
    "activation"
  ],
  "overall_quantize_config": {
    "datatype": "bfloat16"
  }
Optionally, the quantization configuration of different tensors in the model can be set separately. The configurations must be set in tensor_quantize_config keyword. In example configuration file mix_precision_config.json, global datatype of quantization is bfloat16, and change the datatype of bias to float16. The rest of the parameters are used the same as the global parameters.
"tensor_quantize_config": {

    "bias": {
        "datatype": "float16", 
    } 
}
Layer Configurations:
Layer quantization configurations must be added in the "layer_quantize_config" list. And two parameter configuration methods, layer type and layer name, are supported. There are five notes to do layer configuration.
  • Each individual layer configuration must be in dictionary format.
  • In each layer configuration, the quantizable_data_type and overall_quantize_config parameter are required. In overall_quantize_config parameter, all quantization parameters for this layer need to be included.
  • If the setting is based on layer type, the layer_name parameter should be null.
  • If the setting is based on layer name, the model needs to run the calibration process firstly, then pick the required layer name from the generated python file in quantized_result directory. Besides, the layer_type parameter should be null.
  • Same as the model configuration, the quantization configuration of different tensors in the layer can be set separately. They must be set in tensor_quantize_config keywords.
In the example configuration file, there are two layer configurations. One is based on layer type and the other is based on layer name. In the layer configuration based on layer type, torch.nn.Conv2d layer needs to set to specific quantization parameters. The per_channel parameter of weight is set to true, method parameter of activation is set to entropy.
{
  "layer_type": "torch.nn.Conv2d",
  "layer_name": null,
  "quantizable_data_type": [
    "weights",
    "bias",
    "activation"],
  "overall_quantize_config": {
    "bit_width": 8,
    "method": "maxmin",
    "round_mode": "half_even",
    "symmetry": true,
    "per_channel": false,
    "signed": true,
    "narrow_range": false,
    "scale_type": "float",
    "calib_statistic_method": "max"
  },
  "tensor_quantize_config": {
    "weights": {
      "per_channel": true
    },
    "activation": {
      "method": "entropy"
    }
  }
}
In the layer configuration based on layer name, the layer named ResNet::ResNet/Conv2d[conv1]/input.2 needs to be set to specific quantization parameters. The round_mode of activation in this layer is set to half_up.
{
  "layer_type": null,
  "layer_name": "ResNet::ResNet/Conv2d[conv1]/input.2",
  "quantizable_data_type": [
    "weights",
    "bias",
    "activation"],
  "overall_quantize_config": {
    "bit_width": 8,
    "method": "maxmin",
    "round_mode": "half_even",
    "symmetry": true,
    "per_channel": false,
    "signed": true,
    "narrow_range": false,
    "scale_type": "float",
    "calib_statistic_method": "max"
  },
  "tensor_quantize_config": {
    "activation": {
      "round_mode": "half_up"
    }
  }
}
The layer name ResNet::ResNet/Conv2d[conv1]/input.2 is picked from generated file quantize_result/ResNet.py of example code example/resnet18_quant.py.
  • Run the example code with the python resnet18_quant.py --subset_len 100 command. The quantize_result/ResNet.py file is generated.
  • In the file, the name of first convolution layer is ResNet::ResNet/Conv2d[conv1]/input.2.
  • Copy the layer name to quantization configuration file if this layer is set to specific configuration.
import torch
import pytorch_nndct as py_nndct
class ResNet(torch.nn.Module):
  def __init__(self):
    super(ResNet, self).__init__()
    self.module_0 = py_nndct.nn.Input() #ResNet::input_0
    self.module_1 = py_nndct.nn.Conv2d(in_channels=3, out_channels=64, kernel_size=[7, 7], stride=[2, 2], padding=[3, 3], dilation=[1, 1], groups= 1, bias=True) #ResNet::ResNet/Conv2d[conv1]/input.2

Configuration Restrictions

Due to the restriction of DPU device design, if int quantization is used and the quantized models need to be deployed in DPU device, the quantization configuration should meet the restrictions as below:
method: diffs or maxmin
round_mode: std_round for weights, bias, and input; half_up for activation.
symmetry: true
per_channel: false
signed: true
narrow_range: true
scale_type: poweroftwo
calib_statistic_method: modal.

For CPU and GPU device, there is no restriction as DPU device. However, there are some conflicts when using different configurations. For example, if calibration method is ‘maxmin’, ‘percentile’, ‘mse’ or ‘entropy’, the calibration statistic method ‘modal’ is not supported. If symmetry mode is asymmetry, the calibration method ‘mse’ and ‘entropy’ are not supported. Quantization tool gives an error message if there are configuration conflicts.