For multiple quantization
strategy configurations, vai_q_pytorch supports quantization configuration file in JSON
format.
Usage
To make the customized configuration take effect, you only need to pass the
configuration file to torch_quantizer API.
config_file = "./pytorch_quantize_config.json"
quantizer = torch_quantizer(quant_mode=quant_mode,
module=model,
input_args=(input),
device=device,
quant_config_file=config_file)
The ./example/ directory
contains the following three examples: int_config.json,
bfloat16_config.json, and
mix_precision_config.json. You can use the configuration
files with the -config_xxx_config.json command to quantize the
model.
python resnet18_quant.py --quant_mode calib --config_file int_config.json
python resnet18_quant.py --quant_mode test --config_file int_config.json
In the example configuration file, the model configuration in
overall_quantizer_config is set to entropy calibration method and per_tensor
quantization.
"overall_quantize_config": {
...
"method": "entropy",
...
"per_channel": false,
...
},
The configuration of weights in tensor_quantize_config is maxmin
calibration method and per_tensor quantization. It means weights use different
quantization method from model
configuration.
"tensor_quantize_config": {
...
"weights": {
...
"method": "maxmin",
...
"per_channel": false,
...
}
Besides, there is one layer quantization configuration in
layer_quantize_config list. The configuration is based on layer_type and set
torch.nn.Conv2d layer to per_channel
quantization.
"layer_quantize_config": [
{
"layer_type": "torch.nn.Conv2d",
...
"overall_quantize_config": {
...
"per_channel": false,
Configurations
- convert_relu6_to_relu
- (Global quantizer setting) Whether to convert ReLU6 to ReLU. Options: True or False.
- include_cle
- (Global quantizer setting) Whether to use cross layer equalization. Options: True or False.
- include_bias_corr
- (Global quantizer setting) Whether to use bias correction. Options: True or False
- target_device
- (Global quantizer setting) Device to deploy quantized model, options: DPU, CPU, GPU
- quantizable_data_type
- (Global quantizer setting) tensor types to be quantized in model
- data_type
- (Tensor quantization setting) data type used in quantization, option: int, bfloat16, float16, float32
- bit_width
- (Tensor quantization setting) Bit width used in quantization. Only for quantization when data type is int.
- method
- (Tensor quantization setting)Method used to calibrate the quantization scale. Options: Maxmin, Percentile, Entropy, MSE, diffs. Only for quantization when data type is int.
- round_mode
- (Tensor quantization setting)Rounding method in quantization process. Options: half_even, half_up, half_down, std_round. Only for quantization when data type is int.
- symmetry
- (Tensor quantization setting)Whether to use symmetric quantization. Options: True or False. Only for quantization when data type is int.
- per_channel
- (Tensor quantization setting)Whether to use per_channel quantization. Options: True or False. Only for quantization when data type is int.
- signed
- (Tensor quantization setting)Whether to use signed quantization. Options: True or False. Only for quantization when data type is int.
- narrow_range
- (Tensor quantization setting)Whether to use symmetric integer range for signed quantization. Options: True or False. Only for quantization when data type is int.
- scale_type
- (Tensor quantization setting)Scale type used in quantization process. Options: Float, poweroftwo. Only for quantization when data type is int.
- calib_statistic_method
- (Tensor quantization setting)Method to choose one optimal quantization scale if got different scales using multiple batch data. Options: modal, max, mean, median. Only for quantization when data type is int.
Hierarchical Configuration:
Quantization configuration is in hierarchical structure.
- If the configuration file is not provided in the torch_quantizer API, the default configuration is used, which is adapted to DPU device and uses poweroftwo quantization method.
- If configuration file is provided, model configuration, including global quantizer settings and global tensor quantization settings are required.
- If only model configuration is provided in the configuration file, all tensors in the model will use the same configuration.
- Layer configuration could be used to set some layers to specific configuration parameters.
Default Configurations:
Details of default configuration are shown
below.
"convert_relu6_to_relu": false,
"include_cle": true,
"include_bias_corr": true,
"target_device": "DPU",
"quantizable_data_type": [
"input",
"weights",
"bias",
"activation"],
"datatype": "int",
"bit_width": 8,
"method": "diffs",
"round_mode": "std_round",
"symmetry": true,
"per_channel": false,
"signed": true,
"narrow_range": false,
"scale_type": "poweroftwo",
"calib_statistic_method": "modal"
Model Configurations:
In the example configuration file "int_config.json", all tensors in
the model are set as same int8 quantization configurations. In this case, just set
the global quantization parameters, and these parameters must be set under the
"overall_quantize_config" keyword. As shown
below.
"convert_relu6_to_relu": false,
"include_cle": false,
"keep_first_last_layer_accuracy": false,
"keep_add_layer_accuracy": false,
"include_bias_corr": false,
"target_device": "CPU",
"quantizable_data_type": [
"input",
"weights",
"bias",
"activation"],
"overall_quantize_config": {
"datatype": "int",
"bit_width": 8,
"method": "maxmin",
"round_mode": "half_even",
"symmetry": true,
"per_channel": false,
"signed": true,
"narrow_range": false,
"scale_type": "float",
"calib_statistic_method": "max"
}
Similar to int_config.json,
all tensors in the model are set as same bfloat16 quantization configurations in
bfloat16_config.json. The only datatype is
set in the global quantization parameters, as shown
below:
"convert_relu6_to_relu": false,
"convert_silu_to_hswish": false,
"include_cle": false,
"keep_first_last_layer_accuracy": false,
"keep_add_layer_accuracy": false,
"include_bias_corr": false,
"target_device": "CPU",
"quantizable_data_type": [
"input",
"weights",
"bias",
"activation"
],
"overall_quantize_config": {
"datatype": "bfloat16"
}
Optionally, the quantization configuration of different tensors in
the model can be set separately. The configurations must be set in
tensor_quantize_config keyword. In example configuration file mix_precision_config.json, global datatype of
quantization is bfloat16, and change the datatype of bias to float16. The rest of
the parameters are used the same as the global
parameters.
"tensor_quantize_config": {
"bias": {
"datatype": "float16",
}
}
Layer Configurations:
Layer quantization configurations must be added in the
"layer_quantize_config" list. And two parameter configuration methods, layer type
and layer name, are supported. There are five notes to do layer configuration.
- Each individual layer configuration must be in dictionary format.
- In each layer configuration, the quantizable_data_type and overall_quantize_config parameter are required. In overall_quantize_config parameter, all quantization parameters for this layer need to be included.
- If the setting is based on layer type, the layer_name parameter should be null.
- If the setting is based on layer name, the model needs to run the calibration process firstly, then pick the required layer name from the generated python file in quantized_result directory. Besides, the layer_type parameter should be null.
- Same as the model configuration, the quantization configuration of different tensors in the layer can be set separately. They must be set in tensor_quantize_config keywords.
In the example configuration file, there are two layer
configurations. One is based on layer type and the other is based on layer name. In
the layer configuration based on layer type, torch.nn.Conv2d layer needs to set to
specific quantization parameters. The per_channel parameter of weight is set to
true, method parameter of activation is set to
entropy.
{
"layer_type": "torch.nn.Conv2d",
"layer_name": null,
"quantizable_data_type": [
"weights",
"bias",
"activation"],
"overall_quantize_config": {
"bit_width": 8,
"method": "maxmin",
"round_mode": "half_even",
"symmetry": true,
"per_channel": false,
"signed": true,
"narrow_range": false,
"scale_type": "float",
"calib_statistic_method": "max"
},
"tensor_quantize_config": {
"weights": {
"per_channel": true
},
"activation": {
"method": "entropy"
}
}
}
In the layer configuration based on layer name, the layer named
ResNet::ResNet/Conv2d[conv1]/input.2 needs to be set to specific quantization
parameters. The round_mode of activation in this layer is set to
half_up.
{
"layer_type": null,
"layer_name": "ResNet::ResNet/Conv2d[conv1]/input.2",
"quantizable_data_type": [
"weights",
"bias",
"activation"],
"overall_quantize_config": {
"bit_width": 8,
"method": "maxmin",
"round_mode": "half_even",
"symmetry": true,
"per_channel": false,
"signed": true,
"narrow_range": false,
"scale_type": "float",
"calib_statistic_method": "max"
},
"tensor_quantize_config": {
"activation": {
"round_mode": "half_up"
}
}
}
The layer name ResNet::ResNet/Conv2d[conv1]/input.2 is picked from
generated file quantize_result/ResNet.py of
example code example/resnet18_quant.py.
- Run the example code with the
python resnet18_quant.py --subset_len 100
command. The quantize_result/ResNet.py file is generated. - In the file, the name of first convolution layer is ResNet::ResNet/Conv2d[conv1]/input.2.
- Copy the layer name to quantization configuration file if this layer is set to specific configuration.
import torch
import pytorch_nndct as py_nndct
class ResNet(torch.nn.Module):
def __init__(self):
super(ResNet, self).__init__()
self.module_0 = py_nndct.nn.Input() #ResNet::input_0
self.module_1 = py_nndct.nn.Conv2d(in_channels=3, out_channels=64, kernel_size=[7, 7], stride=[2, 2], padding=[3, 3], dilation=[1, 1], groups= 1, bias=True) #ResNet::ResNet/Conv2d[conv1]/input.2
Configuration Restrictions
Due to the restriction of DPU device design, if int quantization is
used and the quantized models need to be deployed in DPU device, the quantization
configuration should meet the restrictions as
below:
method: diffs or maxmin
round_mode: std_round for weights, bias, and input; half_up for activation.
symmetry: true
per_channel: false
signed: true
narrow_range: true
scale_type: poweroftwo
calib_statistic_method: modal.
For CPU and GPU device, there is no restriction as DPU device. However, there are some conflicts when using different configurations. For example, if calibration method is ‘maxmin’, ‘percentile’, ‘mse’ or ‘entropy’, the calibration statistic method ‘modal’ is not supported. If symmetry mode is asymmetry, the calibration method ‘mse’ and ‘entropy’ are not supported. Quantization tool gives an error message if there are configuration conflicts.