The following are some tips for getting better training results:
- Load the pre-trained floating-point weights as initial values to start the quantization aware training if possible. It is possible to train from scratch with random initial values, but this will make training more difficult and long.
- If pre-trained floating-point weights are loaded, then different
initial learning rates and learning rate decrease strategies need to be used for
the network parameters and quantizer parameters, respectively. In general, the
learning rate of network parameters needs to be set small, while the learning
rate of quantizer parameters needs to be larger.
model = qat_processor.trainable_model() param_groups = [{ 'params': model.quantizer_parameters(), 'lr': 1e-2, 'name': 'quantizer' }, { 'params': model.non_quantizer_parameters(), 'lr': 1e-5, 'name': 'weight' }] optimizer = torch.optim.Adam(param_groups)
- For the choice of optimizer, avoid using torch.optim.SGD, as this optimizer can prevent the training from converging. AMD recommends using torch.optim.Adam or torch.optim.RMSprop and their variants.