SSD (Single Shot Multibox Detector)

Baseline Model

SSD (https://arxiv.org/abs/1512.02325) is a deep neural network for detecting objects in images. This example uses the VGG16 as the backbone of the model.

Create a Configuration File

Create a file named config.prototxt:

workspace: "examples/decent_p/ssd/"
 
model: "examples/decent_p/ssd/float.prototxt"
weights: "examples/decent_p/ssd/float.caffemodel"
solver: "examples/decent_p/ssd/solver.prototxt"
 
gpu: "0,1,2,3"
test_iter: 10
acc_name: "detection_eval" 
ssd_ap_version: "11point"
 
rate: 0.15
 
pruner {
  method: REGULAR
 
  exclude {
    layer_top:
      "conv4_3_norm_mbox_loc"
    layer_top:
      "conv4_3_norm_mbox_conf"
    layer_top: "fc7_mbox_loc"
    layer_top: "fc7_mbox_conf"
    layer_top:
      "conv6_2_mbox_loc"
    layer_top:
      "conv6_2_mbox_conf"
    layer_top:
      "conv7_2_mbox_loc"
    layer_top:
      "conv7_2_mbox_conf"
    layer_top:
      "conv8_2_mbox_loc"
layer_top:
      "conv8_2_mbox_conf"
layer_top:
      "conv9_2_mbox_loc"
    layer_top:
      "conv9_2_mbox_conf"
}
}

Due to the nature of the SSD network, the number of filters in some convolution layers must be fixed and these layers need to be excluded from pruning. In the sample above, the top names of the layers to be excluded are listed within the "exclude" section. In general, if a convolution layer is directly calculated with the label, it cannot be pruned. For example, if the output of a convolution layer needs to be calculated with the label to get top-5 accuracy, then it must be excluded. Because the number of classes of label is fixed, it is necessary to ensure that the dimensions of the output of this convolution layer match the label.

Perform Model Analysis

$ ./vai_p_caffe ana –config config.prototxt

Prune the Model

$ ./vai_p_caffe prune –config config.prototxt

Finetune the Pruned Model

The following solver settings can be used as initial parameters for fine-tuning:

net: "float.prototxt"
test_iter: 229
test_interval: 500
base_lr: 0.001
display: 10
max_iter: 120000
lr_policy: "multistep"
gamma: 0.1
momentum: 0.9
weight_decay: 0.0005
snapshot: 500
snapshot_prefix: "SSD_"
solver_mode: GPU
device_id: 4
debug_info: false
snapshot_after_train: true
test_initialization: false
average_loss: 10
stepvalue: 80000
stepvalue: 100000
stepvalue: 120000
iter_size: 1
type: "SGD"
eval_type: "detection"
ap_version: "11point"

$ ./vai_p_caffe finetune -config config.prototxt

Estimated time required: about 50 hours for 650 epochs using Cityscapes training set (2975 images, 4 x NVIDIA Tesla V100).

Get Final Output

To get the finalized model, run the following:

$ ./vai_p_caffe transform –model baseline.prototxt –weights finetuned_model.caffemodel -output
final.caffemodel

Pruning Results

Dataset: Cityscapes (four classes)
Input Size: 500 x 500
GPU Platform: 4 x NVIDIA Tesla V100
FLOPs: 173G
#Parameters: 24M

Table 1. Pruning Results of SSD
Round	FLOPs	Parameters	mAP
0	100%	100%	0.571
1	50%	29%	0.587
2	9.7%	9.7%	0.559

SSD (Single Shot Multibox Detector) - 1.4.1 English