With fine-grained pruning, weights that have minimal effect on the output are set to zero such that the corresponding computations are skipped or removed from the inference graph. This results in sparse matrices (that is, matrices that have many zero elements). Fine-grained pruning can achieve high compression rates with a modest reduction in accuracy. However, a hardware accelerator capable of implementing fine-grained sparsity must either be a fully customized, pipelined implementation or a more general-purpose “Matrix of Processing Engines” type of accelerator with the addition of specialized hardware and techniques for weight skipping and compression.
The Vitis AI sparsity pruner implements a fine-grained sparse pruning algorithm for multiple N:M sparsity patterns in each contiguous block of M values. The pruning algorithm prunes weight values along input channel dimensions. For each set of M weights, the pruner sets the N weights with the smallest value to zero. Typical values for M can be 4/8/16, and N equals as one half the value of M, which achieves 50% fine-grained sparsity.
The Vitis AI sparsity pruner supports both weight and activation sparsity for convolution and fully connected layers. The sparsity of activations can be 0 or 0.5. When the sparsity of activations is 0, the weights sparsity can be 0, 0.5, or 0.75. When the sparsity of activations is 0.5, the weights sparsity can only be 0.75.
The sparsity pruning steps are as follows:
- Generate the sparse model
- Fine-tune the sparse model
- Export the sparse model