How Can We Be So Slow? Realizing the Performance Benefits of Sparse Networks

Click on the image above to enlarge

Abstract:

Interest in sparse neural networks has never been greater. With the exponential growth in model size, sparsity represents a powerful technique to reduce both training and inference costs. Sparsity can be applied to both weights and activations, with sparsities of up to 95%+ being achievable before model accuracy degrades irreparably. Implemented correctly, the benefit of sparsity in weights and activations is multiplicative i.e., a 10X reduction in both weights and activations translates into a 100X reduction in the computational cost of a forward pass. Unfortunately, despite the clear potential for sparse models to deliver significant performance improvements, the benefits observed to date on GPUs and CPUs have been extremely limited. Many model runtimes completely fail to exploit the benefits of sparsity, and, for those that do, 2-3X improvements in inference performance are observed for most models by leveraging weight sparsity.

CPUs and GPUs are optimized for dense, regular computations, and efficiently exploiting the irregular patterns of non-zero weights and activations in sparse networks has proved challenging. In this presentation we present novel FPGA-based sparse CNN models that concurrently leverage both activation and weight sparsity to run 100X faster than their dense counterparts and outperform sparse networks on a 24-core CPU by over 12X. We present the techniques developed to achieve this speedup from sparsity and discuss how many of the learnings could be applied to develop fast sparse networks on CPUs.

You can find additional resources on this topic from SNN 2021 here: