Our Director of ML Architecture Lawrence Spracklen and Research Scientist Lucas Souza are speaking at AI DevWorld happening October 25-27th, 2022 at the San Jose Convention Center. They will be talking about how Numenta’s acceleration algorithms optimize hardware efficiency, unlocking order-of-magnitude speedups and energy savings. For those who cannot attend, there will be a virtual talk on November 1st, 2022 at 11:00 – 11:50AM (PT).
AI DevWorld is the world’s largest artificial intelligence developer conference with tracks covering chatbots, machine learning, open source AI libraries, AI for the enterprise, and deep AI / neural networks. This conference targets software engineers and data scientists who are looking for an introduction to AI as well as AI dev professionals looking for a landscape view on the newest AI technologies. Register here.
Abstract:
Most companies with AI models in production today are grappling with stringent latency requirements and escalating energy costs. One way to reduce these burdens is by pruning such models to create sparse lightweight networks. Pruning involves the iterative removal of weights from a pre-trained dense network to obtain a network with fewer parameters, trading off against model accuracy. Determining which weights should be removed in order to minimize the impact to the network’s accuracy is critical. For real-world networks with millions of parameters, however, analytical determination is often computationally infeasible; heuristic techniques are a compelling alternative.
In this presentation, we talk about how to implement commonly-used heuristics such as gradual magnitude pruning (GMP) in production, along with their associated accuracy-speed trade offs, using the BERT family of language models as an example.
Next, we cover ways of accelerating such lightweight networks to achieve peak computational efficiencies and reduce energy consumption. We walk through how our acceleration algorithms optimize hardware efficiency, unlocking order-of-magnitude speedups and energy savings.
Finally, we present best practices on how these techniques can be combined to achieve multiplicative effects in reducing energy consumption costs and runtime latencies without sacrificing model accuracy.