SigOpt hosted our first user conference, the SigOpt AI & HPC Summit, on Tuesday, November 16, 2021. It was virtual and free to attend, and you can access content from the event at sigopt.com/summit. For more than any other reason, we were excited to host this Summit to showcase the great work of some of SigOpt’s customers. Today, we share some lessons from Subutai Ahmad, VP of Research at Numenta, on why sparsity is important for neural networks and how SigOpt’s Intelligent Experimentation platform helps achieve high accuracy sparsity.
Why is Sparsity Important for Neural Networks?
Today, CPUs and GPUs are the workhorses of AI. The best hardware is going through hundreds of trillions of arithmetic operations per second. But even that’s not enough to keep up with the tremendous demands imposed by AI. It’s startling that recently, the costs have been doubling every few months. That’s much faster than Moore’s law. So, for example, in 2018, BERT state of the art transformer network cost about $6000 to train. Last year, the evolution of Transformers, GPT-3 cost over $10 million to train, and along with that the carbon footprint is also exploding. With the amount of energy that’s required to train some of these networks, even a single instance of this network could power a small town or a village. And the chart below shows that hardware innovations just cannot keep pace with that. The red curve shows the constraints that are imposed by deep learning and its compute requirements with the growth over the last few years. While the black curve shows the pace of hardware innovations and the acceleration of hardware performance. As you can see, there’s a big disconnect.
This computation gap is a massive problem with AI today.
What are the properties of Sparsity in Neuroscience?
There are two major properties of sparsity. The first property is called activation sparsity. From neuroscience, we know that about 0.5%-2.0% of neurons are active at any point in time. This is an incredibly small number. This is much smaller than the number of neurons that are active in a deep learning system. But in addition to activation sparsity, there’s another property called connectivity sparsity. This means that when you look at groups of neurons that project to one another, the percentage of neurons that are connected is incredibly small. In this case, only about 1%-5% of possible connections exist.
With these two properties, there is a multiplicative benefit. For example, if you can achieve systems that have 90% weight sparsity, then only 10% of weights are non-zero. You can imagine skipping about 90% of the computation, so you only need to do 10% of the computations. In addition to weight sparsity, if you can get to 90% activation sparsity, then you only need to look at 1% of the combinations. Only 1% will have non-zero components on average. This is a multiplicative benefit that is achieved by taking the product of 10% computational cost from weight sparsity and 10% computational cost from activation sparsity to achieve 1% of the overall computational cost. This is the combined power of both sparsity properties. So if we can get anywhere close to what we have in the brain – 95%-98% weight sparsity and activation sparsity – then there exists a massive opportunity to exploit this combination between the two sparsity properties.
By comparison, this high level of sparsity leads to an incredibly efficient energy usage in our brain. Our brain only uses up about 20 or 30 watts of power. That’s less than a light bulb.
How do you implement sparsity for large scale neural networks?
Now that we understand the benefits of sparsity, how can we take advantage of this? We can create deep learning systems that have sparsity properties and then use them to address the scaling issues. Simply put, we create extremely sparse networks. Just like in the brain, the activations and the activity of the neurons in these networks are sparse. To implement this, Ahmad and Scheinkman have provided an overview of the algorithm details: “How can we be so dense? The benefits of using highly sparse representations. (2019)”.
After initial implementation, we need to resolve the challenge of scaling sparse networks with a high level of accuracy. Some of the simple techniques that exist in literature do not always scale well to large datasets, and it turns out that the optimal hyper parameters that you need for training sparse networks are quite different than what you typically use for dense networks. This is where the SigOpt Intelligent Experimentation platform is extremely helpful.
Hyper parameters are quite different between large and small networks. Finding hyper parameters for small networks can be achieved relatively fast, but the same methods for large networks would be extremely slow. To find the optimal hyper parameters for large networks efficiently, Subutai at Numenta used the multi metric feature from SigOpt. They needed to balance two different metrics. Numenta needed to have high accuracy as well as high sparsity. What they did is set up a multi metric SigOpt experiment where they had a separate threshold for each of these two metrics. For reference, there’s a screenshot of the SigOpt dashboard below. You can see that they have a minimum threshold of 96.5% for accuracy, and a minimum density threshold of 0.15. These were the acceptable thresholds for Numenta. For their needs, they wanted to have at least 85% sparsity.
Numenta was searching for an optimal combination of 10 hyper parameters, and they only needed a thousand trials before they found the best hyper parameters.
What are the results of Numenta’s Sparsity Experiments?
The table below shows the results of a trained sparse convolutional network, as well as a dense convolutional network. As you can see, Numenta was able to train sparse convolutional networks that have essentially the same accuracy as the dense convolutional network. However, as you can see, the number of non-zero weights required for the sparse network is about 10 times less than the non-zero weights required for the dense network. So we’re able to get to over 90% sparsity while maintaining accuracy.
In addition to these performance benefits, they were able to open up new applications that they couldn’t handle before. They were able to achieve the corresponding benefits with energy efficiencies as well. When using extremely sparse networks, you can get over 100 times better energy usage. Therefore, you have two orders of magnitude reduction in energy usage with this sparse network. Furthermore, this is one of those rare cases where an increase in performance does not lead to an increase in energy usage. The performance increase that they got is purely from software algorithms, and thus does not require additional energy consumption from hardware performance boosts. So Numenta was able to leverage multiple benefits from the algorithmic performance enhancements.
Conclusion
To learn more about how Numenta achieved sparsity at scale with SigOpt, I encourage you to watch the full talk. To see if SigOpt can drive similar benefits for you and your team, sign up to use it for free.