Our brains only use about 30-40 watts of power, yet are more powerful than neural networks which take extensive amounts of energy to run. What can we learn from the brain to help us build better neural networks?
In this week’s episode of Experiment Exchange, “How Numenta Builds Neural Networks Inspired by Sparsity,” join Michael McCourt as he talks with Subutai Ahmad, VP of Research at Numenta, about his latest work. Numenta works to build neural networks that leverage structures and efficiencies found in the human brain to improve the performance of neural networks. In particular, Subutai’s research has identified sparse models which are high performing, using tools such as SigOpt’s Multimetric and Multitask Optimization.
Q: Tell us a little bit about yourself and your organization.
Numenta is a small research lab. We’re based in the San Francisco Bay area, with about 15 to 20 people. And we’re pretty unusual because we have a two part mission as a research lab. We look at the neocortex and how the brain works, and we try to create biological theories by reading the neuroscience literature and understanding in detail how different aspects of the neocortex in the brain works.
The second part is we try to take that learning and apply that to AI and machine learning, with the goal of trying to improve machine learning algorithms towards the eventual goal of building really intelligent machines. So we have a pretty big roadmap, as you know, we’re not ambitious at all! [Laughs] We’re just trying to understand the brain and build truly intelligent systems. But, you know, I think we’ve made a lot of progress in the first part and we published a lot and now we’re sort of tackling a lot of different aspects of how the brain works and applying that to machine learning. In a broad sweep, that’s what we do.
Q: Would you describe your work as bioinspiration?
Bioinspiration is not a bad term, but it can be very broad. We’ll get into a little bit about how we deal with sparse networks, but we really look to see how the brain solves those problems, and many times they’re quite different from the way they’re solved in the traditional machine learning world. We try to keep as many of the constraints from biology as possible, because what happens is that if you don’t, if you depart from the biology too much, then when you get into really difficult scenarios or difficult problems you don’t know where to look to solve them.
But if you’re relatively close to the biology, you can go back to the neuroscience, go back to experiments, and try to at least get another source of data, maybe more constraints that you can apply. So that’s kind of how we like to think of it.
Q: What are you focused on about the brain? What element of it is driving your work right now?
There are a few different elements. One of the things that I discussed quite a bit in my SigOpt Summit presentation and others is how we deal with sparsity. Sparsity is one aspect of the brain where a very few neurons at a time actually are active, and the percentage of neurons that are actually connected to one another is really small. So if you think of it as a weight matrix or an activation matrix, most of those matrices are filled with zeros. There are very few non zeros, so it’s extremely sparse. But the brain obviously is able to operate amazingly well and deep learning systems today are not that sparse.
One of the things we know about the brain is it’s extremely power efficient. Our brain only uses up about 30 to 40 watts of power. So less than a light bulb of power, as opposed to deep learning systems which use up sometimes a small city’s worth of energy just to train one network. One branch of our research is seeing how we can exploit sparsity to create really performant, energy efficient deep learning systems. A lot of what we used SigOpt for was in that context.
A second aspect of our research is the neuron model itself. So in deep learning the neurons are very, very simple devices. They take a linear weighted sum of their inputs and then followed by nonlinearity, whereas in biology, neurons are actually much, much more sophisticated. They’re very complex, with nonlinear, temporal, spatial properties. We think that incorporating those aspects is going to really help unsupervised learning and continual learning in deep learning systems. So that’s kind of a second aspect of our research, is expanding what we mean by neuron to include more of what we know from biology.
Then there’s the last piece, which is perhaps the most fundamental thing in neuroscience. It’s that every part of your brain, regardless of whether it’s the visual system or the auditory system or your language system or high level thought, they all operate on the exact same micro circuit, and that’s called a cortical column. This is about 100,000 neurons in the human brain. Each cortical column is about 100,000 neurons. We have thousands of them throughout our brain, but they have the exact same architecture. So this says that evolution has figured out a very generic algorithm that can be applied to almost anything that you think of as an intelligent function.
So a third part of our research is really understanding how that circuit works, and how you can hook them up into a larger scale that works. We feel if you can do that, you can crack the code. So that’s the big idea, that there’s a common cortical algorithm and this relies on sparsity and a more sophisticated neuron model. But it’s a very big idea that’s come out from the neuroscience.
Q: You talked about the costs of this training. Isn’t this just Moore’s Law? You add more chips in, the algorithm gets better, but it also costs more to train?
That’s a good question. Moore’s Law basically says that hardware power in some measure is going to be, I forget what he said, doubling every 18 months or something. I think that is an exponential increase. But it turns out deep learning systems are on a completely separate curve. It’s like an exponential on top of that exponential.
Hardware innovations just have not caught up. The power needs of deep learning systems are doubling every three to four months. And it’s even getting shorter than that now. It’s just insane. Hardware just isn’t keeping up. The costs are increasing proportionately, and energy usage and the carbon footprint, already a pretty big issue, is increasing dramatically.
We feel that one of the ways to really combat this is through algorithmic approaches and to be smarter about how we create these networks. Why not look at the brain? The brain is much more powerful than any deep learning system today, yet it only uses less than a light bulb of power.
Q: It seems like some of the tooling and algorithms there that work very well for these large distributed systems may not play as well for some of these very dense systems that you might see in neural networks. Is that true?
Exactly, there’s a mismatch between hardware and algorithms today in deep learning and AI. The hardware systems are very much focused on dense operations and GPUs are great at that and CPUs are great at that. But when you have these really sparse matrices, the overhead of figuring out where the non-zeros are can swamp any benefit on today’s hardware.
In particular, these sparse matrices are somewhat dynamic because exactly where the non-zeros are can change on an input by input basis. And so being able to do that efficiently and smartly is definitely a research area. We’ve made a lot of progress on that. We’ve been working on FPGA-based systems where you can design the circuit and you don’t need to have these large scale SIMD processors that are really suited to dense systems. You can hand construct the circuit.
One of the interesting things is, let’s say you take a vector and you want to multiply it by a matrix. If you have 90% zeros in one of the matrices, you could imagine maybe a 10X improvement in your speed. But if you have 90% zeros on the other side as well, you can actually get a 100X improvement. These things are multiplicative and because only on average 1% of the time will you actually have zeros on both sides. So it’s a pretty remarkable property that there’s this multiplicative benefit.
Doing these sparse/sparse computations has been challenging, and so we’ve figured out a way with FPGAs that I described in detail in the SigOpt talk of how we can handle that. But we’ve shown we can get 100X improvement in energy usage as well as speed using FPGA systems compared to running the dense networks on the same system. We’re now starting to look at the modern Intel CPUs and GPUs, because they’re having some really interesting instruction sets that are available, like the AVX-512 and others. We think we can take some of those ideas and port them over to more conventional hardware systems as well. We’re super excited about that.
Q: Let’s discuss your usage of SigOpt. How do you use SigOp in your work?
SigOpt is a general purpose hyperparameter optimization system. SigOpt doesn’t care whether you’re working with sparse networks or not. The challenge that we have is we want to create these really high dimensional, large networks that are very sparse, but they also have to be very accurate. It’s very challenging to create really sparse, really accurate networks. Then we have a third challenge where we actually have to match it to the hardware itself. The hardware imposes certain constraints on the sparsity patterns and steps. So achieving that goal requires quite a bit of hyperparameter search.
What we found is that sparse networks work with different hyperparameter regimes than dense networks. You can’t take what you’ve learned from the dense networks and just apply them to sparse networks. You really have to look at it with fresh eyes and tune the parameters specifically for sparse networks.
With SigOpt, there are two different aspects that we used that were particularly helpful. One is the Multimetric Optimization feature. With that, we were able to set up a search process where one metric was optimizing the sparsity and the other for accuracy. We had minimum thresholds on both. In the experiment that I described in the talk, we ran a thousand trials and SigOpt was able to optimize both those metrics. As it turned out, out of a thousand trials, only four actually met our criteria—a tiny part of the whole search space. There were about ten different hyperparameters we optimized. So that’s a large 10d search space. SigOpt was able to find those tiny solutions.
In that process, we learned a lot about how these hyperparameters relate to each other through the charting tools that SigOpt provides. That was one example of where we really successfully used it, and it would have been really hard to do it in a manual way.
The second aspect where we used SigOpt is when we’ve taken our system ideas and tried to apply them to really large scale systems like ResNet-50 or on transformer networks like BERT which are extremely costly to train. We used SigOpt’s Multitask feature and set up several different tasks with different costs associated with them. Then SigOpt was able to optimize our cost in addition to the overall accuracy. SigOpt would suggest cheaper experiments to run in the beginning while it figured out combinations of parameters and where in the parameter space is closer to optimum, then it slowly suggested higher and higher cost things to run. That was a big cost savings for us. Trying to train a thousand transformers is just impossible.
Q: Glad you were able to use SigOpt’s Multitask feature!
I’ll tell you one thing that didn’t work. Sometimes people will train a small network and then you try to use those hyperparameters in the context of a large network. That, we found, does not work. Instead, what we do is have the large network all the way through and we train it for different numbers of iterations. So we use a small number of iterations to get a sense of what’s working, and then every once in a while we run the full number of epochs of training. That works a lot better.
Q: How did you choose your hyperparameters that you were studying in this particular problem?
It’s a combination of two things. We had some mathematical concepts that we were operating under in how sparsity works in high dimensional scenarios. We were able to use that to choose. And then there was a bunch of stuff we didn’t have theories for, really about the learning rate and initially even things like batch size. So there were some things we had a mathematical basis for that helped us narrow it down, and then some stuff we just don’t really know, we just let SigOpt figure it out.
Q: What’s next for Numenta? What should people be looking forward to from you in 2022?
There are a bunch of things we’re working on that we’re pretty excited about. One is scaling these sparse systems to train large scale transformer networks and other large scale systems. Everything we’ve done so far has been on the inference side by speeding the inference, but we think training itself can be dramatically sped up through sparsity, so we’re working on that.
The next thing is something called continual learning. The brain, as you know, is constantly learning. You and I are learning right now, whereas a deep learning system is trained and then it’s sort of frozen from that point on, it can’t react to changes. So there are aspects of biological neurons that we think are critical to having systems that can learn continually—in addition to sparsity. We’ll be releasing some papers soon on how we can take these biological insights and apply them to continua learning.
We’ve also started implementing ideas around cortical columns. It’s hard to say how fast that research will go, but we’re pretty optimistic and we’re pretty excited about that.
Q: What does the research around cortical columns look like? Are you trying to model the process that’s in place there?
It’s definitely an art, not a science. Maybe another way to put it is, what aspects of the biology do we need to incorporate and what aspects can we ignore?
For example, if you think about airplanes, they’re not made out of feathers and they don’t flap their wings. However, when the Wright brothers came up with flying machines, they studied birds intensely and what they figured out is that the shape of the wings matters, how the wing turns, and the fact that there’s a tail. All of those are critical to controlling powered flight—those aspects from the biology they incorporated into airplanes—but feathers and flapping wings, they didn’t.
So, there’s not an easy answer to that. Part of it is we try to understand conceptually what’s going on and then see how we can translate that into code. We definitely have to make sure we don’t get too bogged down in the details because there’s so much detail in the neuroscience and most of that probably is not going to be relevant to AI. But some critical things will be very relevant. So we’re trying to build an understanding of it and then coming at it from the computer science point of view or the machine learning point of view, and seeing how to incorporate that. That’s a lot of what our research meetings and discussions are all about.
Q: Where can people go to learn more about your research?
You can go to Numenta.com, our website that has a list of a lot of our papers and so on. We are actually very, very open with our code and with our concepts, and we publish a lot. You can go to our YouTube channel, where you can actually see some of our research meetings and talks. We have a lot of blog posts that try to explain different aspects of sparsity and continual learning, the neuron model, and so on. Our founder, Jeff Hawkins, has also recently published a book called The Thousand Brain’s Theory of Intelligence that tries to explain this idea of cortical columns and how they might be put together into an intelligent system.
From SigOpt, Experiment Exchange is a podcast where we find out how the latest developments in AI are transforming our world. Host and SigOpt Head of Engineering Michael McCourt interviews researchers, industry leaders, and other technical experts about their work in AI, ML, and HPC — asking the hard questions we all want answers to. Subscribe through your favorite podcast platform including Spotify, Google Podcasts, Apple Podcasts, and more!