SigOpt’s mission is to empower experts with software that streamlines experimentation and makes it easier for them to realize the full potential of their research. Core to this mission is our Academic Program, which offers SigOpt for free to academia, nonprofits and others doing research for good. In this latest Academic Interview, Brady Neal from the Mila (Quebec Artificial Intelligence Institute) discuss the paper, “A Modern Take on the Bias-Variance Tradeoff in Neural Networks” and its implications for deep learning practitioners. In the course of conducting their research, Brady used a free version of SigOpt to tune hyperparameters. We hope you enjoy this edition of our Academic Interview series.
What is your research subject?
We study generalization in deep learning, through the lens of the bias-variance tradeoff that is commonly taught in introductory ML courses in the context of underfitting vs. overfitting. The seminal bias-variance tradeoff paper suggested that larger neural networks have reduced bias at the expense of higher variance. We demonstrate that this is not necessarily the case, as we find both bias and variance decrease with network width (size of each layer). We offer more in-depth experimental and theoretical analysis to help illustrate why.
What made you interested in it?
Generalization is at the core of machine learning. Learned models are only useful if they can generalize to unseen data from a finite set of examples.
The bias-variance tradeoff suggests that large models should not be able to generalize well due to high variance (fitting the noise in the data). The naive interpretation of the bias-variance tradeoff seems to be increasingly suspect in the modern era where we use huge models. This makes it a particularly relevant subject as teams tackle increasingly complex models.
What are your main conclusions?
We have a few conclusions that I’ll quickly describe. Please refer to the paper for more context and detailed information regarding these conclusions. In short:
- It is not necessary to trade bias for variance in neural networks as you increase their capacity.
- We separate variance due to sampling of the training set (what is normally simply referred to as “variance” in simple settings) from variance due to initialization. We find that variance due to training set sampling is roughly constant with both width and, surprisingly, depth. Variance due to initialization decreases with width and increases with depth, in our setting.
- Some of this behavior can already be seen in over-parameterized linear models. We provide a more general theoretical result in a simplified setting, inspired by linear models.
Which models did you use and how did you select them?
Basic, fully connected neural networks. We use simple models to demonstrate our results hold in as simple an experimental setting as possible.
Why did you select SigOpt to optimize these models?
SigOpt offers software that automates critical tasks for research.They combine automated state-of-the-art tuning and automated visualizations for interpreting tuning runs in an easy-to-use hosted solution. And, importantly, they offer it for free to academics as a way to support the community.
While the high performance Bayesian optimization SigOpt uses typically shines when it comes to high-dimensional problems, I found it still beat random search when optimizing just a single hyperparameter. It is exponentially (in the number of hyperparameters) more efficient than random search.
How would you characterize the benefit of using SigOpt?
SigOpt turned the time-consuming, uncertain task of hyperparameter tuning into an easy API call. Equally important, SigOpt’s solution includes a web dashboard that provides convenient graphs to monitor and diagnose potential problems with hyperparameter optimization.
For whom is your research most valuable?
It provides value to anyone who knows what the bias-variance tradeoff is (taught in introductory courses). It is most valuable to those who are studying generalization in machine learning. Although many practitioners already use large networks without fear of overfitting, this work can help them reconcile this reality with the traditional view of the bias-variance tradeoff.
How do you expect to continue evolving this research in the future?
This research has been evolving for about a year. We are just about done with it, but we hope that others begin using the bias-variance lens to analyze generalization. The bias-variance lens provides a noticeably different perspective. Also, we hope this clear demonstration of bias-variance not behaving as it is often caricatured will conceptually help those who are developing models with many parameters.
What would you most love to see as an improvement to SigOpt’s product to assist you in further research?
I don’t think I used SigOpt even close to its full capabilities, and I’m not that familiar with the features it already has (beyond the most basic ones), so I’m not really qualified to answer this.