Tools like neon, Caffe, Theano, and TensorFlow make it easier than ever to build custom neural networks and to reproduce groundbreaking research. Current advancements in cloud-based platforms like the Nervana Cloud enable practitioners to seamlessly build, train, and deploy these powerful methods. Finding the best configurations of these deep nets and efficiently tuning their parameters, however, remains one of the most limiting aspects in realizing the benefit of these techniques.
“…we still don’t really know why some configurations of deep neural networks work in some cases and not others, let alone having a more or less automatic approach to determining the architectures and the hyperparameters.”
VP Engineering at Quora (former Director of Research at Netflix)
Bayesian optimization represents an approach to solving the problem of parameter tuning in a robust way. SigOpt was designed to help solve this problem directly by providing a simple, scalable, and stable ensemble of these methods. SigOpt has been shown to outperform standard methods of model tuning like random and grid search as well as open source global optimization alternatives across a wide variety of machine learning models, including neural networks.
The Nervana Cloud is built to be an underlying deep learning infrastructure and democratizes access to the tools and infrastructure necessary to create neural nets. It obviates managing infrastructure, facilitates data ingestion from a motley of sources, aids data exploration, is powered by the fastest deep learning framework, and can seamlessly be deployed in production applications.
By combining the ease of building, training, and deploying neural networks with Nervana Cloud with the efficient tuning of SigOpt, the promise of easily accessible, production quality deep learning is now a reality for everyone. In this post we will detail how to use SigOpt with the Nervana Cloud and show results on how SigOpt and Nervana are able to reproduce, and beat, the state of the art performance in two papers.
Figure 1: The SigOpt Optimization Loop. SigOpt bolts on top of machine learning systems like the Nervana Cloud via a simple REST API. First, SigOpt suggests parameter configurations. These configurations are evaluated on the Nervana Cloud. The observed results are then reported back to SigOpt and the process is repeated, converging on an optimal configuration.
How does Bayesian optimization work?
Bayesian optimization is fundamentally about making the best possible decision of what parameter configuration to evaluate next for a model, given an objective to maximize and previous observations about how historical configurations have performed. Inherently it is making a tradeoff between exploration and exploitation. Exploration learns more about the model, the length scales over which individual parameters vary, and how they combine to influence the overall objective. Exploitation uses the historical knowledge already gained from previous observations to suggest the best possible next configuration, maximizing the expected result. By efficiently and automatically making these tradeoffs, Bayesian Optimization techniques can quickly find the global optima difficult to optimize problems, often much faster than traditional methods like local search or brute force methods like grid or random search that ignore historical information.
There are many open source packages devoted to providing interfaces to this research including MOE, Spearmint, SMAC, and Hyperopt. In practice these tools usually represent a single Bayesian optimization approach and are often too infrastructurally brittle to be deployed generically in production environments, even after the substantial administration required to get them up and running. SigOpt represents an ensemble of the state-of-the-art of Bayesian Optimization research behind a parallel API, allowing users to get the best of the research, without any of the overhead. More information about the underlying optimization methods in this post can be found on the SigOpt research page.
Bayesian optimization of deep neural networks
Deep neural networks (DNNs) have many hyperparameters to tune, including learning rate, weight decay, momentum, weight initializations, number of activations, number of layers, batch size, among many others. Even a simple neural network can have dozens of parameters (Figure 1), and often these parameters are continuous variables with only loose guidelines on numerical ranges. Tuning all of these knobs has historically been more an art than a science. Even small changes can greatly impact whether or not a model converges and produces good results. Further complicating the problem, many of these parameters have coupling effects. This leads to much wasted time on trial and error, as each configuration of parameters must be tried and iterated on based on results. Training DNNs can be extremely time-consuming given the volume of data and compute density required, which means that Bayesian Optimization can be especially helpful in tuning DNNs.
Figure 2: A visualization of a simple neural network for classifying binary data in a 2D space that has 22 different tunable parameters. Finding the best configuration via an exhaustive grid search would be computationally intractible.
Using SigOpt and Nervana Cloud to train DNNs
SigOpt and Nervana Cloud work well together to help data scientists produce high-accuracy trained DNN solutions quickly (Figure 2). First, the data scientist selects a deep learning model architecture from Nervana’s state-of-the-art model library. The particular starting model will depend on the problem to be solved. For example, Fast Region-based Convolutional Neural Networks (Fast R-CNN) are well-suited for detecting object locations in images, while Deep Residual Nets are better at classification. Once the data scientist has selected the model architecture to start with, they customize the model script based on various factors such as the specific problem in question, the format of the input data, etc. Finally, the model script is updated to support hyperparameter suggestions coming from the SigOpt REST API rather than from manual inputs by the data scientist or default values from the model.
After each suggested parameter configuration is received from SigOpt the model script is submitted to Nervana Cloud for training and evaluation. Nervana Cloud has been optimized down to the silicon to handle the most complex deep learning training at scale, which has enabled Nervana Could to achieve training speeds 10x faster than conventional GPU-based systems and frameworks. So each training run takes place at world-class speed.
Through SigOpt integration, the model script conducts many training evaluations in succession, each using a unique suggested configuration of hyperparameters. After an evaluation has completed, the observed accuracy (or any other objective metric) of the trained model is measured and reported back to SigOpt. Based on this result, SigOpt suggests a new set of hyperparameters to be evaluated next. The process continues, and ultimately, the system converges on the best combination of hyperparameters. SigOpt finds good parameter configurations 10x faster than traditional approaches like grid search, allowing experts to get to the best version of their models in less time and with less compute overhead.
The combination of Nervana Cloud and SigOpt enables the data scientist to achieve a level of accuracy much faster and in far fewer steps than with standard methods, and without the need for manual expert tuning.
Convolutional Neural Network Error Reduction
By combining SigOpt and neon on the Nervana Cloud we are able to more efficiently find better parameter configurations than standard methods and expertly published results on an all convolutional neural network. These architectures are increasingly growing in popularity and comprise solely of convolutional layers. Notably, pooling and affine layers are supplanted with strided convolutions and one-by-one filter convolutions, respectively. All convolutional nets allow for learned spatial sampling, dense labeling, image generation, among others.
Convolutional neural networks contain many tunable parameters that affect performance including learning rates, epochs, stochastic gradient descent parameters, and more. Finding the best configuration for these parameters is extremely non-intuitive and can often be very dataset dependent. In practice the configurations published or used in production are often found via an expertly refined grid or random search. Bayesian optimization represents a way to get to better configurations in less evaluations, without requiring expert administration.
Neon is equipped with an extensive model zoo including the all convolutional neural network from Springenberg et al. We compared the expert fine tuning of this network on the CIFAR-10 dataset to SigOpt and random search, a common non-bayesian hyperparameter tuning strategy. SigOpt was able to achieve better results than the standard method while also requiring fewer total iterations. In addition it was also able in improve upon the expertly tuned results from the paper, reducing the relative error rate by 1.6%.
Figure 3: The best found trace of SigOpt and random search. The line represents the accuracy on the validation set achieved by the best performing configuration from each optimization strategy after a number of total evaluations. SigOpt is usually able to find good optima after a number of evaluations equal to 10 to 20 times the number of parameters being. In this case we tuned 9 hyperparameters(1).
|Method||Accuracy Achieved||Required Evaluations||Error reduction vs Expert Baseline|
Table 1: A comparison of various hyperparameter tuning methods for tuning an all convolutional neural net. SigOpt was able to get better results than the standard method and the expert, while also requiring far fewer total evaluations than the standard method, saving expert and computational time while also producing a better model.
By using the ensemble of Bayesian optimization methods provided by SigOpt we were able to find a better result faster than the standard random search approach, while also improving the expert result without any expert time required.
15% Deep Residual Net ImageNet Error Reduction
These results are not limited to only neural networks with specific architectures, but can be extended to any method. We show that using SigOpt and the Nervana Cloud we were able to reduce the relative error in a deep residual network (ResNet) as well. These methods currently hold the crown in the annual ImageNet competition, achieving 3.57% accuracy. These networks surpass even humans in image classification in many cases. Depth has been shown to be an extremely important attribute in the quality of neural networks. Increasing depth naively, however, leads to a host of optimization challenges. ResNets stem from the residual learning framework. The authors argue that learning perturbations from the identity function is easier than learning an unreferenced function. Their final network is composed of 152 layers and can take up to a few weeks to train.
In this example we took the deep residual network from Kaiming et al. and compared the expert fine tuning to SigOpt, using the implementation in Neon’s model zoo on the CIFAR-10 dataset. Random search was omitted from this example due to time constraints, these networks typically take a very long time to evaluate, which highlights the need for intelligent hyperparameter searching methods. SigOpt was able in improve upon the expertly tuned results from the paper, reducing the relative error rate by 15%, while not requiring any expert time usually required for fine tuning.
|Method||Accuracy Achieved||Required Evaluations||Error reduction vs Expert Baseline|
Table 2: SigOpt was able to reduce the relative error rate by 15% compared to an expertly tuned baseline by tuning 9 hyperparameters in the model(4).
Figure 4: SigOpt was able to reduce the relative error rate by 15% compared to an expertly tuned baseline, it was able to do so exponentially faster (after 130 iterations) than standard methods like grid search.
Combining Nervana Cloud with SigOpt allows users to train and tune neural networks a combined 100x faster than standard methods while also producing better results.
: Parameters optimized in All CNN experiment:
- epochs : number of passes over dataset during SGD – int [50, 500]
- log(learning_rate) : step size in SGD – double [-3.0, -0.3]
- log(weight_decay) : weight decay in SGD – double [-3.0, 0.0]
- gaussian_scale : standard dev of weight initialization – double [0.01, 0.5]
- momentum_coef : momentum term in SGD – double [0.001, 0.999]
- momentum_step_change : mul amount to decrease momentum – double [0.001, 0.999]
- momentum_step_schedule_start : epoch to start momentum decay – double [50,300]
- momentum_step_schedule_step_width : # epoch between momentum decay – int [5,100]
- momentum_step_schedule_steps : how many momentum decays to occur – int [1,20]
: Accuracy observed on the Nervana Cloud using the hyperparameters published in the paper.
: The accuracy reported in the paper.
: Parameters optimized in the DeepRes experiment :
- depth : number of layers in net – int [1, 20]
- all the same parameters and domains as All CNN(1)