$\def\a{\alpha} \def\e{\varepsilon} \def\s{\sigma} \def\RR{\mathbb{R}} \def\mC{\mathsf{C}} \def\mK{\mathsf{K}} \def\mI{\mathsf{I}} \def\ggamma{\boldsymbol{\gamma}} \def\kk{\boldsymbol{k}} \def\uu{\boldsymbol{u}} \def\vv{\boldsymbol{v}} \def\ww{\boldsymbol{w}} \def\xx{\boldsymbol{x}} \def\xopt{\xx_{\text{opt}}} \def\yy{\boldsymbol{y}} \def\zz{\boldsymbol{z}} \def\cD{\mathcal{D}} \def\cX{\mathcal{X}} \def\lmle{L_{\text{MLE}}} \def\lmple{L_{\text{MPLE}}} \def\lkv{L_{\text{KV}}} \def\lclv{L_{\text{CLV}}} \def\ppa{\frac{\partial}{\partial\a}} \DeclareMathOperator*{\argmin}{argmin}$

# TensorFlow ConvNets on a Budget with Bayesian Optimization

In this post on integrating SigOpt with machine learning frameworks, we will show you how to use SigOpt and TensorFlow to efficiently search for an optimal configuration of a convolutional neural network (CNN). There are a large number of tunable parameters associated with defining and training deep neural networks1 and SigOpt accelerates searching through these settings to find optimal configurations. This search is typically a slow and expensive process, especially when using standard techniques like grid or random search, as evaluating each configuration can take multiple hours. SigOpt finds good combinations far more efficiently than these standard methods by employing an ensemble of state-of-the-art Bayesian optimization techniques, allowing users to arrive at the best models faster and cheaper.

In this example, we consider the same optical character recognition task of the SVHN dataset as discussed in a previous post. Our goal is to build a model capable of recognizing digits (0-9) in small, real-world images of house numbers. We use SigOpt to efficiently find a good structure and training configuration for a convolutional neural net. Check out the code here if you’d like to start experimenting!

### Convolutional Neural Net Structure

The structure and topology of a deep neural network can have dramatic implications for performance on a given task2. Many small decisions go into the connectivity and aggregation strategies for each of the layers that make up a deep neural net. These parameters can be non-intuitive to choose in an optimal, or even acceptable, fashion. In this experiment, we used a TensorFlow CNN example designed for the MNIST dataset as a starting point. Figure 1 represents a typical CNN structure, highlighting the parameters we chose to vary in this experiment. A more complete discussion of these architectural decisions can be found in an online course from Stanford3. It should be noted that Figure 1 is an approximation of the architecture used in this example, and the code serves as a more complete reference.

Figure 1: Representative convolutional neural net topology. Important parameters include the width and depth of the convolutional filters, as well as dropout probability.4

TensorFlow has greatly simplified the effort required to build and experiment with deep neural network (DNN) designs. Tuning these networks, however, is still an incredibly important part of creating a successful model. The optimal structural parameters often highly depend on the dataset under consideration. SigOpt offers Bayesian optimization as a service to minimize the amount of trial and error required to find good structural parameters for DNNs and CNNs.

### Stochastic Gradient Descent Parameters ($$\alpha$$, $$\beta$$, $$\gamma$$)

Once the structure of the neural net has been selected, an optimization strategy based on stochastic gradient descent (SGD) is used to fit the weight parameters of the convolutional neural net. There is no shortage of SGD algorithm variations implemented in TensorFlow and several parametrizations of RMSProp, a particular SGD variation, are compared in Figure 2.

Figure 2: Progression of RMSProp gradient descent with different parametrizations. left: Various decay rates with other parameters fixed: purple = .01, black = .5, red = .93. center: Various learning rates with other parameters fixed: purple = .016, black = .1, red = .6. right: Various momentums with other parameters fixed: purple = .2, black = .6, red = .93.

It can be a counterintuitive and time consuming task to optimally configure a particular SGD algorithm for a given model and dataset. To simplify this tedious process, we expose to SigOpt the parameters that govern the RMSProp optimization algorithm. Important parameters governing its behavior are the learning rate ($$\alpha$$), momentum ($$\beta$$), and decay ($$\gamma$$) terms. These parameters define the RMSProp gradient update step:

Algorithm 1: Pseudocode for RMSProp stochastic gradient descent. Stochastic gradient refers to the fact that we are estimating the loss function gradient using a subsample (batch) of the entire training data.

For this example, we used only a single epoch of the training data, where one epoch refers to a complete presentation of the entire training data (~500K images in our example). Batch size refers to the number of training examples used in the computation of each stochastic gradient (10K images in our example). One epoch is made up of several batch sized updates, so as to minimize the in-memory resources associated required for the optimization5. Using only a single epoch can be detrimental to performance, but this was done in the interest of time for this example.

### Classification Performance

To compare tuning the CNNs hyperparameters when using random search versus SigOpt, we ran 5 experiments using each method and compared the median best-seen trace. The objective was the classification accuracy on a single 80 / 20 fold of the training and “extra” set of the SVHN dataset (71K + 500K images). The median best-seen trace for each optimization strategy is shown below in Figure 3.

In our experiment we allowed SigOpt and random search to perform 80 function evaluations (each representing a different proposed configuration of the CNN). A progression of the best-seen objective at each evaluation for both methods is shown below in Figure 3. We include, as a baseline, the accuracy of an untuned TensorFlow CNN using the default parameters suggested in the official TensorFlow example. We also include the performance of a random forest classifier using sklearn defaults.

Figure 3: Median best seen trace of CV accuracy over 5 independent optimization runs using SigOpt, random search as well as two baselines where no tuning was performed.

After hyperparameter optimization was completed for each method, we compared accuracy using a completely held out data set (SHVN test set, 26k images) using the best configuration found in the tuning phase.  The best hyperparameter configurations for each method in each of the 5 optimization runs was used for evaluation. The mean of these accuracies is reported in the table below. We also include the same baseline models described above and report their performance on the held out evaluation set.

Table 1: Comparison of model accuracy on the held out dataset after different tuning strategies.

### Cost Analysis

Using SigOpt to optimize deep learning architectures instead of a standard approach like random search can translate to real savings in the total cost of tuning a model. This is especially true when expensive computational resources (for example GPU EC2 instances) are required by your modeling efforts.

We compare the cost required to reach specific performances on the CV accuracy objective metric in our example experiment. Quickly finding optimal configurations has a direct saving on computational costs associated with tuning on top of the performance benefits of having a better model. Here we assume each observation costs \$2.60, which is the cost per hour of using a single on-demand g2.8xlarge instance in EC2.

Table 2: Required costs for achieving same performance when tuning with SigOpt and random search. For CNNs in production more epochs are traditionally used; for this example we assume 50 GPUs and that the results scale perfectly with the parallelism.

We observe that SigOpt offers a drastic discount in cost to achieve equivalent performance levels when compared with a standard method like random search. While this experiment required only a relatively modest amount of computational resources, more sophisticated models and larger datasets require more instances training for up to weeks at a time, as was the case for the AlphaGo DNN, which used 50 GPUs for training.  In this setting, an 80% reduction in computational costs could easily translate to tens of thousands of dollars in savings.

### Closing Remarks

Deep learning has quickly become an exciting new area in applied machine learning. Development and innovation are often slowed by the complexity and effort required to find optimal structure and training strategies for deep learning architectures. Optimal configurations for one dataset don’t necessarily translate to others, and using default parameters can often lead to suboptimal results. This inhibited R&D cycle can be frustrating for practitioners, but it also carries a very real monetary cost. SigOpt offers Bayesian optimization as a service to assist machine learning engineers and data scientists in being more cost-effective in their modeling efforts. Start building state-of-the-art machine learning models on a budget today!

SigOpt automates the tuning of your model’s hyper, feature, and architecture parameters. We use state-of-the-art Bayesian optimization that improves model performance up to 100x faster. Our customers include academic institutions such as MIT and companies such as Hotwire and Prudential. Leave us your email and we’ll demonstrate how model tuning leads to significant performance and revenue gains for your company.

Using Tensorflow?
Talk to SigOpt’s engineers about optimizing your models 10-100x faster. It’s what we do.

## Footnotes

1. James Bergstra, Rémi Bardenet, Yoshua Bengio and Balázs Kégl. Algorithms for hyper-parameter optimization. Advances in Neural Information Processing Systems. 2011 [PDF] Return
2. Yoshua Bengio. Learning Deep Architectures for AI. Foundations and trends in Machine Learning. 2009. [PDF] Return
3. Fei-Fei Li, Andrej Karpathy, and Justin Johnson. Convolutional Neural Networks for Visual Recognition. Stanford Online Course. [LINK] Return
4. Pierre Sermanet, Soumith Chintala and Yann LeCun. Convolutional Neural Networks Applied to House Numbers Digit Classification. Pattern Recognition International Conference (ICPR). 2012. [PDF
5. Geoffrey Hinton, Nitish Srivastav, and Kevin Swersky. Neural Networks for Machine Learning. University of Toronto Course Slides [LINK] Return
Ian Dewancker Guest Author