# Bayesian Optimization 101

Hyperparameter Optimization, Modeling Best Practices, SigOpt 101, Training & Tuning

This blog post will go over why you should use Bayesian optimization in your modeling process, the basics of Bayesian optimization, and how to effectively leverage Bayesian optimization for your modeling problems. Learn about other common hyperparameter optimization methods here.

## Why use Bayesian Optimization

Bayesian optimization democratizes access to scale, efficiency, and performance. Originally popularized as a way to break free from the grid, Bayesian optimization efficiently uncovers the global maxima of a black-box function in a defined parameter space. In the context of hyperparameter optimization, this black-box function can be the objective function: accuracy value for a validation or test set, loss value for a training or validation set, entropy gained or lost, AUC for ROC curves, A/B test performance, computation cost per epoch, model size, reward amount for reinforcement learning, etc.

There are a variety of attributes of Bayesian optimization that distinguish it from other optimization methods. In particular, Bayesian optimization is the method that:

• Explores and exploits a given parameter space to find the global optima [4, 2]
• Robustly handles noisy data
• Naturally adapts to discrete and irregular parameter domains
• Efficiently scales with the hyperparameter domain

Essentially, Bayesian optimization finds the global optima relatively quickly, works well in noisy or irregular hyperparameter spaces, and efficiently explores large parameter domains. Due to these properties, the optimization technique is beneficial for hyperparameter tuning and architecture search of machine learning models, simulating and optimizing physical processes, and backtests used in algorithmic trading.

## The Basics of Bayesian Optimization

Here is a quick visual summary of how Bayesian optimization works.

### Step 1: Sample the parameter space

Initialize the process by sampling the hyperparameter space either randomly or low-discrepancy sequencing and getting these observations [3].

### Step 2: Build a surrogate model

Build a probabilistic model (surrogate model) to approximate the true function based on given hyperparameter values and their associated output values (observations). In this case, fit a Gaussian process to the observed data from step 1. Use the mean from the Gaussian process as the function most likely to model the black box function. For more on this, read Michael McCourt’s series on the intuition behind Gaussian Processes.

### Step 3: Figure out where to sample next

Use the maximal location of the acquisition function to figure out where to sample next in the hyperparameter space. Acquisition functions play with the trade-off of exploiting a known high performing result and exploring uncertain locations in the hyperparameter space. Different acquisition functions take different approaches to defining exploration and exploitation.

### Step 4: Sample the parameter space at the points picked on Step 3

Get an observation of the black box function given the newly sampled hyperparameter points. Add observations to the set of observed data.

This process (Steps 2-4) repeats until a maximum number of iterations is met. By iterating through the method explicated above, Bayesian optimization effectively searches the hyperparameter space while homing in on the global optima. [4, 1, 2].

## Quick Tips for Getting the Most out of Bayesian Optimization

### Choose the right metric or metrics to optimize

Choosing the right metric or metrics is an essential step, as these values will be minimized or maximized by Bayesian optimization. Doing this well can ensure that your model’s performance aligns with your end goal, facilitates fairness, or takes your data properties into consideration. Optimization algorithms will only amplify the metric you choose, so it’s important to make sure your metrics reflect your goals.

When in doubt as to which metrics best reflect your goals, try running a couple of short optimization cycles to get a better understanding of your hyperparameter space. Tracking and storing multiple metrics throughout your modeling and optimization process or running optimization cycles on multiple metrics will help you analyze and understand which metrics best relate to improved performance for your problem.

### Integrate optimization throughout your workflow

Beyond hyperparameter tuning, Bayesian optimization can help with data augmentation, feature engineering, model compression, neural architecture search, and much more. Taking optimization into account earlier in your modeling workflow can help solve some of these problems. Furthermore, considering optimization upfront will help alleviate engineering costs for parameterizing your models further down the line.

In practice, we’ve seen significant improvements in performance and modeling workflow benefits when using Bayesian optimization across a wide variety of models and problems, including:

• regression models: beat Wall Street by tuning trading models
• reinforcement learning: create a better agent for the classic cart-pole problem with Bayesian Optimization
• data augmentation: use Bayesian Optimization to augment your dataset
• deep learning architecture: tune model architecture and training parameters to quickly tune a CNN for sentiment analysis
• model compression: tune model distillation to achieve substantial model compression without loss in performance
• fine-tuning for image classification: identify the best suited transfer learning technique for your problem
• unsupervised learning: tune a feedback loop to intelligently feature engineer

### Use a package that makes it easy to get up and running

When choosing the right Bayesian optimization package for you, consider the following questions:

• How much effort would it be to parameterize your existing code to integrate the package?
• Is the package kept up-to-date?
• Does the package offer features to make your optimization cycles more efficient?
• How easy is it to orchestrate and execute the package on your compute environment?
• Will you have to take care of parallelizing Bayesian optimization yourself (we do not recommend this) or is it built in?

Between open source and commercial offerings, there are plenty of Bayesian optimization packages you could use. One of the most important considerations when making a selection is not related to marginal performance differences between them, but, instead, how easy they are to get up and running with your project. A fully supported package should at a minimum include an easy way to integrate the optimization loop in your code, recent releases or pull requests that suggest it is maintained, automatic scheduling of next model configuration suggestions, and support for asynchronous parallelization.

Depending on your experimentation needs, you may also want to evaluate which features the package includes. Does it support all parameter types you need to optimize? Does it include multimetric or multiobjective optimization? Can you run multitask optimization to use partial cost tasks to reduce the cost of your tuning job? How about a way to introduce prior beliefs or establish trust regions? There are many features that make a Bayesian optimization package more or less useful for your particular modeling project that need to be considered.

## What’s Next?

### References

[1] E. Brochu, V.M. Cora, N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. CoRR, abs/1012.2599, 2010.

[2] P. Frazier. Bayesian Optimization. Recent Advances in Optimization and Modeling of Contemporary Problems, October 2018.

[3] M. W. Hoffman, B. Shahriari. Modular mechanisms for bayesian optimization. In NIPS Workshop on Bayesian Optimization, 2014.

[4] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, Jan 2016.