## Grid Search vs Bayesian Optimization

A Comparison of Hyperparameter Tuning Methods

The most common hyperparameter tuning method in practice is grid search, where an exponentially large grid of possible hyperparameter configuration combinations is exhaustively searched. This is slow, expensive, and quickly becomes intractable for complex machine learning pipelines.

Bayesian Optimization represents an intelligent alternative to grid search. By learning from previous configurations it achieves superior results in a fraction of the time. SigOpt is an ensemble of Bayesian optimization techniques behind a simple API.

Grid Search | Bayesian Optimization | |

Features | ||

Required Evaluations vs Dimension | Exponential | Linear^{1} |

Speed^{2} | Slow | Fast |

Parallel Sampling | Yes | Yes |

Computational Cost^{3} | High | Low^{4} |

Finds Optimal Solutions | No^{5} | Yes |

^{1}In practice, Bayesian optimization powered by SigOpt finds optimal solutions within a linear number of steps for a large number of standard use cases from the academic literature. For more information see our paper presented at ICML 2016.

^{2}Time required for optimization scales with number of required evaluations.

^{3}Computational cost scales with number of required evaluations

^{4}Bayesian optimization powered by SigOpt suggests new configurations within 1s, far outpacing traditional open source implementations which can quickly grow intractable.

^{5}In practice grid search is coarse in each dimension, which allows it to easily alias over good values.

## Other Grid Search Variants

#### Random Subset

Random Subset Grid Search involves randomly selecting elements from the exponential set of possible configurations in a traditional exhaustive grid search. This can decrease the total number of evaluations at the cost of performance.

#### 1D Slice / Parameter Sweep

1D Slice, sometimes called Parameter Sweep, involves searching along a discrete set of values 1 dimension at a time. This decreases the total number of evaluations to scale linearly, but is not guaranteed to find even a local optima.

#### Low Discrepancy

Low Discrepancy Search, a Quasi-Monte Carlo method, is designed to sample high dimensional spaces, but does not attempt to perform optimization. Examples include Sobol and Halton sequences or latin hypercube sampling.

## Comparison of Methods

Below is the result of maximizing a selection of 4D functions from our recent paper^{1} presented at the International Conference on Machine Learning (ICML) on evaluating hyperparameter optimization strategies. The figures below contain the best seen traces (described in another paper about our evaluation framework presented at ICML) which depict the best function value observed after a specific number of function evaluations.

#### 4D Hartmann function

#### 4D Shekel function

#### 4D Michalewicz function

As you can see, SigOpt reaches better configurations in a fraction of the time compared to these other methods. Furthermore, these results are statistically signifcant and extend to a wide variety of other examples.

^{1}We compare the performance on a width 10 exhaustive grid search (with a total of 10,000 function evaluations) to each of the grid search variations described above with only 100 function evaluations as well as Bayesian Optimization using SigOptâ€™s optimization service. We include a selection of the functions used in the SigOpt evalset github repo.