Grid Search vs Bayesian Optimization

A Comparison of Hyperparameter Tuning Methods

The most common hyperparameter tuning method in practice is grid search, where an exponentially large grid of possible hyperparameter configuration combinations is exhaustively searched. This is slow, expensive, and quickly becomes intractable for complex machine learning pipelines.

Bayesian Optimization represents an intelligent alternative to grid search. By learning from previous configurations it achieves superior results in a fraction of the time. SigOpt is an ensemble of Bayesian optimization techniques behind a simple API.

Grid SearchBayesian Optimization
Required Evaluations vs DimensionExponentialLinear1
Parallel SamplingYesYes
Computational Cost3HighLow4
Finds Optimal SolutionsNo5Yes

1In practice, Bayesian optimization powered by SigOpt finds optimal solutions within a linear number of steps for a large number of standard use cases from the academic literature. For more information see our paper presented at ICML 2016.

2Time required for optimization scales with number of required evaluations.

3Computational cost scales with number of required evaluations

4Bayesian optimization powered by SigOpt suggests new configurations within 1s, far outpacing traditional open source implementations which can quickly grow intractable.

5In practice grid search is coarse in each dimension, which allows it to easily alias over good values.

Other Grid Search Variants

Random Subset

Random Subset Grid Search involves randomly selecting elements from the exponential set of possible configurations in a traditional exhaustive grid search. This can decrease the total number of evaluations at the cost of performance.

1D Slice / Parameter Sweep

1D Slice, sometimes called Parameter Sweep, involves searching along a discrete set of values 1 dimension at a time. This decreases the total number of evaluations to scale linearly, but is not guaranteed to find even a local optima.

Low Discrepancy

Low Discrepancy Search, a Quasi-Monte Carlo method, is designed to sample high dimensional spaces, but does not attempt to perform optimization. Examples include Sobol and Halton sequences or latin hypercube sampling.

Comparison of Methods

Below is the result of maximizing a selection of 4D functions from our recent paper1 presented at the International Conference on Machine Learning (ICML) on evaluating hyperparameter optimization strategies. The figures below contain the best seen traces (described in another paper about our evaluation framework presented at ICML) which depict the best function value observed after a specific number of function evaluations.

4D Hartmann function

4D Shekel function

4D Michalewicz function

As you can see, SigOpt reaches better configurations in a fraction of the time compared to these other methods. Furthermore, these results are statistically signifcant and extend to a wide variety of other examples.

1We compare the performance on a width 10 exhaustive grid search (with a total of 10,000 function evaluations) to each of the grid search variations described above with only 100 function evaluations as well as Bayesian Optimization using SigOpt’s optimization service. We include a selection of the functions used in the SigOpt evalset github repo.