Hyperparameter optimization is a common problem in machine learning. Machine learning algorithms, from logistic regression to neural nets, depend on well-tuned hyperparameters to reach maximum effectiveness. Different hyperparameter optimization strategies have varied performance and cost (in time, money, and compute cycles.) So how do you choose?
Evaluating optimization strategies is non-intuitive. Stochastic optimization strategies produce a distribution of best-found values. And how should we choose between two similarly-performing strategies? This blog post provides solutions for comparing different optimization strategies for any optimization problem, using hyperparameter tuning as the motivating example.
We can generalize this problem as given a function that accepts inputs and returns a numerical output, how can we efficiently find the inputs, or parameters, that maximize the function’s output?
In the case of hyperparameter tuning, the input to our function is the hyperparameters of the model, and the output is the result of measuring the corresponding model’s performance on an offline dataset, a common practice in machine learning.
The inputs of the function are parameters, and the parameter space represents all possible values of the parameters, usually defined as “acceptable” bounds for each parameter. The dimension of the function is the number of these parameters.
An optimization strategy defines how we will select parameters and how many function evaluations we will perform for every optimization. The final result of an optimization is the “winning” set of parameters that have produced the highest output observed for our function. Considerations when evaluating the performance of an optimization strategy include:
What is the best output value observed by this strategy? What is the expected cost of this strategy? How many times do I have to evaluate the function to achieve the best-observed value?
Example Optimization Strategies
We’ll explore three different methods of optimizing hyperparameters: grid search, random search, and Bayesian optimization. There are other potential strategies, but many of these require too many function evaluations per optimization to be feasible.
Grid search suggests parameter configurations deterministically, by laying down a grid of all possible configurations inside your parameter space. To optimize with this strategy, evaluate your function at every point on this grid. With as few as four parameters, this problem can become impractical, because the number of function evaluations required for this strategy increases exponentially with each additional parameter, due to the curse of dimensionality.
Random Search suggests configurations randomly from your parameter space. While less common in machine learning practice than grid search, random search has been shown to find equal or better values than grid search within fewer function evaluations for certain types of problems. To optimize with random search, evaluate your function at some number of random configurations in the parameter space; note that it may be unclear how to determine the number of function evaluations required for your particular problem.
Bayesian optimization is an adaptive approach to parameter optimization, trading off between exploring new areas of the parameter space, and exploiting historical information to find the parameters that maximize the function quickly. Like random search, Bayesian optimization is stochastic. In practice, tools like SigOpt perform well with a number of evaluations equal to 10 to 20 times the dimension of the optimization problem.
Optimization Strategy Performance
The best-found value is the highest output value of your function reported during an optimization. When comparing stochastic optimization strategies, we consider the distribution of the best-found values over many optimizations of a single strategy and apply a statistical test, similar to the test commonly used when comparing CTR distributions across variations in an A/B test.
Each optimization in the graph above consists of 20 function evaluations. For grid search, we used a 4 x 5 grid. Grid search, a deterministic optimization strategy, found effectively the same best value on every optimization, but the median accuracy was low, likely because the grid was too coarse. On the other hand, the best found values for random search and SigOpt have a wider range, but the median of each distribution is higher than that of grid search.
Number of Function Evaluations
Optimization strategies must be practical, and an important consideration is the number of function evaluations required by the strategy to reach its optimal value. A strategy like grid search requires an exponentially growing number of function evaluations with the number of parameters and becomes intractable in higher dimensions. Strategies like random search and Bayesian optimization attempt to reach an optimal value with function evaluations linearly proportional to the dimensions of the problem.
Best Seen Trace
To evaluate how quickly a strategy reaches an optimal value, you can examine a best seen trace, which plots the best value observed so far at function evaluation 1, 2, 3…n. Because some of our optimization methods are inherently stochastic we examine the interquartile range of the best seen trace – when aggregating 50 optimizations the interquartile range represents the span of the middle 25 results. This plot shows the evolution of the best seen trace over 40 function evaluations.When two optimizations with the same fixed number of function evaluations result in similar best found values we can break ties by determining which optimization strategy was able to observe the best result first. We use the Area Under Curve (AUC) of the best seen trace to see which strategy got to the best results in the fewest function evaluations. To convert the best seen traces into numerical measures fit for comparison, we use the following formula to find the AUC of these best seen traces:
where fLB is a lower bound for f designed to ensure that the AUC is always positive.
Comparing with Confidence
We want assurances that one strategy is better than another with statistical significance. We define “better” to mean:
The best found value from method A will be better than the best found value from method B with some level of confidenceIf A and B find the same best value, A finds this value within fewer function evaluations than B with some level of confidence as determined by the AUC of the best seen trace
To compare two optimization strategies with a specific level of confidence, we can apply the Mann-Whitney U Test with to the best found values from each strategy. We can then use the Area Under the Curve of the best seen traces to break ties. Our internal testing on functions publicly available in the EvalSet shows that, at significance .01, SigOpt outperforms grid search in 399 out of 400 statistically significant tests.
There are many theoretical and practical concerns when evaluating optimization strategies. The best strategy for your problem is the one that finds the best value the fastest–with the fewest function evaluations. The best found value and the Area Under the Curve are two measurements to compare optimization strategies. These measurements combined with the Mann-Whitney U Test allow us to make rigorous statistical statements about which optimization strategies performs better. SigOpt uses Bayesian optimization to help you find the best parameters for a given function or model the fastest, and we frequently use these methods to determine our own performance against popular optimization strategies. Sign up today!