# Cost matters: on the importance of cost-aware hyperparameter optimization

Advanced Optimization Techniques, Hyperparameter Optimization, Research

We are excited to present recent research that we published in the proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI) 2021, “A nonmyopic approach to cost-constrained Bayesian optimization.” This work was jointly performed with researchers from Facebook and Amazon: David Eriksson, Valerio Perrone, and Matthias Seeger, and in this blog post, we briefly discuss the motivating factors behind this work.

## The default state of HPO: iteration-based optimization

Nearly all practical hyperparameter optimization packages, from Optuna to HyperOpt to SigOpt itself, attempt to determine optimal hyperparameters within a certain number of iterations. Let’s see how a user might run each with a budget 100.

HyperOpt:

from hyperopt import fmin
fmin(
objective,
max_evals=100,
)

Optuna:

from optuna import create_study
study = create_study()
study.optimize(
objective,
n_trials=100
)

SKOpt:

from skopt import gp_minimize

gp_minimize(
func=objective,
n_calls=100,
)

SigOpt:

while experiment.progress.observation_count < 100:
suggestion = conn.experiments(experiment.id).suggestions().create()
value = objective(suggestion.assignments)
conn.experiments(experiment.id).observations().create(
suggestion=suggestion.id,
value=value,
)

Note that we’ve simplified the code snippets above for clarity’s sake; a user will have to write slightly more complex code by defining the search space, the objective, etc. But you can see that all these optimizers have essentially the same interface: plug in the objective you want to maximize / minimize, and then number of iterations you want to run it for. Most people take this interface for granted.

But is this really the right interface? We argue that by asking for iteration count, when we really care about something like cumulative training time, we are leaving performance on the table.

## The challenge: varying costs of hyperparameter evaluation

Measuring optimization progress with iterations is reasonable if each evaluation takes the same amount of time. In HPO problems, this is not often the case; the training time associated with one hyperparameter configuration may significantly differ from the training time of a different hyperparameter configuration.

We confirm this in an exhaustive study of hyperparameter optimization problems, published in our UAI paper. We consider five of the most commonly used models in machine learning:

• k-nearest neighbor (KNN)
• Multi-layer perceptron (MLP)
• Support vector machine (SVM)
• Decision tree (DT)
• Random forest (RF)

Taken together, these models comprise the large majority of general models that data scientists and machine learning experts use (note that deep models have been omitted, but that these results still hold for those as well).

Top: Runtime distribution, log-scaled, of 5000 randomly selected points for the k-nearest-neighbors (KNN), Multi- layer Perceptron (MLP), Support Vector Machine (SVM), Decision Tree (DT), and Random Forest (RF) hyperparameter optimization problems.

We train each of our five models on the commonly benchmarked OpenML w2a dataset using 5000 randomly chosen hyperparameter configurations drawn from a standard search space (see our paper for full search space details). We then plotted a distribution of the resulting runtimes for each model.  The x-axis of each histogram represents the runtime, and the y-axis, which has been logarithmically scaled, represents density, i.e., fraction of total evaluations.

As you can see, the training times of each model vary greatly, often by an order of magnitude or more! This is because for each model, a few hyperparameters heavily influence not only model performance, but also training time, e.g., the layer sizes in a neural network or the number of trees in a forest. In fact, we have found that, in nearly all practical applications, evaluation costs vary significantly in different regions of the search space.

Consequently, the cumulative training time spent tuning these hyperparameters is not directly proportional to the number of iterations! In fact, It is entirely possible, according to the histogram above, for one optimizer to evaluate one hyperparameter configuration, for another optimizer to evaluate one hundred points, and for both to take an equal amount of time doing so.

## The challenge: handling a heterogeneous cost landscape

We want an optimizer that accounts for the varying cost landscape of the objective function, and makes intelligent decisions about where to evaluate next. This is known as a cost-aware optimization routine. Instead of taking in an optimization budget measured in iteration count, our optimization routine takes into account an optimization budget measured by a cost metric such as time.

Colloquially, instead of asking an optimization routine to compute the optimal hyperparameters in 100 iterations, we can instead ask it to compute the optimal hyperparameters in, e.g., 100 minutes of training time. We might invoke such an optimization routine in the following generic way:

optimizer.minimize(
func=objective,
num_minutes=100,
)

This asks the optimization routine to find the best hyperparameters it can within 100 minutes, which turns out to be quite a different challenge than finding the best hyperparameters it can within 100 iterations. In the next section, we provide a brief example illustrating why.

## On the benefit of being cost-aware

In our paper, we develop a cost-aware optimization algorithm, which we’ll call cost aware Bayesian optimization (CA-BO) in this blog post. Cost-aware BO is an important research problem that the hyperparameter optimization community is actively prioritizing; in practice, the user often has a cost budget that is expressed in money, time, or some other heterogeneous metric, that constrains hyperparameter optimization. To keep things simple, let’s consider a running example comparing our CA-BO method to Optuna, a great open source tool from Preferred Networks. Both optimizers are tasked to maximize the binary classification accuracy of an XGBoost model in 100 optimization iterations.

We see that as iteration continues, Optuna gradually outperforms CA-BO. The performance gap in this case is quite convincing. What might be happening? We would hope CA-BO at least performs as well as Optuna.

This gap exists because CA-BO is cost-aware. It recognizes that the evaluation cost varies, and makes intelligent decisions accordingly. Let us examine optimization performance when we account for varying evaluation cost, by replacing the x-axis with cumulative training time instead of iterations.

Above, we have re-plotted the performance comparison between CA-BO and Optuna with this new x-axis.  Once we do this, we see that our CA-BO performs much better!  We can see that Optuna eventually does outperform CA-BO, but only after CA-BO finishes running within it’s time budget (and it takes much longer to achieve comparable accuracy at nearly all times during optimization).

The fundamental change that allows us to achieve superior performance in terms of cumulative training time is our switch to cost-efficiency from iteration-efficiency (also known as sample-efficiency, which most hyperparameter optimization routines aim for).

Placing the two performance graphs side-by-side reinforces the stark difference between the cost-aware and cost-unaware optimization routines when we measure their respective performance through two different lenses. On the left, we see that our hyperparameter optimization routine finds superior hyperparameters within 200 minutes.  On the right, we see that it finds worse hyperparameters within 100 iterations (which a cost-unaware optimization tool may take more than 500 minutes to conduct).

The picture on the left is the de-facto state of HPO at the moment. Potentially, the picture on the right is what developers trying to move models to production actually care about; a method that accounts for the metric they care about, and makes intelligent decisions accordingly.

## Conclusions

Cost-aware Bayesian optimization is a rapidly evolving class of algorithms for HPO, and something that other researchers are also tackling. For more information, please see our paper, which contains the technical details of our approach. For more information on other optimization techniques and applications check out the SigOpt research page.

Techniques like this are built into the SigOpt optimizer, one of the pillars of the SigOpt Intelligent Experimentation Platform. You can try out the optimizer and the rest of the platform for free by signing up today

Eric Lee Research Engineer
Michael McCourt Research Engineer