When it’s time to ensure that your model, whether that’s a recommendation system, computer vision classifier, or trading strategy is performing as well as it possibly can, you’ll need to choose the right hyperparameter optimization strategy. Data scientists and researchers alike choose from a wide variety of optimization methods, for a whole host of reasons, ranging from complexity, compute demands, dimensionality, and more. Let’s take a look at a variety of hyperparameter optimization techniques, from simple to sophisticated, and how they may or may not meet your needs. Let’s begin with manual tuning.
How you might start with manual tuning
Manual tuning is the simplest of all approaches: on a case by case basis, you devise good values for your parameters based on intuition or past experience.
Data scientists and graduate students alike often make educated guesses or try out parameter sets found in a research paper—maybe from its GitHub repository. It’s certainly possible to guess that a larger batch size might accelerate tuning, or that a higher learning rate might (perhaps with luck) help you arrive at a more accurate model faster. There are also business objectives to take into account, which one might enter directly into the model, metric, or constraints.
However, for many models, there will be parameters that have little or no intuition: the “black box” aspect of machine learning means it’s best to simply experiment. Some data scientists will start by guessing and checking parameter selections, but this requires executing a training run, deciding what to change, and then starting a new run, incorporating the findings from that prior run. While some modelers may feel that intuition followed with iterative experimentation by hand is the only way forward, this becomes challenging as dimensionality increases. It may even be difficult to account for training failures or track what parameter sets have been tried so far, let alone making tradeoffs on multiple metrics, constraints, and evaluations in high dimensions. At this stage, an experimenter may ask, “Isn’t there a better way?”
Exhausting all possibilities with grid search
Grid search is the exhaustive approach: you program an iterator to cover all of the permutations of parameters in as many dimensions as you need.
If your problem space is small and discrete, you can get great confidence knowing that you’ve tried all options. For example, 3 options for batch size and 4 options for number of filters—you might be able to iterate through all 12 combinations for this two-dimensional optimization problem.
If you’re designing a trading strategy or an image segmentation algorithm, however, chances are you’ll have more than just two parameters to adjust or tune. Additionally, as your model becomes more complex, you’ll likely face the “curse of dimensionality” as the set of combinations grows exponentially with parameters you are tuning, and all of a sudden you’ll have trillions of scenarios to evaluate rather than just 12. At this stage, you may be able to eliminate some of those scenarios on a case by case basis, especially as you develop domain-specific knowledge, but the problem can still seem overwhelming, especially as compute costs ramp up with more observations. Testing every possible scenario can often be cost- or time-prohibitive.
Source: Deepak Senapati
Guess and check with random search
Random search is an approach that uses random sampling: you sample each dimension independently, providing a random spread of possible parameter values for your algorithm.
This method is especially useful for models with dozens of parameters, because no other brute force approach is feasible from a computational or wall-clock time standpoint. For complex models, you’ll simply never finish optimizing if you test every last possible point in the parameter space. With time constraints in mind, you’ll only explore a portion of the possible outcomes, so you may want to try another broader (or “exploratory”) approach: random search. Randomly sampling points for each tunable parameter is one way to build intuition and improve your model at the same time. It’s even easy to parallelize random search: more random suggestions can easily be tested across a cluster of machines. No random sample is dependent on a prior sample, so random suggestions can be provided on-demand, on an as needed basis to each machine in the cluster. Many experts advocate that random search is almost always better than a grid search (Bergstra et. al, 2012).
On the other hand, random search is less predictable than grid search: for higher dimensional problems in particular, you may not allow yourself a sufficiently high observation budget and miss a crucial region of the exploration space, effectively leaving valuable accuracy (or your metric of choice) on the table. Random search also takes no information from existing observations; when it finds an improvement, it continues to sample randomly, rather than learning from its past successes. Although high-dimensional spaces lend themselves to random sampling, “thoroughly” sampling the parameter space might also be prohibitive in terms of compute resources. Many random suggestions have no logical grounds to provide an improved outcome, and thus both compute and wall-clock time are wasted, sometimes at great business expense hoping to get lucky: some organizations will deploy thousands of virtual machines (often GPU-enabled) running for weeks at a time training models, sometimes adding up to monthly cloud service provider bills of $1M or more.
Source: Deepak Senapati
Hill climbing to find an optimal parameter set: Convex Optimization
Gradient-based methods typically start from an initial “guess” and move in the direction of greatest descent (or ascent, for maximization). When the function is convex and smooth, this can have certain strong convergence properties.
Convex optimization requires that the parameter space be continuous everywhere for the gradient information to exist. It’s also helpful if there’s a single local maximum or minimum, to ensure that the optimizer yields the best results. Stochastic gradient descent is a popular tool for dealing with noisy objectives or imprecise gradient evaluation. This strategy may yield useful results in optimizing the weights of a machine learning model (though this technique may be less useful for architecture or hyperparameter search due to the often discontinuous, non-convex nature of the space).
However, modelers quickly learn that many models exhibit unpleasant behavior (such as nonsmooth derivatives), or contain categorical or integer parameters not suited to convex optimization. Variations can be used in mixed continuous-discrete spaces, but the theory of convex optimization will not apply. Experts have developed BFGS, Nelder-Mead, Powell and other strategies for situations where gradient information is unavailable. These tools still don’t address scenarios in which there are global rather than local optima to be found.
Source: PLOS
Adding up a lot of small adjustments with Evolutionary Search
Evolutionary search is a method that makes minor adjustments (or “mutations”) to randomly selected parameters, and discards the result if it doesn’t perform, but uses it as the basis for new adjustments if it does.
One of the simplest optimization strategies that can benefit from prior success but still search for better outcomes, is to take a parameter set previously deemed to be somewhat successful, and make minor adjustments to it (“evolve it”) along one or more of its directions of freedom in order to improve its success metric (like AUPRC or F1, for example). This process is repeated so that the best performing (“fittest”) variations form populations of similar points converging to potentially different local optima. Tuning a model to apply it to a new domain (using ImageNet as a basis to detect signs of illness in radiology scans, for example) might only require minor adjustments of some parameters and is a potential great case for this type of search. Even if major adjustments of other parameters (or axes) are required to reach higher accuracy, searching with this strategy over more time or with greater compute resources might just lead to an optimal outcome in terms of accuracy than brute force methods.
However, evolutionary approaches may miss entire areas of the exploration space because they may iterate too closely to the starting (seed) parameter configurations. Evolutionary search has gained some credence for its use in Tesla’s Autopilot and in the nas-net that underlies Google’s AutoML Vision algorithms, but not every data scientist has an entire datacenter at her disposal; both these organizations have deployed massive compute resources to make evolutionary strategies work for their models due to the large number of training runs needed to build and evolve the underlying configuration populations.
Source: Jarrett Dunn
Bayesian Optimization
Bayesian optimization applies Bayesian logic to search for new parameter sets in a way that leverages prior observations to achieve better performance (exploitation) and optimally learn about the space (exploration).
Although there are a wide variety of approaches to finding the optimal configuration for any given model, there is an approach that works reasonably well for nearly every model up to a certain level of dimensionality: Bayesian Optimization. Relying on Bayesian logic and the results of prior rounds of trial, assessment, and observation, Bayesian Optimization permits you, the modeler, to automatically both explore the parameter space of your model based on bounds you specify, and then leverage past successes to bias successive trials closer and closer to an optimal outcome in an exploitative process, trading off between these two approaches to find global optima. Bayesian optimization lets you balance a thorough understanding of the parameter space, while in a stochastic, probabilistic way, “zoom in” on the regions around prior successful trials. That way you can frequently find both global and local optima, rather than getting stuck at a local optimum or discontinuity.
For the uninitiated, Bayesian optimization is, in practice, the hardest to implement of all of these methods. As a result, many researchers simply fall back to methods that are more time consuming, more expensive, but ultimately simpler to implement and debug. For academics and data teams alike, it’s challenging to get the implementation details, on multiple dimensions of scale, within the optimization method itself, right on the first try. Many novices to Bayesian optimization discover that some of the most readily available tools operate too slowly or simply crash when faced with higher dimensional problems, or certain edge case models.
Making optimization easy and accessible
Many data scientists attempt to solve business problems using models that are tested to succeed on an industry-standard common dataset. Yet, even with techniques like transfer learning, novel business problems arrive that can’t be solved in ways previously seen in the wild. While the same model architecture may yield valuable insights on how to solve a novel problem, often there are too many values to learn and consciously adjust to meet business needs on deadline. There’s a small chance you’ll have sufficient time or compute resources to through evolutionary search or random search at a problem until you find a model or parameter set that performs well for you. In these cases, Bayesian optimization provides a methodical and tangibly more efficient process to finding one or multiple optima that meet the business needs for your model. In all cases, you can use SigOpt to operationalize these tuning methods.
If you’re interested in trying out SigOpt, please fill out this form if you are a modeler in academia please check out our free SigOpt Academic Program. If you’d like to learn how SigOpt compares to alternative optimization techniques, check out this ICML workshop paper. Next time we’ll discuss some popular open source optimization packages, and how they compare from purely a capability standpoint. Stay tuned!
Use SigOpt free. Sign up today.