BERT performs well on question-answering tasks, but as a model, it is both structurally large and also resource-consuming. Meghana Ravikumar, ML Engineer here at SigOpt, recently published a technical use case that applies Multimetric Bayesian Optimization to evaluate the trade off between size and accuracy of BERT on Squad 2.0. (To learn more about her work, review her series of posts (1, 2, and 3), watch this webinar, or view this lightning talk at Ray Summit.) Her results, such as retaining accuracy with a model that is 22% smaller, were compelling. Let’s explore how she arrived at her results.
In short, Meghana utilized a variety of SigOpt features that take some of the pain out of the messy, and, at times, frustrating, modeling process.
SigOpt Runs are an easy way to track any training run and its set of modeling attributes with just a few lines of code. As you utilize Runs, we populate a history of these runs and the metadata attached to them in our interactive dashboard. You get a history of your modeling progress every step of the way.
Meghana used Runs throughout this BERT case. Initially, she used the capability to create good initial baselines and get a quick sense of model behavior. Throughout her model training process, she used runs to track her work so it that was available for later analysis in her SigOpt Experiment (Experiments are the term that encapsulates the optimization, tuning, and training of a single model in our web-app dashboard).
Learn more: Docs
Metric Strategy is a component of the SigOpt API that lets you select whether to store a metric, optimize it, or set a Metric Threshold or Constraint. In this manner you can calculate metrics about the performance of your model for later use in additional experiments: flexibility is controlled at the API level, or can be modified in the web-app dashboard.
In this use case, Meghana uses Metric Thresholds to avoid problematic model architecture configurations, and leverages Metric Storage to track and gain insights on optimized and unoptimized metrics during model training and tuning.
Mixed Parameter Spaces (Categorical + Integer)
In many scenarios, your model won’t just have one type of parameter. Many models include a learning rate (floating point number or “float”), a number of layers (typically an integer value), and a categorical parameter, which might indicate whether the model should apply an optimization strategy such as “Adam” or “SGD.” SigOpt is versatile in that it handles categorical, integer, and floating point parameters, and some models incorporate all three of these parameter types.
Meghana leverages mixed parameter spaces to effectively conduct a neural architecture search to find a set of optimal architectures.
Many business problems require the nuance inherent in more than one success metric. In these scenarios, you can judge your model based on up to two metrics, to give yourself a tradeoff (Pareto) frontier. Using this frontier, you can then select the ideal model with just the right compromise on your two optimized metrics. For example, for one customer their Multimetric experiments optimize inference time versus accuracy, while another customer minimizes risk while maximizing profit.
Multimetric Optimization is the core strategy of Meghana’s use case. Using Multimetric Optimization, Meghana was able to simultaneously optimize for two competing metrics and understand the trade-offs associated when optimizing her model’s architecture.
Learn more: Docs
While Metric Thresholds allow you to constrain SigOpt’s optimizer to produce results above or below a certain threshold, using this capability together with Multimetric Optimization allows you to use thresholds in place of optimizing three or more metrics at the same time. Apply Multimetric optimization to two metrics, and then use Thresholds to constrain the third.
Meghana uses Multimetric Thresholds to avoid suboptimal model configurations and so that she can use her expert knowledge to provide feedback to the optimizer.
Learn more: Docs
Reporting observation failures is a great way for SigOpt to avoid scenarios in which your model won’t converge, or if you have a machine failure, you can safely ignore erroneous results in your optimization loop. While an observation failure may be a data point of “last resort,” it can be invaluable in helping you avoid wasted training cycles, which may be long-running or expensive, depending on your model type and use case.
The amount of GPU memory is a hard constraint for Meghana’s experiment. Instead of bounding her architecture parameter ranges, she uses Failed Observations to mark architectures that are too big in real-time.
Learn more: Docs
Parallelism helps you train and tune your models faster. SigOpt is able to provide different, new parameter sets across a cluster of machines used to train your model. This helps you train differently optimized versions of your model at the same time, all while leaving no workers idle. As soon as an observation comes in from a training epoch, SigOpt is ready with a new suggestion to ensure all your infrastructure is maximally engaged.
This use case is run in parallel across 20 AWS EC2 instances. SigOpt makes it seamless to set up parallelization, effectively scheduling and scaling training runs during the automated hyperparameter tuning job. This made it easy to utilize parallel instances to speed up wall-clock time to tune this expensive model.
Try These Features Yourself
If you are developing a machine learning or deep learning model, you could likely benefit from a combination of these features. They are easy to integrate—it usually takes users less than an hour to get up and running with SigOpt—so you can get started with us today.
For a limited time, we are offering access to a private beta of new SigOpt functionality. Take advantage of this offer by signing up here.