BERT is high performing across a wide range of NLP tasks, but it is also very, very large. Its size can make it impractical for modeling teams to use in their constrained environments. In response, compression is increasingly important for anyone intending to use BERT. But most of these compression techniques are limited in their ability to provide practical guidance on the tradeoff between model architecture size and model performance.
In this technical case study, SigOpt ML Engineer, Meghana Ravikumar addresses this shortcoming by applying Multimetric Bayesian Optimization to distill BERT for Question Answering tasks using SQUAD 2.0. In yesterday’s presentation, she presented the results from her experiment, including these primary takeaways:
- Meghana uncovered a configuration of BERT that was 22% smaller than the baseline model (in number of parameters), but retained a similar level of accuracy (~67% Exact) using SigOpt Multimetric Bayesian Optimization
- Meghana tracked and visualized her runs through the process using SigOpt Experiment Management, which made it quicker to establish a viable baseline and easier to develop intuition on the model’s behavior
- There are a wide variety of practical implications for running experiments like this to uncover ways to use BERT for real-world modeling tasks
And here is a more specific summary of the presentation. Click through to view any segment you missed:
- Background on BERT, various distillation techniques and the two primary goals of this particular use case – understanding tradeoffs in size and performance for BERT (0:48)
- Overview of the experiment design, which applies SigOpt Multimetric Bayesian Optimization to tune a distillation of BERT for SQUAD 2.0 question answering tasks and tracks progress through training and tuning with SigOpt Experiment Management (2:08)
- Deeper explanation of distillation in the context of NLP and BERT, how it is used to train a smaller student model from a larger teacher, and the setup for the hyperparameter optimization process (3:30)
- Process for defining the student model and approach to creating a baseline for the hyperparameter optimization experiment, with baseline values of 67.07% for Exact as a baseline measurement of accuracy and 66.3M Parameters as a baseline measurement of size (5:25)
- How SigOpt automates the hyperparameter optimization process and automates Multimetric Bayesian Optimization more specifically to evaluate these competing metrics for size and accuracy (6:49)
- Establishing a Metric Threshold to focus the optimizer on the parameter space that is above 50% accuracy (8:58)
- Overview of parameters to be optimized, including training parameters, architecture parameters and distillation parameters, and the optimization loop itself (9:55)
- Cluster orchestration setup and how it is initialized with Ray Core to facilitate at-scale distribution of the tuning job in parallel (11:31)
- Review of results, including all configurations of the model trained by SigOpt through the optimization run in its trade off between exploration and exploitation (11:58)
- Analysis of specific optimal points on this Pareto Frontier of results that were displayed in the SigOpt dashboard at the end of the optimization run, including a model configuration that retains accuracy and reduces model size by 22.47% (12:42)
- Evaluation of the tuning job in the SigOpt dashboard, including comparisons of metrics, parameter importance, and parallel coordinates(13:58)
- Analysis of the model’s performance on specific Question Answering topics to more deeply understand model behavior (15:22)
- Summary of conclusions from this process, including the value of Multimetric Bayesian Optimization for evaluating these tradeoffs between metrics (18:47)
- The most interesting trend from the optimal architecture is that heads stayed constant and layers varied, in response to a question from the audience (19:38)
- Discussion around how optimizing two metrics at once was performed within SigOpt automatically and without any additional effort from Meghana (20:19)
To recreate or repurpose this work please use this repo. Model checkpoints for the models in the results table can be found here. The AMI used to run this code can be found here. To play around with the SigOpt dashboard and analyze results for yourself, take a look at the experiment. Below is a screenshot from the SigOpt dashboard.
You can also watch the recording or share it with your colleagues. If you’re interested in learning more, follow our blog or try our product. If you found Experiment Management particularly compelling, join the private beta to get free access.