ICYMI Recap: Lessons from using SigOpt to weigh tradeoffs for BERT size and accuracy

Tiffany Huynh
Deep Learning, Experiment Management, Natural Language

BERT is high performing and has a generalizable architecture that can be used in a variety of NLP tasks. But BERT is a very large model, expensive to train and too complex for many production systems.

In this webinar, Machine Learning Engineer Meghana Ravikumar, provided a quick overview on how she set up a Bayesian Optimization experiment to explore tradeoffs between size and accuracy for BERT. Building on that, she explained how she used SigOpt to help organize her modeling process, walked through critical points of her modeling workflow and how she leveraged SigOpt to draw specific conclusions on reducing the size of model performance.

Here are a few highlights from the discussion. Navigate to a watch any segment that you missed: 

  • Recap of the BERT compression project (0:55)
  • Use distillation as the compression technique and perform architecture search during the optimization process to define the student and teacher models (1:47)
  • Multimetric Bayesian Optimization is used to optimize for two competing metrics to reduce the model size while increasing the accuracy (2:14)
  • Setting up a baseline to scope out an experimental design to achieve a high performing model for the given dataset (3:36)
  • Baseline #1: Training from scratch resulted in a 4% accuracy, but concluded that using teacher model as BERT pre-trained on SQUAD 2.0 was not a good idea (5:31)
  • Baseline #2: Seeding the model with pre-trained weights resulted in a 67% accuracy./ Hypotheses are validated, properties of a healthy training run, and concluded that the experiment is feasible (6:34)
  • Running Multimetric Bayesian optimization to understand the problem space with the three types of parameters to search (9:30)
  • Experiment dashboard with hyperparameter tuning results (10:53)
  • Performance of at least 40% accuracy across all sizes of models, there are preferred areas of correlations between the parameters that leads to higher model performance and lower model size (11:35)
  • Correlations in parameter space helped identify handful of parameters that influence model performance (12:38)
  • Exploring specific parameter areas and filtering runs (13:11)
  • Providing feedback to the optimizer by implementing metrics threshold on 50% accuracy to avoid randomly guessing and leading the model to perform poorly (15:58)
  • Monitoring the full experiment in real-time helped identify that something was wrong with the training cycles because none of the model configurations passed the 50% threshold (17:13)
  • How did Experiment Management help through the process? Model development, understand the problem space, monitor long cycles (19:38)
  • Experiment Management enables modelers to validate ideas quickly and efficiently, and what decisions to make every step of the way (20:10

If you are interested in learning about Meghana’s work with BERT, you can read more about it in this three part seriesThe first post provides more context on BERT and the distillation techniques that are used. In the second post, she explains the experimental design for Multimetric Bayesian Optimization. The third post covers the results from her analysis, which includes reducing BERT by over 20% but still retaining its accuracy.

Meghana mentioned throughout her presentation how Experiment Management helped her organize her modeling workflow and enabled her to make intuitive decisions every step of the way. Sign up here for free access to the private beta for our solution. If you want to receive the newest update from our blogs, feel free to subscribe here.

Use SigOpt free. Sign up today.

Tiffany Huynh Field Marketing