Efficient BERT with Multimetric Optimization, part 3

Meghana Ravikumar
Advanced Optimization Techniques, Application, Applied AI Insights, Augmented ML Workflow, Deep Learning, Focus Area, Model Type, Natural Language

This is the third post in this series about distilling BERT with Multimetric Bayesian Optimization. In this post, we will discuss the results from our experiment. Part 1 discusses the background for the experiment and Part 2 discusses the set up for the Bayesian Optimization. This post was originally published on the Nvidia Developer Blog.

In our previous posts, we discuss the importance of BERT for transfer learning in NLP, and establish the foundations of our experiment design. In this post, we go over our model performance and model size results from our experiment.

As a recap of our experiment design, we pair distillation with Multimetric Bayesian Optimization to distill BERT for question answering and assess the trade-offs between model size and performance. What we find is that we are able to compress our baseline architecture by 22% without losing model performance, and we’re able to boost our model performance by 3.5% with minimal increase in model size. Which is pretty great!

The rest of this post will dig deeper into our results and answer the following questions:

  1. By combining distillation and Multimetric Bayesian Optimization, can we better understand the effects of compression and architecture decisions on model performance? Do these architectural decisions (including model size) or distillation properties dominate the trade-offs?
  2. Can we leverage these trade-offs to find models that lend themselves well to application specific systems (ex: productionalization, edge computing, etc)?


Multimetric Optimization Results

By leveraging SigOpt’s Multimetric Optimization, to tune our distillation process and model parameters, we’re able to effectively understand the effects of compression and architecture changes on model performance. 

Let’s take a look at our results from the Multimetric Optimization experiment.

Figure12. The Pareto frontier resulting from the Multimetric Optimization experiment. Each dot is an optimization run (or one distillation cycle). The yellow dots each represent a Pareto optimal hyperparameter configuration relative to the others. The pink dot is the baseline described above. The pale blue dots represent student models’ exact score that fell below the 50% metric threshold set for the experiment.

As we were hoping, our multimetric experiment produces 24 sets of optimal model configurations depicted by the yellow dots in Figure9. These sets of optimal configurations strongly fill out our Pareto frontier. Each point on the frontier provides a best possible trade-off between number of parameters and exact score. Due to the metric threshold we set, the optimizer efficiently learns to avoid configurations that have model’s with less than 50% exact and focus on exploring configurations that perform well. In fact, 80% of the configurations suggested by SigOpt were either smaller or more accurate than the baseline. Seven model configurations were both smaller and higher performing than the baseline.

As the baseline (67.07% exact and 66.3 M parameters) is not on this frontier, this indicates that the model’s baseline configurations can still be pushed to reach an optimal trade-off point. We see that the optimizer was able to find model architectures that are significantly smaller, with 96% of the frontier smaller than the baseline. We also see that the optimizer finds a significant number of configurations that are at or better performing than the baseline. 

ScenarioFrontier point: sizeFrontier point: exact scoreSize diff from baselineExact score diff from baselineLink to Run
Baseline66.3 M67.07%--Baseline training curves
Retain model size and maximize performance66.36M70.55%+0.09% params+3.45%Model training curves
Minimize model size and maximize performance65.18M70.26%-1.69% params+3.19%Model Training Curves
Minimize model size and retain performance51.40M66.82%-22.47% params-0.25% Model Training Curves

Table3. Selection of points from the frontier and comparisons to the baseline (top row).

As we see from the table above, we are able to increase the performance of our baseline model by ~ 3.50% with only a 0.09% increase in model size. On the flip side, we are able to shrink our model by ~22.50% parameters with only a 0.25% dip in model performance!

Most importantly, we see that optimization results result in a large set of strong performing model configurations that we can choose from for our needs (instead of relying on a single architecture). 

Looking at the parameter space

Now that we understand the trade-offs between model size and performance, let’s take a look at which parameters heavily influence these trade-offs for each metric that was jointly optimized.

Figure13. Parameter importance rankings for exact score (left) and number of parameters (right).

Unsurprisingly, both exact score and number of parameters are mainly influenced by the number of layers in the network. From our architecture search, we see that the architectural parameters (num layers, num attention heads, and attention dropout) heavily influence the number of parameters, but a broader range of hyperparameters affect exact score. Specifically, apart from num layers, exact score is dominated by the learning rate and 3 dropouts. In future work, it would be interesting to dig deeper to understand why the dropouts direct the model’s performance.

Figure14. Parallel coordinates for points on the Pareto Frontier. 

Figure15. Parallel coordinates for points on the Pareto Frontier that >= 66% exact score.

Taking a look at the parallel coordinates graphs and parameter importance charts, we see that the model has a strong preference for the values of the number of layers, learning rate, and the 3 dropouts. When we look at the points on the frontier that perform at least as well as the baseline, we see stronger patterns emerge. Most interestingly, we find that these high performing models weigh the soft target distillation loss (alpha_ce) much higher (at ~0.80) than the hard target distillation loss (at ~0.20). For architecture, we see the optimal values for all dropouts is close to 0.0, the number of layers center around 5, the number of heads center around 11, and specific attention heads are pruned given the size. 

Model Architecture and Convergence

Now that we understand our hyperparameter space and parameter preferences, let’s take a look to see if our optimized models converge and learn well. Using SigOpt’s dashboard, we will analyze the best performing model’s and smallest model’s (row’s 2 and 4 in Table3) architectures and training runs.

Best-performing Model

As we see below, the best performing model retains both the number of layers and number of attention heads from the baseline. Unlike the baseline, it drops all the dropouts to 0, uses a significant number of steps to warm up, and heavily weighs the soft target loss.

Figure 16. Best performing model loss and accuracy curves.

Parameter NameParameter Value
Weight for Soft Targets0.85
Weight for Hard Targets0.15
Attention Dropout0
Learning Rate3.8e-5
Number of MultiAttention Heads12
Number of layers6
Pruning Seed42
QA Dropout0.04
Warm up steps52

Table4. Model configurations for the best model.

Smallest model

The smallest model uses 4 layers of Transformer blocks and prunes a single attention head from each one. Much like the best performing model, it also weighs the soft target loss over the hard target loss and uses no dropout. Interestingly, it uses the highest possible number of warm up steps in its learning strategy.

Figure 17. Training curves for the smallest model that performs as well as the baseline.

Parameter NameParameter Value
Weight for Soft Targets0.83
Weight for Hard Targets0.17
Attention Dropout0
Learning Rate5e-5
Number of MultiAttention Heads11
Number of layers4
Pruning Seed92
QA Dropout0.06
Warm up steps100

Table5. Model configurations for the smallest model.

Much like the baseline, both models converge and aptly learn the distinction between answerable and unanswerable questions. Both models seem to learn similarly, as they have similar performance and loss curves. The main difference is in how they are able to correctly classify and answer answerable questions. While both models initially struggle and dip in performance for “HaAns_exact”, the larger model is able to quickly learn and start recognizing answerable questions. The smaller model begins to climb out of the dip but quickly stagnates in its ability to recognize answerable questions. The best performing model might be able to learn these more complex patterns as it is a larger model in terms of the number of layers and attention heads.

By using SigOpt’s Multimetric Optimization to tune our distillation process, we are able to understand the trade-offs between model size and performance, get insights on our hyperparameter space, and validate our training runs. Beyond understanding our trade-offs, we are able to make informed decisions for our model architecture, hyperparameters, and distillation process that suit our own modeling workflows and production systems. At the end of the process we were able to identify different configurations that made optimal tradeoffs between size and accuracy, including 7 that were better in both metrics as compared to the baseline.

Analyzing our Best Performing Model

Our best performing model (from the results above) achieves 70.55% on Exact score. Let’s take a closer look to see how well our model answers questions.

To evaluate the model performance, we will look at how accurately it classifies unanswerable questions, how accurately it answers answerable questions, and why it fails to do either. The model’s accurate classifications will follow the Exact score guidelines and expects the exact string score to be the correct answer. The model’s misclassifications will be split into 4 categories: 

  • Mostly correct: answer predictions that are almost the right string, give or take a punctuation or preposition
  • Mostly wrong: predictions that are very unlike the true answer
  • Label has answer: predictions that claim questions are unanswerable when the questions are answerable
  • Label has no answer: predictions that answer questions with no answers

This stratification will help us understand where the model falters and how we can improve model performance in future work. 

Taking a look at the broader buckets of correct and incorrect predictions, the model overwhelmingly predicts the right answer for all the topics in SQUAD 2.0, with some topics such as the Yuan Dynasty, Warsaw, and Victoria, Australia having more incorrect answers than most.

Figure18. Percent of questions accurately and inaccurately answered per topic for 10 topics in SQUAD 2.0

When we break the incorrect answers up into the 4 categories, we see most of the incorrect predictions characterized by the model predicting answers for unanswerable questions. Furthermore, we see the least number of incorrect answers due to the model answering questions with completely wrong information. We continue to see topics such as the Yuan Dynasty, Warsaw, and Steam engine continue to confuse the model.

Figure19. Incorrectly answered questions by groups per topic for 10 topics.

In a future post, we will further explore the model’s behavior and delve deeper into each misclassification category and challenging topics for the model.

Why does this matter?

Building small and strong performing models for NLP is hard. Although there is great work currently being done to develop such models, it is not always easy to understand how the decisions and trade-offs were made to arrive at these models. In our experiment, we leverage distillation to effectively compress BERT, and multimetric Bayesian optimization paired with metric management to intelligently find the right architecture and hyperparameters for our compressed model.  By combining these two methods, we are able to more thoroughly understand the effects of distillation and architecture decisions on the compressed model’s performance. Furthermore, we are able to gain intuition on our hyperparameter and architectural parameter space, which will help us make informed decisions in future related work. Lastly, from our Pareto frontier, we’re able to practically assess the trade-offs between model size and performance and choose a compressed model architecture that best fits our needs. 


To recreate or repurpose this work please use this repo. Model checkpoints for the models in the results table can be found here. The AMI used to run this code can be found here. To play around with the SigOpt dashboard and analyze results for yourself, take a look at the experiment.  If you’d like to learn more about this experiment, watch the webinar, and to learn more about the importance of experiment management in NLP read our blog

To use SigOpt, sign up for our free beta to get started and follow our blog to stay up to date.


Thank you to Adesoji Adeshina, Austin Doupnik, Scott Clark, Nick Payton, Nicki Vance, and Michael McCourt for their thoughts and input.


[1] J. Devlin, M. Chang, K. Lee, K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/pdf/1810.04805.pdf

[2] G. Hinton, O. Vinyals, J. Dean. Distilling the Knowledge in a Neural Network. https://arxiv.org/abs/1503.02531

[3] P. Rajpurkar, R. Jia, P. Liang. Know What You Don’t Know: Unanswerable Questions for SQuAD. https://arxiv.org/abs/1806.03822

[4] V. Sanh, L. Debut, J. Chaumond, T. Wolf. DistilBERT, a distilled version of BERT: smaller,

faster, cheaper and lighter. https://arxiv.org/pdf/1910.01108.pdf

[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, I. Polosukhin. Attention is All You Need. https://arxiv.org/pdf/1706.03762.pdf

Meghana Ravikumar Technical Marketing Engineer

Want more content from SigOpt? Sign up now.