BERT is a strong and generalizable architecture that can be transferred for a variety of NLP tasks (for more on this see our previous post or Sebastian Ruder’s excellent analysis). But, it is very, very large which can make it very, very slow. This has led to many efforts in compressing the architecture, including: HuggingFace’s DistilBert, Rasa’s pruning technique for BERT, and Utterwork’s fast-bert (and many more).
But these methods are limited in two ways. First, we don’t know how well compression would work for niche datasets or application specific work. Second, we don’t understand the tradeoffs associated with the compression process or architecture decisions made for the compressed model. Both of these limitations make it hard to apply these works to real world modeling problems.
- By combining distillation and Multimetric Bayesian Optimization, can we better understand the effects of compression and architecture decisions on model performance? Do these architectural decisions (including model size) or distillation properties dominate the trade-offs?
- Can we leverage these trade-offs to find models that lend themselves well to application specific systems (ex: productionalization, edge computing, etc)?
The Data: SQUAD 2.0
We use SQUAD 2.0 as the question answering training dataset. SQUAD 2.0 is composed of 35 topics ranging from history to chemistry, and is split 50/50 across answerable and unanswerable questions. Because of how accuracy (exact match) is calculated, a model could guess all the questions as unanswerable and be 50% accurate. This is clearly not ideal, and we will deal with this in our optimization setting down the line.
Figure 1. This is an example of a passage and its question/answer pairs. Each question will have either a set of possible answers and their respective character positions in the passage, or will be tagged as unanswerable.
Compression method: Distillation
We use distillation as our compression method. The main idea behind distillation is to produce a smaller model (student model) that retains the performance of a larger model (teacher model). The student model’s loss function is a weighted average of a soft target, dictated by the teacher’s output softmax layer, and a hard target loss, dictated by the true labels in the dataset (your typical loss function). By including the soft target loss, the student uses this information to generalize the same way as the teacher model, and reach higher model performance than if it were to be trained from scratch. For more on this, see the original paper or Ujjay Upadhyay’s post. 
Figure 2. High-level of our distillation process for question answering. For more on distillation, see Intel’s overview and DistilBERT’s process.
For our distillation process, the teacher model is BERT pre-trained on the Toronto Book Corpus and English Wikipedia, and fine-tuned on SQUAD 2.0. We conduct a Neural Architecture Search (NAS) using Multimetric Bayesian Optimization to find an optimal set of architectures for the student model.
Figure 3. Our student model’s architecture is determined by SigOpt’s Multimetric Bayesian Optimization and its weights are seeded by pre-trained DistilBERT. Parameters not seen in the pre-trained model are initialized according to DistilBERT’s initialization method. 
Our baseline uses the DistilBERT architecture as the student model and the teacher model will be BERT pre-trained for question answering. The baseline will use the following default hyperparameter settings from DistilBERT. The student baseline model reaches a 67.07% accuracy with 66.3 M parameters after training for 3 epochs.
Multimetric Bayesian Optimization
We use SigOpt’s Multimetric Bayesian Optimization to optimize our two competing metrics (accuracy and size) and understand the trade-offs for the distillation process. We concurrently optimize: the student model’s architecture, SGD parameters, and distillation parameters. By optimizing for these sets of parameters across our metrics, we hope to find a set of Pareto efficient architectures that perform strongly for question-answering and are smaller than the baseline.
Figure 4. Diagram of optimizing the distillation process and student model architecture.
As models can randomly guess and reach 50% accuracy, we set a metric threshold on exact score (accuracy) to deprioritize parameter configurations that are below 50%. We also avoid parameter and architecture configurations that lead to CUDA memory issues by marking them as metric failures during runtime. This allows us to keep the parameter space open and rely on the optimizer to learn and avoid infeasible regions.
Let’s take a look at our results from the Multimetric Optimization experiment.
Figure 5. The Pareto frontier resulting from the Multimetric Optimization experiment. Each dot is an optimization run (or one distillation cycle). The yellow dots each represent a Pareto optimal model configuration. The pink dot is the baseline described above. The pale blue dots represent student models’ exact score that fell below the 50% metric threshold set for the experiment.
As we were hoping, our multimetric experiment produces many sets of Pareto optimal model configurations depicted by the yellow dots in Figure 9. Seen in the table below, we are able to increase the performance of our baseline model by ~ 3.50% with only a 0.09% increase in model size. On the flip side, we are able to shrink our model by ~22.50% parameters with only a 0.25% dip in model performance!
Most importantly, we see that optimization results result in a large set of strong performing model configurations that we can choose from for our needs (instead of relying on a single architecture). This allows the modeler to select the tradeoffs that make the most sense for their application, without needing to know what was even possible before the optimization.
|Scenario||Frontier point: size||Frontier point: exact score||Size diff from baseline||Exact score diff from baseline||Link to Run|
|Baseline||66.3 M||67.07%||-||-||Baseline training curves|
|Retain model size and maximize performance||66.36M||70.55%||+0.09% params||+3.45%||Model training curves|
|Minimize model size and maximize performance||65.18M||70.26%||-1.69% params||+3.19%||Model Training Curves|
|Minimize model size and retain performance||51.40M||66.82%||-22.47% params||-0.25%||Model Training Curves|
Table 1. Selection of points from the frontier and comparisons to the baseline (top row).
Why does this matter?
Building small and strong performing models for NLP is hard. Although there is great work currently being done to develop such models, it is not always easy to understand how the decisions and trade-offs were made to arrive at these models. In our experiment, we leverage distillation to effectively compress BERT, and multimetric Bayesian optimization paired with metric management to intelligently find the right architecture and hyperparameters for our compressed model. By combining these two methods, we are able to more thoroughly understand the effects of distillation and architecture decisions on the compressed model’s performance. Furthermore, we are able to gain intuition on our hyperparameter and architectural parameter space, which will help us make informed decisions in future related work. Lastly, from our Pareto frontier, we’re able to practically assess the trade-offs between model size and performance and choose a compressed model architecture that best fits our needs.
To recreate or repurpose this work please use this repo. Model checkpoints for the models in the results table can be found here. The AMI used to run this code can be found here. To play around with the SigOpt dashboard and analyze results for yourself, take a look at the experiment. If you’d like to learn more about this experiment, watch the webinar.
 J. Devlin, M. Chang, K. Lee, K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/pdf/1810.04805.pdf
 G. Hinton, O. Vinyals, J. Dean. Distilling the Knowledge in a Neural Network. https://arxiv.org/abs/1503.02531
 P. Rajpurkar, R. Jia, P. Liang. Know What You Don’t Know: Unanswerable Questions for SQuAD. https://arxiv.org/abs/1806.03822
 V. Sanh, L. Debut, J. Chaumond, T. Wolf. DistilBERT, a distilled version of BERT: smaller,
faster, cheaper and lighter. https://arxiv.org/pdf/1910.01108.pdf
 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, I. Polosukhin. Attention is All You Need. https://arxiv.org/pdf/1706.03762.pdf