This is the first post in this series about distilling BERT with Multimetric Bayesian Optimization. Part 2 discusses the set up for the Bayesian experiment, and Part 3 discusses the results of this experiment. This post was originally published on the Nvidia Developer Blog.
We’ve all heard of BERT- Ernie’s partner in crime. Just kidding. I mean the Natural Language Processing (NLP) architecture developed by Google in 2018 (much less exciting I know). But, much like the beloved Sesame Street characters who help children learn their alphabets, BERT helps models learn language. Based on Vaswani et al’s Transformer architecture, BERT leverages Transformer blocks to create a malleable architecture suitable for transfer learning.
Before BERT, each core NLP task (language generation, language understanding, neural machine translation, entity recognition, etc.) had its own architecture and corpora for training a high performing model. With the introduction of BERT, we suddenly have a strong performing, generalizable model that can be transferred to a variety of tasks. Essentially, BERT allows a variety of problems to share off-the-shelf pretrained models, and moves NLP closer to standardization (similar to how ResNet changed Computer Vision). For more on this, see our previous post where we talk about why transformers are important, or Sebatian Ruder’s excellent analysis on the state of Transfer Learning in NLP. 
But, BERT is really, really large. The BERT base is 110M parameters and BERT large is 340M parameters (compared to the original ELMo model that is ~94M parameters). This makes BERT costly to train, too complex for many production systems, and too large for federated learning and edge-computing.
To address this challenge, many teams have compressed BERT to make the size manageable, including HuggingFace’s DistilBert, Rasa’s pruning technique for BERT, and Utterwork’s fast-bert (and many more). These works focus on compressing the size of BERT for language understanding while retaining model performance.
But these approaches are limited in two ways. First, they do not tell us how well compression would perform on more application focused methods, niche datasets, and directly on non-language understanding NLP tasks. Second, they are designed in a way that limits our ability to gather practical insights on the overall trade-offs between model performance and model architecture decisions. We chose this experiment to begin to address these two limitations to give NLP researchers additional insight on how to apply BERT to meet their practical needs.
Inspired by these works, and specifically, DistilBERT, we will explore this problem by distilling BERT for question answering. Specifically, we will pair distillation with Multimetric Bayesian Optimization. By concurrently tuning metrics like model accuracy and number of model parameters, we will be able to distill BERT and assess the trade-offs between model size and performance. This experiment is designed to address two questions through this process:
- By combining distillation and Multimetric Bayesian Optimization, can we better understand the effects of compression and architecture decisions on model performance? Do these architectural decisions (including model size) or distillation properties dominate the trade-offs?
- Can we leverage these trade-offs to find models that lend themselves well to application specific systems (ex: productionalization, edge computing, etc)?
Brought to you by the creators of SQUAD 1.1, SQUAD 2.0 is the current benchmark dataset for question answering. This classic question answering dataset is composed of passages and their respective question/answer pairs, where each answer can be found as a sentence fragment of the larger context. By including unanswerable questions in the dataset, SQUAD 2.0 introduces an additional layer of complexity not seen in SQUAD 1.1. Think of this as your standard reading comprehension exam (without multiple choice), where you’re given a long passage and a list of questions to answer from the passage. 
Figure1. This is an example of a passage and its question/answer pairs. Each question will have either a set of possible answers and their respective character positions in the passage, or will be tagged as unanswerable.
SQUAD 2.0 is split into 35 wide-ranging and unique topics, including, for example, niche physics concepts, historical analysis of Warsaw and the chemical properties of Oxygen. Its broad range of topics make it a good benchmark dataset to access general question answering capabilities. 
Along with the 35 topics, the dataset is 50.07% unanswerable and 49.93% answerable. Answerable questions require the model to find specific strings within the context, but the unanswerable questions do not and only require the question to be classified as unanswerable.
Figure2. The above image shows the split between unanswerable and answerable questions for ten topics of the dataset.
Although the addition of unanswerable questions makes the dataset more realistic, it forces the dataset to be unnaturally stratified. Essentially, a model could guess that all the questions are unanswerable and be 50% accurate. This is clearly not ideal, and we will deal with this in our optimization setting down the line.
Figure3. The above is an example of learning curves from the SigOpt experimentation platform that indicate the model believes all questions are unanswerable. This is indicated by the 100% accuracy for the metric “NoAns_exact”, the 0% accuracy for the metric “HasAns_exact”, and the constant 50% accuracy for the metric “Exact”.
We’ll be using the BERT architecture and HuggingFace’s Transformer package and model zoo for the implementation and pre-trained models. We would not be able to conduct this experiment without these resources.
Figure4. A Transformer block from Vaswani et al. BERT uses repeated blocks of the Input portion (encoder) of the Transformer network.
Before looking into DistilBERT, let’s take a look at how Distillation generally works.
Figure5. High level view of a distillation process. 
The main idea behind distillation is to produce a smaller model (student model) that retains the performance of a larger model trained on the dataset (teacher model). Prior to the distillation process, the student model’s architecture (a smaller version of the teacher model) is chosen (ex: teacher model is ResNet-150 and the student model is ResNet-13). During the distillation process, the student model is trained on the same dataset (or subset of the dataset) as the teacher model. The student model’s loss function is a weighted average of a soft target, dictated by the teacher’s output softmax layer, and a hard target loss, dictated by the true labels in the dataset (your typical loss function). By including the soft target loss, the student leverages the teacher model’s learned probabilistic distributions across classes. The student uses this information to generalize the same way as the teacher model, and reach higher model performance than if it were to be trained from scratch. For more on this, see the original paper or Ujjay Upadhyay’s post. 
Now that we understand distillation, let’s take a look at how distillation works for DistilBERT.
Figure6. A high-level view of DistilBert’s distillation process for language modeling. 
In the process depicted above, distillation compresses BERT (teacher model) to DistilBERT (a strong performing student model). Both DistilBERT and BERT are trained on the BookCorpus and English Wikipedia (great corpuses for general language understanding). As with the general distillation process, the student model’s soft target loss comes from a pre-trained BERT’s output softmax layer, and the hard target loss comes from training the student model on the dataset. 
In our next post, we will set up our experiment design for searching for a student architecture and understanding the trade-offs between model size and performance during distillation. This will include how we: search for our student architecture, set up our distillation process, choose the right NVIDIA GPU, and manage our orchestration.
To recreate or repurpose this work please use this repo. Model checkpoints for the models in the results table can be found here. The AMI used to run this code can be found here. To play around with the SigOpt dashboard and analyze results for yourself, take a look at the experiment. If you’d like to learn more about this experiment, watch the webinar, and to learn more about the importance of experiment management in NLP read our blog.
 J. Devlin, M. Chang, K. Lee, K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/pdf/1810.04805.pdf
 G. Hinton, O. Vinyals, J. Dean. Distilling the Knowledge in a Neural Network. https://arxiv.org/abs/1503.02531
 P. Rajpurkar, R. Jia, P. Liang. Know What You Don’t Know: Unanswerable Questions for SQuAD. https://arxiv.org/abs/1806.03822
 V. Sanh, L. Debut, J. Chaumond, T. Wolf. DistilBERT, a distilled version of BERT: smaller,
faster, cheaper and lighter. https://arxiv.org/pdf/1910.01108.pdf
 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, I. Polosukhin. Attention is All You Need. https://arxiv.org/pdf/1706.03762.pdf