This is the second post in this series about distilling BERT with Multimetric Bayesian Optimization. In this post, we discuss our experiment design. Part 1 discusses the background for the experiment and Part 3 discusses the results from our experiment. This post was originally published on the Nvidia Developer Blog.
In our previous posts, we discuss the importance of the BERT architecture in making transfer learning accessible in NLP. BERT allows a variety of problems to share off-the-shelf pretrained models, and moves NLP closer to standardization (similar to how ResNet changed Computer Vision). But, BERT is really large. This makes BERT costly to train, too complex for many production systems, and too large for federated learning and edge-computing. To address this challenge, many teams have compressed BERT to make the size manageable
Inspired by DistilBERT, we will explore this problem by distilling BERT for question answering. Specifically, we will pair distillation with Multimetric Bayesian Optimization. By concurrently tuning metrics like model accuracy and number of model parameters, we will be able to distill BERT and assess the trade-offs between model size and performance. This experiment is designed to address two questions through this process:
- By combining distillation and Multimetric Bayesian Optimization, can we better understand the effects of compression and architecture decisions on model performance? Do these architectural decisions (including model size) or distillation properties dominate the trade-offs?
- Can we leverage these trade-offs to find models that lend themselves well to application specific systems (ex: productionalization, edge computing, etc)?
In post #1, we walk through the concepts behind the experiment set up. In this post, we will walk through our experiment set up.
Optimizing the Distillation process
Let’s take a look at how we’re going to modify DistilBert’s distillation process to apply distillation directly for question answering.
Figure7. High-level of our distillation process
Like the DistilBERT process, we use a BERT teacher model, a weighted loss function, and BERT-based student model architectures. Instead of distilling for language understanding, we will distill for question-answering.
We will use two different pre-trained models from the DistilBERT model zoo. The first for our teacher model and the second to seed the weights for our student model. Our teacher model is BERT pre-trained on the Toronto Book Corpus and English Wikipedia, and fine-tuned on SQUAD 2.0.
Figure8. Depicts how our teacher model has been pre-trained and fine-tuned.
Unlike DistilBERT and general distillation, we want to understand the effects of changing the student model’s architecture on the overall distillation process. To do so, we will explore many student model architectures that are suggested to us by our Multimetric Bayesian Optimization.
From preliminary runs and DistilBERT’s analysis, weight initialization for the student network is important. To provide a warm start for our student model, we seed model weights from pre-trained DistilBERT wherever possible and train the model from these initialized weights.
Figure9. Our student model’s architecture is determined by SigOpt’s Multimetric Bayesian Optimization and its weights are seeded by pre-trained DistilBERT. Parameters not seen in the pre-trained model are initialized according to DistilBERT’s initialization method. 
Setting the Baseline
Our baseline uses the DistilBERT architecture as the student model for the process depicted above. The teacher model will be BERT pre-trained for question answering (depicted above) The baseline will use following default hyperparameter settings from DistilBERT. These hyperparameters notably include: low dropout probabilities, equal weighting for soft and hard targets, and a low temperature for distillation.
|Parameter Name||Parameter Value||Parameter Type|
|Adam epsilon||9.98e-09||Model training|
|Weight for Soft Targets||0.5||Distillation|
|Weight for Hard Targets||0.5||Distillation|
|Beta 1||0.9||Model training|
|Beta 2||0.999||Model training|
|Learning Rate||5e-5||Model training|
|Number of MultiAttention Heads||12||Architecture|
|Number of layers||6||Architecture|
|Eval Batch Size||8||Model training|
|Training Batch Size||8||Model training|
|Warm up steps||0||Model training|
|Weight Decay||9e-06||Model training|
Table1. Baseline parameter values.
Because there are a lot of moving parts in the distillation process, it is important to visualize and understand the baseline student model’s training curves. The baseline loss and performance curves give us a good idea of how a healthy model performs and learns.
We keep track of many different metrics that help us understand and monitor the model’s performance. These metrics include: “HasAns_exact”, “NoAns_exact”, “exact”, “f1”, and “loss”. While “exact” and “f1” track the model’s overall performance, “HasAns_exact” and “NoAns_exact” track the model’s performance for answerable and unanswerable questions respectively. By looking at all of these metrics, we’re able to discern a well-fitting model over one that classifies all questions as unanswerable.
Figure10. Loss and accuracy curves for the baseline from the SigOpt experiment tracking dashboard that gives us an idea of what convergence will look like for the student model.
At the end of the distillation process, the student baseline model reaches a 67.07% accuracy with 66.3 M parameters after training for 3 epochs.
Multimetric Optimization Design
In order to assess the trade-offs between the size and accuracy of the compressed student model, we optimize properties of the model architecture, common hyperparameters, and the general distillation process. Using DistilBERT’s architecture as a baseline, we effectively conduct neural architecture search (NAS) to alter the number of Transformer blocks/layers and prune Multi-Attention Heads to modify model size. For more on using Bayesian Optimization for NAS see this paper by Kirthevasan Kandasamy et al. To optimize the model’s training process, we tune the model’s hyperparameters for Adam SGD parameters, batch sizes, drop outs, initializations, and warm up steps. To optimize the distillation process, we focus on tuning the weighted loss function described above to understand the importance of hard vs soft targets.
Figure11. Diagram of optimizing the distillation process and student model architecture.
We use SigOpt’s Multimetric Bayesian Optimization, Metric Management, and parallelization to trade-off our two competing metrics (accuracy and size). Following convention, we use the total number of trainable parameters to calculate model size, and SQUAD 2.0’s exact score to calculate accuracy. By optimizing for these sets of parameters across our metrics, we hope to find a set of Pareto efficient architectures that both perform strongly for question-answering and are relatively small compared to the baseline.
As shown above, we track a variety of metrics and monitor the model’s health for each training run using SigOpt’s experiment dashboard. And, using Metric Strategy, we also store and track the model’s f1 score and inference time (our unoptimized performance metrics) for each optimization cycle. By combining the two tracking systems, we’re able to better understand the model’s behavior when using a black-box optimization tool, and keep an eye on metrics we care about but don’t want to optimize.
The ensemble of Bayesian and global optimization strategies backing the SigOpt optimizer are well suited for large parameter spaces that include hyperparameter, architecture, and other parameters like this. It would be intractable to begin to search a similar space with parameter sweeps like grid or random search.
Defining the Multimetric Experiment
Our Multimetric Bayesian Optimization will search the following space:
|Parameter Name||Parameter Value||Parameter Type|
|Adam epsilon||[9.98e-09, 9.99e-06]||Model training|
|Weight for Soft Targets||[0,1]||Distillation|
|Weight for Hard Targets||[0,1]||Distillation|
|Beta 1||[0.7, 0.9999]||Model training|
|Beta 2||[0.7, 0.9999]||Model training|
|Learning Rate||[2e-6, 0.1]||Model training|
|Number of MultiAttention Heads||[1, 12]||Architecture|
|Number of layers||[1, 20]||Architecture|
|Eval Batch Size||[4, 32]||Model training|
|Training Batch Size||[4, 32]||Model training|
|Pruning Seed||[1, 100]||Architecture|
|Warm up steps||[0, 100]||Model training|
|Weight Decay||[8.3e-7, 0.018]||Model training|
Table2. Hyperparameter space definition that will be explored by SigOpt. Like the baseline, the student model is always trained for 3 epochs.
As we learnt previously, SQUAD 2.0’s unanswerable questions can skew the model. Models can randomly guess and reach 50% accuracy, so we set a metric threshold on exact score (accuracy) to deprioritize parameter configurations that are at or below 50% to make our search more efficient. We also avoid parameter and architecture configurations that lead to CUDA memory issues by marking them as failures during runtime. This allows us to keep the parameter space open and rely on the optimizer to learn and avoid infeasible regions.
From the optimization process detailed above, we hope to understand the trade-offs when tuning these parameters and produce sets of pareto optimal architectures.
Selecting the right GPU
While benchmarks for GPU speed-ups and CUDA performance are available for popular CNN architectures (see Justin Johnson’s work on benchmarks), it is being developed for Transformers (see Hugging Face’s work on inference time benchmarks). In order to select the right GPU for our model training process, we start by comparing 2 GPU options: NVIDIA Tesla V100 (p3.2xlarge instances) and NVIDIA T4 Tensor Core (g4dn.2xlarge instances). We compare these two GPUs because they both satisfy our training requirements and each have their own benefits. They both have enough GPU memory for a training cycle and each instance has enough CPU to quickly perform non-GPU specific computation. The NVIDIA T4 Tensor Core is our most cost-effective option and the NVIDIA Tesla V100 is our most cutting edge option.
To select the right GPU for our experiment, we run the baseline on both GPU types. The NVIDIA T4 Tensor Core takes 2.5 hours to execute 1 training epoch. Unsurprisingly, the NVIDIA Tesla V100 cuts this training time in half and takes 1.3 hours to complete 1 training epoch.
Because we wanted to cut down the wall-clock time of our experiment, we use the NVIDIA Tesla V100 GPUs for all tuning cycles. By using the NVIDIA Tesla V100, we were able to complete our experiment within 4 days.
Orchestration and Wall-Clock time
For our Multimetric Optimization experiment, we use 20 p3.2xlarge AWS EC2 instances that each use 1 NVIDIA Tesla V100 GPUs. Running a single optimization cycle (or one distillation cycle) on SQUAD 2.0 takes 4 hours on average. For our experiment, we run 479 optimization cycles (each running a distillation process) asynchronously parallelized across 20 instances, taking ~4 days of wall-clock time to complete. To efficiently execute across 20 instances, we use Ray as our infrastructure orchestration service and SigOpt for parameter configuration scheduling and tuning parallelization.
In our next post, we will walk through our results from this experiment.
To recreate or repurpose this work please use this repo. Model checkpoints for the models in the results table can be found here. The AMI used to run this code can be found here. To play around with the SigOpt dashboard and analyze results for yourself, take a look at the experiment. If you’d like to learn more about this experiment, watch the webinar, and to learn more about the importance of experiment management in NLP read our blog.
 J. Devlin, M. Chang, K. Lee, K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/pdf/1810.04805.pdf
 G. Hinton, O. Vinyals, J. Dean. Distilling the Knowledge in a Neural Network. https://arxiv.org/abs/1503.02531
 P. Rajpurkar, R. Jia, P. Liang. Know What You Don’t Know: Unanswerable Questions for SQuAD. https://arxiv.org/abs/1806.03822
 V. Sanh, L. Debut, J. Chaumond, T. Wolf. DistilBERT, a distilled version of BERT: smaller,
faster, cheaper and lighter. https://arxiv.org/pdf/1910.01108.pdf
 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, I. Polosukhin. Attention is All You Need. https://arxiv.org/pdf/1706.03762.pdf