Vaswani et al developed the Transformer architecture in 2017, and opened up access to non-sequential architecture for natural language processing (NLP). It fundamentally pivoted the state-of-the-art for neural machine translation from sequential RNN based methods to more parallelizable units (transformers). From this architecture, Devlin et al. developed BERT in 2018.
BERT capitalizes on these Transformer building blocks and applies the architecture to NLP tasks beyond neural machine translation. With its publication, BERT became the new state-of-the-art, beating the existing state-of-the-art in question answering, general language understanding, and commonsense inference. But, more importantly, BERT is transferable. It can be applied, with very minimal changes, to tasks that historically required task-specific architectures (ex: NMT and seq-to-seq models).  Essentially, within a span of 2 years, compute parallelism was suddenly available to NLP (as we moved from sequential architectures to transformers) and a more universal architecture emerged to address various NLP tasks (which is pretty neat).
With BERT as the universal architecture, general language understanding became the base training task and other NLP tasks became downstream tasks. Before this transition, each task required unique state-of-the-art architectures and trained from scratch on large corpuses. With this transition, NLP problems can share high-performing, pre-trained models across a plethora of corpuses. This allows NLP to leverage transfer learning and moves NLP closer to standardization. Without BERT and its transferability, every modeling problem would continue to require large datasets, every model would have to be trained from scratch, and knowledge sharing would be heavily restricted. For more on this, Sebatian Ruder has a great piece on transfer learning in NLP.
Although BERT unlocks transfer learning for NLP, the architecture is too complex for many production systems, too large for federated learning and edge-computing, and too costly to train from scratch when necessary. There are many ongoing efforts to reduce the size of BERT while retaining model performance and transferability from HuggingFace’s DistilBert to Rasa’s pruning technique for BERT to fast-bert (and many more).
BERT innately facilitates the compression process. Its relatively independent and repetitive transformer unit building blocks, and their subunits of Multihead Attention and Fully Connected Layers, allow for easy manipulation of model size. These smaller, compressed, distilled or pruned architectures are just as transferable as BERT. They are parallelizable and enable transfer learning without the overhead of the full model.
As depicted in the image to the left, DistilBERT results from a distillation process and manual architecture search to effectively drop the size of BERT without losing performance on language understanding, question answering, and common sense inference. 
Given these architectural properties of BERT, open sourcing of BERT, and community support, such as Hugging Face’s Transformers package, developing high-performing NLP models under resource constraints should be easier than before. As we saw above, many people are working on compressing BERT to a more manageable size without losing model performance.
But there are two shortcomings of this work that limit practical applications of these techniques. First, from each of these examples, we get a single compressed architecture that performs well without understanding the full trade-offs involved during compression. This limitation makes it challenging for modelers to apply it to their own set of tasks. Second, these examples solely strive to achieve high model performance on language understanding. While this work is necessary and critical, it is also important to understand how well compression would perform on more application focused methods, niche datasets, and non-language understanding NLP tasks. In short, it is tough to practically take these methods and understand the impact on your modeling workflow.
So, how can we practically assess the trade-offs for reducing model size while retaining accuracy given these architecture capabilities? In a future post, inspired by DistilBERT, we will explore this problem by distilling BERT for question answering. Specifically, we will pair distillation with Multimetric Bayesian Optimization to compress BERT and assess the trade-offs between size and accuracy. By combining these methods and concurrently tuning things like model accuracy and number of parameters, we will better understand the effects of distillation on model performance and make informed decisions on their trade-offs.
This post builds on a lot of the great work from the community, and would not be possible without it. This includes, HuggingFace’s Transformers package and model zoo. Without the flexibility of transformers and the community support around pre-trained models and developer tools, we would not be able to explore these trade-offs.
 J. Devlin, M. Chang, K. Lee, K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/pdf/1810.04805.pdf
 G. Hinton, O. Vinyals, J. Dean. Distilling the Knowledge in a Neural Network. https://arxiv.org/abs/1503.02531
 V. Sanh, L. Debut, J. Chaumond, T. Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. https://arxiv.org/pdf/1910.01108.pdf
 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, I. Polosukhin. Attention is All You Need. https://arxiv.org/pdf/1706.03762.pdf