Why is Experiment Management Important for NLP?

Meghana Ravikumar
Application, Augmented ML Workflow, Deep Learning, Experiment Management, Focus Area, Model Type, Modeling Best Practices, Natural Language, SigOpt 101

I think we can all agree that modeling can often feel like a crapshoot (if you didn’t know before, surprise!). It has many moving parts that need to be tracked, from understanding the data to featurization and model selection to orchestrating infrastructure and doing it all over again. And many times, it feels like all of this is happening at once. At each stage of the process, there is a vast amount of discovery to be done and there are a wide variety of decisions to be made. 

For me, this was most evident during my most recent project on leveraging Multimetric Bayesian Optimization to distill BERT (the teacher model) to a smaller, strong performing architecture (the student model) on SQUAD 2.0 (for more on this project, read the blog or watch the webinar). This project made me feel as though I was juggling ten hacksaws while balancing on a unicycle in a circus tent that was on fire.

Or like Simon Cowell during this performance:

Simon Cowell Fire Juggling GIF by America's Got Talent - Find & Share on GIPHY

What I mean is, there are a lot of components that all need to function well and seamlessly together. For this project, my chainsaws/flaming torches/main components were: 

  • adapting existing packages, 
  • identifying the best optimization metrics, and 
  • effectively executing my tuning process in parallel. 

Leveraging SigOpt’s Experiment Management took away some of the juggling. I was able to quickly analyze my model runs, find bugs, and make informed decisions. Instead of relying on randomly jotted notes in notebooks and docs (although I still do this out of habit but less so), I was able to use the platform to trace back and understand my past work. More specifically, the product helped me with critical decisions during model development, understanding my problem space, and monitoring my full run.

Model Development

Baselining is the first step of my model development process. Many times, I find myself working with novel data and models and having no intuition on what to expect. To get past this, I train an unaltered model with the dataset to understand: what is the highest accuracy (or other metric) the model can achieve? Does this model converge? And is the loss curve “healthy”? Essentially, a baseline model should give me a rough understanding of the problem space, help me design my model, and validate whether or not the modeling problem at hand is worth solving. 

Establishing a representative baseline model for an experiment is an iterative process in itself. For this experiment, I worked off of HuggingFace’s DistilBERT paper. After reproducing the paper’s results, I identified the paper’s default architecture parameters, associated hyperparameters, featurizations, weight initializations, and training method. These are the defaults I use for my baseline. The first model baseline I tried for my experiment was the default parameters of DistilBERT but trained on SQUAD 2.0. The results were underwhelming. The resulting model performed poorly across all the tracked accuracy metrics, actually decreased in performance over time, and somewhat converged to a 35% accuracy. Not ideal. 

Figure 1. Learning curves for baseline without pretrained weights for SQUAD 2.0 metrics.

But, by training this model, I realized that the default parameters were definitely better suited for the large Toronto Book Corpus and English Wikipedia corpuses (the training data for the original paper), and for general language understanding (the training task of the original paper). Which led me to ask, is this methodology suitable for compressing BERT for question answering? Essentially, should I actually attempt this problem if the model performance is so bad given this baseline?

As BERT and DistilBERT both show that general language understanding is a great pre-training task for downstream question answering tasks, I decided to use pre-trained from the DistilBERT model as weight initialization for my baseline architecture. By doing so, the baseline model is  now initialized with a rough idea of the English language and, from that point, is further trained for question answering. For more on transfer learning, see Sebastian Ruder’s overview on the topic.

With pre-trained weights, the baseline model hit a 67% accuracy, does indeed converge, and has a good looking loss curve. Big improvement. 

Figure 2. Learning curves for baseline with pertained weight initialization for SQUAD 2.0 metrics.

By iterating through various baselines, I was able to better understand my challenging problem space, make informed decisions about my model design, and decided to run the full experiment.

In addition to the accuracy metric, I also kept track of “HasAns_exact” (exact score for answerable questions) and “NoAns_exact” (exact score for unanswerable questions). This goes back to the properties of SQUAD 2.0 being evenly split across 2 categories: answerable and unanswerable questions. By keeping an eye on metrics for each category and across the whole dataset, I was able to avoid model configurations that overfit to one category, and, instead, identify configurations that focused on generalizability.

Understanding My Problem Space

After understanding what the baseline looks like, I moved on to the next step of my model development process: building an intuition for my problem space. Specifically I wanted to see what happens when I perturb the model’s parameters. You can do this in many ways. I chose to run a small Multimetric Bayesian Optimization experiment and let the optimizer choose the parameters for me.

Figure 3. Best Metrics result for a small Multimetric Optimization experiment. Each point represents a unique set of parameter values. The orange dots are part of this experiment’s Pareto Frontier.

By running a small multimetric optimization experiment, I quickly understood the rough trade-offs associated with distillation. More interestingly, it rapidly became clear that larger models don’t perform as well as smaller models, the parameter space leads to the model’s performance jumping from 20% to 40%, and a small cluster was forming around 50% exact. Dun dun dun.

By iterating over the parameters and looking for correlations, I identified a handful of parameters that heavily influence the model’s performance including: the number of layers (n_layers),  distillation loss (alpha_ce), all the dropouts for the network (qa_droput, attention_dropout, dropout), and distillation temperature.

Figure 4. Parallel coordinates for parameters that most affect the model’s performance. The sweet spot ranges for each of these parameters leads to the drastic jump in model performance from 20% to 40%.

Doing a small optimization run helped me understand correlations between parameters, and bolstered my intuition for important properties of the distillation process and model architecture search. 

Next, I wanted to understand what’s going on at 50% exact.

Figure 5. Selecting a point that represents the behavior of the 50% accuracy cluster.

Figure 6. Training curves for the representative point of the 50% accuracy cluster. As indicated by the metrics, the model is unable to recognize answerable questions.

Because half of SQUAD 2.0’s questions are unanswerable, if trained poorly, the model can achieve 50% accuracy by randomly predicting that everything is unanswerable.  By looking into the training curves for these models, I saw this type of training behavior present across the cluster of models around 50% accuracy. Each model scores a 0% accuracy for answers that are answerable and a 100% for questions that are unanswerable. This is obviously not ideal.

By looking at these training curves, I realized that the “exact” metric is not the best metric for a black-box optimizer. Instead, a more representative composite metric between the performance scores for answerable and unanswerable questions would be more suitable. 

Figure 7. Metric Threshold set at 50% accuracy.

But, composite metrics are difficult to create and accurately depict the relationship between the two categories. Instead, I opted to include a metric threshold on 50% for the performance metric “exact” to intelligently avoid that region. Now, the optimizer knows to focus on parameter suggestions that lead to a model performance greater than 50%. 

Monitoring My Full Run

Apart from being a great tool to leverage for model development and gaining intuition for my problem, Experiment Management also helps with monitoring long training and tuning cycles. Whether it’s clusters not auto-scaling properly, an uncaught bug in the code, incorrect environment parameters, or unexpected model behavior, training and tuning runs are difficult to monitor due to lots of moving parts. More specifically, the tool helped me rapidly identify incorrect model parameters.

After checkpointing, testing, and orchestrating my code, I was ready to run my full experiment that would run continuously for 4 days. So, off I went. Within 24 hours, it was clear to me that there was something wrong with the tuning cycles. I inspected the analysis page for my optimization run to find that no model configurations had passed my required threshold:

Figure 8. Best Metrics results for the full tuning cycle. Oddly, none of the optimization points seen pass the 50% accuracy threshold.

And so, I inspected a few training runs:

Figure 9. Representative training curves for models from the optimization cycle.

They all looked like the curves above, where the models were learning well for answerable questions but failing to recognize unanswerable questions. This was unexpected. I’d seen the opposite happen but this was strange and present in all the models. After looking through some logs, I went back to my training script. And, lo and behold, I had not turned on a crucial flag, and the dataset was being featurized incorrectly. Amazing.

After the featurization was fixed, I was able to successfully finish my tuning run and found many viable sets of model configurations with well structured training curves. 

Figure 10. Best Metrics graph and Pareto Frontier for the completed tuning cycle. The optimization process resulted in many viable model configurations that perform better than 50% accuracy. 

Figure 11. Training curves from the highest performing model on the Pareto Frontier (70.55% accuracy). 

Will I probably forget to turn flags on in the future? Yes, 100%. But, by using a monitoring tool, I was able to quickly identify the problem, know where to start looking, and resolve the issue quickly. 

In Conclusion

As you’ve learned through this very long blog post, modeling is messy. As modelers, we have to iterate and validate ideas quickly and efficiently while gaining an intuition for the problem we’re solving. Many times, we spin up long processes with high compute costs and long training cycles. Even if we do our best to test all components, run subsets of the data, try baselines, checkpoint everything under the sun and understand the problem space, something could always go wrong. Having something like SigOpt’s Experiment Management serves as a home base through all the moving parts. It allows me to quickly track and analyze my models, see patterns across my multitude of training runs, and provides a single space I can return to in the chaos. We hope you join our beta and try it out yourself.

Meghana Ravikumar AI Product Manager