Efficient BERT: How to get up and running

Meghana Ravikumar
Application, Applied AI Insights, Augmented ML Workflow, Multimetric Optimization, Natural Language

You’ve heard us talk about Efficient BERT frequently over the last 3 weeks. If not, here’s a webinar to watch and a blog post to read. This blog post will walk you through the resources available to adapt this work to your own modeling process.

List of Resources:

  • Github repository 
  • Model checkpoints zip
  • SigOpt Experiment Dashboard to view and analyze experiment results
  • AWS AMI for the experiment environment

How do I get Efficient BERT up and running?

Use the published model checkpoints

From the Multimetric Bayesian Optimization results, we found a set of optimal architectures and selected to highlight 3 such model configurations in the table below. 

ScenarioFrontier point: sizeFrontier point: exact scoreSize diff from baselineExact score diff from baselineLink to Run
Baseline66.3 M67.07%--Baseline training curves
Retain model size and maximize performance66.36M70.55%+0.09% params+3.45%Model training curves
Minimize model size and maximize performance65.18M70.26%-1.69% params+3.19%Model Training Curves
Minimize model size and retain performance51.40M66.82%-22.47% params-0.25% Model Training Curves

Final model checkpoints each of the models listed in the table above can be downloaded here.

The checkpoint directory folder structure follows HuggingFace’s Transformer package checkpointing structure. 

ScenarioFrontier point: sizeFrontier point: exact scoreSize diff from baselineExact score diff from baselineLink to RunCheckpoint ID
Retain model size and maximize performance66.36M70.55%+0.09% params+3.45%Model training curves29572758
Minimize model size and maximize performance65.18M70.26%-1.69% params+3.19%Model Training Curves29580036
Minimize model size and retain performance51.40M66.82%-22.47% params-0.25% Model Training Curves29567632

To see how to load a checkpoint directory, follow this Colab notebook.

Now, you should have loaded a trained Transformer model with the given configuration and pre-trained model weights. From here, you can finetune the model to your own dataset, play around with different architectures, and experiment as you like.

Draw insights from the SigOpt Dashboard

Explore SigOpt’s Experiment Dashboard to gather new insights and learn about the parameter space. 

Find model architectures that are viable for you.

Understand correlations between parameters.

Identify important patterns between parameter values and metrics.

Recreate the full experiment

To recreate the full experiment effectively, sign up for our free beta!

You have 3 main ways to run the repo:

  • Run distillation on SQUAD 2.0
  • Optimize distillation on SQUAD 2.0
  • Orchestrate optimization with Ray

We’ll focus on optimizing the distillation process with SigOpt and Ray (the 3rd option above). To learn how to execute the other two options, please see the repo’s readme.

Overview of the optimization experiment design:

Overview of orchestration with Ray:

Executing the full experiment

Create a data directory:

mkdir -p ./data

Get SQUAD 2.0 files locally.

wget -P ./data https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json

wget -P ./data https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

wget -P ./data https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/

Change name of evaluation script:

mv ./data/index.html ./data/evaluate-v2.0.py

Make a new Python3 virtualenvironment:

virtualenv -p $(which python3) ./venv
source venv/bin/activate
pip3 install transformers==2.4.1
pip3 install scikit-learn
pip3 install boto3
pip3 install tensorboard
pip3 install torch torchvision
pip3 install logbeam
pip3 install sigopt
pip3 install 'ray[tune]'

Logging

All logs are set to standard out and to CloudWatch. To specify your logstream and loghandler, change the cw_handler in logger.py.

Running Optimization and Distillation

The following are the CLI options for optimizing the distillation process. The process will run SigOpt’s Multimetric Bayesian Optimization with model distillation.  

Main CLI options:

* model_type: student model type

* teacher_type: teacher model type

* teacher_name_or_path: local checkpoints for teacher model or a model from [HuggingFace’s model zoo](https://huggingface.co/models)

* train_file: path to SQUAD 2.0 training file

* predict_file: path to SQUAD 2.0 dev file

* output_dir: output directory for training outputs

* num_train_epochs: number of epochs to train student model during distillation

* experiment_name: SigOpt experiment name

* api_token: SigOpt API Token

* use_hpo_default_ranges: flag to use default hpo ranges specified here

* sigopt_experiment_id: experiment id of existing SigOpt experiment. if not None, will load existing experiment

* sigopt_observation_budget: number of optimization runs for SigOpt experiment

* store_s3: flag to store checkpoints to s3

* s3_bucket: s3 bucket name for checkpoint storing

* cache_s3_bucket: flag to download stored caches in s3

* train_cache_s3_directory: location of training data’s cached features on s3

* eval_cache_s3_directory: location of dev data’s cached features in s3

For a full list of default hyperparameter ranges, check this file.  And for a full list of defaults, check here.

Execute the following CLI:

python sigopt-examples/bert-distillation-multimetric/sigopt_optimization_cli.py --model_type distilbert --train_file ./data/train-v2.0.json
--predict_file ./data/dev-v2.0.json --experiment_name test-multimetric-distillation --project_name test-multimetric --use_hpo_default_ranges --api_token <SigOpt API Token> --output_dir
./test_run_5 --sigopt_observation_budget 200 --teacher_type bert --teacher_name_or_path twmkn9/bert-base-uncased-squad2 --num_train_epochs 3 --store_s3 --s3_bucket s3-checkpoint-bucket --cache_s3_bucket --train_cache_s3_directory s3-cache-bucket/train_cache --eval_cache_s3_directory s3-cache-bucket/dev_cache

The above CLI runs Multimetric Optimization on the distillation process using a teacher model from the HuggingFace model zoo. The optimization will run for 200 cycles and train the student model for 3 epochs each. The checkpoints will be stored on s3 and logged/saved for every 1000 steps. The feature caches for the dataset will be pulled from the specified s3 bucket.

Using Ray to orchestrate the optimization process

The following options will run SigOpt’s Multimetric Bayesian Optimization with distillation using Ray for orchestration. Before using this CLI, look through the Ray documentation. The following Ray orchestration will use this AMI to set up the environment for the nodes in the cluster.

To run the optimization process on a Ray cluster:

  1. Launch a Ray cluster. There is an example config in the repo
  2. Execute the ray cli on the cluster launched from step 1

Main CLI options:

* model_type: student model type

* teacher_type: teacher model type

* teacher_name_or_path: local checkpoints for teacher model or a model from HuggingFace’s model zoo

* train_file: path to SQUAD 2.0 training file

* predict_file: path to SQUAD 2.0 dev file

* output_dir: output directory for training outputs

* num_train_epochs: number of epochs to train student model during distillation

* experiment_name: SigOpt experiment name

* api_token: SigOpt API Token

* use_hpo_default_ranges: flag to use default hpo ranges specified here 

* sigopt_experiment_id: experiment id of existing SigOpt experiment. if not None, will load existing experiment

* sigopt_observation_budget: number of optimization runs for SigOpt experiment

* store_s3: flag to store checkpoints to s3

* s3_bucket: s3 bucket name for checkpoint storing

* cache_s3_bucket: flag to download stored caches in s3

* train_cache_s3_directory: location of training data’s cached features on s3

* eval_cache_s3_directory: location of dev data’s cached features in s3

* max_concurrent: max number of concurrent workers used

* parallel: total number of parallel workers used for SigOpt

* num_cpu: number of cpus required for each run

* num_gpu: number of gpus required for each run

* ray_address: ip address of ray cluster’s head node

* clean_raytune_output: flag to clear RayTune outputs after the run

* raytune_output_directory: output directory for RayTune

Execute the following CLI:

python sigopt-examples/bert-distillation-multimetric/sigopt_ray_optimization_cli.py
        --model_type distilbert
    --teacher_type bert
    --teacher_name_or_path twmkn9/bert-base-uncased-squad2
    --train_file /home/ubuntu/SQUAD_2/train-v2.0.json
    --predict_file /home/ubuntu/SQUAD_2/dev-v2.0.json
    --num_train_epochs 3
    --experiment_name bert-distillation-full-run
    --project_name 	bert-distillation-full-run
        --use_hpo_default_ranges
        --api_token <SigOpt API Token>
        --output_dir /home/ubuntu/output_dir
    --overwrite_output_dir
    --sigopt_observation_budget 479
    --parallel 20
    --max_concurrent 20
    --num_cpu 8
    --num_gpu 1
    --store_s3
    --s3_bucket <S3_checkpoint_bucket>
    --ray_address <RAY_IP_ADDRESS>
    --cache_s3_bucket <S3_cache_bucket>
    --train_cache_s3_directory <S3_path_train_features>
        --eval_cache_s3_directory <S3_path_dev_features>

The above cli executes the optimization process across 20 workers in parallel.

Thanks for Using SigOpt

I hope this helps you get started with Efficient BERT. Our mission at SigOpt is to accelerate and amplify the impact of modelers everywhere by building software solutions that boost the model development process. We hope providing use cases and code examples like these make it easier for you to get started with our products on these important use cases. 

If you want to learn more about our product, schedule time to talk with our team, check out our docs, or sign up to receive blog post updates. I’d also love to learn more about your NLP projects, so reach out to [email protected] to get in touch directly.

Use SigOpt free. Sign up today.

img-Meghana
Meghana Ravikumar AI Product Manager