Hugging Face is on a mission to democratize state-of-the-art Machine Learning, and a critical part of their work is to make these state-of-the-art models as efficient as possible, to use less energy and memory at scale, and to be more affordable to run by companies of all sizes. Their collaboration with SigOpt and Intel through the Hugging Face Hardware Partner Program enables Hugging Face to make advanced efficiency and optimization techniques easily available to the community, through their new Hugging Face Optimum open source library dedicated to production performance.
For the past few years, Hugging Face Transformers has added a tremendous number of new architectures and models to their Hugging Face Model Hub – which now has more than 9,000 of them as of early 2021. Hugging Face Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation and more in over 100 languages. Its aim is to make cutting-edge natural language processing (NLP) easier to use for everyone. As the NLP landscape keeps trending towards more and more BERT-like models being used in production, it remains challenging to efficiently deploy and run these architectures at scale. This is why Hugging Face recently introduced their Hugging Face Inference API.
Today, we’ll be sharing a summary of how Hugging Face has used SigOpt’s hyper parameter optimizations to better leverage Intel CPU’s for BERT model inference. When it comes to leveraging BERT-like models from Hugging Face’s Model Hub, there are many knobs which can be tuned to make things faster. In this case, “faster” means improved throughput and latency on Intel CPU’s. However, since there are so many knobs, tuning all of them to reach optimal performance can be cumbersome. For instance, in their experiments, the following knobs were tuned:
- The number of cores: although using as many cores as you have is often a good idea, it does not always provide the best performance because it also means more communication between the different threads. On top of that, having better performance with fewer cores can be very useful as it allows to run multiple instances at the same time, resulting in both better latency and throughput.
- The memory allocator: which memory allocator out of the default malloc, Google’s tcmalloc and Facebook’s jemalloc provides the best performance?
- The parallelism library: which parallelism library out of GNU OpenMP and Intel OpenMP provides the best performance?
- Transparent Huge Pages: does enabling Transparent Huge Pages (THP) on the system provide better performance?
- KMP block time parameter: sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.
Of course, the brute force approach, consisting of trying out all the possibilities will provide the best knob values to use to get optimal performance but, the size of the search space being N x 3 x 2 x 2 x 2 = 24N, it can take a lot of time: on a machine with 80 physical cores, this means trying out at most 24 x 80 = 1920 different setups! 😱
Fortunately, SigOpt’s Bayesian Optimization allows us to make the hyper parameter tuning experiments both faster and more convenient to analyze, while maintaining similar performance compared to the optimal results. Hugging Face found that brute force gives the best latency results while SigOpt’s Bayesian Optimization approach only had a maximum relative difference of 8.6%. Luckily, HuggingFace can still leverage the brute force approach with SigOpt’s Intelligent Experimentation platform by leveraging the SigOpt Bring Your Own Optimizer feature.
SigOpt is also very useful for analysis: it provides a lot of figures and valuable information. First, it gives the best value it was able to find, the corresponding knobs, and the history of trials and how it improved as trials went, for example, with sequence length = 20:
In this specific setup, 16 cores along with the other knobs were able to give the best results, that is very important to know, because as mentioned before, that means that multiple instances of the model can be run in parallel while still having the best latency for each.
It also shows that it had converged at roughly 20 trials, meaning that maybe 25 trials instead of 40 would have been enough. A wide range of other valuable information is available, such as Parameter Importance:
As expected, the number of cores is, by far, the most important parameter, but the others play a part too, and it is very experiment dependent. For instance, for the sequence length = 512 experiment, this was the Parameter Importance:
Here not only the impact of using OpenMP vs Intel OpenMP was bigger than the impact of the allocator, the relative importance of each knob is more balanced than in the sequence length = 20 experiment. And many more figures, often interactive, are available on SigOpt such as:
- 2D experiment history, allowing to compare knobs vs knobs or knobs vs objectives
- 3D experiment history, allowing to do the same thing as the 2D experiment history with one more knob / objective.
Conclusion – Accelerating Transformers for Production
In this post, we shared how Hugging Face optimized the new Intel Ice Lake Xeon CPUs for running AI workloads at scale along with the software elements that can be swapped and tuned in order to exploit the full potential of the hardware. All these items are to be considered after setting-up the various lower-level knobs to maximize the usage of all the cores and resources.
Read more at Hugging Face’s Full Length Blog Post about the software and hardware optimizations that HuggingFace implemented to accelerate their Transformers on Intel CPU’s. If you want to learn more about how SigOpt can be used to improve different workloads, check out this blog post. If you want to get started addressing similar problems in your workflow, use SigOpt free today by signing up at https://sigopt.com/signup.