Best-On-Intel: Intel Labs On Their Development Of AutoQ, An Efficient Framework For Quantization Using OpenVINO Toolkit And SigOpt

Tobias Andreasen, Vui Seng Chua, Chaunté W Lacewell, Nilesh Jain (Intel Labs), and Alexander Kozlov (OpenVINO)

After joining Intel, the SigOpt team has continued to work closely with groups both inside and outside of Intel in order to enable modelers everywhere to accelerate and amplify their impact with the SigOpt intelligent experimentation platform.

One of those teams has been Intel Labs, which has been developing a novel approach to quantization called AutoQ. The team has been leveraging the SigOpt Bring-Your-Own-Optimizer (BYOO) capabilities for their project in order to handle experiment management without worrying about tracking and visualization.

If you are interested in using BYOO and the other SigOpt features, sign up for free today.

In two sentences, what is AutoQ?

AutoQ is an automated solution that improves the scalability and productivity of model quantization. AutoQ employs Automated Machine Learning (AutoML) algorithms to automatically assign precision for different layers of the deep neural network (DNN) to minimize accuracy degradation, maximize performance, significantly (>100x) improve productivity and alleviate the reliance on subject matter experts.

You mentioned quantization, what is that and why is it important?

Machine learning models, especially DNNs, are ubiquitous given their predictive power and robustness for intelligent applications. However, they are over-parameterized and demand large computational and memory footprints, leading to inferior performance, low energy efficiency and, oftentimes, misalignment with requirements for deployment.

Quantization is a method for translating the trained models into lower precision arithmetic. For example, models trained in full-precision (FP32) can be mapped to lower precision format e.g. FP16, 8-bit, or 4-bit integers. Mapping a model to 8-bit arithmetic reduces 4x of the memory footprint, and theoretically it can improve compute performance by 4x. Upcoming AI hardware platforms are starting to come up with efficient implementation of low precision arithmetic.

But quantization introduces distortion to the model parameters and could degrade the accuracy of the already trained DNNs. In practice, the process of quantization involves tweaking the quantization parameters such as skipping sensitive layers or assigning different precision per layer in order to minimize accuracy impact while maximizing performance and power efficiency.

To whom is this research important?

This research is important to Intel’s customers who are deploying deep learning into production and running these on Intel processors. The conventional approach relies on human expertise and is largely manual. In this manual scenario, engineers first decide on a representative subset of data. Then they painstakingly calibrate the quantization parameters, layer by layer, to minimize distortion in order to retain the model accuracy and optimize for performance after deployment.

This process is iterative in nature and labor-intensive. It can take weeks for a given model depending on the domain experience of the engineers and complexity of the model. Consider quantizing a prominent image classification model, like ResNet-50, in a mixture of 8, 4, 2 precision. In that scenario there would be 3^100 possible combinations for its 50 layers of convolutions and 50 activation functions.

In practice, each Intel Customer has many models to be quantized and a variety of different solutions may be required for Intel’s rich AI solutions spanning edge to cloud. To address this productivity challenge, a scalable and automated solution is critical to optimize a large set of models for deployment.

What piqued your interest in this specific type of research?

Intel’s customers have brought this up as a considerable challenge in our conversations with them. Similarly, we have witnessed the ongoing trend of too much demand for AI talent without the supply to match it. If we could put our customers in a position to do these painstaking tasks more productively, they could deploy more models with fewer engineers much faster and with better performance.

How did you leverage OpenVINO to build AutoQ?

We used the Neural Network Compression Framework (NNCF), one of the components in the Intel OpenVINO toolkit. AutoQ leverages NNCF in two ways. First NNCF is able to automatically transform models with respect to a prescribed layer-wise precision. During the exploration stage, AutoQ provides a mixed-precision candidate to NNCF, NNCF handles the quantization transformation of the model and evaluates the accuracy of the quantized model, as well as estimation pertaining to model efficiency improvement such as memory and compute compression metrics. Second, once AutoQ learns a final solution, AutoQ leverages NNCF infrastructure for quantize-aware training to recover the accuracy of the quantized model. One thing to note is that we have used NNCF as a testbed for the algorithm but we have also released AutoQ as part of OpenVINO/NNCF to make it available for developers.

Along the same lines, how did you apply SigOpt?

We used the SigOpt Bring Your Own Optimizer (BYOO) feature to log and visualize our model metadata during training and optimization processes. During AutoQ’s iterative search, for each explored quantization configuration, AutoQ pushes the metadata and metric of interest to SigOpt as a run. We then use the SigOpt Dashboard to visualize the results of all the runs in the project together without writing any plotting code.

For our use case, we are generally interested in the metrics over cycles of iteration. By using SigOpt we can easily plot accuracy, model size compression and bit complexity over Runs. We can also plot Runs against different sets of AutoQ hyperparameters and constraints.

We found that integrating SigOpt with AutoQ, OpenVINO and NNCF provided a better developer experience and further democratized the quantization process. As a next step we are exploring ways we can simplify the process even further.

Can you talk a little bit about the results from AutoQ?

We evaluated AutoQ on several convolutional models and computer vision tasks, such as image classification, object detection and semantic segmentation. Depending on the dataset and the model architectures, AutoQ automatically quantized the models to 6-8x of original model in size in FP32 precision with negligible accuracy loss.

Earlier in the year, we did publish a whitepaper with some specific results, where we, among other things, showed how AutoQ applied to ResNet-50 was able to compress the model >7x in size with a slightly higher (+0.12 Top1) accuracy.

In terms of productivity, a model can be quantized in a matter of a few hours to a day with AutoQ as compared to 2-4 weeks by human effort, demonstrating significant productivity gain with minimal human intervention.

When can readers start to use AutoQ?

It is available now. AutoQ was released with NNCF v1.6 earlier this year.

For anyone interested in learning more about the OpenVINO toolkit, it can be installed directly from the website and is ready to use from there. For anyone interested in getting from 1 to 100 on how to use the OpenVINO toolkit, Intel has launched both a beginners and an intermediate course on the OpenVINO toolkit on coursera, something which we can highly recommend.

SigOpt is a little bit different in the sense that it is a software-as-a-service product, so all you really need to do here is to sign up for a free account. From there, their platform is taking you through everything that you need to understand in order to hit the ground running!

Here towards the end, what is next for AutoQ?

The next step for AutoQ is to investigate the applicability of AutoQ to more classes of models and tasks such as transformer networks for language modeling and generative models.

Finally, one thing that we have noticed is that SigOpt is leveraging sequential model-based optimization approaches, whereas AutoQ is looking at cycles of iterations. This means that some of the visualizations which are relevant for AutoQ do not currently exist on the SigOpt dashboard. Therefore, we have been working with the SigOpt team to provide guidance on how such visualizations should look. So by the time that those visualizations are ready, we hope to make an even tighter integration with SigOpt.

Learn more about AutoQ by reading this whitepaper: Automated Mixed-Precision Quantization for Next-Generation Inference Hardware. And sign up to use SigOpt for free and get started in minutes.

Tobias Andreasen
Tobias Andreasen Machine Learning Specialist
Vui Seng Chua Guest Author
Chaunté W Lacewell Guest Author
Nilesh Jain (Intel Labs) Guest Author
Alexander Kozlov (OpenVINO) Guest Author