SigOpt Recap with Intel AI – Faster, Better Training for Recommendation Systems

Luis Bermudez
Deep Learning, Recommendation System

SigOpt hosted our first user conference, the SigOpt AI & HPC Summit, on Tuesday, November 16, 2021. It was virtual and free to attend, and you can access content from the event at sigopt.com/summit. For more than any other reason, we were excited to host this Summit to showcase the great work of some of SigOpt’s customers. Today, we share how Ke Ding from Intel was able to increase the performance of Recommendation Systems on large scale training clusters using software optimization techniques.

Q: What is a recommendation system?

A typical recommendation system takes two types of data as input. One is called numerical dense features. Examples include age, number of purchases per month, and active minutes per day on a certain application. The other types of inputs are categorical, sparse features. Examples include gender, product ID, and location. Sparse features are usually very large due to its nature. With both dense and sparse features as inputs, they pass through the recommendation models and it generates an output. The most common output of the recommendation system is the probability of a click, called click through rate. It indicates with all the dense and the sparse context features presented, whether or not the user will end up accepting the recommendation. 

You can imagine why it’s so important to build the best recommendation systems in order to generate the best business values. Hyperscalers and the many other enterprise customers are working very hard in this domain. As a result, there are many SOTA models for recommendation systems such as NCF, Wide & Deep (W&D), DIEN, and DLRM. Also, there are many industry benchmarks. Two popular examples are Twitter’s annual RecSys Challenge and MLconf’s MLPerf benchmark. 

Recommendation systems are so important yet there are many technical challenges to solve. Every accuracy improvement directly translates into business values. Hence, the accuracy requirement is very high. Because of this, oftentimes the recommendation models are very big and the data set is huge in order to cover more scenarios. Model retraining is another challenge in order to get the most up to date models based on user data.

Q: How does DLRM work?

DLRM is a SOTA model and part of the MLPerf training benchmark. At a high level its model topology actually is quite simple. The dense feature goes to the bottom MLP and the sparse feature goes to embedding lookup tables. After that, they interact together and then feed into top MLP and then output as the click probability. Each block has its own characteristics as color coded for memory, communication and compute bounds. The Criteo data set is used for DLRM training and it’s 1.2 terabytes. The DLRM model size is huge because of the sparse layers.

Probability of a click

In total, if running it on a single instance, the required memory will be more than 100 gigabytes. On the other hand, the 26 embedding tables are extremely imbalanced. All the biggest tables have more than 14 million entries, while the smallest one only has three or four entries. With all this, DLRM poses unique challenges on distributed training because of the need to balance, compute memory and communication.

The model is so large that we will need to use both model and data prioritization in order to speed up the training. However, when we do so, it means that we are increasing the global batch size and reducing the number of weight updates per epoch. However, this introduced another challenge for the model convergence. To tackle these challenges we implemented an efficient scale out solution using data and the model parallelization, a novel data loader and the new hybrid splitSGD + LAMB optimizers and efficient hyperparameter tuning for model convergence.

Modern Convergence

Q: How do you get DLRM to converge?

Let’s talk about DLRM model convergence. During the modeling practice, we have many design choices that need to be finalized – through both heuristic knowledge and automation tools. 

Even if you have a well-trained FP32 model, when you retrain or fine tune it for a downstream task (or perhaps optimize it for a particular hardware platform), oftentimes you cannot use the original hyperparameter set and you will need to tune a new one in order to meet the business requirement. Common examples of this include scale-out training with more hardware resources in order to speed up training. This means you now have a big batch size or you have to use non-precision formats such as BF16. Another example includes a new optimizer which needs new hyper parameters. A third example is a new dataset for a new use case that needs new hyper parameters. A fourth example is a simple desire to tune the parameters further to get a better training time. All of these examples require access to a good hyperparameter optimization tool or service.

Q: How do you optimize your DLRM implementation?

As a data scientist, I need my hyper parameter optimization tool to provide a flexible search space definition which supports different data types (e.g., transformations, constraints, or conditions). My HPO tool should also be capable of a large search space to tackle complicated problems.

I also need my hyper parameter optimization software to be built on an efficient model so that it can converge and find the global optimum more quickly. And that’s actually the unique advantage. Also, in order to speed up the execution, I need a parallel execution feature. Furthermore, it’s important that I’m allowed to activate early stopping for trials that are not yielding promising results. This is because we can already predict whether the current one is going to be much worse or much better based on early results. Also, it’s important to be able to support optimization scenarios which contain multiple objectives. For example, I may need to optimize both performance and accuracy. For this example, multi metric support is required. 

Finally, I also need my HPO tool to have a nice dashboard with insights. A dashboard will save tons of work for data scientists like myself. Fortunately, we have SigOpt as part of Intel’s machine learning toolset. SigOpt is a hosted service that can support all the requirements I just mentioned.

Q: How did SigOpt help DLRM?

The SigOpt platform helped my DLRM training convergence problem. 

HPO Tuning Setup

The above image shows the HPO tuning setup. In this diagram, the hyper parameter space for the tuning workload is defined. Then, the SigOpt Optimizer sends suggestions to the hyper parameter space, and then it runs the trial with the suggested hyper parameters. Once the trial is complete, it sends the resulting metrics back to the SigOpt optimizer to get more suggestions. This is an iterative process until the pre-defined tuning budget is reached. In this DLRM case, the hyper parameter space contains four tunable variables: learning rate, warmup, decay, and optimizer choice. It’s a big workload and I use Intel a Xeon CPU cluster for this tuning exercise. The powerful experiment features from SigOpt help speed up the tuning.

Efficiency improvements of any kind are super helpful in saving cost as well as to improve the time to trim. One of the features I used during this tuning exercise includes the learning curve regulator, which helps with anomaly detection, learning curves, and overfitting detection. Also, I used the Pruner feature to early stop unpromising experiments. I used several types of pruners in my experiments, including: medium, percentage, and ASHA. 

SigOpt provides a very nice dashboard for the summary and insights for your tuning experiments. For this DLRM tuning task to simplify the problem, we set the tuning target to optimize the AUC accuracy score only as a single metric summarization problem. So from the left side experiment improvement figure, we can clearly say SigOpt is able to find the result much quicker and better compared to other open source HPO frameworks. The DLRM AUC convergence threshold is 0.8025. SigOpt only needs a few experiments to exceed this threshold. Within the experiment budget the optimization result gets improved further, and eventually it’s able to find a very good AUC score that helps achieve a faster DLRM convergence.

Conclusion

To learn more about software and hardware optimizations for Recommendation Systems, I encourage you to watch the full the talk. To see if SigOpt can drive similar results for you and your team, sign up to use it for free.

1605041069985
Luis Bermudez AI Developer Advocate