There is so much content and data these days that users can have a hard time finding what they’re looking for, and they can just give up. In parallel, industry datasets are operating at a large scale and are growing at a rapid pace. Luckily, recommendation systems help users stay engaged and they also make it easier for users to find what they’re looking for. Current recommendation systems include DLRM, Neural Collaborative Filtering, Variational Autoencoders for Collaborative Filtering, BERT4Rec, and more.
The Deep Learning Recommendation Model (DLRM) can capitalize on large amounts of training data and it has shown advantages over traditional methods, which is why it is one of the benchmark workloads for MLPerf. There are multiple challenges for training large-scale recommendation systems like DLRM, including huge datasets, complex data preprocessing, and extensive repeated experimentation. In a recent blog post, Ke Ding from Intel explored how to overcome these challenges with a variety of techniques. In this post, I’ll dive deeper into how Ke ran experimentation – and hyper parameter optimization in particular – to realize faster convergence of DLRM in fewer iterations than is otherwise typically possible.
When Intel first started training DLRM on the Criteo Terabyte dataset, they spent over 2 hours to reach convergence with 4 sockets and 32K global batch size on Intel Xeon Platinum 8380H. After their optimizations, they spent less than 15 minutes to converge DLRM with 64 sockets and 256k global batch size on Intel Xeon Cooper-Lake 8376H. Intel enabled DLRM to train significantly faster with novel parallelization solutions, including vertical split embedding, LAMB optimization, and parallelizable data loaders. In the process, they
- Reduced communication costs and memory consumption.
- Enabled large batch sizes and better scaling efficiency.
- Reduced bandwidth requirements and overhead.
The figure below shows that the training time was reduced by more than 8x. Read more about their compute, memory, and bandwidth optimizations at Parallel Universe Magazine.
In addition to Intel’s hardware optimizations, they also applied hyper parameter optimization (HPO) to configure the model to reach a performance threshold. Their goal was to reach the threshold in the shortest wall-clock time possible. To enable this, they needed to reach convergence with a higher number of sockets, and that convergence needed to reach a 0.8025 AUC threshold. To reach their goal, they leveraged the SigOpt Intelligent Experimentation platform. SigOpt enables teams to choose their own optimizer for hyper parameter optimization or use SigOpt’s proprietary optimizer. In this case, Intel chose to use the proprietary SigOpt optimizer that combines the best attributes of Bayesian and global optimization algorithms for more sample-efficient HPO.
In order to further reduce the total wall-clock time for hyper parameter optimization, Intel leveraged SigOpt’s parallel experiment feature that makes it easy to run HPO across as much compute width as you have available. In this case, the parallel experiment feature allowed them to run hyper parameter optimization on several Xeon clusters at once. This enabled them to automatically and intelligently tune important hyper parameters, such as the learning rate, warm-up, and decay without running jobs for days.
SigOpt also automatically tracked all metadata associated with these experiments in the SigOpt dashboard. Embedded within this dashboard, SigOpt provided Intel with visualizations for both individual training runs and comparisons of metrics and parameters within and across experiments. The above figure shows how the experiment reached their AUC threshold in a few runs, and then continued to improve beyond the threshold. SigOpt visualizations can also explain which parameters are most critical to the optimization goal. For instance, the figure below shows a chart for parameter importance analysis, which indicates that the base learning rate hyper parameter is the most influential for AUC optimization.
With SigOpt-enabled HPO, Intel reached convergence in as few as four runs. After this, DLRM continued to improve past the threshold to 0.85 AUC with 256k global batch size. This exceeded the original 0.75 AUC with 32k global batch size, representing better accuracy much faster than was otherwise possible with other optimization methods.
If you want to learn more about this use case, Ke Ding will be giving a presentation at the SigOpt AI & HPC Summit on November 16th. It is free and virtual, so register for the Summit today. If you’d rather get your hands on SigOpt to see how it works on your own problems, you can sign up today and use SigOpt for free and get started in minutes.