Imagine a data scientist attempting to run hyperparameter optimization (HPO) experiments. HPO requires training a model over and over with different hyperparameter sets to optimize performance. This data scientist may want to evaluate several models in parallel so their experiment finishes faster. Evaluating several models in parallel may require more compute than is available with the data scientist’s laptop, so they turn to remote infrastructure, such as AWS EC2 instances.
Whose responsibility is it to setup the HPO experiments on the remote machines? The data scientist builds models, and they use HPO as a tool to boost performance. Starting jobs on complex infrastructure sounds like the responsibility of the infrastructure or DevOps team. So, what should this data scientist do? They could consult the infrastructure team every time they want to launch an HPO experiment. Or, they could become a magical unicorn who is expert in both model building and infrastructure management. Easy, right?
Machine learning (ML) infrastructure aims to bridge the gap between machine learning and infrastructure. Broadly, ML infrastructure products solve data science workflow challenges with software infrastructure technologies. In this post, I’ll provide an overview of ML infrastructure tools that aim to solve the problem of launching HPO experiments on clusters, discuss some of their common infrastructure technology choices, and end with some thoughts on the user experience of ML infrastructure tools.
At SigOpt we offer an API for hyperparameter optimization, and working with our enterprise customers to deliver this service at scale has given us deep experience with the challenge of launching HPO experiments onto clusters. Last year, to better support these ML infrastructure needs for our customers, we launched SigOpt Orchestrate (paper), which complements existing products like Amazon Sagemaker, AI Platform from Google Cloud, Kubeflow‘s Katib, NVIDIA frameworks, and the startup Polyaxon. Ours is the only solution with enterprise level support and cloud agnostic infrastructure management.
What are the critical capabilities of these solutions that make them effective at solving these tough engineering tasks? First, Containerization and Kubernetes underpin most ML infrastructure solutions, and are essential for any effective cluster management solution. Containerization helps users package code and dependencies to run on remote machines. If you’ve used containers, you’ve likely used Docker, which is today’s most popular containerization technology. Kubernetes provides “container orchestration,” allowing you to organize your instances into a Kubernetes cluster with one unified API for starting containers on your cluster. Using these technologies, an ML infrastructure tool like Orchestrate can build an image for your model’s code and dependencies, push that image to a cloud image registry, and then send one job spec file to your Kubernetes cluster with instructions for starting and running containers from the image.
But launching HPO experiments isn’t the end of the story. The second essential capability for effective cluster management is a user experience that makes using the tool easy for a data scientist. Debugging and monitoring tools are critical to this workflow. Yes, the data science team could simply use the kubectl tool to monitor status and view logs, but that solution requires them to learn a tool that is, in reality, strictly an infrastructure tool. And, the whole point of ML infrastructure is to abstract away infrastructure tools from modelers. Data scientists and machine learning engineers deserve to harness the power of containerization and kubernetes, without needing to understand containerization and kubernetes. The user experience on top of an ML infrastructure tool is the last mile in bridging the gap between machine learning and infrastructure. Whether its a website, an API, a python package, or a CLI, an ML infrastructure tool should container an abstraction layer between the modeler and the infrastructure tools.
Many data science teams choose to use hyperparameter optimization to boost the performance of their models, but this introduces challenges into their workflow. Machine learning infrastructure tools can help bridge the gap between the modeler and the cluster by inserting an abstraction layer between the model builder and any infrastructure tools used to communicate with the cluster. I hope this blog post gives you a starting point to evaluate, or build, ML infrastructure tools for your team.