How We Scaled SigOpt to Handle the Most Relentless of Workloads: Part 1

Jim Blomo, Patrick Hayes, and Barrett Williams
Applied AI Insights, Augmented ML Workflow, Training & Tuning

Editor’s note: you can find Part 2 in this series here.

Scalability is one of the main factors to consider when deciding whether an optimization solution will work for you and your organization. Modeling and simulation workloads are constantly evolving, and the quantity of resources dedicated to training, tuning, and testing ML models is continuously growing. 

If you’re evaluating the components of your modeling platform, you should ask yourself, “will my optimization solution scale with me?” To answer this question, let’s define the different dimensions of scale:

  • Organizational Scale — Developing flexible, generalized solutions that can work across projects and teams with potentially different modeling workloads and needs.
  • System scale — Engineering systems that are designed to ensure new suggestions are provided within milliseconds to reduce latency in the tuning process
  • Dimensional Model Scale — Crafting bespoke algorithmic strategies to tune models that require more than 25 parameters, including categorical, integer, and continuous parameter types
  • Evaluation Scale — Solving certain problems requires thousands of observations to get precise, robust measurements of model performance
  • Parallel Compute Scale — Asynchronously processing and suggesting parameter configurations to facilitate faster wall-clock time, maximize computing resources and a discussion around the tradeoffs in this process.
  • Experiment Throughput Scale — Bursting Bayesian optimization to thousands of experiments in any given hour for a single user or set of users.

Organizational scale: 

What works for an individual using an optimization tool for a one-off project may not work for teams working on mission critical projects over the course of several years. A data scientist with time and a specific problem can adapt a solution to their specific workflow, incorporate tweaks dependent on their problem, and handle any upgrades or hurdles along the way. (The challenges inherent in change management for an in-house solution.) 

Organizations that want to scale usage of an optimization solution, however, need a tool that will work with many different languages, model types, environments, purposes, and users. Upgrades need to be non-disruptive, and algorithm changes need to avoid regression. Interfaces need to be consistent so that data scientists and other users who move from one project to another or one team to another can transfer their skills and tools to the new job, and teams can effectively communicate with each other.

System scale:

Even for a specific modeling problem, it’s challenging to share compute resources, particularly when providing robust parameter suggestions to a number of simultaneous problems in milliseconds. At a system level, SigOpt is designed to adjust its ensemble of algorithms to deliver useful suggestions to your models in mere milliseconds. We do this by learning how long your model takes to train, and then selecting methods that will generate the best suggestions within that time window. These are details that many other solutions, often built for one-off academic projects, don’t consider and can’t handle.

Dimensional model scale:

Many optimization strategies are very compute-intensive, and thus perform well on lower-dimensional problems. Perhaps you want only to tune batch size and learning rate: Many tools can help you explore this space. However, many of our customers train dozens or hundreds of parameters, including continuous, categorical, and discrete, sometimes with constraints, conditional dependence, or specific configurations resulting in “failure” for the system. We built our solution and our ensemble to scale to handle many types of low- and high-dimensional problems, meaning you and your data science team won’t need to seek out different optimizers for models of varying complexity.

Evaluation scale:

Certain high value models take days or even weeks to optimize. SigOpt tracks and stores up to 10,000 observations per experiment, and with a 99.9% uptime SLA, we ensure that you can run your experiments for weeks or longer without a hitch. Many optimization solutions can’t handle such long-training models because their methods become prohibitively slow well before the 10,000 observation mark. These methods may also not be as fully featured in terms of metrics and parameter tracking. As customers’ experiments grow in size, they don’t have to manually switch optimization algorithms: our ensemble handles these adjustments for them, automatically.

SigOpt System Diagram

SigOpt’s underlying infrastructure is designed to ensure scalability of multiple varieties.

Parallel compute scale:

Training large models is prohibitive on a single server, so most organizations training image classifiers, prediction engines, or recommendation systems may take days or weeks to train a model. In these scenarios, it’s always productive to maintain downward pressure on training time, which typically encourages data scientists and ML engineers to scale the training process across multiple workers or nodes, often GPU-accelerated, if possible. SigOpt can make up to 100 suggestions asynchronously, and in parallel. We’ve adapted our Bayesian optimization algorithms to provide levels of parallelism that were previously only possible with naive methods like random search.

But many model tuning and optimization systems aren’t designed to handle multiple worker nodes simultaneously or asynchronously expecting a new set of parameters. SigOpt can provide asynchronous parameter suggestions so that no worker is left idle while waiting for the next suggestion. Our infrastructure scales to meet the needs of your infrastructure as you distribute your modeling workload across more and more machines.

Experiment throughput scale:

Lastly, we’re prepared to handle surges of traffic up to many thousands of experiments at the same time. We prepare for these surges by establishing elastic capacity on the back end. This helps us scale up our servers and worker infrastructure to handle a high volume simultaneous requests in a very short window of time. In a follow-up post, we’ll cover how we scale our infrastructure to meet demand, including using a number of novel and patented mechanisms.

Planning for the future

As enterprises coalesce large data lakes and begin to train more and more complex models, business performance starts to hinge on model (and to some extent modeler) performance. In the coming years, it is only rational that data science teams will grow, and business success will depend on more reliable forecasts, recommendation systems, predictive maintenance, and trading models. As you scope your needs for tracking and enhancing model performance, it’s important to assess whether the solution you’re building will scale to the dimensional complexity, number of simultaneous users, number of observations, and even number of metrics that you’ll require in the future. With half a decade of experience and a cohort of demanding customers, SigOpt can handle all of these challenges that businesses face as they grow their modeling competency.

In our next blog post on scaling, we cover at the technical level how we scaled experiment throughput for one particularly demanding customer. If you’re interested in learning more check out our YouTube channel, or book a demo with our team.

Jim Blomo Head of Engineering
Patrick Hayes
Patrick Hayes Co-Founder & Chief Technology Officer
Barrett Williams Product Marketing Lead

Want more content from SigOpt? Sign up now.