Scalability is one of the main factors to consider when deciding whether an optimization solution will work for you and your organization. Modeling and simulation workloads are constantly evolving, and the quantity of resources dedicated to training, tuning, and testing ML models is continuously growing.
In part 1 of this blog series, we reviewed the different types of scale that our customers require of us, along with some of the considerations you might make as you build out your data science team and grow your experimentation workload.
Today, we’ll cover the specific ways in which we bolstered our infrastructure to handle large simultaneous request volume from a couple of our more demanding customers—at a level of technical detail suited for industry practitioners and technical managers.
Scaling compute infrastructure
When it comes to model optimization, no two customers make identical demands on our infrastructure. Adjusting our capacity to handle 10x the number of experiments or observations is usually trivial, but some customers come to us with a full thousand-fold difference.
This dramatic change in the request load on our APIs and workers exposes underlying assumptions and inefficiencies that do not present themselves at 10x.
The first challenge we faced while simulating a 1000x workload was the need to both orchestrate and spin up so many new compute instances at once. Autoscaling didn’t integrate smoothly with the deployment systems we had previously designed, and it proved more difficult and error-prone than our SLA budget or our general comfort levels could tolerate. Horizontal scaling seemed like a quick-and-easy solution, but spinning up more than 100 instances at a time resulted in immediate issues, such as SSH connection limits that caused deploy failures. These errors were difficult to debug and took time to recover from. We then decided to scale vertically: instead of a 4-core machine running 4 separate processes, why not have a 32-core machine running 32 processes?
The first problem we encountered while scaling vertically was third-party rate-limiting—having hundreds of new processes simultaneously register with outside applications resulted in throttling. Upon failure, our scripts would immediately restart and hit the throttled service again, creating an infinite loop (sometimes known as a “thundering herd”) that kept us locked out indefinitely. To alleviate this, we implemented an exponential backoff with jitter whenever we encountered a rate-limiting failure. While this means some processes might take a bit longer to successfully start, it spaces them out far enough to relieve the throttling.
Since there are limits and risks to strictly vertical scaling, we opted for a combination of horizontal and vertical, spinning up multiple instances with more cores and more memory on each instance.
The next problem we encountered was more insidious. Doubling instance size did not double how many requests we could successfully respond to. In fact, some of the largest instances performed worse than the same number of the smaller instances! Digging into the depths of our codebase, we determined that each of our backend processes was using as many threads as possible. Normally this makes programs faster—one process uses many threads across multiple cores while another process waits on disk reads or network calls. But multithreading is actually a detriment when all processes are doing simultaneous intensive computation. They fight over limited shared resources, context switch exponentially and have so much communication overhead that they perform worse than single-threaded versions. To ensure we were getting efficient usage of our compute capacity, we systematically evaluated the algorithmic performance across many variations of instance size, number of processes and amount of multithreading, settling on the optimal configuration for our workload.
Getting the balance just right between horizontal and vertical scaling is a challenge for computationally taxing algorithms such as those that form our ensemble behind the SigOpt API. Accordingly, testing is invaluable to benchmark the relative benefits and challenges of these two types of adjustments to infrastructure capacity.
Tuning our databases to keep up with the rest of our infrastructure
While throwing more computational resources at the problem is one way to scale, it’s important to look for bottlenecks and sources of inefficiency elsewhere—we found a few in certain common database queries. After running a few simulations, we noticed three distinct culprits that were slowing our database to a crawl: linear counting, frequent updates, and missing indexes.
Missing indexes are the most straightforward issue to solve. Monitoring slow queries will typically show patterns in which a seemingly simple query starts taking longer, the longer the simulation continues to run. In our cases, query time was increasing linearly with table size, which indicated we were walking through the entire table to find our results. A quick analysis of these types of queries showed which columns should have an index. Adding an index immediately cuts down on query time, and it grows much more slowly (logarithmically) over time, eliminating the overload.
Another, more complicated slow-down we encountered was frequent updates to the same database row. Unlike read-only queries, which can run in parallel and have special optimizations to ignore race conditions, each time one process wants to update a row, it needs a lock so that no other process gets in the way. When there are 10s or 100s of processes all attempting to update the exact same row, they have to execute sequentially, and the overhead from acquiring and releasing the lock adds up. Our solution here was to step back and look at the system as a whole. The field we were trying to update was not critical information—it was internal bookkeeping we were using to monitor the health of another internal system. Thus, we were able to refactor this out of our primary database and into a separate caching system that could much more easily handle the frequent updates, substantially reducing our database load.
A different problem with a similar solution was a slow aggregation query. The query was read-only but required multiple joins to group and filter the results. We already had indexes on the relevant database columns and could not find any clear ways to improve the efficiency of the query. Stepping back again, we asked ourselves what the result was being used for and could we calculate it another way. Our aggregation was yielding a count of customer objects, which we were using for some internal thresholding. Here was another great use case for our separate caching system—increment a counter each time a customer creates a new object, rather than joining, aggregating, and counting all customer objects each time they created an object. This slashed our query time to single-digit milliseconds, a huge improvement over the linear-scaling query that was taking multiple seconds during our simulation tests!
It’s important to not just look out for resource constraints, but also to hunt down inefficient database queries and updates. Updates that access the same field, in particular, can cause long queues and introduce exponential latency. There are multiple ways to improve these types of bottlenecks: making queries more efficient with an index, moving updates to faster, more volatile systems, and caching results to save time on subsequent updates. The optimal solution will depend on the data involved and where it gets used.
Improving practices and procedures for infrastructure reliability
Another aspect of system scaling that might otherwise go overlooked is the need to monitor, diagnose, and maintain system health. Preparing tools and procedures that site reliability engineers can use during on-call incidents and/or periods of high load is extremely important to ensure quick recovery from failure and ongoing system stability.
The first step in discovering which parts of the system need improvement in order to function at scale is to run simulations that mimic high load situations and determine which features create bottlenecks. Although this may seem simple, preparing your test environment to run these simulations requires some setup overhead. Additionally, monitoring incremental improvements made throughout the scaling process requires your testing procedures to be repeatable so that comparisons can be made across separate tests. Automating the simulation procedure, including adding and configuring required infrastructure with secure access control, updating monitoring dashboards and alerts, as well as kicking off the simulation itself are some examples of areas that we focused on in order to ensure that we could measure our progress across multiple simulations. As an added bonus, we were able to reuse some of our work to create an organized monitoring platform in our production environment, taking some of the responsibility of configuring our monitoring system off of SREs.
Another crucial system that helps you correctly diagnose errors is a reliable logging system that is easy to navigate. One difficulty that many teams face is not being able to find information needed to diagnose issues, or spending too much time trying to do so. Poorly configured logging schema and logging bloat can make it difficult to search through logs and find specific pieces of information. Additionally, logging bloat can become very expensive as systems scale, as storage costs increase along with log volume. As a part of this scaling project, we worked on reducing our log volume by establishing which information was necessary to diagnose errors, then omitting unnecessary information from our logs.
Building efficient SRE (site reliability engineer) teams require sufficient and regular training and documentation. SREs should be aware of and know how to use the tools they need to respond to oncall events. During our scaling work, we improved on one of our SRE tools that we use to dynamically reconfigure services without downtime. We added descriptions to each configuration, as well as extra configurations that would be useful during periods of higher load. In the past, it was difficult to make use of these configurations because understanding their specific use cases often required engineers to have historical context from when the configurations were put into place. This made it hard for newer engineers to join the on-call rotation, and was not a scalable way to maintain SRE procedures.
A large part of creating sophisticated distributed systems is developing the tools and procedures needed to monitor health and diagnose problems. Supporting SRE engineers with the tools they need to respond to on-call events will make for an efficient and reliable team that can maintain system health at scale!
Building out a scalable optimization API and platform, especially when you serve customers as demanding as ours, is a remarkable challenge. As any business grows, it will encounter engineering and infrastructure challenges presented by new customers, or customers who make novel demands. Today, we described some of the backend changes we had to make to serve a couple especially demanding customers. We hope you can apply some of these lessons to your business, and perhaps also relate to the idea that scaling machine learning infrastructure can prove challenging. That said, our engineers, SREs, and on-call team are up to the task, and ready for new problems to solve!