Dealing with Troublesome Metrics

Michael McCourt
All Model Types, Modeling Best Practices

At SigOpt, our goal is to help our customers build better models, simulations, and processes by maximizing key metrics for them.  These metrics can come from any number of different fields, including optical character recognitionfinancial modeling and even sports betting.  Each metric may have a different interpretation or a different significance (e.g., financial modeling practitioners may be trying to maximize portfolio value), but all of them are functions, a key mathematical concept which is the focus of this blog post.  We will introduce the concept of functions and discuss what properties make certain functions more dangerous to maximize.

Metrics as Functions

In the simplest sense, a function is a mapping between inputs and outputs: for a given input a function should return a single output.1  In a business context, these functions are a formal representation of the relationship between decisions a business can make (inputs/parameters) and key quantities which may describe the business’s success or failure (outputs/metric values).  I use the term “formal” representation because, unfortunately, these functions exist only in some cryptic (and potentially implicit) sense and the actual structure of the function is obscured by layers of complexity specific to the business operations, like a machine learning model or physical process.  As a result, while the input-output relationship may be well-established (see the figure below) basic mathematical tools are rendered useless by the lack of a clean mathematical formulation (say, e=mc2).  Such a function is often referred to as a black-box function because the inner workings of the function are unknown.

It is likely that you are surrounded by many such metrics on any given day and that you are, perhaps without even realizing it, trying to maximize them to make your life easier (few people try to find the least bang for their buck).  As suggested, this could arise in various circumstances, a number of which we discuss at the end of this post, but we will try to focus our discussion here on a topic of common knowledge: cookies.  Indeed, what factors play into the “quality” of baked cookies, the metric of primary interest here?  We define quality here to be the average results of a customer survey where customers ranked the taste of the cookies on a scale between 0 and 10.  Only four key factors are mentioned here, all of which happen to be continuous: the amount of sugar, the amount of shortening, the temperature at which they are baked and then duration for which they are baked.

Figure 1: Many factors, including sugar, shortening, temperature and time, can play a role in the quality of delicious cookies, but their interaction may be complicated and unintuitive.  We call such a metric, which is dependent on many factors in confusing ways, a black box function.

These black-box functions, which incorporate the complicated aspects of corporate and industrial operations into a single fundamental measurement of success, are the metrics of most relevance.  This leaves experts administering these operations, who have the task of maximizing these metrics, in the difficult position of having to do so without some of the powerful mathematics developed for simpler optimization problems.2  SigOpt was built to help these experts optimize these operations, guiding them to the best metric values with less trial and error.  This is especially important because each measurement of this metric is often costly, either in time (unfortunately, cookies must cool before they can be eaten) or in resources (eggs are not free, and the oven in which the cookies are baked is not cheap either).  This cost implies the need to maximize the function in as few iterations as possible.

Fortunately, strategies exist for maximizing these metrics even under such restrictive circumstances.  At SigOpt, our weapon of choice is sequential model-based optimization(SMBO) which, as the name suggests, requires the construction of a model of the function being maximized.  This model is continuously fit and updated as new outputs of the function (observations of the metric) are generated, thus the inclusion of the word sequential.  The quality of this model is paramount, since a good model will effectively represent the desired metric and suggest likely inputs for which the metric is maximized.

We use Gaussian processes to model customer metrics (other options also exist) which have natural mechanisms for developing a well-fit model, including, but not limited tomaximum likelihood estimation.  Unfortunately, a naively implemented Gaussian process model may be limited by properties of the desired function, leaving users of open source SMBO software with a complicated question: which functions can be effectively modeled, and thus effectively optimized, with this Gaussian process model-based strategy?  The inability to answer this question can stifle the progress of an SMBO algorithm, costing users time and money.

From a purely mathematical standpoint, the correct strategy for identifying functions which are well-modeled with a given Gaussian process involves the reproducing kernel Hilbert space associated with the covariance of that Gaussian process.  For the 90% of readers who do not know what that means, do not worry because the 10% of readers who know what that means know that statement has limited practical relevance.  This blog post will discuss potentially troubling functions (all in one dimension,3 to keep the topic more tangible) and strategies that SigOpt employs to help our customers efficiently maximize their important metrics.

Unscaled Functions

Quick explanation: Functions with values on different scales often may produce models that can only approximate one scale well.

As is the case with all of these properties, this is best described with a picture.  In the figure below the same function is displayed on two different domains, which present a very different perspective.

Figure 2a: A function with a distinct maximum near x=12.98.  left: The function values for 5<x<15, are overwhelmed by the much smaller values for x<5 and x>15.  right: If only the typical values on 5<x<15 are studied, the maximum is easier to visually identify.  The problem on the right is better scaled and, because it may be easier to model, it may be easier to maximize.

The idea behind scale is a measurement of what is considered a “typical,” or perhaps average, the value of the function and a well-scaled function would be a function whose values are all nearby each other.  A function with a range on the same order as its typical value could be considered reasonably scaled.

In the context of optimization, the typical value around which we are interested in creating a model is the maximum of the function.  Therefore, a poorly scaled function would be a function with values many orders of magnitude away from the optimum.

Consider a cookie example that might look like this: perhaps x could be a measurement of time spent baking, and y could be a measurement of how much customers like the resulting cookies.  As the graph implies, baking for too short or long (less than 5 minutes or more than 15 minutes) yields a swift decline in perceived quality, which anyone who has ever made cookies would know.  But for bakers who want to maximize their cookie quality, finding the optimal baking time can be the difference between opening a kiosk at the mall and competing with Mrs. Fields.

There is nothing fundamentally different between modeling either of the functions presented in the figure above, but there is a potential downside to modeling a function that varies over such a large range in the metric, as demonstrated in the figure below.

Figure 2b: Two models (in blue) fit to data (in orange) sampled from the functions in Figure 1a. left: Data sampled at 32 points4 over 5<x<15 creates a model which acts erratically within the domain of interest; only the region of 5<x<15 has been plotted. right: A model is created using the same data within 5<x<15 (without any of the values at a different scale) and more closely models the true function of interest.

Both models in the figure above pass through the same observed data, but the model on the left has the added pressure of needing to match values in the (ungraphed) regions of x<5 or x>15.  This pressure spawns the erratic behavior in the model on the left, and can cause that model to perform worse than the model on the right.

Main takeaway about reasonably scaled functions: Choosing a function whose typical value is on the same order of magnitude as the entire range can produce a better model.

How SigOpt manages this for you: At SigOpt, we understand that some of the most interesting metrics are also those which exist on multiple different scales, which is why we have multiple strategies to help facilitate efficient modeling of even the worst scaled functions.  At the simplest level we always apply some form of invertible transformation to the observed data to help guarantee that a wildly large range will not dominate the computation (including so-called warped Gaussian processes).  There are also decisions we actively make to choose an appropriate covariance kernel for these unscaled situations.  We are also developing newer strategies for modeling data that quickly shifts between scales, including the adaptation of differential equation solvers for shocks.

Well Chosen Domain

Quick explanation: Prior to optimization, a domain of interest must be specified; if that region is too large, modeling may be inefficient.

For a specific amount of shortening, certain sugar amounts would be too great to produce viable cookies – including those large values in the optimization problem would make the function harder to model.  That philosophy is what motivates this idea of a well chosen domain, and the distinction between a well-chosen domain and a poorly chosen domain is on display in the figure below.

Figure 3a: The same function plotted over two domains.  left: The function appears rather boring in most of the domain, with the region of interest occupying only a small portion.  right: By choosing a better (smaller) domain, the action of the function is more easily modeled because less energy is expended in the boring region.

Visual inspection of the different domains in the figure above would present a cognizant user with the same conclusion: there is a maximum somewhere near x=.85 cups of sugar.  But, when constructing a model on which to optimize, only limited samples of the function are available.  Thus, using some of those samples in a region nowhere near the minimum is wasteful and can slow the development of the model and, in turn, the progress of the optimization.  This is demonstrated in the figure below.

Figure 3b: The evolution of the models (in blue) as increasingly many points (in orange) are sampled.  Left: When samples from the boring region dominate the model, it takes a longer time to resolve the interesting region.  Right: When the domain is well focused on the interesting region the model requires fewer points to converge on the desired behavior.

Clearly, the model with the well-chosen domain is able to more quickly act like the true function, whereas the model with the poorly chosen domain5 is still evolving even with the same number of function evaluations.  Since function evaluations are assumed to be expensive (either in the form of time or money/resources), being able to generate a better model with fewer points is beneficial.

Main takeaway about well-chosen domains: Better identifying the location of the minimum may yield a better model with less (costly) function evaluations.

How SigOpt manages this for you: At SigOpt, we want customers to focus on developing the best models and to leave the model tuning to us.  Given the data that customers provide, we have implemented several strategies to identify appropriate sampling strategies and avoid the wasted observations shown on the left in Figure 3b.  Our optimization routines often begin with a low discrepancy phase, providing a mathematically optimal mechanism for preparing the underlying model.  We also have researched strategies for dividing up the region of interest, so as to fit different Gaussian processes in different regions: these strategies including treed Gaussian processes and local interpolation.  Furthermore, our models propagate the optimization by studying a modified form of expected improvement, which balances the need to fit an effective model (exploration) against the need to find the maximum in as few function evaluations as possible (exploitation).

Continuous Functions

Quick explanation: Functions that are disconnected can behave unpredictably and are more difficult to model.

Continuity is an important topic which can, at times, be tricky.  In the simplest sense, a continuous function is a function which is connected, and, thus, small changes in inputs yield small changes in outputs.  Many functions of interest are continuous, but some important metrics are not continuous, and a naive implementation of Gaussian processes may have difficulty in that setting.  An example of this difficulty is displayed in the figure below.

Figure 4: We attempt to model a discontinuous function. left: The function, which seems harmless and has a clear maximum at x=400 with discontinuities at x=350 and x=450. right: Our model (in blue) of this reasonable function, which predicts minima near x=350 and x=450, even as increasingly much data (in orange) is provided to the model.

In the context of baking cookies, suppose the baking time is already fixed and the optimal temperature (the x-axis in the figure above, in Fahrenheit) is under analysis.  If the temperature is too low, the cookies fail to bake and the chef disposes of the cookies.  If the temperature is too high, the cookies are burnt to a crisp and are, again, disposed of.  In both circumstances, the pastry chef perceives the customer preference as the worst value possible which could produce a discontinuity when actual customer preferences are recorded; after all, some of us really enjoy the uncooked dough as much as the baked cookies.

As is painfully obvious, this disconnect between the true customer preferences and the censoring imposed by the pastry chef may yield a model that predicts specious behavior around x=350 and x=450.  Historically, this has been referred to as the Gibbs phenomenon and can not be avoided for this particular choice of Gaussian process model.6  In a sense, the model is trying to “build up energy” to jump from one branch of the function to the other branch, which results in these undesired oscillations near the discontinuities.

Main takeaway about discontinuous functions: Creating a model for discontinuous data requires careful work to make sure that unphysical behavior does not dominate the predictions.

How SigOpt manages this for you: This situation has troubled researchers since the 1800s, and among possible resolutions, the simplest is to create a model using a more suitable covariance kernel. Unfortunately, models which work well for Figure 4 may perform worse on functions that have no discontinuities and are actually smooth, such as the function from Figure 3a.  Effectively conducting this tradeoff is a key part of the work we do at SigOpt to better model our customer’s metrics.  We also consider nontraditional Gaussian process strategies involving local interpolation (discussed here, originally, in the context of preconditioning) and the adaptation of methods primarily designed for the solution of partial differential equations (including the partition of unity methods and essentially non-oscillatory methods).

Functions Are All Around Us

Although all of the examples discussed here involve baking cookies, the mathematics applies to a broad range of situations which can be described with functions.  The efficiency of an assembly line could be a function of the number of operators, the number of supervisors, the ratio of work time to break time, the temperature, and other such factors.  The accuracy of a support vector machine may be a function of the choice of feature map, associated hyperparameters, the box constraint, and the choice of the loss function.  The profit of a financial asset trading strategy could be a function of the desired risk, perceived volatility, duration of past data influence and frequency of model refitting.  But in all of these cases, the quality of an SMBO algorithm is limited by the ability of the model to resemble the function of interest.

This blog post provides some insight into common issues within the function approximation/modeling community and advice that we would give practitioners who wanted to implement an open source or internal SMBO solution.  At SigOpt, our goal is to provide users a simple framework through which to maximize their important metrics, whether they be as industrial as the design of an airplane or as satisfying as the tastiness of cookies.  We do not expect users to conduct complicated manipulations on their metrics to avoid the problems described above; we would rather users spend their time using their expert knowledge to develop more interesting and valuable models.  Thus, our best advice is to sign up for SigOpt today to start optimizing!

Footnotes

1. There are also multivalued functions where a single input maps to multiple outputs.  They arise naturally when dealing with complex variables and may occur in some applications, but are not something we consider here. Return
2. The standard reference for numerical optimization would be the classic text by Dennis and Schnabel.  That book deals with standard mathematical strategies for numerical optimization, and not the model-based optimization strategy we employ at SigOpt, but some concepts, including those in chapter 7, are relevant in any optimization setting. Return
3. The examples presented in this post all describe a four dimensional problem (where sugar, shortening, temperature and time are turned into cookies of varying quality) in terms of a single dimension.  This analysis of a single dimension, with all others fixed, is done to make easily interpretable graphs, but is not, generally, an effective strategy for conducting optimization in high dimensional spaces. Return
4. In this particular example we have used equally spaced points to create our model, but this decision is only for demonstrative purposes.  During sequential optimization, these points would almost certainly be sampled according to some acquisition function.  Furthermore, if the goal were only to create a good model, statistics would suggest an optimal design while approximation theory might suggest using the Chebyshev-Gauss-Lobatto points. Return
5. This larger domain with a boring region serves a useful purpose in demonstrating the value of identifying a domain which tightly bounds the optimum.  The idea that there may be different regions of activity for a function (as opposed to just boring and interesting) is indicative of something larger, however.  This property might be referred to as nonstationarity in the spatial statistics or machine learning communities, and has been addressed (albeit somewhat anonymously) in the numerical analysis community with local interpolation and multilevel methods. Return
6. This model was designed to accentuate this undesired behavior, to demonstrate a severe situation that could arise when someone new to SMBO first experiments.  As suggested above, a different choice of covariance kernel (in particular, one with less smoothness) may not have such large undesired oscillations.  Of course, knowing that a priori would require extra analysis on the function of interest prior to optimization. Return
MichaelMcCourt
Michael McCourt Research Engineer