Learn why Two Sigma standardized on our optimization solution to scale their research efforts. Read More

Intuition Behind Covariance Kernels

Michael McCourt
All Model Types, Modeling Best Practices

In a previous post on SigOpt Fundamentals, we introduced the concept of Gaussian processes; these are one of the tools that SigOpt uses to help companies optimally design their products, such as airplane components, traffic routing strategies or synthetic ingredients for traditional oriental remedies. Gaussian processes are powerful because they allow you to exploit previous observations about a system to make informed, and provably optimal, predictions about unobserved behavior. They do this by defining an expected relationship between all possible situations; this relationship is called the covariance and is the topic of this post.

Before we talk about covariance, we should take a moment and think about variance, a simple word with shockingly important implications. Colloquially speaking, if we say something has variance we mean that it can vary unexpectedly; things that are invariant change only in a predictable fashion. Because I have a steady job at SigOpt, for any single month, my income has no variance1. My bank account, on the other hand, has variance because I may or may not splurge on a nice sweater. Things that are random have variance. The figure below depicts how the temperature at any moment is random, whereas the current day of the week is not.

Figure 1: The temperature at any moment varies unpredictably, whereas the day of the week is predictable and not random. Data courtesy of Weather Underground.

The concept of covariance is similar but pushed to the next level. If two quantities have some covariance then a change in one implies a change in the other. It seems safe to say that more traffic on the freeway implies a longer commute time: these quantities have positive covariance. The time spent on my phone and its remaining battery life have negative covariance because the more I use my phone the less battery remains. Two quantities, such as my phone battery life and… the average temperature on Venus, have zero covariance because knowing one does not imply anything about the other.2

So, the question on your mind must be: What does this have to do with Gaussian processes and SigOpt? Defining a Gaussian process requires also defining a specific mathematical function called the covariance kernel. This wonderful function, that facilitates all the great results we expect from Gaussian processes, is usually a relatively simple function that computes the covariance between values at any two locations.

The covariance kernel encapsulates how an observed value at one location can be used to predict outcomes at other locations. If the kernel says that two locations have high covariance, then a good observation at one location implies that a good observation at the other location is likely3, for however you may define “good.” Conversely, when two locations have covariance near 0, then information at one location provides no predictive insight into the other location.

‍Figure 2: Suppose that we have some observed results at the location of the dashed line. The covariance kernel centered at that dashed line tells us how much we can infer from that observation at other desired, but as yet unobserved, locations.

The figure above depicts how the covariance kernel helps us make good predictions given information that we have observed. Covariance kernels have a high value in the neighborhood immediately surrounding an observed location; this implies that future observations in that neighborhood should have a similar value. That concept is called continuity, and without it we can say very little about the situation we hope to study. Fortunately, many situations are continuous: if we measure the temperature at two locations a millimeter apart we would expect them to be closely related.4

As we try to predict the world far from an observation, we have a harder time making good predictions. This makes sense in the physical world (try using the temperature here to predict the temperature on Venus) and is represented in the covariance kernel by very small, or zero, values far away from an observed location. This also allows for observations near the desired prediction location to more heavily influence the outcome than far away observations. Considering all our observations simultaneously, along with the proper combination of covariance kernel values, allows us to make predictions at any location.5, as in the figure below.

Figure 3: The covariance kernel describes the way that observed outcomes can be used to best predict future outcomes.

This proper combination, which gives the best prediction at unobserved locations, is defined by a weighted average of the covariance values between observed and unobserved locations.  The weighted average is determined with the help of the observed values; the derivation is a bit involved, but can be found here or here, or in a future post of ours. It may help to think of it as putting together a puzzle where you know that certain pieces belong in certain locations (see the figure below).

Figure 4: A covariance kernel gives you a mechanism to take the red puzzle pieces that you know (previous observations, left image) and predict the missing pieces (future outcomes, right image).6

Of course, with great power comes great responsibility, and those who use Gaussian processes are aware of the potential for lousy results when using an inappropriate covariance kernel. If the kernel decays too quickly, you lose any ability to make effective predictions except in the immediate neighborhood of your observations. If the kernel decays too slowly, you are forced to know what is happening at locations too far away: again, imagine having to predict the weather on Venus based on the weather in Hong Kong. How then can we choose the right kernel?

Fortunately, brilliant mathematicians and statisticians have devised strategies for making a smart decision, from maximum likelihood estimation to cross-validation and other methods discussed in my recent book. Understanding these gives SigOpt the ability to leverage the existing observations to their full potential and minimize the need for additional experimentation. The theory introduced in this and subsequent blog posts, combined with our state-of-the-art computational tools, allows our customers to experiment as efficiently as possible. Don’t be left in the dark… sign up for SigOpt today to start putting your puzzle together! Also, stay tuned for a future post where we talk about the valuable properties of kernels such as those in the figure below.

‍Figure 5: My favorite part of this post: fun with kernels! In future posts which will discuss cool properties of kernels, such as tensor products, compact support, boundary conditions and more.

Footnotes

1. For any month, my income is constant, but my income grows as time passes. This predictable change is not random, though, and may be called a trend. We will not discuss random processes in time, though. Return
2. Okay... maybe one could argue that somehow the average temperature on Venus in some way reflects the density of sunspots, which in turn implies something about the efficiency of the mobile phone satellite network, which would shut down my phone and boost the battery life. Or maybe that would cause my phone to constantly search for a fleeting signal and more quickly drain the battery life. Bottom line, I’m a mathematician, not an astrophysicist or telecommunications engineer, so bear with me. Return
3. This statement is phrased in a probabilistic sense: locations with high covariance are expected to act similarly, locations with very negative covariance are expected to trend in opposite directions, and locations with low covariance are expected to act unpredictably. Given that the whole situation is random, covariance is more of a governing force than a dictator. Return
4. It is possible to make predictions for data with a small number of discontinuities, though this problem is much better handled in the math world than the statistics world. It will be a while before we discuss that in this blog, so interested parties can look at, e.g., WENO methods., described here or here for differential equations, but applicable to discontinuous approximation. Return
5. The accuracy of predictions is a complicated topic for a number of reasons that we will discuss in a future blog post. Here we mention only that predictions very far from all observations should be trusted very little. Return
6. This image is an altered version of an image found here. Return
MichaelMcCourt
Michael McCourt Research Engineer