Likelihood for Gaussian Processes

Michael McCourt
All Model Types, Modeling Best Practices

For those of you already well-versed with the likelihood for Gaussian processes, it is recommended that you read our in-depth post.

Companies that use SigOpt seek to optimize a variety of metrics, from the accuracy of machine learning models to the quality of physical products. One of our strategies to efficiently conduct this optimization is to approximate the behavior of the metric using reported observations from our customers. That approximation, which we often define using Gaussian processes, can be used to identify subsequent parameters (think font size or advertising shape) which are likely to provide good metric values. Those new experiments, in turn, produce more information which can be used to build a better approximation. Repeating this process (shown graphically in the figure below) will eventually expose the optimal behavior for the user.

Figure 1: The SigOpt workflow. Existing results reported by users are analyzed and approximations are generated. These approximations give rise to suggested new parameters for users to consider. The best of those suggestions is tested by the user which produces new observed results. As the cycle is perpetuated, the optimal parameters choices and company performance are uncovered. This post focuses on the Approximate component.

At SigOpt, our ability to make these suggestions is dependent on our ability to use such as Gaussian processes to effectively approximate our user’s metrics. As mentioned in an earlier post on the approximation of data, there are infinitely many ways to use observed data to make predictions about unobserved values. Using the best approximation gives our customers the fastest path to optimal behavior, which minimizes the costs of experimentation. Of course, this leads to the question “What is the best approximation?” which is a complicated question without a perfect answer, even when considering only Gaussian processes. One useful strategy is to choose an approximation that maximizes the likelihood.

The likelihood is, of course, a very generic term that people use every day, particularly people who visit casinos. Webster’s dictionary defines likelihood as “the chance that something will happen,” which is a succinct and often appropriate definition; try to keep that definition in your mind throughout this post. Our goal is to determine which approximation, among the infinitely many, is the best fit for our customer’s data, because the best approximation helps us most efficiently expose their optimal behavior. One popular mechanism for Gaussian processes involves choosing the “most likely” approximation: the approximation which is most likely to have generated the data that was observed. This process is called maximum likelihood estimation.

Before we discuss likelihoods for Gaussian processes, we should consider a simpler situation — the flipping of a coin. A coin flip could be predicted deterministically1 with factors such as air temperature, thumb intensity, initial height and others built into a complicated physical model. Usually, however, most people approximate a coin flip as a random event with only one factor: the chance of coming up heads. We can use maximum likelihood estimation to analyze data and produce our best approximation of the true coin flip.

Likelihood allows us to compare different versions of the world and determine which is more likely.

Suppose you have a coin that you flip 6 times, and over the course of those 6 flips the coin comes up heads 4 times (the next figure depicts this). You may be inclined to approximate the coin as having a 66.7% chance of coming up heads. If that was your guess, then well done; that is, indeed, the maximum likelihood estimate as the figure below suggests.

Figure 2: On the left, 6 observed coin flips, 4 of which came heads. On the right, a graph which shows the likelihood of certain coins having generated those observations.

We should spend some time analyzing this situation because these insights are valuable in a broad context. The first thing to notice in this graph is that it is impossible (zero likelihood) that the coin always comes heads or always comes tails — this is a result of the fact that both tails and heads have been observed and thus neither must happen every time. Next, we see that there is a clear maximum of this graph, which corresponds to 4/6 = 66.7% chance of heads. That value is called the maximum likelihood estimator, and it describes the coin which is most likely to have generated the observed results.

The final point to observe from the figure above is that, although 66.7% is the most likely chance of a heads, any value with a nonzero likelihood is possible. Thus, it is possible that the coin is a fair coin (50% chance of coming up heads) even though we did not observe 3 heads and 3 tails in our 6 flips. The likelihood values should be thought of in a relative, not absolute, context: a likelihood value that is higher is more likely to have created the results. It would take infinitely many observations to be able to make an absolute statement about the coin.

How is this relevant to problems involving Gaussian processes? When approximating a coin flip as a random event with some chance of coming up heads, users introduce a hyperparameter: the chance of coming up heads2. For the approximation of the coin to accurately represent the true coin, they must ensure that this hyperparameter is chosen correctly. Were our approximate coin to flip heads 90% of the time, it would do a very poor job of helping us predict how a fair coin (50% heads) would perform in, say, a casino simulation. The same logic applies to Gaussian processes — they can provide an outstanding3 approximation and can make outstanding predictions, but only if they have well-chosen hyperparameter(s). This is demonstrated in the figure below.

Figure 3: In the bottom left, there is a graph depicting the likelihood associated with various hyperparameter choices. Each of the approximations is associated with a specific hyperparameter value, but there is only one representing the maximum likelihood estimate: the light blue graph.

This figure shows that the same principles that applied to the coin flipping problem (observed data suggests a “best” approximation that was most likely to have generated that data) can be used to find a good Gaussian process with which to conduct predictions. The light blue approximation has the hyperparameter that maximizes the likelihood while the green and dark blue approximations have lower likelihood. Does that mean that green and dark blue cannot be good approximations? No … it simply means that they are less likely given the data that we observed (the black circles). That is the goal of studying likelihood: choosing an approximation that is best supported by the data.

That is the goal of studying likelihood: choosing an approximation that is best supported by the data.

Likelihood is not the only mechanism that exists for selecting a good approximation and making predictions; another technique called cross-validation is even more popular in many disciplines. One of the main benefits of likelihood is that it is very clearly defined for Gaussian processes, allowing SigOpt to select the optimal Gaussian process and generate optimal suggestions for our users. Sign up for your free trial at SigOpt today to maximize your likelihood of success! Also, if you found this introduction to likelihood valuable, check out our blog post on some of the technical aspects of the likelihood for Gaussian processes.

Use SigOpt free. Sign up today.


1. One could argue that nothing is actually deterministic since events at the quantum level happen probabilistically. Those of you who enjoy such analysis may find this relevant. Return
2. The term hyperparameter is most common in machine learning and Bayesian statistical settings, where the term parameter already applies to user-defined quantities. There, hyperparameters are used to indicate that these are new parameters applied to help study existing parameters. The use of the prefix hyper is not fundamental, and some literature prefers the term nuisance parameter, although that term could be used in other circumstances as well. In a numerical analysis setting, such as Chapter 14 of my recent book, it would be common to use the word variables rather than parameters, thus freeing the word parameters for what we refer to here as hyperparameters. Basically, I am alerting you to the fact that the topic of this blog post appears in many different communities which each may use different terminology to discuss the same concept. Return
3. Here you can find some recent slides discussing the theoretically optimal results for approximation with Gaussian processes. Fully understanding them requires knowledge of reproducing kernel Hilbert spaces, which is why we avoid that discussion in this post. Return
Michael McCourt Research Engineer

Want more content from SigOpt? Sign up now.