$\def\a{\alpha} \def\e{\varepsilon} \def\s{\sigma} \def\RR{\mathbb{R}} \def\mC{\mathsf{C}} \def\mK{\mathsf{K}} \def\mI{\mathsf{I}} \def\kk{\boldsymbol{k}} \def\xx{\boldsymbol{x}} \def\yy{\boldsymbol{y}} \def\zz{\boldsymbol{z}} \def\cD{\mathcal{D}} \def\cX{\mathcal{X}} \def\lmle{L_{\text{MLE}}} \def\lmple{L_{\text{MPLE}}} \def\lkv{L_{\text{KV}}} \def\lclv{L_{\text{CLV}}} \def\ppa{\frac{\partial}{\partial\a}}$

# Profile Likelihood vs. Kriging Variance

All Model Types, Modeling Best Practices

This is the first post in our SigOpt in Depth series, which deals with more technical aspects of topics that appear in the SigOpt Fundamentals series. As a result, these posts will require more technical background than those; for instance, this post assumes familiarity with Gaussian processes at the level of Rasmussen and Williams (RW) or Fasshauer and McCourt (FM). Our SigOpt Fundamentals series includes introductions to concepts such as likelihood and Gaussian processes.

## Kernel-based interpolation

At SigOpt, our customers hope to maximize some metric $$f$$: this could be the accuracy of a text classifier or the quality of a mechanical simulation. One important strategy we use internally to help our customers is to approximate the behavior of the metric  $$f$$ using Gaussian processes. The goal is to find a function $$s$$ which looks like the function $$f$$ at all locations \xx_k where we have observed $$f$$. Mathematically, the interpolation problem is stated below.

Given a set of locations $$\cX=\{\xx_1,\ldots,\xx_N\}$$ in $$\RR^d$$ and associated function values $$\yy=\begin{pmatrix}y_1\\\\\vdots\\\\y_N\end{pmatrix}$$, such that $$y_k=f(\xx_k)$$ for $$1\leq k\leq N$$, find a continuous function $$s$$ such that $$s(\xx_k)=y_k$$.

This post only considers interpolation of noise-free $$y_k$$ values to simplify the math, but the difference between approximation and interpolation is stressed here. SigOpt handles both situations in practice.

One of the issues with trying to solve this scattered data interpolation problem is that it is ill-posed: there are infinitely many functions that interpolate the data. We have already made one significant choice1; by modeling the data with a Gaussian process we guarantee a kernel-based interpolant. This means that predictions from the function $$s$$ will be a linear combination of reproducing kernels $$K(\xx,\zz)$$ (also called covariance kernels, positive definite kernels, or radial basis functions). We read this notation as the kernel $$K$$ centered at $$\zz$$ and evaluated at $$\xx$$2.  Some intuition as to the nature and role of these covariance kernels is provided here.

These reproducing kernels $$K$$ are the basic components of reproducing kernel Hilbert spaces. They appear in both statistical contexts (RW, eq. 2.19) and numerical analysis contexts (FM, ch. 2), but both contexts agree that the correct interpolant to observed data $$y_k$$ at locations $$\xx_k$$, for $$1\leq k\leq N$$, is

\begin{align} s(\xx) = \kk(\xx)^T\mK^{-1}\yy. \end{align}

Two new terms are introduced here:

\begin{align} \kk(\xx) = \begin{pmatrix}K(\xx, \xx_1) \\\\ \vdots \\\\ K(\xx, \xx_N)\end{pmatrix}, \qquad \mK = \begin{pmatrix}K(\xx_1, \xx_1) & \cdots & K(\xx_1, \xx_N) \\\\ & \vdots & \\\\ K(\xx_N, \xx_1)& \cdots &K(\xx_N, \xx_N)\end{pmatrix}. \end{align}

The vector $$\kk(\xx)$$ is comprised of kernel values centered at the data locations $$\cX$$ and evaluated at the location $$\xx$$, and it defines the basis for our kernel-based interpolant. The matrix $$K$$ has columns of $$\kk(\xx_1)$$ through $$\kk(\xx_N)$$; the term Gram matrix is often used for $$K$$ and in statistical regression this is analogous to the design matrix.

Even after choosing to model our user’s metric with Gaussian processes, another choice remains which is, in some ways, also the most important. As hinted at near the end of this post, there are infinitely many reproducing kernels, each with their own special attributes. Some, such as the Matérn kernel, are popular and used frequently in applications such as spatial statistics. Others, such as the Szegő kernel, exist more for theoretical purposes. Here, we simplify our discussion by only considering a radial Gaussian kernel:

\begin{align} K(\xx, \zz) = \a\exp\left(-\e^2 \|\xx – \zz\|^2\right). \end{align}

Notice that even though we have isolated a single reproducing kernel, there are 2 free hyperparameters3 remaining: the process variance $$\a$$ and the shape parameter $$\e$$. These must be positive, and they must chosen with care to produce good predictions.

How, then, shall we “correctly” choose these hyperparameters? There is, in reality, no unanimous choice in the research community; many use a strategy called cross-validation. The goal of this post is to compare and contrast two other methods: maximum profile likelihood estimation and minimization of the kriging variance.

Whereas the goal of maximum likelihood estimation is to create an approximation which is most likely to have generated the given data, minimizing the kriging variance creates an approximation for which we have the most possible confidence in our predictions.

## Derivation

Maximum likelihood estimation, as discussed in (RW, ch. 5.4) involves choosing the model that was most likely to have generated the observed data $$\cX$$ and $$\yy$$. For a zero mean Gaussian process and noiseless data, the likelihood of given hyperparameters having generated observed data is judged by the probability of observing that data given that those hyperparameters define the covariance kernel.4 This is a normal density function

\begin{align} p(\yy|\e, \a) = ((2\pi)^N \det{\mK})^{-1/2} \exp\left(-\frac{1}{2}\yy^T\mK^{-1}\yy\right), \end{align}

but computing this can cause underflow. Instead, it is common to minimize the function

\begin{align} \lmle(\e, \a) = \log{\det{\mK}} + \yy^T\mK^{-1}\yy \end{align}

which has the same solution; this is because

$$\lmle(\e, \a) = -2\log(p(\yy|\e, \a)) – N\log{2\pi}$$

This strategy of scaling and removing of constants occurs again below, though we will skip the details.

Maximizing this directly (with a black box optimization algorithm, such as SigOpt) is a common strategy; indeed, in most circumstances, SigOpt would be the best tool available because of its ability to optimize without knowing the structure of $$\lmle$$. But luckily for us (and this blog post), this particular formula is well-structured and allows us to explicitly define $$\a$$ by solving $$\ppa\lmle(\e, \a)=0$$. We can do this most easily by creating a new matrix $$\bar{\mK} = \mK/\a$$, so that $$\bar{\mK}$$ is only a function of $$\e$$. Note that $$\mK=\a\bar{\mK}$$ and $$\ppa\mK=\bar{\mK}$$. We can rewrite $$\lmle$$ using this notation as

\begin{align} \lmle(\e, \a) % &= \log{\det(\a\bar{\mK})} + \yy^T(\a\bar{\mK})^{-1}\yy \\\\ % &= \log(\a^N\det\bar{\mK}) + \yy^T\left(\frac{1}{\a}\bar{\mK}^{-1}\right)\yy \\\\ &= N\log\a + \log\det\bar{\mK} + \frac{1}{\a}\yy^T\bar{\mK}^{-1}\yy. \end{align}

This can be easily differentiated wrt $$\a$$ to get

\begin{align} \ppa\lmle(\e, \a) = \frac{N}{\a} – \frac{1}{\a^2}\yy^T\bar{\mK}^{-1}\yy, \end{align}

thus the solution to $$\ppa\lmle(\e, \a) = 0$$ is $$\a = (1/N)\yy^T\bar{\mK}^{-1}\yy$$. Substituting this quantity in $$\lmle$$ gives us a new quantity to be minimized

\begin{align} \lmle\left(\e, (1/N)\yy^T\bar{\mK}^{-1}\yy)\right) = N\log\left((1/N)\yy^T\bar{\mK}^{-1}\yy\right) &+ \log\det\bar{\mK} \\\\ &+ \left((1/N)\yy^T\bar{\mK}^{-1}\yy\right)^{-1}\yy^T\bar{\mK}^{-1}\yy, \end{align}

which, after some cleanup and removal of constants, can be written as

\begin{align} \lmple(\e) = N\log\left(\yy^T\bar{\mK}^{-1}\yy\right) + \log{\det{\bar{\mK}}}. \end{align}

We call this term the profile likelihood, and it allows us to rephrase the MLE problem in a lower dimension by explicitly casting the $$\a$$ term as a function of $$\e$$.

Maximizing $$\lmle$$ or $$\lmple$$ are common strategies for identifying viable hyperparameters of a Gaussian process approximation, but they are not alone. Whereas the goal of maximum likelihood estimation is to create an approximation which is most likely to have generated the given data, minimizing the kriging variancecreates an approximation for which we have the most possible confidence in our predictions.

As explained in (RW, eq. 2.19) or (FM, ch. 5.3.3), Gaussian process predictions at a point $$\xx$$ are normal random variables with expectation $$s(\xx) = \kk(\xx)^T\mK^{-1}\yy$$ (these are our predictions) and variance5

\begin{align} V(\xx) = K(\xx, \xx) – \kk(\xx)^T\mK^{-1}\kk(\xx). \end{align}

We define new terms with $$\a$$ extracted: $$\a\bar{\kk} = \kk$$ and $$\a\bar{K} = K$$. Substituting these and canceling $$\a$$ terms allows us to write the variance as

\begin{align} V(\xx) = \a\left(\bar{K}(\xx, \xx) – \bar{\kk}(\xx)^T\bar{\mK}^{-1}\bar{\kk}(\xx)\right). \end{align}

We cannot minimize the kriging variance at all locations $$\xx$$ simultaneously, so we may define our goal to be the minimization of some norm of the kriging variance. Using the $$\a = (1/N)\yy^T\bar{\mK}^{-1}\yy$$ value from earlier, and taking logarithms and cleaning up, allows us to minimize the norm of the kriging variance by minimizing

\begin{align} \lkv(\e) = \log\left(\yy^T\bar{\mK}^{-1}\yy\right) + \log\left\|\bar{K}(\xx, \xx) – \bar{\kk}(\xx)^T\bar{\mK}^{-1}\bar{\kk}(\xx)\right\|. \end{align}

## Analysis

This brings us to the crux of this blog post, which is a comparison between minimizing the profile likelihood quantity $$\lmple$$ and the kriging variance quantity $$\lkv$$:

• Structure: both have two terms (get the easy ones out of the way first)
• Scale: both exist on a logarithmic scale
• Matching terms: $$\log(\yy^T\mK^{-1}\yy)$$ appears in both terms. Note that the quantity $$\yy^T\mK^{-1}\yy$$ is also called the Mahalanobis distance and the Hilbert space norm of the interpolant $$s$$
• Splitting: The $$\yy$$ values appear in only one term in both quantities. As a result both quantities have a term dependent on $$\yy$$ and a term independent of $$\yy$$.

The most relevant point is probably the final point, where both quantities involve a $$\log(\yy^T\mK^{-1}\yy)$$ plus a term dependent only on $$\cX$$. This small difference between quantities to minimize can yield very different choices of hyperparameters, or very similar choices, as shown in the figure below.

Figure 1: These two plots represent two different sets of observed data. We see that it is possible for the kriging variance and profile likelihood to have similar, or very different, minima. The Danger! warning denotes hyperparameter values for which computation is perilously ill-conditioned; we will discuss state-of-the-art strategies for avoiding this shaky behavior in a future blog post.

Results like this, after the discussion above, lead us to conclude that the relationship between these two hyperparameter selection strategies is nontrivial: they are similar but not the same, and they may, or may not, yield very different approximations. These differences can greatly impact the quality of the suggestions an optimization strategy based on them would give. Unfortunately, these nuances are of little consolation to those who try to use Gaussian processes without experience knowing which situations prefer the approximation most likely to have generated the data (maximum likelihood estimation) or the approximation whose predictions are most consistent (minimum kriging variance). Members of the SigOpt research team are constantly analyzing ways to leverage these and other selection criteria as effectively as possible, removing this onerous burden from our users and providing access to our best approximations and resulting predictions.

One recent advance has revolved around the matrix decomposition (FM, eq. 14.4)

\begin{align} \mC(\xx) = \begin{pmatrix} K(\xx, \xx) & \kk(\xx)^T \\\\ \kk(\xx) & \mK \end{pmatrix} = \begin{pmatrix} 1 & \kk(\xx)^T \\\\ 0 & \mK \end{pmatrix} \begin{pmatrix} K(\xx, \xx) – \kk(\xx)^T\mK^{-1}\kk(\xx) & 0 \\\\ \mK^{-1}\kk(\xx) & \mI \end{pmatrix}. \end{align}

This matrix appears in the derivation of Gaussian process predictions (RW, eq. 2.18) and analysis here will demonstrate its potential role in choosing this shape parameter $$\e$$. Using properties of determinants and the notation from earlier, we see that

\begin{align} \log\det\mC(\xx) &= \log\det\mK + \log V(\xx) \\\\ &= \log\det\left[\a\bar{\mK}\right] + \log\left[\a\left(\bar{K}(\xx, \xx) – \bar{\kk}(\xx)^T\bar{\mK}^{-1}\bar{\kk}(\xx)\right)\right] \\\\ &= (N+1)\log\a + \log\det\bar{\mK} + \log\left(\bar{K}(\xx, \xx) – \bar{\kk}(\xx)^T\bar{\mK}^{-1}\bar{\kk}(\xx)\right) \\\\ \end{align}

Using the presumed value of $$\a$$, removing the $$-(N+1)\log{N}$$ constant and again dealing in some norm of the determinant allows us to write this quantity as a combined likelihood and variance,

\begin{align} \lclv(\e) &= \lmple(\e) + \lkv(\e). \end{align}

This result gives a reason to believe that the similarities described above regarding $$\lmple$$ and $$\lkv$$ were more substantive than just hopeful. Studying $$\mC(\xx)$$

seems to suggest that these two parameterization methods are actually just two sides of the same coin and that minimizing the combination of them is a viable, and possibly preferable, strategy.6

## Conclusion

We hope this post reinforced the notion that parameterization of Gaussian processes is a tricky topic (as many who have attempted Bayesian Optimization via open source tools may be aware), but that there are many different strategies available. The focus of this blog was on the relationship between kriging variance and profile likelihood, but even this analysis has led to more complexity by suggesting a different composite strategy. SigOpt has dedicated significant resources to researching and addressing these and other important problems as accurately, efficiently and stably as possible so as to provide our customers with the speediest realization of their optimal behavior. Sign up today for your free trial to take advantage of this and other research advances at SigOpt to optimize your complicated machine learning systems.

## Footnotes

1. Some readers may think that polynomials are the preferred tool for interpolation (and indeed many great mathematicians would agree), but beyond one dimension $$d>1$$ ) there is no guaranteed polynomial interpolant to scattered data. The catch-all proof of this fact is stated in the Haar-Mairhuber-Curtis theorem. For those readers who prefer tangible evidence, try finding the unique quadratic polynomial interpolant to the following data:

\begin{align} f&(&-0.5&, &-2.0) &= 1.0 \\\\ f&(&-1.0&, &-1.0) &= 1.0 \\\\ f&(&-2.0&, &-0.5) &= 1.0 \\\\ f&(&0.5&, &2.0) &= 1.0 \\\\ f&(&1.0&, &1.0) &= 1.0 \\\\ f&(&2.0&, &0.5) &= 1.0 \end{align} Return

2. Reproducing kernels are symmetric, so $$K(\xx,\zz) = K(\zz,\xx). The choice to think of \(\zz$$ as the kernel center and $$\xx$$ as the evaluation point is just our convention. Return
3. In a footnote of our SigOpt Fundamentals likelihood post, we discussed the different names under which hyperparameters may be termed. Also, there are other forms of the Gaussian kernel, as discussed in (FW, ch. 3); the form chosen here expedites the demonstration of the topic. Return
4. For those of you wondering, there is a physical (non-statistical) intuition behind maximum likelihood estimation that makes sense in the numerical analysis setting. See (FM, ch. 14.3.3). Return
5. This quantity is also referred to as the power function in numerical analysis literature on reproducing kernel Hilbert spaces, and it plays a role in classical error bounds. See (FM, ch. 9.3). Return
6. At first blush, diligent readers may recognize that I am suggesting that a good strategy for choosing $$\e$$ is to minimize the determinant of the symmetric positive definite matrix $$\mC(\xx)$$. The infimum of that quantity is 0, which would lead to a troubling result regarding invertibility. Fortunately, the $$a$$ term blows up as the determinant drops to 0, leading to the same balancing act that appeared in $$\lmple$$, albeit in a new setting. Return
Michael McCourt Research Engineer