Here at SigOpt, Gaussian processes and reproducing kernel Hilbert spaces (RKHS) are important components of our Bayesian optimization methodology. Our research into this topic often exists at the intersection of approximation theory and spatial statistics. We recently attended the SIAM Computational Science and Engineering 2017 meeting in Atlanta to continue learning about outstanding accomplishments in these and other exciting topics in computation.
In addition to participating in the career fair, we also organized a mini-symposium to bring together experts in Gaussian processes and RKHS and discuss recent advances that are relevant to Bayesian optimization. First, we would like to extend our thanks to Jian Wu, Jie Chen and Greg Fasshauer for contributing talks to the session; their presentations provided a diverse set of insights into new strategies for improving the viability and performance of RKHS theory in Bayesian optimization. This post serves to discuss our ongoing collaboration with Dr. Fasshauer and explore the topic of his talk at a level suitable for a broad audience. For a more thorough discussion of the interplay between Green’s functions and covariance kernels, we will be releasing an In Depth blog post soon.
A fundamental component of a Gaussian process is its covariance kernel, the mechanism by which information at different locations is believed to jointly vary. This defines the degree to which knowledge at one location provides knowledge at another; it also provides the key tool by which existing function observations can be used to predict function values at unobserved locations.
As such, the choice of covariance kernel can play a significant role in the quality of the kernel-based approximation to the function and, in turn, the quality of any Bayesian optimization in which that approximation is involved. We have previously discussed the topic of optimally choosing covariance kernels from a parametrized family using maximum likelihood estimation as well as the value of alternate parametrization metrics. In this post we explore the potential for a new family of kernels, with specially designed properties, to yield benefits during Bayesian optimization.
It should be noted that the role of positive definite (covariance) kernels extends beyond our application here in spatial statistics. Their role in support vector machines, studying text data, high dimensional integration, density estimation, and other fields demonstrates their broad value. In this discussion, however, we will be focusing only on their role in scattered data approximation.
Isotropic and Anisotropic Radial Covariance Kernels
Before we talk about new kernels, it would be prudent to review kernels already common in the literature, especially as this affords us an opportunity to look at some pictures. By far, the most popular covariance kernels across disciplines are radial (thus the term radial basis function), and among those the most popular is the Gaussian kernel (also referred to, in a persistent bit of misnomer, as simply the RBF kernel). The images below show examples of these radial Gaussian (a.k.a. squared exponential) kernels with different shape parameters.
These are examples of radial kernels (also called isotropic), but not all kernels are radial. Radial kernels can be converted from their isotropic form (where all dimensions have the same bandwidth/length scale) to their anisotropic form (where dimensions can act differently, also know more generally as stationary or, in some communities, automatic relevance determination) by redefining the sense of distance. Below are two anisotropic variants on the Gaussian kernels above with a skewed sense of distance.
Nonstationary Iterated Brownian Bridge Covariance Kernels
All of the kernels described above, and actually most common covariance kernels, have the property that the distance between observations defines the covariance (thus the term radial). The reasons for their popularity are both practical (computational benefits exist for radial kernels) and historical (this article reviews much of this history regarding completely monotone functions and interpretations in the context of the radial Fourier transform).
Nonradial kernels have, in the spatial statistics literature, enjoyed less popularity. Part of this is for theoretical reasons: kernels which are a function of only a distance quantity can then be studied as functions of one variable. In the recent past, however, there has been a renewed emphasis on the potential benefits of using nonradial kernels, partly driven by the numerical/functional analysis community (this text provides some context).
The potential benefits of using nonradial kernels include the opportunity to model functions which vary in a nonstationary way; because these kernels are not translation invariant they have the ability to recognize variations in the function on different scales in different parts of the domain. See, e.g., this article for more discussion on the topic. One point of growth has been in the use of kernels defined through Green’s functions solutions to differential operators, a topic normally restricted to the solution of differential equations but which serves as a previously untapped source of positive definite kernels. Early papers on this topic include these two articles. An upcoming blog post will discuss this topic more thoroughly.
To put a tangible face to this topic, we can look at examples of one-dimensional nonradial kernels belonging to the iterated Brownian bridge (IBB) family of kernels (introduced first in this article, and later in this text).
Those readers that are familiar with the Matérn class of positive definite kernels may see some similarities to the kernels in the figure above. This is not a coincidence: the full space Matérn kernels share the same differential operator, but the kernels above are only defined on a compact domain with specific boundary conditions. Originally, in fact, these IBB kernels were given the title “Compact Matérn” for this reason. A visual comparison between the full space Matérn and IBB kernels is shown in the figure below.In the setting as I have described it so far, these kernels only exist in 1D; this can be extended to higher dimensions by considering product forms of these kernels. The figure below demonstrates this in 2 dimensions (beyond 2D it is rather difficult to visualize). In a future post, we will more thoroughly study the performance of these tensor product kernels in higher dimensional spaces, where the pain of searching around the boundary is amplified.
IBB Kernels in Bayesian Optimization
These kernels were originally invented with the goal of using them in a collocation solution of boundary value problems: by automatically enforcing boundary conditions, these kernels are an attractive basis for which to approximate a solution. Chapter 20 of this book provides an example of this strategy.
Recently, we decided that these IBB kernels could potentially be adapted for use in Bayesian optimization. In particular, one common complaint about some Bayesian optimization methods is their propensity to sample near the boundary; this problem was addressed recently in this article which describes using a quadratic mean to dissuade samples from the boundary. Our strategy here is to use these IBB kernels to enforce behavior near the boundary and thus decrease the desire to sample around the boundary until the interior has been sufficiently explored.
In the figures below, we demonstrate the impact of this choice in comparison to a standard Matérn kernel with the same smoothness and shape parameter; a deterministic linear mean is assumed in these kriging models. Eight initialization points, randomly selected from the domain, are used to seed the optimization. These examples involve fixed covariances over the course of the optimization.2
In the example above, the function actually satisfies the boundary conditions (zero values on the boundary), so one could (legitimately) argue that this superior behavior for the iterated Brownian bridge kernel is to be expected.3 Below we show a function with arbitrary boundary behavior and multiple local optima and again compare the behavior.
A common complaint of Bayesian optimization methods using Gaussian process models is their propensity to sample around the boundary and in the corners of the solution domain. Here we introduced a strategy using special positive definite kernels to sample away from the boundary by enforcing specific behavior at the boundary. In a future blog post, we will delve into the derivation of these kernels and explore more complicated optimization problems.
Special thanks, again, to Greg Fasshauer for his continued collaboration on this project. Also, thanks to both Jian Wu and Jie Chen for their valued contributions to our SIAM mini-symposium.