Bayesian optimization of some function \(f\) is powered by two key components:

- a probabilistic surrogate model which encapsulates the understanding of \(f\) given the data that has been observed, and
- an acquisition function, which assigns a sense of benefit to sampling at a future location and which is maximized to identify subsequent suggested evaluation points.

Many different surrogate models exist (including Gaussian processes, random forests,kernel density estimation, neural networks, among others) and many different acquisition functions exist (including probability of improvement, expected improvement, knowledge gradient, entropy search, upper confidence bound, among others).

This blog post discusses recent advances in interpreting the upper confidence bound acquisition function (abbreviated below as UCB) from a geometric perspective. In particular, this post is an abbreviated version of an article I will be presenting on Tuesday, April 17 at ICASSP 2018 in Calgary. I hope that anyone attending this conference who is interested will join the Probabilistic Models session (MLSP-P2) at 1:30pm.

## Upper Confidence Bound

The Upper Confidence Bound^{1} (UCB) acquisition function balances exploration of the function \(f\) and exploitation of the current state of the surrogate model. In the simplest case, we assume that the surrogate mdoel is a Gaussian process, meaning that the UCB can be written as

\(a(\xx) = -(\mu(\xx) – \kappa\sigma(\xx))\)

where \(\mu(\xx)\) and \(\sigma^2(\xx)\) are the mean and variance of the surrogate model. The value \(\kappa\) is a free parameter which can be chosen so as to enforce a specific balance between exploration (high \(\sigma\)) and exploitation (low \(\mu\)). Assuming the current data \(\cD_n\) is used to generate the surrogate model and, consequently, the UCB function \(a(\xx;\cD_n)\), the \(n+1\)th point to evaluate would be determined by

\(\xx_{n+1} = \argmax_{\xx} a(\xx;\cD_n).\)

At this point, it helps to consider a specific example. Suppose that we are given

\(f(\xx) = 1 – \sqrt{x_1 x_2} \sin (x_1) \sin (x_2)\)

where \(\xx = \begin{pmatrix}x_1\\x_2\end{pmatrix}\) is two-dimensional vector. A contour plot is presented in the figure below.

After 40 observed results (red dots) are present in \(\cD_{40}\), the surrogate model mean \(\mu(\xx)\) standard deviation \(\sigma^2(\xx)\), and UCB \(a\) are computed and displayed in the figure below. The surrogate model and choice of \(\kappa=1\) were designed to produce easily interpreted pictures; the details of surrogate development are beyond the scope of this article and likely involve likelihood computation.

## Maximizing the Upper Confidence Bound

As seen in the UCB graph in Figure 2, the acquisition function is multimodal which will require some form of nonconvex optimization to maximize. Such a process is expensive and potentially complicated, often without simple guarantees of being able to easily identify the best UCB point (the gray star).^{2}

Of particular note in this acquisition function maximization scenario is the impact of the “initial guess”, a commonly required component of iterative algorithms.^{3} This issue has always existed, and appears here when convex optimization tools are used to try to identify the maximum of \(a\). The figure below shows the solutions that two standard optimizers, Powell’s method and a modification of Newton’s method, will find, as a function of the initial guess (for plotting cleanliness, solutions owned by less than 2% of the domain were not provided their own regions).

For both of these methods which are dependent on an initial guess, there is only roughly a 20% chance that a randomly chosen initial guess in the domain will converge to the true solution (in gray). In higher dimensions, this will likely be even worse. The goal of our article is to attempt to improve the prospects of this optimization process by providing better initial guesses.

## Clustering to Guide Initial Guesses

In our article, we consider a geometric interpretation of UCB, by studying the interaction of prospective points in a \(\mu\)-\(\sigma\) 2D plane. To understand the motivation for this, it can help to reorganize the UCB definition into the following form

\(\sigma(\xx) = \frac{1}{\kappa}\mu(\xx) + \frac{a(\xx)}{\kappa}.\)

This should bring to mind the classic “slope-intercept” form of a line, \(y=mx+b\), which is apparent in the figure below where a UCB contour plot is provided as a function of \(\mu\) and \(\sigma\).

Obviously, applying a Powell or BFGS solver on 10000 initial guesses is quite costly and possibly more than what is required (especially for the 2D problem we are studying). Our proposed strategy is to use the physical representation shown in Figure 4 to select a subset of points in the physical domain to be used at initial guesses in the optimization tool. In particular, we identify clusters of points in \(\mu\)-\(\sigma\) space to effectively sample regions of the physical domain which are diverse in \(\mu\)-\(\sigma\) space. These clusters are identified through Gaussian mixture modeling. The figure below shows the outcome from clustering on the 10000 points.

As suggested by Figure 5, even if the clusters in \(\mu\)-\(\sigma\) space are contiguous, the associated \(x\) locations in physical space need not be contiguous. This yields a degree of freedom regarding how to use the information in the clusters to effectively initialize the optimization. In the figure below we present two possible strategies which are discussed in greater detail in our article.

## Conclusion

We have revisited one of the traditional acquisition functions of Bayesian optimization, the upper confidence bound and looked to a slightly different interpretation to better inform optimization of it. In our article, we present a number of examples of this clustering-guided UCB method being applied to various optimization problems. If you are interested in this topic and happen to be attending ICASSP 2018, please stop by our poster session (Tuesday, April 17).