Learn why Two Sigma standardized on our optimization solution to scale their research efforts. Read More

$\def\gg{\mathbf{g}} \def\xx{\mathbf{x}} \def\yy{\mathbf{y}} \def\mD{\mathbf{D}} \def\mG{\mathbf{G}} \def\mW{\mathbf{W}} \def\mX{\mathbf{X}} \def\mY{\mathbf{Y}} \def\sgn{\mathrm{sgn}} \def\RR{\mathbb{R}} \def\cD{\mathcal{D}} \def\cF{\mathcal{F}} \def\cN{\mathcal{N}} \def\cO{\mathcal{O}} \def\cS{\mathcal{S}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator{\me}{e}$

# Clustering Applied to Acquisition Functions

All Model Types, Modeling Best Practices

Bayesian optimization of some function $$f$$ is powered by two key components:

• a probabilistic surrogate model which encapsulates the understanding of $$f$$ given the data that has been observed, and
• an acquisition function, which assigns a sense of benefit to sampling at a future location and which is maximized to identify subsequent suggested evaluation points.

Many different surrogate models exist (including Gaussian processesrandom forests,kernel density estimationneural networks, among others) and many different acquisition functions exist (including probability of improvementexpected improvementknowledge gradiententropy searchupper confidence bound, among others).

This blog post discusses recent advances in interpreting the upper confidence bound acquisition function (abbreviated below as UCB) from a geometric perspective. In particular, this post is an abbreviated version of an article I will be presenting on Tuesday, April 17 at ICASSP 2018 in Calgary. I hope that anyone attending this conference who is interested will join the Probabilistic Models session (MLSP-P2) at 1:30pm.

## Upper Confidence Bound

The Upper Confidence Bound1 (UCB) acquisition function balances exploration of the function $$f$$ and exploitation of the current state of the surrogate model. In the simplest case, we assume that the surrogate mdoel is a Gaussian process, meaning that the UCB can be written as

$$a(\xx) = -(\mu(\xx) – \kappa\sigma(\xx))$$

where $$\mu(\xx)$$ and $$\sigma^2(\xx)$$ are the mean and variance of the surrogate model. The value $$\kappa$$ is a free parameter which can be chosen so as to enforce a specific balance between exploration (high $$\sigma$$) and exploitation (low $$\mu$$). Assuming the current data $$\cD_n$$ is used to generate the surrogate model and, consequently, the UCB function $$a(\xx;\cD_n)$$, the $$n+1$$th point to evaluate would be determined by

$$\xx_{n+1} = \argmax_{\xx} a(\xx;\cD_n).$$

At this point, it helps to consider a specific example. Suppose that we are given

$$f(\xx) = 1 – \sqrt{x_1 x_2} \sin (x_1) \sin (x_2)$$

where $$\xx = \begin{pmatrix}x_1\\x_2\end{pmatrix}$$ is two-dimensional vector. A contour plot is presented in the figure below.

Figure 1: Contour plot of a moderately interesting function.

After 40 observed results (red dots) are present in $$\cD_{40}$$, the surrogate model mean $$\mu(\xx)$$ standard deviation $$\sigma^2(\xx)$$, and UCB $$a$$ are computed and displayed in the figure below. The surrogate model and choice of $$\kappa=1$$ were designed to produce easily interpreted pictures; the details of surrogate development are beyond the scope of this article and likely involve likelihood computation.

Figure 2: Contour plots of (left) mean, (center) standard deviation, and (right) UCB. Red points are known observations. The gray star in the right panel is the max of the UCB, and thus the next point to test in the optimization.

## Maximizing the Upper Confidence Bound

As seen in the UCB graph in Figure 2, the acquisition function is multimodal which will require some form of nonconvex optimization to maximize. Such a process is expensive and potentially complicated, often without simple guarantees of being able to easily identify the best UCB point (the gray star).2

Of particular note in this acquisition function maximization scenario is the impact of the “initial guess”, a commonly required component of iterative algorithms.3 This issue has always existed, and appears here when convex optimization tools are used to try to identify the maximum of $$a$$. The figure below shows the solutions that two standard optimizers, Powell’s method and a modification of Newton’s method, will find, as a function of the initial guess (for plotting cleanliness, solutions owned by less than 2% of the domain were not provided their own regions).

Figure 3: (left) Contour plot of the acquisition function $$a$$, as appearing in Figure 2, with red stars representing solutions found by initial guesses distributed throughout the domain. (center) The 8 Powell optimizer solutions (stars) to which initial guesses in the domain converge and corresponding regions containing those initial guesses. (right) Matching results for the Newton optimizer.

For both of these methods which are dependent on an initial guess, there is only roughly a 20% chance that a randomly chosen initial guess in the domain will converge to the true solution (in gray). In higher dimensions, this will likely be even worse. The goal of our article is to attempt to improve the prospects of this optimization process by providing better initial guesses.

## Clustering to Guide Initial Guesses

In our article, we consider a geometric interpretation of UCB, by studying the interaction of prospective points in a $$\mu$$-$$\sigma$$ 2D plane. To understand the motivation for this, it can help to reorganize the UCB definition into the following form

$$\sigma(\xx) = \frac{1}{\kappa}\mu(\xx) + \frac{a(\xx)}{\kappa}.$$

This should bring to mind the classic “slope-intercept” form of a line, $$y=mx+b$$, which is apparent in the figure below where a UCB contour plot is provided as a function of $$\mu$$ and $$\sigma$$.

Figure 4: A $$\mu$$-$$\sigma$$ study of UCB. The contour plot generates lines as level sets because of the form of UCB. Additionally, the surrogate model is sampled at 10,000 uniformly distributed random points in the domain and the resulting $$\mu$$-$$\sigma$$ values are marked with a black dot to help indicate the region of feasible outcomes for the given surrogate.

Obviously, applying a Powell or BFGS solver on 10000 initial guesses is quite costly and possibly more than what is required (especially for the 2D problem we are studying). Our proposed strategy is to use the physical representation shown in Figure 4 to select a subset of points in the physical domain to be used at initial guesses in the optimization tool. In particular, we identify clusters of points in $$\mu$$-$$\sigma$$ space to effectively sample regions of the physical domain which are diverse in $$\mu$$-$$\sigma$$ space. These clusters are identified through Gaussian mixture modeling. The figure below shows the outcome from clustering on the 10000 points.

Figure 5: (left) 4 clusters have been identified from the data sampled in Figure 4. The Gaussian cluster mean is identified with a square, and the max $$a$$ value for that cluster is identified with a circle. (right) The physical domain is divided according to association with each of the 4 colors (with matching color). The input data point nearest to each of the means of the clusters is plotted with a circle of the corresponding color. The result of a BFGS optimization at each circle is represented with a star of the same color.

As suggested by Figure 5, even if the clusters in $$\mu$$-$$\sigma$$ space are contiguous, the associated $$x$$ locations in physical space need not be contiguous. This yields a degree of freedom regarding how to use the information in the clusters to effectively initialize the optimization. In the figure below we present two possible strategies which are discussed in greater detail in our article.

Figure 6: Different initial guesses derived from the clustering present in Figure 5. (left) The input data point with $$\mu$$-$$\sigma$$ value nearest to each of the means of the clusters is represented with a square. (right) The max $$a$$ point in each cluster is represented with a circle.

## Conclusion

We have revisited one of the traditional acquisition functions of Bayesian optimization, the upper confidence bound and looked to a slightly different interpretation to better inform optimization of it. In our article, we present a number of examples of this clustering-guided UCB method being applied to various optimization problems. If you are interested in this topic and happen to be attending ICASSP 2018, please stop by our poster session (Tuesday, April 17).

## Footnotes

1. The term "Upper Confidence Bound" was developed in the context of multi-armed bandits, with the goal of maximizing a given function $$f$$. In Bayesian optimization, the stated goal is often the minimization of $$f$$, and the analogous name for such an acquisition function would be "Lower Confidence Bound". For consistency, the name UCB is preserved, but the equations have been manipulated to minimize a function. Return
2. This is in contrast to convex optimization, where strong theory regarding convergence rate has existed for quite some time. The humor of requiring nonconvex optimization (to maximize $$a$$) so as to empower nonconvex optimization (to optimize $$f$$) is not lost on Bayesian optimization researchers. An important distinction is that $$a$$ is not black-box; rather, it has a known form as well as (often) known gradients and a low evaluation cost (relative to $$f$$), allowing for more powerful tools to be used to maximize $$a$$. Return
3. A great deal of theory exists on the convergence of algorithms such as stochastic gradient descent for optimizing nonconvex objectives. Such results can show asymptotic convergence (from any initial location) to the global optimum for certain classes of functions. For our demonstrative purposes regarding the complexity of optimizing multimodal functions, we only use convex optimization tools for maximizing $$a$$ in this post. Return
Jungtaek Kim Guest Author