The research team at SigOpt is fortunate to be part of an outstanding community of experts from around the world. In our *Highlight* blogs, we take the opportunity to feature friends of ours who have recently done exceptional work. This post introduces the article Bayesian Optimization of Composite Functions by Raul Astudillo and Peter Frazier, appearing in the upcoming ICML 2019 proceedings. Raul is currently pursuing his Ph. D. at Cornell University under Peter, who was also the Ph. D. advisor of SigOpt’s CEO Scott Clark.

Bayesian optimization has proven successful in optimization of expensive-to-evaluate black-box objective functions. By black-box, we mean that no special structure over the objective function, such as convexity, is assumed, and that we only have access to (a few) queries of it. While in practice many objective functions are truly black-boxes, in this post we discuss an implicit structure that arises frequently and that can be exploited to make Bayesian optimization more efficient. More concretely, we consider *composite* objective functions, i.e., of the form $f(x)=g(h(x))$, where $h$ is a black-box derivative-free expensive-to-evaluate function with vector-valued outputs, and $g$ is a cheap-to-evaluate, real-valued function.

Optimization of composite objective functions arises in a number of application settings in the literature. For instance, this article in materials design posed the objective of optimizing the combination of multiple material properties via a performance index; in this case, $h(x)$ is the set of material properties that results from a particular chemical composition, $x$, and $g$ is given by the performance index used. Evaluating the material properties, $h(x)$, that result from a materials design often requires performing expensive physical or computational experiments (as was discussed here).

Composite objective functions also arise in calibration of expensive simulators, where the goal is to *invert* the parameters, $x$, so that the output of the simulator, $h(x)$, most closely matches some observed data vector $y_{\text{obs}}$. Typically, this problem can be formulated as $\min_x \|h(x)- y_{\text{obs}}\|$, where $\|\cdot\|$ is the $L_1$ or $L_2$ norm.

In our paper, we show how to exploit the composite structure of objective functions to optimize them more efficiently. Our approach is straightforward: instead of modeling $f$ directly using a single-output Gaussian process, we model $h$ using a multi-output Gaussian process. Intuitively, this strategy is particularly beneficial when observations of $h(x)$ provide information relevant to optimization that is not available from observations of $f(x)$ alone. A simple but illustrative example is the following: suppose $x$ and $h(x)$ are real-valued and $g(y)=y^2$. If $h$ is continuous, $h(a)<0$, and $h(b)>0$, then, if we follow the approach described above, we know that there is a global minimum in the interval $(a,b)\text{,}$ whereas if we follow the standard Bayesian optimization approach, we do not.

Having specified the statistical model to be used, it remains to specify how the points to be evaluated will be chosen. As is typical in Bayesian optimization, we do this through an acquisition function whose maximization indicates the next point to evaluate. Indeed, we will use the, arguably, most popular acquisition function in Bayesian optimization: expected improvement.

The expected improvement within our setting, which we call for composite functions ($\eicf$), is defined analogously to the classical expected improvement ($\ei$), but where our posterior on $f(x)$ is given by the composition of $g$ and the normally distributed posterior distribution on $h(x)$:

\[

\eicf_n(x) = \expectation_{n}\left[\left\{g(h(x))-f_n^{*}\right\}^{+}\right],

\]

where $f_n^* = \max_{i=1,

\ldots,n}f(x_i)$ is the maximum value across the points that have been evaluated so far, $x_1,\ldots, x_n$, and $\expectation_n$ indicates the conditional expectation given the available observations at time $n$, $\{\left(x_i, h(x_i)\right)\}_{i=1}^n$.

The graphics below illustrate the $\eicf$ and classical $\ei$ acquisition functions in the formerly described setting where $h$ is scalar-valued and $f(x) = g(h(x)) = h(x)^2$. As argued before, the additional information provided by observing $h(x)$ makes our approach evaluate points that are closer to the global optima of $f$.

Figure 1: Illustrative example of the $\eicf$ and classical $\ei$ acquisition functions, in a problem where $h$ is scalar-valued and $g(h(x))=h(x)^2$.

Observations of $h(x)$ provide a substantially more accurate view of where global optima of $f$ reside as compared with observations of $f(x)$ alone, and cause $\eicf$ to evaluate at points much closer to these global optima.

We note that, unlike the classical $\ei$ acquisition function, for general functions $g$, $\eicf$ lacks a closed form. However, in our paper we show that, under mild regularity conditions, it can be efficiently maximized. Our approach to maximize $\eicf$ is based on the so called *reparametrization trick*, which allows us to efficiently compute an unbiased estimator of the gradient of $\eicf$; this unbiased estimator is then used within multi-start stochastic gradient ascent.

An example of the results of our experiments (seen in the figure below) show that our approach may dramatically outperform standard Bayesian optimization; further details, especially regarding additional test problems, are provided in our article.

Figure 2: Empirical demonstration of performance on a standard test for Bayesian calibration of expensive models. Our new CF strategies can demonstrate significantly better performance.

We encourage you to check out our paper for a detailed description of our experiments and a formal description of our approach. If you are attending ICML 2019, we also invite you to our talk (Wed Jun 12th 05:05 — 05:10 PM @ Room 101) and to stop by our poster (Poster Wed Jun 12th 06:30 — 09:00 PM @ Pacific Ballroom #237).