This is the second of three blog posts during which we explore the concept of uncertainty – or noise – and its implications for Bayesian optimization. This is part of our series of blog content on research that informs our product and methodologies.
 Uncertainty 1: Modeling with Uncertainty
 Uncertainty 2: Bayesian Optimization with Uncertainty
 Uncertainty 3: Balancing Multiple Metrics with Uncertainty
Intro
In our previous blog post, we discussed the importance of properly accounting for uncertainty when constructing machine learning models for regression/classification. Here we focus on how uncertainty plays an important role in Bayesian optimization. We will show how a naive strategy for optimization (random search) falls short in problems with uncertainty in the function values. Then, we will review the fundamentals of Bayesian optimization, highlighting acquisition functions that could be used in this setting. Finally, we illustrate the differences between these approaches, as well as the impact of noisy observations, for a simple optimization problem. This is our second blog post in the uncertainty series; our third post (released next week) discusses the interpretation of efficient configurations for multiple metrics which are noisy.
Blackbox optimization refresher
Blackbox optimization has been a critical component of many complex systems in science and engineering; one notable application is hyperparameter tuning, a key component of machine learning (ML) systems. Here, we introduce ideas fundamental to constructing a blackbox optimizer which is robust to noise.
Let us begin by describing a typical blackbox optimization setting:
 We want to sequentially select locations^{1} of the input domain, \(x \in \mX\), to find the global optimum of an objective function \(f : \mX \to \RR\), \(x^\ast = \arg\max_x f(x)\).
 We have no structural knowledge of \(f\) beyond querying its function values at a given input location \(x\).
 The objective function is expensivetocompute and we want to find \(x^\ast\) as quick as possible, i.e., few function evaluations.
When tuning hyperparameters, for example, the objective function \(f\) is typically the performance of the ML algorithm, e.g., the validation accuracy of the model. The solution \(x^\ast\) is the hyperparameter configuration that results in the best (validation) performance. For a neural network, typical hyperparameters that can be optimized using this technique include the learning rate, the SGD momentum and the number of nodes per layer.
As an illustrative example, let us consider the function depicted in Figure 1. Our goal is to maximize this function in the blackbox setting. For ease of exposition, we will say that an optimization was successful if the optimizer returns as an answer or final recommendation any location in the region of interest^{2}. As shown in the figure, the region of interest is a subset of the input domain corresponding to the highest function values.
Blackbox optimization with noisy observations
In practice, our ability to measure and/or record observations from realworld problems is limited. In ML, we typically estimate the generalization performance of a model using a development/validation set. In science, the outcome of experiments are often subject to measurement error. In everyday life we typically encounter uncertainty!
To cope with these limitations, we will assume that the observed values of \(f\) are corrupted by an unknown observation uncertainty, which we more concisely call noise. Specifically, we denote by \(y_i\)^{3} the observed value associated with the input location \(x_i\) —this is the value that was actually observed in the presence of uncertainty.
One common strategy when modeling this noisy data is to decompose the observed value as
\(y_i = f(x_i) + \varepsilon.\)
We have assumed that the observed value \(y_i\) has two additive components: one that represents the true/exact function value \(f(x_i)\), and another one that corresponds to the imprecision of the recorded observation. This \(\varepsilon\) represents a stochastic noise term — typically, centered at zero, with known noise variance \(\sigma^2\) so that \(\varepsilon \sim N(0, \sigma^2)\). It is possible to consider different assumptions about the noise but this one is certainly the most used due to implications of the central limit theorem.
Figure 2 shows an example of hundreds of observations sampled from the observation model described above. Notice that the observed values can be significantly different from the true function values. Identifying the region of interest in Figure 2, especially with limited observations, becomes very challenging. High function values can be seen in locations that are not part of the region of interest, which can be misleading for the optimizer. This is an important point to remember: we should not fully trust the observed values.
Next, we analyze potential strategies for performing blackbox optimization in the presence of noise. Our function from Figure 1 will be used in two ways:

 noiseless, all observations are exact \(y_i = f(x_i)\), as in Figure 1,
 with noise, where the noise standard deviation \(\sigma\)is fixed to 10% of the function range, as in Figure 2.
Random search
A simple optimization strategy is grid search^{4} to randomly select locations in the input space; the one associated with the highest function value is then returned as the answer. Why might this very simple strategy work? As long as the region of interest has a nonzero probability of being sampled, random search (RS) will find the right answer. The question is: how long will it take?
In Figure 3, we show the probability of RS finding the region of interest in our running example for both settings^{5}. The difference between the noiseless and the noise case is considerable: after 30 iterations, the probability of finding the region of interest values drops from 53% to 44%; to achieve a 90% success rate, we need at least 91 function evaluations for the noiseless case and more than 1000 for the noise case! Note that this is just a simple onedimensional problem, in higher dimensions RS will perform much worse in both settings.
Random search is a simple and easytoimplement method but it is a less reliable method when the function values are noisy. In the next section, we describe a more sophisticated approach.
Bayesian optimization
Bayesian optimization is an active research topic devoted to efficiently solving blackbox optimization (Jones et al., 1998; Brochu et al, 2010; Snoek et al., 2012; Shahriari et al., 2016 Frazier, 2018). The main idea is twofold: first, we create a surrogate model (e.g., Gaussian process GP) for the expensivetoevaluate objective function; then, we design an acquisition function to intelligently select where to evaluate next.
Both surrogate model and acquisition function have an important role during the optimization. Here, we focus on the latter, but an interested reader might find Rasmussen and Williams (2006) particular useful for a technical discussion about arguably the most common class of surrogate models: GPs. Chapter 2 introduces the model and explains how to do predictions using noisy observations.
Acquisition functions have been extensively studied in the research community (Močkus et al., 1978; Jones, 2001; Srinivas et al., 2010; Hennig and Schuler, 2012;HernándezLobato et al., 2014; Wang and Jegelka, 2017). Below we will briefly mention a couple of acquisition functions that could be used for the noisy setting. For a complete review and more indepth discussion about other acquisition functions, please check Snoek et al. 2012, Shahriari et al. 2016, or Frazier 2018.
Expected improvement (EI): Commonly used in Bayesian optimization, EI is literally the expectation of the improvement,
\(EI(x; \mD) = \mathbb{E} \big[ \max( f(x) – y_{max}, 0) \big]\)
where \(y_{max}\) is \(\max_i y_i\). This well known formula, however, is only valid when we have exact observations of the function values \(y_i = f(x_i)\). Thus, despite being widely used in practice, EI is not appropriate for noisy problems because it follows the same WYSIWYG logic of RS. With noise, the best observation \(y_{max}\) may not correspond to the best true function values at the observed locations \(max_i\,f(x_i)\). This may require changes to the standard EI strategy.
Expected improvement with plugin:^{6} One common approximation is to use EI with a different improvement target. Instead of (the likely noisy) \(y_{max}\), one can use the highest expected value of the observed point as the reference for the improvement. This approach works well when the noise level is small but it becomes inaccurate otherwise (the predictions after observing a noisy value might change, even for the already observed locations).
Noisy expected improvement (NEI): Recently, Letham et al. (2018) and Frazier (2018) (Section 5) describe how to properly derive and compute expected improvement for noisy observations. In the former paper the authors refer to this acquisition function as noisy expected improvement. Computing NEI is more involved and computationally more expensive than regular EI but it has the benefit of avoiding repeated sampling at the same locations. This acquisition function can then be safely used in the presence of noisy observations, and it reduces to regular EI when the observations are exact.
Knowledge Gradient (KG): As discussed in our previous blog post, Expected improvement vs knowledge gradient, KG (Frazier et al., 2006) is another popular acquisition function used in Bayesian optimization. The reason why expected EI and KG result in two different strategies is related to what we wish to return as a final solution. EI is more conservative and restricts the final solution to previously evaluated points; KG permits the final solution to be a location x that has not been observed but is believed to be the best performing. The byproduct of these assumptions is two distinct strategies for Bayesian optimization.
In contrast to standard EI, the knowledge gradient has been known to perform well in the presence of noise for several years now (Frazier et al., 2009). KG properly accounts for the model differences induced by the prospect of observing a noisy observation. It also suggests the highest expected value as the final recommendation, as opposed to the highest observed value as in the noisefree expected improvement setting.
Augmented expected improvement (AEI): We also would like to mention AEI (Huang et al., 2006) as an interesting improvementbased acquisition function. AEI is easier to implement and less computationally intensive than KG or NEI. In Pichenyet al. (2013) the authors have shown that AEI was the most competitive strategy among other heuristics for noisy bayesian optimization.
Entropy search methods: Finally, we conclude this list with a family of methods known by the name entropy search. In all these strategies, the goal is to minimize the uncertainty with respect to either the optimum location \(x^{\ast}\) or the highest function value \(f(x^{\ast})\). Entropy search (ES) (Hennig and Schuler, 2012) and predictive entropy search (PES) (HernándezLobato et al., 2014) are probably the two most famous examples. And, more recently, output predictive entropy search (OPES) (Hoffman and Ghahramani, 2015) and maxvalue entropy search (MES) (Wang and Jegelka, 2017) were also proposed.
Numerical demonstration
By revisiting our earlier example, we can consider the performance of Bayesian optimization (using augmented expected improvement^{7}) in the presence of noise. In the figure below, we show the probability of finding the region of interest as more iterations transpire. In the noiseless setting, the answer is found in less than 20 iterations. In the noise setting, there is more uncertainty about the location of the global optimum due to the noise. However, after 30 iterations, the estimated probability of finding the region of interest is 97%; after 50 it always finds the right answer.
Instead of focusing on the region of interest, we can also check the distribution of answers for both settings and algorithms. In the animation below, we show the probability mass for random search (top) and Bayesian optimization (bottom); the graphs on the left correspond to the noiseless setting; on the right, to the noise. The height of the orange bar matches the probabilities in the Figure 4. Bayesian optimization in both cases quickly finds the region of interest.
Thus far, we have fixed the noise level to 10% of the function range. Now, we analyze the impact of different levels of noise. The figure below shows the performance of both optimizers after 30 iterations for different noise levels. As expected, the performance deteriorates as we increase the noise.
Conclusion
In this blog post, we presented the importance of accounting for measurement uncertainty during optimization. We discussed how we typically model the imprecision associated with measurements of the objective function, and how those imprecisions can be deceptive for a naive blackbox optimizer. Then, we briefly described the fundamentals of using Bayesian optimization in the presence of noisy observations.
In the following blog post (released next week) we will extend this discussion to multimetric optimization, especially focusing on how can we analyze the outcome of a multimetric optimizer when there is uncertainty on the observed measurements.
Special Thanks
I would like to thank Roman Garnett for our many great conversations about Bayesian decision theory for optimization. His interpretation of the role of noise in Bayesian optimization, and enthusiastic evangelization of the topic, has helped myself and others develop better strategies for sequential decision making.