 $\def\a{\alpha} \def\e{\varepsilon} \def\s{\sigma} \def\RR{\mathbb{R}} \def\mC{\mathsf{C}} \def\mK{\mathsf{K}} \def\mI{\mathsf{I}} \def\ggamma{\boldsymbol{\gamma}} \def\kk{\boldsymbol{k}} \def\uu{\boldsymbol{u}} \def\vv{\boldsymbol{v}} \def\ww{\boldsymbol{w}} \def\xx{\boldsymbol{x}} \def\xopt{\xx_{\text{opt}}} \def\yy{\boldsymbol{y}} \def\zz{\boldsymbol{z}} \def\cD{\mathcal{D}} \def\cX{\mathcal{X}} \def\lmle{L_{\text{MLE}}} \def\lmple{L_{\text{MPLE}}} \def\lkv{L_{\text{KV}}} \def\lclv{L_{\text{CLV}}} \def\ppa{\frac{\partial}{\partial\a}} \DeclareMathOperator*{\argmin}{argmin}$

# Intro to Multicriteria Optimization

Advanced Optimization Techniques, All Model Types

In this post, we will discuss the topic of multicriteria optimization, when you need to optimize a model for more than a single metric, and how to use SigOpt to solve these problems.

Here at SigOpt, we help clients to more quickly tune financial models and design industrial processes by providing access to an ensemble of Bayesian Optimization strategies through our REST API or web interface and associated PythonJava and clients. Our fundamental goal is to find the maximum of a black box function (or metric) — one which takes as input several values (real numbers, integers or categories) and outputs 1 real number to be maximized, with no information about the underlying method for which the inputs interact to result in the output (which is why it is called a “black box”).  Tasks as diverse as classifying images and betting on sports require optimization of a suitable metric black box function (which serves as a proxy for quality) to achieve the best performance; failure to do so can yield subpar results for even powerful tools such as deep neural networks.

Fortunately, many situations neatly present such a black box function: financial trading problems may leverage backtesting, machine learning problems may use cross-validation, industrial projects may study product quality subject to tolerance.  Some situations, however, do not fit as neatly into the black box function setting described above, where a single metric is being optimized.  Consider this relatively standard situation of driving on the freeway1: an increase in speed will cause a decrease in the time until I reach my destination, but may also cause an increase in the cost by using more fuel.  How, then, can I choose a speed so as to simultaneously minimize time to the destination and minimize the cost of the trip?

Such a problem is a multicriteria optimization and does not fit into the standard framework for black-box optimization; applications including portfolio managementcancer treatment, and network theory have already benefited from the use of multicriteria optimization.  This post will discuss the complexities of this problem and some strategies for manipulating it into a single criterion problem.

## Vector-valued functions

Single criterion optimization problems are phrased in the following language:

$$\xopt = \argmin_{\xx\in\Omega} f(\xx), \qquad f:\Omega\to\RR,$$

where $$\Omega$$ is the set of possible inputs (think real numbers, integers and categories), $$\xx$$ is an object (just a vector if $$\Omega$$ consists of only real numbers) describing the inputs and $$f$$ is the function we hope to minimize.  The standard problem is phrased as minimization, though $$f$$ could instead be maximized by minimizing $$-f$$. This optimization problem can be efficiently solved using the Bayesian optimization tools in SigOpt.

The situation changes when $$f:\Omega\to\RR$$ becomes the vector-valued function $$g:\Omega\to\RR^k$$, that is, the function we hope to optimize now maps inputs $$\xx\in\Omega$$ vectors $$g(\xx)\in\RR^k$$. The reason this is a more complicated situation is that an ordering of vectors in $$\RR^k$$ does not exist. Any two numbers $$\alpha,\beta\in\RR$$ must satisfy exactly one of $$\alpha<\beta$$, $$\alpha>\beta$$ or $$\alpha=\beta$$, but such a requirement is not true for vectors $$\uu,\vv\in\RR^k$$; how would one order the vectors

$$\uu=\begin{pmatrix}1\\\\2\\\\3\end{pmatrix}, \qquad \vv=\begin{pmatrix}2\\\\1\\\\3\end{pmatrix}, \qquad \ww=\begin{pmatrix}3\\\\2\\\\1\end{pmatrix}?$$

The inability to order these vectors produces ambiguity in any attempt to minimize $$g$$: if $$\uu<\vv$$ is ill-defined, how could one ever declare that $$\uu=g(\xx)$$ is the minimum value?  To read in depth about problems of this variety and their solution, see this text or this text.

## Multicriteria Example

We can return now to the problem posed earlier about simultaneously trying to minimize transit time and cost while driving on a freeway. In that setting, there is some function $$g$$ of the form

$$g(\text{speed}) = \begin{pmatrix}\text{time to destination} \\ \text{cost of trip}\end{pmatrix} \qquad\longleftrightarrow\qquad g(\xx) = \uu = \begin{pmatrix}u_1 \\ u_2\end{pmatrix}.$$

What does $$g$$ look like? We do not know — that is the nature of a black box problem. We likely know certain attributes of $$g$$, such as a decrease in speed probably causes an increase in time to destination, but the exact relationship may be difficult to write down. Experimentation has been conducted on this problem to empirically estimate $$g$$ (see here or here).  We use the 1997 data on “Fuel Economy by Speed” hosted at Oak Ridge to generate the figure below. Figure 1: (left) Mileage data from 1997 reported in Table 4.29 of this report and a model used to make predictions. (right) Relationship between the cost and time required to make a trip as determined below. Using the data from that report, we can create an approximation with which we can predict mileage at speeds that have not yet been observed.

While looking at this red curve, it is apparent that there are some results which could not possibly be considered acceptable minima: the speed of 10 miles/hour requires 10 hours and would cost more than $$\14$$ and both of those values can be improved upon by a better choice of speed. Such points, where all components of the objective vector $$\uu$$ are greater than different observed results $$\vv$$, are called dominated.  Any results which are not dominated are considered Pareto optimal (also called Pareto efficient), and the entire set of such points form the Pareto frontier, which is displayed for this travel problem in the figure below. Figure 2: A zoomed-in version of the right half of Figure 1, where the Pareto frontier (manifold of Pareto optimal results) is colored in blue. All other possible outcomes are dominated by at least one point on the Pareto frontier. The two stars represent two speeds which have the same cost; the yellow star’s travel time is greater than the blue star, thus it is dominated.

The blue portion of the curve in this figure contains points which are all considered Pareto optimal, which implies that further decreasing one of the components of the objective would cause an increase in the other.  Because we have limited our speed to 75 miles/hour our required time cannot approach 0 (which should happen as we drive increasingly fast2 which is why the curve ends where it does.  Orange represents speeds which are inferior; the two stars have equal cost but unequal time.

It should be clear from Figure 2 why speeds that are not Pareto optimal cannot be considered solutions to the multicriteria optimization problem: there is little reason to drive 38.5 miles/hour and arrive in 2.6 hours (orange star) when one could drive 61.2 miles/hour and arrive one hour earlier at the same cost.  For this example, the interface between the orange and blue sections represents the minimum cost to reach the destination; any attempt to decrease the time of transit would result in an increase in the cost of the trip.  How then can we decide which of the results on the blue curve are “the solution”?

In truth, there is no single way to decide which of the speeds that represent the blue curve are the solution.  Driving 52.4 miles/hour represents the minimum cost, driving 75 miles/hour represents the minimum time, and any speed in between could be considered optimal depending on personal preference.  The tradeoffs between these two criteria can either be managed by some supervisory decision maker (the driver of the car in this example) or by merging the multiple criteria into some single criteria and phrasing the problem as a standard optimization problem.  Either way, once this Pareto frontier has been identified (or, more likely, approximated), no further refinement of the solution the multicriteria optimization is possible.

These types of tradeoffs appear in the real world all the time, as shown in this popular comic strip among graduate students; specifically, the image below depicts (in an entirely non-rigorous way) multiple factors competing for the lunch dollars of a graduate student.  The point below denoted “Optimal Mediocrity” is dependent on the degree to which “Gross”, “Effort” and “Cost” are valued by the decision maker.  It may, in fact, be very common for students to more greatly value “Cost” and prefer the Instant Noodles.

## Linear scalarization

Each research field has its own domain experts that can serve as decision makers, and as such it is difficult to discuss the decision maker strategy in generalities. The method of forming a scalar objective function $$f:\Omega\to\RR$$ from the vector objective function $$g:\Omega\to\RR^k$$ is more universal, so we restrict our discussion to this linear scalarization methodology.  Scalarization (discussed nicely in chapter 2 of this text) is the process by which some transition function $$T:\RR^k\to\RR$$ is applied to the multiple objectives from $$g$$ to rephrase the multicriteria problem as

$$\xopt = \argmin_{\xx\in\Omega} T(g(\xx)).$$

Unfortunately, all this does is transition the burden of decision making between the components of $$g$$ from some expert decision maker to the function $$T$$; choosing an appropriate function $$T$$ is important because a poor choice can result in solutions to this scalar optimization problem which are not Pareto optimal in the multicriteria problem.

A variety of scalarization strategies have been developed for different circumstances, but the simplest among them is a weighted-sum strategy involving a nonnegative weighted average of the components of $$g$$ for weights /(\gamma_1,\ldots,\gamma_k>0\) such that $$\gamma_1+\ldots+\gamma_k=1$$, we define

$$T_\ggamma(g(\xx)) = T_\ggamma\begin{pmatrix}g_1(\xx)\\\\\vdots\\\\g_k(\xx)\end{pmatrix} = \gamma_1 g_1(\xx) + \ldots + \gamma_k g_k(\xx).$$

One benefit of this strategy is that any solution to this problem must be Pareto optimal.3  Another benefit is its easy interpretability: larger $$\gamma$$ values correspond to components which are more important to minimize. Within the context of an industrial process, it would probably still take an expert to choose an appropriate balance between these $$\gamma$$ values, but this strategy provides a strategy to apply certain preferences and the benefit of Pareto optimality.

Referring back to the car driving problem,$$\gamma_1$$ and $$\gamma_2$$ values could be chosen to scalarize the vector-valued objective? Because $$\gamma_1+\gamma_2=1$$, we actually need only choose one parameter, which we will just call $$\gamma$$, to create the scalar objective function

$$T_\gamma(g(\text{speed})) = \gamma (\text{time to destination}) + (1-\gamma) (\text{cost of trip}).$$

So, what impact does the choice of $$\gamma$$ have on the objective function? The figure below shows the $$T_\gamma(g)$$ objective for several choices of $$\gamma$$, as well as the optimal speed as a function of $$\gamma$$. Figure 4(left) The scalarized form of the objective function for various $$\gamma$$ values.  (right) As explained by the equation above, smaller $$\gamma$$ values place more emphasis on fuel efficiency and larger values prefer a shorter duration. Because we assume no one will violate traffic laws4, choosing $$\gamma>.91$$ will always return 75 miles/hour.

Choosing $$\gamma$$ large states a preference towards less time, and choosing $$\gamma$$ small states a preference towards less cost. The actual relationship, as depicted in the right half of Figure 4, is complicated by the fact that the components of the objective $$g$$ do not have the same units and on the Pareto frontier they are not on the same scale (the topic of scale is mentioned in this post): cost varies between $$\9-\14$$ whereas time varies between 1-2 hours. From a physical sense, even adding values with different units is inconsistent, though we skirt this issue by stating the $$T_\gamma(g)$$ optimization problem in a fully mathematical (read unphysical) setting. Given that, the role that $$\gamma$$ plays is very much a function of those units, which is why there is a 10 miles/hour range for $$\gamma\in[0, .8]$$ and another 10 miles/hour range for $$\gamma\in[.8, .9]$$. Problems phrased in this way can be readily optimized by tools like SigOpt.

In a perfect world, all the components of a problem would be on exactly the same scale, allowing for an easier prediction of the role that $$\gamma$$ plays. Below is a figure which shows the same optimal solution relationship, but for slightly different versions of the $$g$$ function. We can clearly see that rescaled versions of these functions have a significant impact on the resulting optimum for a given $$\gamma$$. On the left, we see, perhaps, a more consistent transition as $$\gamma$$ increases; this is the result of rescaling the objective components to $$[0,1]$$, which is only possible for this simple problem because we know the exact values of the Pareto frontier. On the right, we see an extreme transition period, caused by converting the currency of the gasoline purchase to Chinese renminbi (multiplying $$g_2$$ by 6.67 RMB/USD). Figure 5: (left) Objective values from g are scaled to [0,1]. (right) g2 is multiplied by 6.67.

From a purely mathematical perspective, it does not matter what rescaling, if any, is done to this problem: for all the values on the Pareto frontier, there is at least one $$\gamma$$ for which it is the solution of the scalarized problem.5 From a practical standpoint, it can be significant, though, because some expert must make a decision regarding the appropriate $$\gamma$$, and that decision is much easier in the left half of Figure 5 (where a small change in $$\gamma$$ has a small change in optimal speed) than in the right half of Figure 5 (where the impact of a small change in $$\gamma$$ has an unpredictable impact on the optimal speed).

## $$\epsilon$$-constraint scalarization

In some circumstances, it may be the case that one component of $$g$$, say the $$i$$th component $$g_i$$, is more important than the rest; as long as the other $$k-1$$ components are “satisfactory” only the $$g_i$$ component needs to be minimized. Mathematically, this becomes a constrained scalar optimization problem,

\begin{align*} \xopt = \argmin_{\xx\in\cX} g_i(\xx), \quad \newline\mbox{ subject to }\quad g_1(\xx)\leq\epsilon_1,\;\ldots,\;g_{i-1}(\xx)\leq\epsilon_{i-1},\; g_{i+1}(\xx)\leq\epsilon_{i+1},\;\ldots,\;g_k(\xx)\leq\epsilon_k. \end{align*}

This notation is somewhat dense, but can be interpreted more easily in the context of the example above; suppose the driver has only $$11$$ available and must reach the destination at a cost below that. In that setting, $$\epsilon_2=11$$ and the multicriteria optimization problem becomes the scalar problem

\begin{align*} \min_{0\leq\text{speed}\leq75}\; \text{time to destination}(\text{speed}), \qquad\newline\mbox{ subject to }\qquad \text{cost of trip}(\text{speed})\leq 11. \end{align*}

Graphically this can be seen in the figure below. Problems phrased in this way can be optimized by tools like SigOpt by taking advantage of reporting “failure” observations when criteria are not met. Figure 6: (left) If we constrain the cost to less than \$11 and concern ourselves only with minimizing the time to destination, the red star represents the solution. (right) If, instead, we simply must arrive in less than 90 minutes and want to incur the minimum cost, the solution is the red star.

## Conclusion

Optimization is what we do at SigOpt, but not all optimization problems are created equally. Multicriteria optimization problems, which involve the simultaneous optimization of multiple objective functions, require additional specification to solve them uniquely. This post shows an example of such a problem and walks through a strategy for converting it into a standard problem which can be efficiently optimized using the Bayesian optimization tools within SigOpt. Stay tuned for more updates on using SigOpt to tune these problems and others as we continue to expand the capabilities of our optimization platform.

## Special thanks

I would like to extend my gratitude to Devon Sigler, a top-tier mathematician and multicriteria expert at the University of Colorado Denver, for his help refining some of the descriptions in this post and clearing up my own understanding of the implications of nonconvexity in weighted sum scalarization.

3. The requirement that the $$\gamma$$ values be strictly greater than zero is needed to guarantee Pareto optimality. If, instead, only weak Pareto optimality is required, it is possible to allow some $$\gamma$$ values to be 0. Return
5. This is only true if all objective functions are convex and the feasible region is convex. If this is not the case there can be points on the Pareto frontier that no value of gamma will find. This is one of the primary drawbacks for the weighted sum scalarization method. This is, however, not a problem for the $$\epsilon$$-constraint method --- there are $$\epsilon$$ choices that can be used to find each point on the Pareto frontier. Return 