Building a Better Mousetrap via Multicriteria Bayesian Optimization

Michael McCourt
Advanced Optimization Techniques, All Model Types

In an earlier post, we introduced the topic of multicriteria optimization and some mechanisms for solving multicriteria problems (also called multiobjective problems) using the SigOpt optimization tool; SigOpt provides a RESTful API to solve scalar black-box optimization problems for customers from fields such as industrial design and image processing.  The earlier introduction discussed a somewhat simple problem of choosing the driving speed on a trip subject to concerns about minimizing both the time in transit and the cost of fuel.  In this post, we want to analyze a more complex situation in which the parameters of a given model produce a random output, and our multicriteria problem involves maximizing the mean while minimizing the variance of that random variable.

The standard example in literature of simultaneously considering the mean behavior of a model/process and the variance is in financial portfolio management.  Generally, investors want to maximize expected gain while minimizing risk (which exists in the form of variance) which motivated the development of the Nobel prize-winning theory of mean-variance analysis. The original work and a 2014 retrospective by the same author serve as helpful introductions to the theory of defining (even implicitly) a utility function and, potentially, an associated quadratic (mean-variance) approximation.  Under some conditions (discussed in those references) an expert’s selection of a portfolio from the mean-variance Pareto frontier should approximately optimize the utility function.

Our example today is less practical and more fun1. We start with a mouse that has some ability to intelligently traverse a maze and study how long it takes the mouse to solve a randomly generated maze. Then we try to tune the parameters of the random maze generation algorithm to create mazes which are very difficult for the mouse to solve. This will be phrased as a multicriteria optimization problem, with the goal of maximizing the mean time to solve while minimizing its variance.

Variations on a theme

The motivating goal of mean-variance analysis involves both wanting good performance (high portfolio return) and wanting to avoid bad performance (high risk and potential for loss).  These are certainly related but are not necessarily the same thing.

For example, gamblers who play slot machines are told on the machine exactly how much they will win/lose on average by playing the slot machine. If the slot machine paid out $0.90 for every $1.00 put into it, with no variance whatsoever, I doubt anyone would play2. But, gamblers may be willing to accept a low mean payout in exchange for a high variance because the higher variance implies a greater chance for a million dollar payout. In this setting, the gambler embracing the high likelihood of a bad performance in exchange for a potentially exceptional performance.

Other circumstances where our goal would not be to maximize the mean and minimize variance include many industrial processes: in the formation of a tool such as a nail, the goal is not to maximize the mean length of the nail but rather to minimize deviation from the desired length. This goal was formalized with the development of the Taguchi loss function3, which served as a scalarization of the multicriteria optimization problem involving minimizing both the variance and the deviation from the desired outcome.

Maze solver description

Our maze solver strategy is the very simple wall follower strategy, which will, admittedly, only work on simply connected mazes (no loops, no levels).  Given that, it is quite easy to implement and has predictable (non-random) behavior which helps push the entirety of the randomness to the maze generation algorithm.

For those of you interested, you can see the implementation of the maze construction and solution in our examples Github repo.

The strategy involves the mouse continually following the wall on the right all the way through the maze until reaching the end.  If the mouse can turn right, it does so, otherwise, it moves forward; if both those paths are blocked it attempts to turn left.  If all three directions are blocked, the mouse has reached a dead end and reverses course, continuing to hug the right wall to guide it forward.  There are an outstanding explanation and graphic of this wall follower strategy available here which I will not try to supersede.  The figure below, instead, shows why the wall follower strategy succeeds for simply connected mazes but may fail for mazes which have their exit inside a loop.

‍Figure 1: Two mice moving through slightly different mazes (see if you can spot the difference) using the wall follower strategy. (left) This maze is solvable because all walls are attached to the boundary (right) This maze is not solvable because the solution is next to a wall which is not attached to the boundary; eventually, the mouse simply returns to the start.

Maze generator description

It is quite tempting for myself, as a research-oriented person, to see the problem of designing a maze as an opportunity to spend several days devising diabolical labyrinths under the cover of “developing content marketing”.  I restrained, however, if only because so many great resources are already available on the internet.  In particular, Wikipedia already had available a perfectly acceptable starting point involving a random depth-first search.  Furthermore, the resulting maze structure guarantees that the failure to solve observed in Figure 1 cannot be encountered.

The only shortcoming with the code already available is that there were no tunable parameters.  We needed to introduce a mechanism for making the mazes harder or easier for the mouse to solve; we did this by varying the relative probability of the depth-first search choosing each of the possible directions with which it could move, after eliminating those directions which it has already traversed.  Essentially, at each step of the maze generation, there is a relative probability of punching through a wall.  The figure below shows one such randomly generated maze, which has no preference for any direction.

Figure 2: A sample maze creation process with equal probability of moving in each direction.

By allowing the creation algorithm the possibility of different probabilities, mazes will start to gain a more identifiable structure, as demonstrated in the figure below. The probability of moving left, up, right and down are the parameters that we tune using Bayesian optimization.

‍Figure 3: In this sample maze creation, the depth-first search algorithm is set with relative probabilities of [.2, 1, .2, 1] for [left, up, right, down]. This results in a maze which prefers long vertical corridors.

Problem Structure

At this point, we have a relatively smart mouse with a fixed maze solving strategy and a random maze generation tool with 3 tunable parameters (the four construction direction relative probabilities minus the need for probabilities to sum to4. Our goal is to create a maze which maximizes the average time required to solve the maze while minimizing the standard deviation5 of that time. Basically, how can we make a maze which is consistently difficult?

A precursor to that involves studying the impact randomness has on the mouse’s solution time, and how that random variable changes as a function of the various parameters. The histograms in the figure below would not be practical in a production setting but are provided here to hint at the impact of the parameters on the mouse’s performance. Mouse maze solves times are described in terms of the proportion of total cells in the maze; the minimum possible value is 0 and the max is 26. The mouse always starts at the top left cell and the exit is always at the bottom right cell, for simplicity.

‍Figure 4: Histograms (showing the proportion, not frequency) of 5000 30×30 maze solve times for various maze construction parameterizations of the [left, up, right, down] relative probabilities. (top left) [1, 1, 1, 1]. (top right) [5, 1, 1, 1]. (bottom left) [1, 5, 1, 1]. (bottom right) [1, 1, 5, 1].

As we can see there is a significant, and seemingly unpredictable, impact from changing the parameters.  Maximizing the mean (or conceivably, median) of this distribution while minimizing the variance (or maybe interquartile range) of this mouse’s speed will be non-intuitive.  This black box style of problem is exactly what SigOpt is designed to handle … except for the fact that SigOpt does not (as of August 2016) natively support multicriteria optimization (though, thanks to the support of our investors, we hopefully will soon).

Devising a solution strategy

So, how can we use scalar Bayesian optimization to provide a solution to this problem? Depending on the situation, there may be multiple ways to interpret “solution” for a multicriteria optimization problem:

  • Some scalar function of the multicriteria components could be formed, which is then optimized to define the “solution”.  This would require only a single scalar optimization but would require prior understanding as to how the scalarization balances the mean and variance.
  • An approximation to the Pareto frontier can be formed by solving several scalarized versions of the problem.  A selection from these scalarized solutions can then be made by a domain expert to determine which is most appropriate given the level of risk aversion.  This strategy is the focus of our work today.

As a baseline, the figure below was generated using a 1000 point low discrepancy grid sampling over the 3D parameter space (varying left, up, right construction probabilities while holding down fixed): it shows the spread of mean/variance mouse solve times estimated over 100 randomly generated 30×30 mazes.

Figure 5: 1000 low discrepancy samples, logarithmically spaced in the domain [0.01, 100]3, is used for left, up, right probabilities to construct random mazes. We seek parametrizations that are high on the y-axis and low on the x-axis (compare to portfolio optimization). The red points are Pareto efficient samples.

There are a lot of things that can be said about this figure, and a lot of interesting research questions (e. g., why is there an empty region in the bottom left?), but we will restrain from a thorough investigation in the interests of following a practicable workflow; the amount of work required to create this picture is significantly more than we hope to invest to find a solution.  The remainder of this section will detail our attempts to do this efficiently.The shape of the feasible set indicates that the metrics of interest may not be convex; this presents limitations to the use of weighted sum scalarization in trying to define the Pareto frontier.  Of course, we likely would not know this at the start of the optimization process, so we will proceed as though we believe the component functions to be convex and see how the results turn out.Our first strategy is to still use the weighted sum method (even though it could be troubled) to approximate the Pareto frontier; recall that, if the component functions are convex, then a weighted sum (which sums to 1) will have an optimum on the Pareto frontier.  We will consider 15 weighted sum experiments with weights evenly spaced in [.05, .95].  Mean and variance values for suggested left, up and right turn parameters will be estimated over 100 maze constructions.  The figure below shows the result of such optimizations, as conducted using SigOpt.

‍Figure 6: (left) A sequence of weighted sum scalarizations with 300 total maze generations produces the approximate Pareto frontier in red; all inefficient points are in blue. The yellow points were the final points of a scalar optimization which did not lie on the frontier. (right) A magnified view of Figure 5 which has the randomly searched Pareto frontier in black. 1000 points were needed to produce this approximate frontier, with no consistent gain in quality over the cheaper solution on the left produced using SigOpt.

The mean and variance times computed for a given set of parameters are not unique to the weight in the scalarization; as such, we can reuse information from previous weight values as “initial data” for the next weighted sum experiment7. This allows us to use only 20 function evaluations per scalarized experiment without a substantial loss in quality. Because the mean and variance are not convex functions of the turn parameters, however, some of the optimization solutions are not present on the true Pareto frontier.

To work around any potential lack of convexity in the component functions, we can instead phrase the problem as an epsilon-constrained scalar optimization, similarly to what was discussed in our earlier multicriteria post.  Each of these scalar optimization problems can be phrased as minimizing only the standard deviation, subject to the mean being greater than a pre-chosen constraint.  The image below was created with 10 constraint-based experiments, using mean constraints evenly spaced in [.95, 1.2].

‍Figure 7: (left) Like Figure 6, this depicts the Pareto frontier (red are efficient, blue are inefficient, yellow are the scalar constraint terminations), but here it is approximated using constrained optimization in SigOpt. (right) The same random search approximated frontier from Figure 6, which required 1000 maze generations, rather than the 300 for the SigOpt generated frontier on the left.

Choosing a solution

So, after all of this, it is important to remember that simply forming the Pareto frontier does not solve the practical issue of designing the product: an expert must choose the “best” solution from the approximate frontier. Because we are not maze experts, we simply present some of the possible choices in the figure below and allow the reader to make their own judgment.  In a more practical/industrial setting, the balance imposed by the expert may take the name utility or risk aversion.

‍Figure 8: Sample mazes (starting in the top left, ending in the bottom right) representing points on the Pareto Frontier: the parameters are listed, along with the estimated mean and standard deviation. (top left) [.034, .021, 88.365], mean .068, sd .002; (top right) [51.4, 51.3, 1.385], mean 1.0, sd .009; (bottom left) [.015, 5.85, .0196], mean 1.1, sd .09; (bottom right) [.042, 61.226, .052], mean 1.21, sd .173.

Conclusion

In many settings there is a need to balance performance against robustness: Toyota may want to increase fuel efficiency but it cannot produce vehicles which gain 20% more efficiency at the cost of failing to start 1 out of every 5 times they are used.  This tradeoff between improving average behavior and introducing a greater potential for bad behavior is the crux of this post, which we have addressed in the context of making a better mouse-stumping maze.

We have used SigOpt to identify multiple approximately Pareto efficient maze configurations from which an expert (maze designer, not an expert mouse) could select the maze which best balances difficulty and consistency.  For these multicriteria problems, the need to efficiently conduct black box optimization is amplified because of the need to generate multiple solutions on the Pareto frontier.  Using Bayesian optimization has a multiplicative benefit because each point on the Pareto frontier is generated at less experimental cost than with, e. g., grid search.

Special Thanks

Again, I would like to extend my thanks to Devon Sigler, an expert in multicriteria optimization who will shortly be finishing his Ph.D.  He has helped introduce me to the wide range of literature that discusses multicriteria optimization and have provided valuable insights regarding the theoretical implications of popular solution strategies.

Use SigOpt free. Sign up today.

Footnotes

1. I have been told that my posts are too serious and informative and should be more fun. I hope you find this post to be more fun. Return
2. SigOpt does not advocate for gambling. If you choose to gamble, please gamble responsibly. Gambling at a game where you automatically lose 10% of your bet with no potential to win is an example of irresponsible gambling. Return
3. Normally, this is where I would interject with my own discussion of the competing interests of theoretical statistics/mathematics and practical applications, given the compelling history of Genichi Taguchi, but Wikipedia has already beaten me to it. Kudos, again, to the outstanding contributions on Wikipedia and those of you out there doing the contributing. Return
4. In our experiments, we allow ourselves 3 tunable parameters: the relative probabilities of moving left, up and right. We use the term “relative probability” to mean relative to the probability of moving down, which is always fixed at 1; the requirement that all possible probabilities sum to 1 is why there are only 3 tunable parameters. It is entirely possible (and more natural) to phrase this situation as having 4 free parameters, but in doing so there will be repeated behavior throughout the 4D parameter space. The random mazes generated with (1, 2, 3, 4) would look the same as (2, 4, 6, 8) or (.1, .2, .3, .4). To avoid this degeneracy (where a 4D problem actually only has 3 dimensions of complexity) we fix the final element in the vector to 1. This is not the only way to resolve this (see, e. g., mixture designs) but it is the simplest, in our opinion. Return
5. It is more common to read about minimizing the variance than the standard deviation (as the name mean-variance analysis suggests) but here we deal with the standard deviation because it has the same units as mean. This helps to put both components of the multicriteria problem on the same scale, which can be helpful when identifying a solution. Return
6. Because the mouse follows the wall follower strategy, it can never reach a location more than twice (once coming and once going if that path led to a dead end). That is why the maximum is 2. The minimum is conceivably 0 if the exit and start were the same location, but functionally a number greater than 0 because we place the exit at the other side of the maze. Return
7. Since latter experiments (in the sequence of weighted sum problems) have the benefit of earlier data, they should produce a better solution to the scalarized optimization problem. Although we did not here, it might be wise to run earlier experiments with a few extra observations to provide them with more opportunity for the optimization algorithm to converge. Return
MichaelMcCourt
Michael McCourt Research Engineer