Winning on Wall Street: Tuning Trading Models with Bayesian Optimization

Michael McCourt
Applied AI Insights, Simulations & Backtests

SigOpt provides customers the opportunity to build better machine learning and financial models by providing users a path to efficiently maximizing key metrics which define their success.  In this post we demonstrate the relevance of model tuning on a basic prediction strategy for investing in bond futures.

The First Piece: A Linear Model

Many factors can be used to build a financial trading model of an asset’s price/value, but there are generally three agreed upon components that are of most significance: correlations with other assets, the trend of the asset we wish to model, and external economic factors such as government reports. Our goal in this blog is to create a model of the US 10 year bond future contract price \(Y_{10}\) using only correlation data with the 2 year \(Y_2\) and 5 year \(Y_5\) bond future contract prices; these prices are recorded at the end of each trading day \(t\) and we hope to make predictions about the next trading day \(t+1\) There are a number of other factors we do not consider here that may be relevant to the contract price, but our goal initially is to create as simple a model as possible and demonstrate the benefit that can be gained by tuning it.

To that end, we start with the absolute simplest model possible: the linear model

\(\hat{Y}_{10}(t+1) = a + bY_2(t) + cY_5(t);\)

note that the use of the hat on \(\hat{Y}_{10}\) reminds us that it is a prediction and not the true value.  The numbers \(a\), \(b\), and \(c\) are constants which define the degree to which \(Y_2\) and \(Y_5\) influence our predictions. This model should be interpreted as “Tomorrow’s 10 year contract price prediction \(\hat{Y}_{10}(t+1)\) is a (weighted) summation of today’s 2 year \(Y_2(t)\) and 5 year \(Y_5(t)\) prices, plus a constant term which serves as a base”.

We can fit this model (use a least squares solver to determine the best \(a\), \(b\), and \(c\) values) to historical data from 01/05/2009 – 03/21/2016.1  It turns out that even this simple model is somewhat capable of accurate predictions thanks to the strong correlation between contract prices.  The figure below shows the outcome of this linear model.

Figure 1: (left) Predictions from the linear model for tomorrow’s 10 year bond contract price given today’s 2 and 5 year prices are represented by the transparent blue plane.  (right) Predictions for the US 10 year price given Germany’s and Britain’s. The circles represent actual data which is colored to represent distance from the plane (blue is closest, red is furthest).

The left component of this figure shows a clear relationship between the input values \(Y_2\) and \(Y_5\) as can be recognized graphically by the proximity of the observed results (the circles) to the prediction (the blue plane).  The right component presents a contrast by instead trying to predict \(Y_{10}\) using Germany’s and Britain’s 10 year bond future contract price; as we can see, there is a much greater distance from the circles to the plane, suggested a much lower predictive capacity.  Beyond this qualitative analysis, there are rigorous statistical tests available for quantifying how well or poorly the predictions match the data.

The Structure of the Trading Model

Given that we have identified the 2year and 5-year contract price as having a strong correlation to the 10-year price, we want to build a model that can use the knowledge of this correlation to profitably buy or sell 10-year contracts.  Our strategy for doing so centers around a standard concept: our model should change over time.2  We choose to fit the model using only the most recent m days, using that information to determine the \(a\), \(b\), and \(c\) values. We then consider all of the most recent n days of contract prices (rather than only today) to make a prediction for tomorrow.  We also introduce a decay factor, \(\lambda\), to increase the impact of more recent data. This changes our models in two ways:

  • the values \(a\), \(b\), and \(c\) which represent the correlations are replaced with \(a(t;m)\), \(b(t;m)\) and \(c(t;m)\) because they can change over time, and
  • the values \(Y_2(t)\) and \(Y_5(t)\) are replaced with \(\bar{Y}_2(t;\lambda,n)\) and \(\bar{Y}_5(t;\lambda,n)\) to denote the inclusion of data from previous days.

This changes the model from its simple form above to the similar but more complicated form:

\(\hat{Y}_{10}(t+1) = a(t;m) +b(t;m)\bar{Y}_2(t;\lambda,n) + c(t;m)\bar{Y}_5(t;\lambda,n).\)

\(\bar{Y}_2(t;\lambda,n)\) and \(\bar{Y}_5(t;\lambda,n)\) will determine how the information from earlier days plays a role in forming the prediction \(\hat{Y}_{10}(t+1)\). Earlier data already plays an indirect role in establishing the coefficients \(a(t;m)\), \(b(t;m)\), and \(c(t;m)\) but we are trying to also provide a direct path to asserting influence.

There are many ways to construct \(\bar{Y}_2(t;\lambda,n)\) and \(\bar{Y}_5(t;\lambda,n)\) for this post we use the  simple strategy of the weighted average of earlier values with an exponentially decaying weight as we walk further back in time:

\(\bar{Y}_2(t;\lambda,n)=\frac{\sum_{i=0}^n Y_2(t-n)e^{-\lambda n}}{\sum_{i=0}^n e^{-\lambda n}}.\)

As the decay factor \(\lambda\geq 0\) grows, the recent history is much more heavily weighted. If \(\lambda=0\) we recover an unweighted average with all of the most recent \(n\) values treated as equally important.  The coefficients \(a(t;m)\), \(b(t;m)\), and \(c(t;m)\) which measure the correlations from the previous m days are still determined using a least squares fit.

Trading Decisions Utilizing this Trading Model

The \(\hat{Y}_{10}(t+1)\) predictions use the knowledge of the past to make predictions about the future.  For the same reason that we limit the influence of the past to the most recent m days, we limit the viability of our coefficients \(a(t;m)\), \(b(t;m)\), and \(c(t;m)\) to only the nearest \(p\) days in the future.  This prevents us from trying to make predictions using stale data, but also requires the recomputation of the coefficients every \(p\) days.  The figure below shows the role of these \(n\), \(m\) and \(p\) values.

Figure 2: This timeline shows how the different regions of time are defined.

At the end of day, we know the true value of \(Y_{10}(t)\) and can use the model to predict the value \(\hat{Y}_{10}(t+1)\). If \(\hat{Y}_{10}(t+1)>Y_{10}(t)\), we should consider a long position and if the predicted value is less than today’s value we prefer a short position; we assume all positions are closed at the end of the trading day.  We introduce a trading threshold \(\sigma\geq0\) so that no position is taken on day \(t+1\) unless \(|\hat{Y}_{10}(t+1) -Y_{10}(t)|/ \hat{Y}_{10}(t+1)>\sigma\). This has the effect of preventing our trading strategy from reacting to situations where \(\hat{Y}_{10}(t+1)\approx Y_{10}(t)\) and making trades based on predictions which show incredibly small change (and thus no change may fall within the predictive interval).

Tuning the Trading Model

While the design of this trading model is logically consistent (tomorrow should look more like today than 67 days ago) it introduces new complexities.  In particular, it leaves open the question of how far in the past the contract values should be studied to determine the model coefficients, and how far in the future we can make predictions without resetting those values.  These choices are not fixed; we are free to use the last 30, 35, 77, or 1254 days to determine the correlations and we are free to make predictions for the next 1, 5, 18, or 543 days.  Moreover, these choices will impact our ability to make effective predictions: looking too far in the past includes outdated data, but failing to look far enough in the past can yield bad correlation estimates.  As such, these free parameters in our model must be chosen in some intelligent way.

To optimally choose these parameters we must define a key metric which serves as a proxy for the quality of the model.  In this setting, we use data from January 2009 to April 2014 as a sort of validation data set, whereby the actual profit per trade is determined using known historical data.  By choosing the free parameters

  • \(n\)- number of past days over which we form \(\bar{Y}_2\) and \(\bar{Y}_5\),
  • \(\lambda\)- decay rate determining the relative impact between subsequent days in forming \(\bar{Y}_2\) and \(\bar{Y}_5\),
  • \(m\) – number of past days used to determine \(a\), \(b\), and \(c\),
  • \(p\) – number of future days for which the \(a\), \(b\), and \(c\) values are considered valid before requiring a refitting, and
  • \(\sigma\) – trading threshold for which we choose not to trade if the predicted difference is too small

to maximize this profit per trade, we can create a trading strategy which, in some way, is designed to maximize the profit.  Of course, the question then becomes “How can we maximize the profit per trade?”

SigOpt provides an efficient optimization tool for exactly these kind of black box problems where the interaction of \(n\), \(\lambda\), \(m\), \(p\), \(\sigma\) and profit is complicated and unintuitive.  Furthermore, the cost of finding the coefficients \(a(t;m)\), \(b(t;m)\), and \(c(t;m)\), while relatively cheap here, could be much more expensive in less simple settings.  The ensemble of Bayesian optimization strategies behind SigOpt can efficiently conduct the optimization in these contexts.

The Impact of Tuning on Profitability

Even with a relatively simple design, models must be appropriately tuned in order to behave well in practice.  To demonstrate this point, we analyze the profit per trade for different optimization mechanisms and compare them to the defaults suggested by some experts.  For simplicity, we only allow the purchase or sale of one bond contract, although one could easily imagine a more complicated strategy with larger positions based on the model’s predictions.

One important component when tuning parameters is the domain which they span.  Setting these bounds inappropriately can significantly cripple optimization methods like exhaustive search by forcing them to search large regions of parameter space with little benefit.  We use the following domains for the parameters in this example: \(n\in[200,400]\), \(\lambda\in[0,4]\), \(m\in[1,20]\), \(p\in[4,20]\), \(\lambda\in[0,.0078]\). The value \(p\) was set to be at least 4 to require some amount of prediction into the future (beyond simply tomorrow), and \(\sigma\) was capped at .0078 to compel some number of trades (.0078 was the 95th percentile of magnitude change over the data set).

The model tuning progress, as a function of the number of parameter configurations that were tested on the model, is shown below for SigOpt as well as two common, simpler alternatives: a random optimization, and a partial grid search we refer to as a parameter sweep, where each parameter is optimized sequentially over a 1D grid with the other parameters fixed using, cumulatively, the best parameter found from the previous 1D grid searches.

Figure 3: These curves show the progress of the optimization, with the median and interquartile ranges presented to show a range of possible outcomes3 over 30 independent optimizations of the same model.  SigOpt is more efficient and effective than a random search strategy or a parameter sweepgetting to better parameter values, faster. See this earlier post for a detailed analysis on how to robustly compare optimization strategies.

We find that the Bayesian optimization strategies we employ often tune models effectively after 10 times the number of parameters so we use 50 model evaluations to tune these 5 parameters.  Using the best parameters observed over the 50 model evaluations we can back test our model on the contract prices from April 2014 to April 2016, the data omitted from the model tuning as a “holdout” dataset.  The total profit accumulated for that time is on display in the figure below for each of the optimization strategies described above.

Figure 4: These histograms show the distribution of future profits for the models tuned by SigOpt (median=$9523), random search (median=$7781) and a parameter sweep (median=$7977).  The median values are depicted with a black line.


This post shows that even a simple financial model will likely have free parameters and that tuning those free parameters can be a difficult task with serious implications regarding the viability of the model.  SigOpt can provide benefits to practitioners who want to see better performance from their models. This also allows for more models to be developed and tested because SigOpt performs the accelerates the necessary tuning process.

There are significant extensions that can be made to the underlying financial model we present here. First, no trend/autoregressive term is present which would likely improve its predictive capacity; this choice was intentional, to limit the model’s complexity, but given the baseline in this post it would be interesting to see the benefit of such a component.  What interests us the most is the introduction of an “Economic News” component which allows the model to account for erratic behavior on days with important reports such as GDP or unemployment rate.  Stay tuned for a follow-up post looking at that problem.

Use SigOpt free. Sign up today.


1. The data used in this analysis was extracted from CME group.  The quantities under analysis represent the dollar value of a bond futures contract: this is equivalent to the price multiplied by $1000 for the 5 year and 10 year bonds and $2000 for the 2 year bond.  This might also be referred to as the contract value. Return
2. In futures markets we do not explicitly model the time value of money under the assumption that the clearing house clears trades every day and thus the necessary transfer of margin (from the short position to long or vice versa) account for the change in price. Return
3. An optimization using any of these three methods will be random because certain decisions must be made randomly: the initialization phase for SigOpt, the order of one dimensional optimizations for parameter sweep, or everything for random search.  The medians and interquartile ranges provide some insight as to the average behavior. Return
Michael McCourt Research Engineer

Want more content from SigOpt? Sign up now.