$\def\xx{\mathbf{x}} \def\yy{\mathbf{y}} \def\zz{\mathbf{z}} \def\mD{\mathbf{D}} \def\mG{\mathbf{G}} \def\mW{\mathbf{W}} \def\mX{\mathbf{X}} \def\mY{\mathbf{Y}} \def\RR{\mathbb{R}} \def\cF{\mathcal{F}} \def\cN{\mathcal{N}} \def\cO{\mathcal{O}} \def\cS{\mathcal{S}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\me}{e}$

# Highlight: A Nonstationary Designer Space-Time Kernel

Our research team at SigOpt has been very fortunate to be able to collaborate with outstanding researchers around the world, including through our academic and internship programs.  In our Highlight blogs, we take the opportunity to feature our work with these collaborators.  This post introduces the article A Nonstationary Designer Space-Time Kernel by Michael McCourt, Gregory Fasshauer, and David Kozak, appearing at the upcoming NeurIPS 2018 spatiotemporal modeling workshop.

Kernel methods are popular in spatial statistics, where Gaussian processes/random fields have been a popular modeling tool for many years.  Many popular machine learning tools fit under the umbrella of kernel methods. Gaussian processes, support vector machines, regularization networks, and kriging all use a positive definite kernel as the underlying machinery powering the models (more information can be found in this book, this book or this book, as a last resort).

In our workshop paper, we focus on better modeling of the time component of space-time phenomena.  The goal is to design a model which can incorporate data differently depending on when that data was observed; as a result, the model will be able to act differently at the initial time and later in the analysis.  We do this by developing a new covariance kernel: it will better leverage data from today to predict outcomes tomorrow. It also has a number of free parameters which are more effective at modeling in this time setting.

We enforce an orientation of the kernel, such that it has a definite starting time but no definite end, by making the kernel nonstationary.  A stationary kernel (often also called a radial kernel) is one which has covariance only defined by the distance from observed data; for such a kernel, data at day 1 must imply the same thing about day 2 as day 101 implies about day 102.  In contrast, we want the starting time to have a different definition of covariance than later times at which we want to make predictions. The figure below shows how this can happen.

Figure 1: Kernels centered at locations spread in time between $$t=0$$ and $$t=25$$.  left: Two radial Gaussian kernels, with different length scales.  right: Two example of our new kernel, with different parametrizations.

Our new kernel is defined using Mercer’s theorem, which says that all covariance kernels have the following form:

$$K(\mathbf{x},\mathbf{z}) = \sum_{i=1}^\infty \lambda_i \varphi_i(\mathbf{x})\varphi_i(\mathbf{z}),$$

for eigenvalues and eigenfunctions $$\lambda_i$$ and $$\varphi_i$$, respectively.  These play a similar role as the eigenvalues and eigenvectors of linear algebra, but for functions.  By choosing the eigenfunctions effectively, we can create the kernel in the figure above. Designing these eigenfunctions to suit a specific purpose, such as modeling in time, is why we use the term designer kernel.

To save the reader a great deal of tedium, the details of the actual construction are relegated to the paper (or, the more aesthetically pleasing poster).  The resulting kernel is defined as

$$K(t, s) = \frac{\Gamma(\alpha+1)}{(1-2\delta)^{\alpha+1}} \left(ts\omega\right)^{-\alpha/2} \me^{-(t+s)\left(\delta+\frac{\omega}{1-\omega}\right)} I_{\alpha }\left({\frac {2{\sqrt {ts\omega}}}{1-\omega}}\right),$$

where $$\Gamma$$ is the gamma function and $$I_\alpha$$ is the modified Bessel function of the first kind.  The values $$\alpha$$, $$\delta$$ and $$\omega$$ are free parameters (also called hyperparameters) of this new kernel.  These parameters have a range of possible values, and the complicated interactions between them allow for the wide range of kernels indicated in the figure above.

The freedom and nonstationarity afforded by this kernel can have an impact in modeling.  The figure below shows a simple example of how a stationary kernel lacks an understanding of the initial time, and how the new kernel can account for this.

Figure 2: Some data is provided near the starting time, and predictions are made at future times. left: The predictions from a standard stationary Gaussian kernel become stagnant far from the observed data.  This is why the behavior is essentially constant far from the start.  right: The new kernel has behavior which continues to evolve further from the origin, even very far away from any observed data.

The figure below shows an example situation for which this kernel was designed: the modeling of temperature data.  Here, we used a product kernel, with stationary behavior in space and nonstationary behavior in time.  We modeled the GEFS Reforecast dataset; the training data was daily measurements for a week of surface temperature data across the western United States with 1◦×1◦ resolution.  We then made predictions for those same locations at the eighth day.

Figure 3: An example of temperature predictions over the Western US, using the new kernel to define covariance in time.

We hope that this light introduction to the topic has encouraged you to read our workshop paper and, if you are in Montreal, to visit the NeurIPS 2018 spatiotemporal modeling workshop to see our poster and chat with us!