 $\def\kk{\boldsymbol{k}} \def\uu{\boldsymbol{u}} \def\vv{\boldsymbol{v}} \def\ww{\boldsymbol{w}} \def\xx{\boldsymbol{x}} \def\xopt{\xx_{\text{opt}}} \def\yy{\boldsymbol{y}} \def\zz{\boldsymbol{z}} \def\cD{\mathcal{D}} \def\cX{\mathcal{X}} \def\lmle{L_{\text{MLE}}} \def\lmple{L_{\text{MPLE}}} \def\lkv{L_{\text{KV}}} \def\lclv{L_{\text{CLV}}} \def\ppa{\frac{\partial}{\partial\a}} \DeclareMathOperator*{\argmin}{argmin} \def\gg{\mathbf{g}} \def\xx{\mathbf{x}} \def\yy{\mathbf{y}} \def\mD{\mathbf{D}} \def\mG{\mathbf{G}} \def\mW{\mathbf{W}} \def\mX{\mathbf{X}} \def\mY{\mathbf{Y}} \def\sgn{\mathrm{sgn}} \def\RR{\mathbb{R}} \def\cF{\mathcal{F}} \def\cN{\mathcal{N}} \def\cO{\mathcal{O}} \def\cS{\mathcal{S}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\me}{e}$

# Unsupervised Learning with Even Less Supervision Using Bayesian Optimization

In this post on integrating SigOpt with machine learning frameworks, we will show you how to use SigOpt and XGBoost to efficiently optimize an unsupervised learning algorithm’s hyperparameters to increase performance on a classification task.

As we previously discussed, fully supervised learning algorithms require each data point to have an associated class or output. In practice, however, it is often the case that relatively few labels are available during training time and labels are costly or time-consuming to acquire. For example, it might be a very slow and expensive process for a group of experts to manually investigate and classify thousands of credit card transaction records as fraudulent or legitimate. A better strategy might be to study the large collection of transaction data without labels, building a representation that better captures the variations in the transaction data automatically.

## Unsupervised Learning

Unsupervised learning algorithms are designed with the hope of capturing some useful latent structure in data. These techniques can often enable dramatic gains in performance on subsequent supervised learning tasks, without requiring more labels from experts. In this post, we will use an unsupervised method on an image recognition task posed by researchers at Stanford1 where we try to recognize house numbers from images collected using Google street view (SVHN). This is a more challenging problem than MNIST (another popular digit recognition dataset) as the appearance of each house number varies quite a bit and the images are often cluttered with neighboring digits: Figure 1: 32×32 cropped samples from the classification task of the SVHN dataset. Each sample is assigned only a single digit label (0 to 9) corresponding to the center digit.2

In this example, we assume access to a large collection of unlabelled images $$X_u$$ where the correct answer is not known and a relatively small amount of labeled data $$(X_s, y)$$ for which the true digit in each image is known (often requiring a non-trivial amount of time and money to collect). Our hope is to find a suitable unsupervised model, built using our large collection of unlabelled images, that transforms images into a more useful representation for our classification task.

Unsupervised and supervised learning algorithms are typically governed by small sets of hyperparameters $$(\lambda_u, \lambda_s)$$ that control algorithm behavior. In our example pipeline below, $$X_u$$ is used to build the unsupervised model $$f_u$$ which is then used to transform the labeled data $$(X_s, y)$$ before the supervised model $$f_s$$ is trained. Our task is to efficiently search for good hyperparameter configurations $$(\lambda_u, \lambda_s)$$ for both the unsupervised and supervised algorithms. SigOpt minimizes the classification error $$E(\lambda_u, \lambda_s)$$ by sequentially generating suggestions for the hyperparameters of the model $$(\lambda_u, \lambda_s)$$. For each suggested hyperparameter configuration a new unsupervised data representation is formed and fed into the supervised model. The observed classification error is reported and the process repeats, converging on the set of hyperparameters that minimizes the classification error. Figure 2: Process for coupled unsupervised and supervised model tuning

Data scientists use their domain knowledge to select appropriate unsupervised and supervised models, but the task of selecting the best parameters for these models is often daunting and non-intuitive. Even simple unsupervised models—such as the whitening strategy we discuss below—often introduce tunable parameters, leaving potentially good models on the table simply because a winning configuration was never found.

SigOpt offers Bayesian optimization as a service, capable of efficiently searching through the joint variations $$(\lambda_u, \lambda_s)$$ of both the supervised and unsupervised aspects of machine learning systems (Figure 2.) This allows experts to unlock the power of unsupervised strategies with the assurance that each model is reaching its full potential automatically. The gains achieved by methods like SigOpt are additive with feature engineering, allowing for better results and faster iteration with less trial and error.

## Show Me The Code

The source code for this experiment and setup script are available here. SigOpt can quickly identify the optimal configuration of these complicated models, much faster than traditional methods such as grid search and random search, especially when more than 2 or 3 hyperparameters are at play. Finding optima more quickly can also drastically save on computational resource costs (e.g. AWS instances) while still achieving comparable or better model performance.

## Unsupervised Model

We start with the initial features describing the data: raw pixel intensities for each image. The goal of the unsupervised model is to transform the data from its original representation to a new (more useful) learned representation without using labeled data. Specifically, you can think of this unsupervised model as a function $$f : \mathbb{R}^N \rightarrow \mathbb{R}^J$$. Where $$N$$ is the number of features in our original representation and $$J$$ is the number of features in the learned representation. In practice, expanded representations (sometimes referred to as a feature map) where $$J$$ is much larger than $$N$$ often work well for improving performance on classification tasks.3

### Image Transform Parameters ($$s$$, $$w$$, $$K$$)

A simple but surprisingly effective transformation for small images was proposed in a paper by Coates where image patches are transformed into distances to (\K\)  learned centroids (average patches) using the k-means algorithm, and then pooled together to form a final feature representation as outlined in the figure below: Figure 3: Feature extraction using a w-by-w receptive field and stride s. w-by-w patches separated by s pixels each, then map them to K-dimensional feature vectors to form a new image representation. These vectors are then pooled over 4 quadrants of the image to form classifier feature vector.

In this example we are working with the 32×32 (n=32) converted gray-scale (d=1) images of the SVHN dataset. We allow SigOpt to vary the stride length ( $$s$$ ) and patch width ( $$w$$ ) parameters. The figure above illustrates a pooling strategy that considers quadrants in the 2×2 grid of the transformed image representation, summing them to get the final transformed vector. We used the suggested resolution in and kept pool_res fixed at 2. $$f(x)$$ represents a $$K$$ dimensional vector that encodes the distances to the $$K$$ learned centroids, and $$f_i(x)$$ refers to the distance of instance $$x$$ to centroid $$i$$. In this experiment, $$K$$ is also a tunable parameter. The final feature representation of each image will have $$J = K * \text{pool_res}^2$$ features.

### Whitening Transform Parameter ( $$\epsilon_{\text{zca}}$$ )

Before generating the image patch centroids and any subsequent patch comparisons to these centroids, we apply a whitening transform to each patch. When dealing with image data, whitening is a common preprocessing transform which removes the correlation between all pairs of individual pixels4. Intuitively, it can be thought of as a transformation that highlights contrast in images. It has been shown to be helpful in image recognition tasks, and may also be useful for other feature data. The figure below shows several example image patches before and after the whitening transform is applied. Figure 4: Comparison of image patches before and after whitening5

The whitening transformation we use is known as ZCA whitening6. This transform is achieved by cleverly applying the eigendecomposition of the covariance matrix estimate to a mean adjusted version of the data matrix, so that the expected covariance of the data matrix becomes the identity. A regularization term $$\epsilon_{\text{zca}}$$  is added to the diagonal eigenvalue matrix, and  $$\epsilon_{\text{zca}}$$ is exposed as a tunable parameter to SigOpt. # fit cov estimate to data matrix X (n x m, n samples, m feats)
cov = LedoitWolf().fit(X)
D, U = numpy.linalg.eigh(cov.covariance_)
V = numpy.sqrt(numpy.linalg.inv(numpy.diag(D+eps_zca)))
Wh = numpy.dot(numpy.dot(U,V),U.T) # ZCA whitening transform matrix
mu = numpy.mean(X, axis=0)
X_zca = numpy.dot(X-mu, Wh)


### Centroid Distance Sparsity Parameter ( $$\text{sparse_p}$$ )

Each whitened patch in the image is transformed by considering the distances to the learned $$K$$ centroids. To control this sparsity of the representation we report only distances that are below a certain percentile, $$\text{sparse_p}$$, when considering the pairwise distances between the current patch and the centroids. Intuitively this acts as a threshold which allows for only the “close” centroids to be active in our representation.

# compute distances between patches and all centroids
Z = k_means.transform(img_ptchs)
tau = numpy.percentile(Z, sparse_p, axis=1, keepdims=True)
Z = numpy.maximum(0, tau - Z)

The figure below illustrates the idea with a simplified example. A whitened image patch (in the upper right) is compared against the 4 learned centroids after k-means clustering. Here, let’s imagine we have set the percentile threshold to 50, so only the distances in the lower half of all centroid distances persist in the final representation, the others are zeroed out Figure 5: Sparsity transform; distances from a patch to centroids above 50th percentile are set to 0

While the convolutional aspects of this unsupervised model are tailored to image data, the general approach of transforming feature data into a representation that reflects distances to learned archetypes seems suitable for other datasets and feature spaces7.

## Supervised Model

With the learned representation of our data, we now seek to maximize performance on our classification task using a smaller labeled dataset. While random forests are an excellent, and simple, classification tool, better performance can typically be achieved by using carefully tuned ensembles of boosted classification trees.

### Gradient Boosting Parameters ( $$\gamma, \theta, M$$ )

We consider the popular library XGBoost as our gradient boosting implementation. Gradient boosting is a generic boosting algorithm that incrementally builds an additive model of base learners, which are themselves simpler classification or regression models. Gradient boosting works by building a new model at each iteration that best reconstructs the gradient of the loss function with respect to the previous ensemble model. In this way, it can be seen as a sort of functional gradient descent and is outlined in more detail below. In the pseudocode below we outline building an ensemble of regression trees, but the same method can be used with a classification loss function $$L$$ Algorithm 1: Pseudocode for supervised gradient boosting using regression trees as base learners

Important parameters governing the gradient boosting algorithm include $$M$$, the number of base learner models in the ensemble, and $$\gamma$$ the learning rate, which controls the relative contribution of each new base learner in the final additive model. Each base learner is also governed by it’s own set of parameters $$\theta$$. Here we consider classification trees as our base learners, governed by a familiar set of parameters managing tree growth and regularization (e.g., max depth, sub_sample). We expose these parameters to SigOpt to optimize simultaneously with the parameters governing the unsupervised transformation discussed previously.

## Classification Performance

To compare model performance, we use accuracy, which can be understood as a measure of the probability of correctly classifying for each image in the test set. For example, a model that correctly recognizes the house numbers for 91% of the images would result in an accuracy score of 0.91.

We compare the ability of SigOpt to find the best hyperparameter configuration to the industry standard methods of random search, which usually outperforms grid search and manual search8 and a baseline of using an untuned model.

Because the underlying methods used are inherently stochastic we performed 10 independent hyperparameter optimizations using both SigOpt and random search for both the purely supervised and combined models. Hyperparameter optimization was performed on the accuracy estimate from an 80/20 cross-validation fold of the training data (73k examples). The ‘extra’ set associated with the SVHN dataset (530K examples) was used to simulate the unlabelled data $$X_u$$ in the unsupervised parts of this example.

For the unsupervised model 90 sequential configuration evaluations (~50 CPU hrs) were used for both SigOpt and random search. For the purely supervised model 40 sequential configuration evaluations (~8 CPU hrs) were used for both SigOpt and random search. In practice, SigOpt is usually able to find good hyperparameter configurations with a number of evaluations equal to 10 times the number of parameters being tuned (9 for the combined model, 4 for the purely supervised model). The same parameters and domains were used for XGBoost in both the unsupervised and purely supervised settings. As a baseline, the holdout accuracy of an untuned scikit-learn random forest using the raw pixel intensity features.

After hyperparameter optimization was completed for each method we compared accuracy using a completely held out dataset (SHVN test set, 26k examples) using the best configuration found in the tuning phase. The holdout dataset was run 10 times for each best hyperparameter configuration for each method, the mean of these runs is reported in the table below. SigOpt outperforms random search with a p-value of 0.0008 using the unpaired Mann-Whitney U test. Table 1: Comparison of model accuracy on held out (test) dataset after different tuning strategies

The chart below tracks the optimization path of SigOpt vs random search optimization strategies when tuning the unsupervised model (Unsup Feats) and only the supervised model (Raw Feats). We plot the interquartile range of the best-seen cross-validated accuracy score on the training set at each objective evaluation during the optimization. As mentioned above, 90 evaluations were used in the optimization of the unsupervised model and 40 in the supervised setting. SigOpt outperforms random search in both settings on this training data (p-value 0.005 using the same Mann-Whitney U test as before). Figure 6: Optimization traces of CV accuracy using SigOpt and random search

## Closing Remarks

Unsupervised learning algorithms can be a powerful tool for boosting the performance of your supervised models when labeling is an expensive or slow process. Tuning automatically brings each model to its full potential. SigOpt was built to help with this non-intuitive task. As this example demonstrates, careful parameter tuning can enable engineering and data science teams to better leverage their unlabelled data and build more predictive data products in less time and lower cost. Sign up for a free evaluation today and get the most from your models!

SigOpt effectively optimizes machine learning models across a variety of datasets and algorithms. See our other examples: