$\def\a{\alpha} \def\e{\varepsilon} \def\s{\sigma} \def\RR{\mathbb{R}} \def\mC{\mathsf{C}} \def\mK{\mathsf{K}} \def\mI{\mathsf{I}} \def\ggamma{\boldsymbol{\gamma}} \def\kk{\boldsymbol{k}} \def\uu{\boldsymbol{u}} \def\vv{\boldsymbol{v}} \def\ww{\boldsymbol{w}} \def\xx{\boldsymbol{x}} \def\xopt{\xx_{\text{opt}}} \def\yy{\boldsymbol{y}} \def\zz{\boldsymbol{z}} \def\cD{\mathcal{D}} \def\cX{\mathcal{X}} \def\lmle{L_{\text{MLE}}} \def\lmple{L_{\text{MPLE}}} \def\lkv{L_{\text{KV}}} \def\lclv{L_{\text{CLV}}} \def\ppa{\frac{\partial}{\partial\a}} \DeclareMathOperator*{\argmin}{argmin} \def\gg{\mathbf{g}} \def\xx{\mathbf{x}} \def\yy{\mathbf{y}} \def\mD{\mathbf{D}} \def\mG{\mathbf{G}} \def\mW{\mathbf{W}} \def\mX{\mathbf{X}} \def\mY{\mathbf{Y}} \def\sgn{\mathrm{sgn}} \def\RR{\mathbb{R}} \def\cF{\mathcal{F}} \def\cN{\mathcal{N}} \def\cO{\mathcal{O}} \def\cS{\mathcal{S}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\me}{e}$

# Automatically Tuning Text Classifiers

All Model Types, Modeling Best Practices

In this first post on integrating SigOpt with machine learning frameworks, we’ll show you how to use SigOpt and scikit-learn to train and tune a model for text sentiment classification in under 50 lines of Python. We show that SigOpt outperforms both strategies scikit-learn recommends for hyperparameter optimization on this task. We’ll also walk through the two fundamental aspects of this (and any) SigOpt experiment: the objective metric and the set of tunable parameters.

Text classification problems appear quite often in modern information systems, and you might imagine building a small document/tweet/blogpost classifier for any number of purposes. In this example, the classification task is to label Amazon product reviews1 as either favorable or not. The objective is to find a classifier that is accurate in its predictions, but also one that gives us confidence it will generalize to data it hasn’t been trained on. We employ the Swiss army knife of machine learning, logistic regression (LR), as our model in this experiment. While the LR model might be conceptually simple and implemented in many statistics and machine learning software packages, valuable engineering time and resources are often wasted experimenting with feature representation and parameter tuning via trial and error. SigOpt can automatically and intelligently optimize your objective, letting you get back to working on other tasks.

## Show Me The Code

If you’d like to try running this example, sign up for a free SigOpt trial and find your newly created credentials on your profile page. To quickly set up an environment to run this example, we recommend using a Mac OS X (with brew) or an Ubuntu machine having at least 4GB of RAM. Once you have a suitable machine (t2.mediums in EC2 are cheap and work great) this script should be all you need to get to get started:

# on Mac OS X or Ubuntu machine
git clone https://github.com/sigopt/sigopt-examples.git
cd sigopt-examples/text-classifier/
sudo ./setup_env.sh
# insert your client_token into sigopt_creds.py
# if not you'll see "This endpoint requires an authenticated user" errors
nohup python sentiment_classifier.py &

Without further ado, here is the short Python snippet to train and tune the LR sentiment classifier (you can also find it here):

import json, math, numpy
import sigopt.interface
from sigopt_creds import client_token
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn import cross_validation

# optimization metric : see blogpost https://blog.sigopt.com/post/133089144983/sigopt-for-ml-automatically-tuning-text
def sentiment_metric(POS_TEXT, NEG_TEXT, params):
min_ngram = params['min_n_gram']
max_ngram = min_ngram + params['n_gram_offset']
min_doc_freq = math.exp(params['log_min_df'])
max_doc_freq = min_doc_freq + params['df_offset']
vectorizer = CountVectorizer(min_df=min_doc_freq, max_df=max_doc_freq,
ngram_range=(min_ngram, max_ngram))
X = vectorizer.fit_transform(POS_TEXT+NEG_TEXT)
y = [1]*len(POS_TEXT) + [-1]*len(NEG_TEXT)
clf = SGDClassifier(loss='log', penalty='elasticnet',
alpha=math.exp(params['log_reg_coef']),
l1_ratio=params['l1_coef'])
cv = cross_validation.ShuffleSplit(X.shape[0], n_iter=5, test_size=0.3,
random_state=0)
cv_scores = cross_validation.cross_val_score(clf, X, y, cv=cv)
return numpy.mean(cv_scores)

conn = sigopt.interface.Connection(client_token=client_token)
experiment = conn.experiments().create(
name='Sentiment LR Classifier',
parameters=[
{ 'name':'l1_coef', 'type': 'double',
'bounds': { 'min': 0, 'max': 1.0 }},
{ 'name':'log_reg_coef', 'type': 'double',
'bounds': { 'min': math.log(0.000001), 'max': math.log(100.0) }},
{ 'name':'min_n_gram', 'type': 'int',
'bounds': { 'min': 1, 'max': 2 }},
{ 'name':'n_gram_offset','type': 'int',
'bounds': { 'min': 0, 'max': 2 }},
{ 'name':'log_min_df', 'type': 'double',
'bounds': { 'min': math.log(0.00000001), 'max': math.log(0.1) }},
{ 'name':'df_offset', 'type': 'double',
'bounds': { 'min': 0.01, 'max': 0.25 }}
],
)
# run experimentation loop
for _ in range(60):
suggestion = conn.experiments(experiment.id).suggestions().create()
opt_metric = sentiment_metric(POSITIVE_TEXT, NEGATIVE_TEXT, suggestion.assignments)
conn.experiments(experiment.id).observations().create(
suggestion=suggestion.id,
value=opt_metric,
)
# track progress on your experiment : https://sigopt.com/experiment/list

Behold the power of abstraction! It’s pretty amazing that we can build and tune a model in so few lines of code. It should be noted however that this experiment takes about 40 min to complete (and costs about \$0.052 if you use a t2.medium) so letting it run in a background session on a remote machine allows you to safely start it and get back to working on other things. You can periodically check in on the progress using the SigOpt experiment dashboard.

## Objective Metric: $$f(\lambda)$$

SigOpt finds parameter configurations that maximize any metric, so we need to pick one that is appropriate for this classification task. We’ll use $$f(\lambda)$$ to denote our objective metric function and $$\lambda$$ to represent the set of tunable parameters, which we discuss in the following section.  In designing our objective metric, accuracy, the number of correctly classified reviews, is obviously important, but we also want assurance that our model generalizes and can perform well on data on which it was not trained.  This is where the idea of cross-validation comes into play.

Cross-validation requires us to split up our entire labeled dataset $$\mathcal{D}$$ into two distinct sets: one to train on $$\mathcal{D}_{\text{train}}$$ and one to validate our trained classifier on $$\mathcal{D}_{\text{valid}}$$.  We then consider metrics like accuracy on only the validation set.  Taking this further and considering not one, but many possible splits of the labeled data, is the idea of k-fold cross-validation where multiple training, validation sets are generated and validation metrics can be aggregated in several ways (e.g., mean, min, max) to give a single estimation of performance.

In this case, we’ll use the mean of the k-folded cross-validation accuracies2.  In our case,  folds are used and the train and validation sets are split randomly using 70% and 30% of the entire dataset, respectively.

$$\mathcal{L}(\pmb{\lambda}, \mathcal{D}_{\text{t}}, \mathcal{D}_{\text{v}}) = \text{acc. of LR}(\pmb{\lambda}, \mathcal{D}_{\text{t}}) \text{ on } \mathcal{D}_{\text{v}}$$

$$f(\pmb{\lambda} ) = \frac{1}{k} \sum_{i=1}^{k} \mathcal{L}(\pmb{\lambda}, \mathcal{D}^{(i)}_{\text{train}}, \mathcal{D}^{(i)}_{\text{valid}})$$

This objective metric  takes on values in the range [0, 1.0], where 0 represents a misclassification of every example in all validation folds and 1.0 represents perfect classification on all validation folds. The higher the cross-validation metric, the better our classifier is doing.

Using many folds might not be practical if training takes a very long time (you might have to settle for 1 or 2 folds only); indeed, SigOpt is the perfect tool for tuning these time consuming and expensive to evaluate functions!

## Tunable Parameters: λ

The objective metric is controlled by a set of parameters that potentially influence its performance.  Parameters can be defined on integer, continuous, or categorical domains. The parameters used in this experiment can be split into two groups: those governing the feature representation of the review text (min_n_gramngram_offsetlog_min_dfdf_offset) and those governing the cost function of logistic regression (log_reg_coefl1_coef). We explain these parameters in more detail below.

## Feature Representation Parameters

The CountVectorizer class in scikit-learn is a convenient mechanism for transforming a corpus of text documents into vectors using a bag of words representations (BOW). Scikit-learn offers quite a bit of control in determining which n-grams make up the vocabulary for your BOW vectors.  As a quick refresher, n-grams are sequences of text tokens as shown in the example below:

The number of times each n-gram appears in a given piece of text is then encoded in the BOW vector describing that text.  CountVectorizer allows you to control the range of n-grams that are included in the vocabulary (min_n_gramngram_offset in our experiment), as well as filtering n-grams outside a specified document-frequency range (log_min_dfdf_offset in our experiment). For example, if a rare 3-gram like “hi_diddly_ho” doesn’t appear with at least min-df frequency in the corpus, it is not included in the vocabulary.  Similarly, n-grams that occur in nearly every document (1-grams like “the”, “a” etc) can also be filtered using the max-df parameter.  Often when the range of the parameter is very large or very small, it makes sense to look at the parameter on the log scale, as we do with the log_min_df parameter.

Which n-gram vocabulary performs best on this classification problem?  Including these vocabulary parameters in our experiment gives SigOpt the power to explore this important question automatically.

## Logistic Regression Error Cost Parameters

Using the SGDClassifier class in scikit-learn, we can succinctly formulate and solve the logistic regression learning problem.  The error function for logistic regression, two-class classification is defined in the following way:

$$E(\pmb{\theta}) = \frac{1}{M}\sum_{i=1}^{M}\log\left( 1.0 + e^{-y_i(\pmb{\theta}^T\mathbf{x_i}) }\right ) + \alpha\left( \frac{1-\rho}{2}\|\pmb{\theta}\|_2^{2} + \rho\|\pmb{\theta}\|_1\right)$$
Where

M = number of training examples
θ = vector of weights the algorithm will learn for each n-gram in vocabulary
– training data label : {-1, 1} for our two class problem
– training data input vector:  BOW vectors described in the previous section
α – the weight of regularization term (log_reg_coef in our experiment)
ρ – the weight of l1 norm term (l1_coef in our experiment)

The first term of the cost function penalizes weights that do not fit the training data while the second term penalizes model complexity (how far are the feature weights away from zero).  scikit-learn performs stochastic gradient descent on this error function with respect to the weights in an attempt to find those that minimize this function.  Check out another one of our blog posts where we discuss in more detail the tradeoff between fidelity and regularization.

Should we use l1 or l2 regularization, or perhaps a weighted mixture?  How much should the entire regularization term be weighted?  With this error formulation, and the  and  parameters exposed to our experiment, SigOpt can quickly find these answers to these important questions.

## Optimization Performance

SigOpt offers one solution to the hyperparameter optimization problem, however, there are other existing techniques.  In particular, random search and grid search are two commonly employed strategies.  Random search, as you might guess, simply selects parameter configurations at random, while grid search sweeps through a selected subset of the parameter space.

How should we evaluate the performance of these alternative optimization strategies?  One criterion that makes sense is to consider the best found (max) value of the objective metric.  Better performing strategies will find better configurations over the duration of their search.  Due to the stochastic nature of these systems, however, we must consider the variation in our best-found measurements over several runs to make fair comparisons.

To ground our discussion, we also report the performance when no hyperparameter optimization is performed, and we simply take the default values for CountVectorizer and SGDClassifier as provided by scikit-learn.  For grid search, we consider 64 evenly spaced parameter configurations (order shuffled randomly) across our domain and analyze the best seen after 60 evaluations to be consistent with our limit on the total number of evaluations for this experiment. An exhaustive grid search is usually prohibitive because the number of possible configurations grows exponentially.  For example, if we considered 10 configurations for each parameter in a problem like ours with 6 parameters, we would have 1 million (  ) total joint configurations to evaluate.

Results averaged over 20 runs, each run consisting of 60 function evaluations.

These are promising results! SigOpt finds the best configuration with statistical significance over the other two approaches (p = 0.0001, using the unpaired Mann-Whitney U test) and improves the performance as compared to the baseline by 5.72%.

## Closing Remarks

This short example scratches the surface of the types of ML related experiments one could conduct using SigOpt.  For example, SGDClassifier has lots of variations from which to select– another experiment might be to treat the loss function as a categorical variable.  What sort of models are you building that could benefit from better experimentation or optimization?  Stay tuned for more posts in our series on integrating SigOpt with various ML frameworks to solve real problems more efficiently!