Using Prior Knowledge to Enhance Scikit-Learn Models

Trevor Skelton

Use your modeling knowledge to help the SigOpt optimizer provide you with quality parameters faster than ever. 

Why use Prior Knowledge? 

SigOpt is a black-box optimization tool, as it has no access to the data, underlying model, or objective function it is optimizing. Users simply supply the optimizer with desired parameters, valid bounds for those hyperparameters to test, and then supply the metric score to inform the Bayesian optimization. This lends itself to use cases where anonymity is important. But what if modelers want to provide additional information to guide the optimizer, in hopes of leading to better parameters faster? 

Defining a Prior Distribution is a unique feature where you as the modeler are able to use your prior knowledge to directly impact SigOpt’s optimization process. Normally when utilizing SigOpt you are able to inject your modeling knowledge in a limited fashion, by specifying which parameters and the bounds with which to test between. However, what if you have prior knowledge or an educated guess that a certain value or range of values may perform well? What if you have determined that there is a high likelihood the model will perform poorly with a relatively high parameter value, but not enough of a likelihood for you to completely rule it out (using Parameter Constraints or the max bound to exclude the range altogether)? These cases are perfect for using prior distributions. 

What is a Prior Distribution? 

When setting up a SigOpt experiment, you can provide a distribution (currently either normal or beta) for parameter values that the optimizer will use as likely values for tuning. This is scaled, such that a prior density function with double the value will receive double the weight in the optimizer’s initial probability to find a good performing value. This can help the SigOpt optimizer focus more on the hyperparameter space most likely to produce good results, and less time ruling out areas of the hyperparameter space unlikely to produce a well-performing model. 

You must exercise caution when using prior distributions. While good prior distributions can accelerate the SigOpt optimizer’s performance, inaccurate prior distributions can hinder it. Eventually, the optimizer will learn to overcome poor prior distributions based on the feedback it receives from the metric, but this may take time. Note that we also must remove any transformations when using prior distributions.  

Using SigOpt with Scikit-Learn 

Let’s take a look at an example SigOpt experiment using scikit-learn and explore the impact of prior knowledge. The following experiments will utilize the simple California housing dataset available via scikit-learn. This dataset contains several features useful for training a model to predict the median house value for various California districts. These features include median income, median house age, average number of rooms, population, latitude and longitude.

When using scikit-learn on Intel hardware, enable the Intel extension for scikit-learn. With the Intel Extension for Scikit-learn you can accelerate your Scikit-learn applications and still have full conformance with all Scikit-Learn APIs and algorithms. This is a free software AI accelerator that brings over 10-100X acceleration across a variety of applications when using Intel hardware. 

from sklearnex import patch_sklearn 

First, we’ll create a function to instantiate a model, fit it on the training data, and evaluate it using cross-validation. This function will take as arguments our features, our target (median housing price), and hyperparameters we want to vary using SigOpt’s experimentation. In the body of the function, we utilize the function arguments to instantiate a model and evaluate the model using scikit-learn’s cross_val_score function to score with 5-fold cross-validation. In this example, we’ll train a simple neural network with two hidden layers and return the mean RMSE score over all cross-validation folds. 

def evaluate_model(X, y, folds=5, alpha=.0001, learning_rate_init=.001, beta_1=.9, beta_2=.999): 
    model = MLPRegressor(hidden_layer_sizes=(100,10), activation='relu', solver='adam', alpha=alpha, batch_size=64, 
                         learning_rate_init=learning_rate_init, max_iter=1000, random_state=random_no, beta_1=beta_1, beta_2=beta_2,) 

    cv_rmse = cross_val_score(model, X, y, scoring='neg_root_mean_squared_error', cv=folds, n_jobs=-1) #fitting and scoring 

    return -np.mean(cv_rmse)

The second function will execute each run of the SigOpt experiment, calling the evaluate_model function we just created with the suggested hyperparameters from SigOpt. This function takes one argument, a SigOpt RunContext object. From this object, we access the hyperparameter suggestion from the SigOpt optimizer with run.params.[hyperparameter]. We provide these values as arguments for our evaluate_model function, receive a cross-validation RMSE score, and return that score.

def run_sigopt(run): 
    args = dict(X=X_train, 

    rmse = evaluate_model(**args) 

    return rmse

Next, we need to define our SigOpt experiment, including the optimization metric, parameters to optimize, and number of runs to execute, calling sigopt.create_experiment(). Each parameter to be optimized needs a dictionary with the name of this argument (e.g., learning_rate_init), the type of value, the bounds as a dictionary with a minimum and maximum value, and optionally a transformation to apply. Our budget specifies the number of runs, or models and scores the SigOpt optimizer will iterate through. More information about configuring a SigOpt experiment can be found in the docs here. 

no_pk_experiment = sigopt.create_experiment( 
    name = 'No Prior Knowledge', 
    metrics = [ 
        dict(name='rmse', strategy='optimize', objective='minimize') 
    parameters = [ 
             bounds=dict(min=.00001, max=1),  

             bounds=dict(min=.00001, max=1),  

             bounds=dict(min=.0001, max=.9999),  

             bounds=dict(min=.0001, max=.9999),  

    budget = 100)  

Finally, we are ready to run our experimentation loop. Putting it all together: 

for run in no_pk_experiment.loop(): 
    with run: 
        rmse = run_sigopt(run) 
        run.log_metric(name='rmse', value=rmse)

We call run_sigopt with each RunContext object as the argument, and be sure to report back to SigOpt the results of our cross-validation score using run.log_metric(), in order for the optimizer to receive feedback and inform future runs. After the experiment loop is complete, we are able to review the results on the SigOpt experiment dashboard, or use the following code to directly access the parameters used for the best performing SigOpt run: 

for run in no_pk_experiment.get_best_runs(): 
    args = dict(run.assignments) #obtain best SigOpt run's parameter values  

Adding in Prior Knowledge 

How might we ascertain prior knowledge in the first place? In our initial SigOpt exploration experiment, we just used SigOpt to tune our neural network model over 100 runs. Since we’re new to the data and model type, we’ve run this experiment without any prior knowledge and with a large scope of parameter bounds. From our SigOpt dashboard, we can access the following plot with RMSE scores for various learning rate initialization values: 

 In this experiment we’ve tested a learning_rate_init value between .00001 and 1 inclusive, and received varying values of the metric of choice, RMSE, after cross-validation. We can use the results from this SigOpt experiment here to form our prior knowledge to use for future runs and experiments. Based on this plot, we are fairly confident the learning_rate_init at a smaller value will result in a lower error and better model, while increasing this value to between 0.1 to 1 very often sees models that are diverging and scoring poorly. This diverging behavior matches what we might expect with a neural network model training with gradient descent with too high of a learning rate. 

If we were to transform this result into a prior distribution for use in future experiments, we would do so as follows in the Python API when defining our parameters when creating our SigOpt experiment: 

 parameters = [ 
           bounds=dict(min=.000001, max=1),  
           prior=dict(name='normal', mean=.00004, scale=.001)), 

We provide an extra prior argument as a dictionary. We define a name (must be either normal or beta) for the type of distribution, and then define either the mean and scale for the normal distribution, or the shape_a and shape_b for the beta distribution. We can use an online normal distribution plotter tool to help us visualize the scale of our prior density function.   

With this prior distribution specified, the SigOpt optimizer will test parameter values primarily in accordance with this distribution, with a peak at the lower end of the bounds, meaning the learning_rate_init will be much more biased to a smaller value where we have already seen some success with. 

We can quickly evaluate some summary statistics to compare the quality of the models training before using prior knowledge, and then after using these prior distributions: 

Before Prior Knowledge: 

Poor models are defined as models with a cross-validation RMSE score worse than if the model simply provided the mean y value from the training data (RMSE ~= 2 for this problem), as the simplest possible baseline. With some models diverging, the mean RMSE score of these 100 runs is quite high, while the median is more reasonable but still not great. 50% of the models we trained used hyperparameter values that led to quite poor results. 

With Informed Prior Knowledge: 

The effect of introducing prior knowledge for our hyperparameters is dramatic: only 4% of models scored poorly instead of 50%, and the mean and median error scores of our models are both significantly improved. This means the SigOpt optimizer is indeed exploring parameter spaces much more likely to find a better performing solution. We’re getting more out of our limited SigOpt budget by informing the optimizer with our prior experimentation results through prior distributions of parameters. 

With informed prior distributions in place, we can use SigOpt’s optimizer to find an equivalently performant model with fewer runs. In practice, prior knowledge can also provide a speedup in experimentation if models with hyperparameters resulting in poor models take longer to train than hyperparameters of better performing models, as is the case in neural networks where poor hyperparameter values can cause gradient descent to diverge. Diverging models will always use the maximum iterations set and never reach an error tolerance needed to converge earlier than the maximum iterations.  

Wrapping Up  

Ready to dive into intelligent experimentation with SigOpt using your Prior Knowledge? Sign up for a free SigOpt account today. 

Both SigOpt experiments run for this blog can be viewed in a guest session here: No Prior Knowledge and with Prior Knowledge.  

Want to learn more about Prior Knowledge? Check out: 

Have feedback or additional questions about Prior Knowledge that weren’t answered here? Join the discussion on the SigOpt Community page. 

IMG_20190513_235054 (2)
Trevor Skelton Machine Learning Engineer Intern