Simple Neural Architecture Search with SigOpt

Barrett Williams and Ellick Chan (Intel)
Advanced Optimization Techniques, Augmented ML Workflow, Hyperparameter Optimization, Machine Learning, Training & Tuning

Neural Architecture Search (NAS) is a modern technique to find optimized neural networks (NN) for a given task, such as image classification, through structural iteration and permutation. Network parameters like the depth of the network, number of convolutional filters, pooling, epochs and learning rate can substantially impact a network’s accuracy, inference throughput, and latency for a given dataset.

The search space for these parameters is large so NAS can take many compute-hours to train. In this article, we show how you can use smarter search algorithms provided by SigOpt paired with raw cluster computing power provided by Ray Tune to accelerate this process. We use a simple example so that practitioners can apply this technique to their own workflows.

To illustrate the core concept of NAS, consider the original network in Figure 1a. This reference network consists of a single input layer followed by one or more copies of Block 1. Block 1 is based on a convolution-pooling motif consisting of 3×3 convolutions with 32 filters, optionally followed by a pooling operation. This pattern continues with one or more copies of Block 2, similarly composed of 32 3×3 convolutional filters. These convolutional blocks are then flattened to a vector, processed through a fully connected layer, and topped off with a softmax function for final classification.

NAS helps the data scientist test a variety of permutations of a reference architecture. Figure 1b shows one option called “depth scaling,” in which Block 2 is repeated to increase the effective depth of the network. For good measure, we also optionally add another fully connected layer of 1024 neurons. In this tutorial, the two fully connected layers are the same size, but they can be different sizes in your application.

Figure 1c shows “width scaling,” in which the depth of the network remains constant but parameters on the operators are varied. In this case, we reduce the number of convolutions in L1 (layer 1) from 32 to 16, and increase the number of convolutions in L2 from 32 to 64. We also make the fully connected layer wider, going from 1024 to 2048. Note that NAS doesn’t have to search for all parameters at once. It’s possible to optimize one parameter at a time and fix the others, or if your optimizer is intelligent like SigOpt, it’s both possible and more efficient to strategically update multiple parameters at once to find the best network architecture more quickly.

Figure 1d explores one more dimension by challenging our assumption of using 3×3 filters. Instead, we substitute the filters in Block 1 with 5×5 filters and Block 2 with 7×7 filters. This can help the performance of certain models and datasets, depending on data characteristics and input image resolution.

By now, it’s fairly clear that even with a simple example, there are a combinatorially large number of NN parameters to customize and explore. In the rest of this article, we will show you how to use SigOpt and Ray Tune to fine tune the space of simple NN used to classify images in the classic CIFAR10 dataset.

Block Diagrams of Multiple Model Architectures

These Block Diagrams Represent Multiple Model Architectures that fall within the Search Space

Overall Workflow

  1. Define a NN training task: choose a dataset and a model template (e.g., CIFAR10; convolutional neural net (CNN)) and define the parameters to tune (e.g., number of layers and/or filters).
  2. Apply Ray Tune to search for a preliminary set of model parameters.
  3. Adapt the search algorithm to SigOpt to get better parameters more efficiently.

Parameterizing the Model

For the purposes of this article, we define a NN training task as a convolutional network with one or more convolutional blocks. We’ll use the CIFAR10 dataset and the Keras API from TensorFlow.

# Install prerequisites
!pip install -qqq ray[tune] pandas
 
# Download dataset
from tensorflow.keras.datasets import cifar10
d = cifar10.load_data()
 
# Ray Tune calls this function many times in parallel with different hyperparameters
def train(config):
    import os, numpy as np
    import psutil
    os.environ['OMP_NUM_THREADS'] = str(int(psutil.cpu_count()/2))
    from tensorflow import keras
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
 
    # Load the dataset
    (X_train, Y_train), (X_test, Y_test) = keras.datasets.cifar10.load_data()
 
    # Create the model
    model = Sequential()
 
    # Build first convolutional block motif
    model.add(Conv2D(config['nconv0'], kernel_size=(3, 3),
          	       activation='relu', input_shape=(32, 32, 3)))
    for i in range(config['nblocks1']):	# Repeat this block nblocks1 times
    	if len(model.layers) > config['layers']: break	# Limit depth to "layers"
    	# Add nconv1 3x3 conv kernels with relu activation
    	model.add(Conv2D(config['nconv1'], kernel_size=(3, 3), activation='relu'))
        # Add pooling
    	if config["pooling"] == "True": model.add(MaxPooling2D(pool_size=(2, 2)))
    	model.add(Dropout(0.25))	# Add dropout
 
    for i in range(config['nblocks2']):
    	if len(model.layers) > config['layers']: break
    	model.add(Conv2D(config['nconv2'], kernel_size=(3, 3), activation='relu'))
    	if config["pooling"] == "True": model.add(MaxPooling2D(pool_size=(2, 2)))
    	model.add(Dropout(0.25))
 
    model.add(Flatten())	# Flatten output tensor to a vector
    for i in range(config['nfcll']):	# number of fully connected last layers
    	if len(model.layers) > config['layers']: break
    	model.add(Dense(1024, activation='relu'))
 
    model.add(Dropout(0.5))
    model.add(Dense(10, activation='softmax'))	# Apply softmax classification
    print("Layers: %d max layers: %d" % (len(model.layers), config['layers']))
 
    # Compile and setup training
    from tensorflow.keras.utils import to_categorical
    model.compile(loss='categorical_crossentropy',
  	# Adam uses an adaptive learning rate that we do not explicitly tune this
  	optimizer=keras.optimizers.Adam(lr=0.0001, decay=1e-6), metrics=['accuracy'])
 
    # Train the model
    model.fit(X_train / 255.0, to_categorical(Y_train),
  	batch_size=128, shuffle=True, verbose=0,
  	epochs=config["epochs"])
   
    # Evaluate the model
    scores = model.evaluate(X_test / 255.0, to_categorical(Y_test))
    tune.report(Accuracy=scores[1], Loss=scores[0])
 
def train_wrapper(config):
  try:
    train(config)
  except ValueError:
    # Applying pooling too many times can shrink the tensors so much
    # that the dimensions go negative
    print("Invalid neural net sampled")
    tune.report(Accuracy=0, Loss=999)

To parametrize the model, we define the following:

  • Epochs — Number of epochs to train a model
  • Layers — Maximum number of layers of the desired model (subsequent layers are pruned)
  • Nconv0 — Number of 3×3 convolution filters for the input layer
  • Nfcll — Number of fully-connected last layers, with 1024 neurons each
  • Pooling — Global setting to enable/disable pooling in convolution blocks #1 and #2
  • Nblocks1 — Number of copies of convolution block #1
  • Nconv1 — Number of 3×3 convolution filters for block #1 and #2
  • Nblocks2 — Number of copies of convolution block #2
  • Nconv2 — Number of 3×3 convolution filters for block #2

To be consistent for deploying clusters in Part 2 (to be published later), we’ll start Ray from the command line. If you’re running on a single node, the following commands aren’t necessary:

!ray stop
!sleep 3
!nohup ray start --head --num-cpus 1
!sleep 3
import ray
ray.init('localhost:6379')

If you’re running on a cluster such as Intel® DevCloud that uses a job scheduler (e.g., the Portable Batch System), the following commands start worker processes on multiple nodes:

!which qsub && echo ray start --address `hostname`:6379 --block --num-cpus 1 | qsub
!which qsub && echo ray start --address `hostname`:6379 --block --num-cpus 1 | qsub
!which qsub && echo ray start --address `hostname`:6379 --block --num-cpus 1 | qsub
!which qsub && sleep 20

Finally, set the parameters:

from ray import tune
config = {
    	"epochs":   tune.randint(20, 30),
    	"layers":   tune.randint(1, 20), 	# maximum number of layers
    	"nconv0":   tune.randint(16, 64),	# input layer
    	"nfcll":    tune.randint(0, 2),  	# fully connected last layer
    	"nblocks1": tune.randint(1, 3),  	# conv block 1
    	"nconv1":   tune.randint(16, 64),
    	"nblocks2": tune.randint(1, 3),  	# conv block 2
    	"nconv2":   tune.randint(16, 64),
    	"pooling":  tune.choice(['True', 'False'])
    }

Apply Ray Tune

Ray Tune is a Python library that facilitates scaled experimentation, as well as hyperparameter optimization via SigOpt, allowing multiple worker nodes to explore the search space in parallel. A naïve grid search of our defined parameter space would explore nearly 1.2 billion possible configurations. In this article, we show how to use random search to speed up this process, then follow up with a smarter guided search using SigOpt, effectively comparing the performance and output of the two approaches.

For Ray Tune, the most important inputs are the function to optimize (train) and the search space for the parameters (config). We defined both of these earlier and provide the corresponding code below. Other options include a choice of search algorithm and scheduler for more guided searches.

# Common options for Ray Tune
tune_opts = {
    # Number of sample points to try, increase for better results
    'num_samples': 5,
    # Some net configs are invalid (pooling too many times creates negative dim)
    'raise_on_failed_trial': False
}
 
import subprocess, psutil
 
# Enable accelerator, if present
try:
  if subprocess.run('nvidia-smi').returncode == 0:
    tune_opts['resources_per_trial'] = {'cpu': 1, 'gpu': 1}
except FileNotFoundError: pass
 
analysis = tune.run(
    train_wrapper,
    config=config,
    verbose=1,
    **tune_opts)
 
# To see the full optimization results, inspect the results dataframe
analysis.results_df
 
# Visualize Ray Tune results
d = analysis.results_df
d.plot.scatter('timestamp', 'Accuracy')

Integrating with SigOpt

To sign up for free access to SigOpt, please use this sign-up form. You’ll then be able to create an account, which will give you access to an API key that you can use in your Google Colab notebook or DevCloud Jupyter notebook.

SigOpt free signup page to embed in Colab

The sign-up form for free access to SigOpt

API Key screengrab to embed in Colab notebook

Where to find your API token, once you’ve signed up

# Fill in your SigOpt key here
SIGOPT_TOKEN   = "YOUR_SIGOPT_API_KEY_HERE"
SIGOPT_PROJECT = "raytune-simplenas"
!pip install -qqq sigopt
 
# Convert Ray Tune parameter space to SigOpt format
def convert_space_to_sigopt(config):
    c = []
    for k, v in config.items():
    	print(k,v)
    	if isinstance(v, ray.tune.sample.Float):
        	c.append({'name': k, 'type': 'double',
                  	'bounds': {'min': v.lower, 'max': v.upper}})
    	elif isinstance(v, ray.tune.sample.Integer):
        	c.append({'name': k, 'type': 'int',
                  	'bounds': {'min': int(v.lower), 'max': int(v.upper)}})
    	elif isinstance(v, ray.tune.sample.Categorical):
        	vals = [{'enum_index': i+1, 'name': str(z),
                 	'object': 'categorical_value'}
                	for i, z in enumerate(v.categories)]
        	c.append({'name': k, 'type': 'categorical', 'categorical_values': vals})
    	else:
        	print('Unknown type:', k, type(v))
        	raise ValueError
    print('config:', c)
    return c
 
import ray, os
from ray.tune.suggest.sigopt import SigOptSearch
from ray.tune.schedulers import FIFOScheduler
sigopt_connection = ray.tune.suggest.sigopt.Connection(client_token=SIGOPT_TOKEN)
algo = SigOptSearch(
    convert_space_to_sigopt(config),
    name=SIGOPT_PROJECT,
    connection=sigopt_connection,
    project=SIGOPT_PROJECT,
    max_concurrent=3,
    metric="Accuracy",
    mode="max")
 
analysis_sigopt = tune.run(
    train_wrapper,
    search_alg=algo,
    verbose=1,
    **tune_opts)
 
analysis_sigopt.results_df
 
# Visualize Ray Tune results
a = analysis.results_df[['timestamp', 'Accuracy']]
s = analysis_sigopt.results_df[['timestamp', 'Accuracy']]
a['timestamp'] -= a['timestamp'].min()
s['timestamp'] -= s['timestamp'].min()
ax = a.plot.scatter('timestamp', 'Accuracy', c='b',
                	label='Random Search- Acc %0.3f%%' % a['Accuracy'].max())
s.plot.scatter('timestamp', 'Accuracy', c='r', ax=ax,
           	label='SigOpt- Acc %0.3f%%' % s['Accuracy'].max(),
           	title='Accuracy vs time')
ax.legend(loc='lower right')

Interpreting SigOpt results

In the example above, we used a limited number of sample points to allow this experiment to complete quickly. If more data points are sampled, you might see a figure like the one shown below. This illustrates that SigOpt’s directed search helps find better solutions more efficiently than random search.

Plot showing accuracy of SigOpt versus random search

Plot showing accuracy of SigOpt versus random search. SigOpt delivers more accurate model configurations faster and more consistently.

SigOpt helps data scientists understand which parameters matter most for their NAS. Below, we see that the number of epochs has the biggest influence on the accuracy of the network, followed by pooling layers, and then the number of convolutional filters. SigOpt searches this space intelligently to find the best values efficiently.

To help data scientists understand the influence of various parameters, SigOpt visualizes the relative parameter importance with respect to the points sampled. Note that this is a bit of a biased sample as the points are chosen intelligently by the optimizer (instead of at random).

Parameter importance shows which parameters most impact accuracy.

The parameter importance plot shows which parameters in your model search most impact accuracy.

Given the relative importance of the parameters, we examine the relationship between convolutional filter parameters nconv0 and nconv1 and find that this particular problem prefers around 50 filters for nconv0 and a small number of filters for nconv1. Any pair of variables can be visualized in this plot.

Three-dimensional plot of experiment history.

This plot shows two parameter axes and the accuracy values they correspond to.

A parallel coordinates plot shows the trajectory of the parameter search. In this case, the highest scores are obtained with a larger number of epochs, pooling, and different combinations of layer parameters. This plot shows what this particular problem prefers. As the dataset or objective is changed, the preferred parameters may differ.

Understanding the relationships between the parameters can help a data scientist better optimize parameter values for the problem and better manage tradeoffs.

Parallel Coordinates on SigOpt Web Dashboard

Parallel coordinates show you which pairings and ranges of model parameter values result in higher accuracy.

Feel free to try this notebook out for yourself on Google Colab or Intel’s own DevCloud. Be sure to sign up for free access to SigOpt, and start optimizing, tracking, and systematizing today.

Editor’s note:

This blog post is also available on Intel’s tech.decoded channel.

Barrett-Williams
Barrett Williams Product Marketing Lead
Ellick Chan (Intel) Guest Author