Imagine walking down the street and seeing the greatest pair of shoes. But alas! The person is getting further and further away from you and you’ll never see those loafers again. Fear not, for you have PURCHASE-NOW, an app where you can take a photo of the shoes and it’ll order the exact pair for you. They’ll be on your doorstep when you come home that evening.
Setting aside whether it is uncouth to take photos of strangers on the street, this is closer to fact than fiction than you probably think. Over the last decade, image processing, and image classification in particular, have become key tools in diagnosing diabetic retinopathy, designing your abode, monitoring the flow of refugees, and preventing self driving cars from hitting you when you walk down the street.
Advances in image classification, spurred by decreasing compute costs, the acceleration of GPUs, and the inception of the ImageNet challenge, have led to rapid improvements in neural network architecture and performance. In 2015, with the introduction of ResNet by He et al, a major architectural change for image classification lead to deeper, better performing convolutional neural networks (CNNs).1 Paired with the rightful glory of ImageNet, ResNet model weights pre-trained on ImageNet-1000 have become a standard baseline for many image classification problems.
Transfer learning is a methodology that applies a model previously trained (pre-trained) on a monumental dataset (typically ImageNet for image classification tasks) to a more novel and niche dataset. These datasets could include satellite images, lung cancer x-rays, road signs, photos of landmarks, and even subsets of ImageNet itself. The promise of transfer learning is to avoid the expensive (in time and resources) process of fully training deep architectures on these new datasets, while still realizing good – if not great – model performance on the classification task without overfitting.
In this transfer learning process, a pre-trained model can be used in multiple ways. We will focus on using a pre-trained models as a feature extractor or a baseline for further model training (fine-tuning). In the case where the pre-trained model is used as a feature extractor, only the final fully connected layer is trained with the rest of the network digesting images into their main features. With fine tuning, the convolution weights are initialized using the pre-trained network and further trained along with the final fully connected layer to fit the dataset at hand.
While using the CNN as a feature extractor is a powerful tool, fine tuning the network is recommended for training data that is dissimilar to those used in training the pre-trained network. Through a use case, the rest of this post will explore the trade-offs of these two approaches to transfer learning in relation to hyperparameter optimization to answer the question- how can we get the most out of these pre-trained models and how do we optimize them efficiently?
Brought to you by the genius behind ImageNet, the Cars dataset was initially introduced in 2013 to compare the effects of using many 2D photos versus a smaller corpus of images and their 3D representations as training data on image classification algorithms. The dataset itself is composed of 16,185 high resolution photos of cars spanning 196 granular labels distinguished by Car, Make, Year with each class approximately 0.5% of the whole. Its complexity arises from the specificity of the dataset, the amount of data available per class, and the amount of data available as a whole. The images themselves also play with RGB channels, illumination, background clutter, and intra-class variance. 
Prior to residual connections, deeper networks were performing more poorly than shallow networks (depreciation). As this depreciation is unexplained by vanishing gradients or over-fitting, He et al introduced residual connections (ResNet) to solve the deprecation problem previously observed with deep networks, hypothesizing and proving that CNNs are able to better fit a residual mapping than the original network function. As a result of this advance, the ResNet architecture won the ILSVRC & COCO 2015 challenges. 
In order to explore the trade-offs between the two main transfer learning methods on pre-trained ResNet models, we will set up our experiment as follows:
Our transfer learning trainings will use the following baseline hyperparameters from He et al:
Our Bayesian Optimization will search the following hyperparameter space:
*Note: Batch Size grows in factors of 2 ex: 16, 32, 64, etc.
All trainings use the following common techniques:
*Note: The validation split is 20% of each label.
We will use SigOpt as our Bayesian Optimization service and SigOpt’s implementation of Multitask Optimization to reduce hyperparameter optimization time. Multitask Optimization allows users to specify tasks at various fidelities and intelligently learns the space from lower fidelity runs. Let’s take a look at an example from Klein. 
In the above image, Klein shows the progress of learning the hyperparameter space for a SVM model trained on MNIST. Klein establishes optimization runs with varying fidelities (s=1/128, s=¼, etc) with each fraction representing a fraction of the data. So, for s=¼ the hyperparameter optimization will use a fourth of the MNIST data. We see that low fidelity optimization runs (s= 1/128 and s = 1/16) provide us with knowledge about the hyperparameter space we can use for high fidelity (s=1) runs. By running more low fidelity optimization runs, we can learn more about the space at a fraction of the cost (compute time, wall clock time, resource cost, etc) of a high fidelity run. 
Along with leveraging Multitask Optimization, we will also use Orchestrate to parallelize our hyperparameter optimization. Orchestrate is a layer on top of kubernetes that leverages AWS’s ECR and EKS systems to help you run hyperparameter tuning across multiple AWS EC2 instances, intelligently scheduling your jobs and resources. It manages your resources and coordinates with SigOpt to track your experiment. SigOpt Orchestrate is a command-line tool that makes it easy to manage training clusters and running optimization experiments.
For our Multitask Optimization runs, we will use 20 p2.xlarge AWS EC2 instances that each use 1 NVIDIA K80 GPU. Tuning ResNet 50 as a feature extractor and fine-tuning ResNet 18 for 1 cycle on CPU each take approximately 8 hours. Running hyperparameter optimization for each cuDNN of these techniques on CPU would rack up 1,760 compute hours, or 73 days. When using NVIDIA GPUs paired with cuDNN, we experience a 4x speed-up and a more reasonable compute time of approximately 900 hours per technique. Hence, no experiments were run on CPU. For more information on GPU speed-ups and CUDA performance, see Justin Johnson’s work on benchmarks for popular CNN architectures using CPU and different GPUs, with and without cuDNN.
The first question we set out to understand is the impact of different transfer learning techniques on model accuracy. In the context of this dataset and classification problem, we see using a deep ResNet architecture (ResNet 50) as a feature extractor underperforms in comparison to a fine-tuning a shallow ResNet (ResNet 18). Now this is surprising, as we expect to see deep feature extractors perform better than fine-tuning shallow networks. Not only do we observe that a deeper architecture is not required to perform well on this task, we also see that the Cars dataset is dissimilar enough in comparison to ImageNet that fine-tuning is necessary. This suggests that it is worthwhile for teams to attempt fine tuning and iterate on varying architecture depths even with datasets that seem similar on the surface level.
Our second question is the effect of hyperparameter optimization for gradient descent on the two transfer learning methods. In both cases, there was a significant lift in the performance of the model when SigOpt was used to optimize the hyperparameters compared to the baseline. In this case, using ResNet 50 as a feature extractor paired with optimization resulted in a 1.58% boost in comparison to the non-optimized ResNet 50 but remains an underperforming model (47.99%). On the other hand, fine-tuning ResNet 18 and optimizing the hyperparameters with SigOpt produces the highest performing model (87.33% accurate) that is 3.92% more accurate than the next-best version of the model (ResNet 18 without optimization). The optimized and fine-tuned ResNet 18 represents a 40.92% reduction in error compared to the non-optimized ResNet 50. With fine-tuning, the model is able to better fit to its data, and bolstered by hyperparameter optimization, results in our best performing model.
Let’s take a closer look as to how Multitask Optimization supports the performance of fine-tuning ResNet 18. As we saw from Klein, Multitask Optimization learns the hyperparameter space from low fidelity models and applies that knowledge to better inform high fidelity models. In our scenario, fidelity relates to the number of epochs to which an optimization cycle is run. Like Klein, we expect our low fidelity models to inform higher fidelity models as we expect models run to a lower number of epochs to be correlated with models run to a higher number of epochs. Here is a breakdown of the fidelities.
As seen in the SVM example from Klein, we also see low fidelity optimization runs informing high fidelity optimization cycles. In particular, with the case of hyperparameter tuning the learning rate, we see Multitask Optimization using low fidelity tasks to explore the hyperparameter space. Once it has an idea of a region to further exploit, it starts using medium fidelity tasks in combination with low fidelity tasks to further explore and exploit the search space. Further along the optimization process, it introduces high fidelity optimization runs that home in on the most beneficial portion of the hyperparameter space.
The third question relates to resource constraints. Teams most often use feature extraction techniques to save wall-clock time or minimize computing costs. But in this case, we show that fine-tuning a shallower network can require the same resources as feature extraction from a deeper network, while also producing much better performance. To optimize the 7 selected hyperparameters this model requires a minimum of 220 training runs (if using efficient approaches like Multitask Optimization). The training time for fine-tuning ResNet 18 is 4.2 hours and is 4.08 hours for training ResNet 50 as a feature extractor. This means ResNet 18 uses 924 compute hours (220* 4.20 hrs) for optimization. But the optimization of ResNet 50 for feature extraction takes 898 hours (220*4.08 hrs) itself. 26 hours is a significant difference, but is less than 3% of the entire time to optimize these models. And when placed in the context of the nearly 40% performance gain of ResNet 18 (87.33%) when compared to ResNet 50 (47.99%), this 3% additional wall-clock time seems well worth it.
Our fourth question, however, addresses the concern that 924 hours is still significant wall-clock time. Multitask Optimization is between 2x – 20x more efficient than other optimization methods like Random Search. But even so, 924 hours (39 days) to optimize a model is often too slow for most teams. One common method for accelerating wall-clock time is running the optimization process in parallel, but this is historically challenging to do efficiently for Bayesian optimization. To solve this challenge, the SigOpt team developed Orchestrate, a cluster management solution that makes it easy to parallelize up to a 100 Bayesian optimization process. This solution made it seamless to run this optimization process across 20 machines without compromising performance. As reported in the table below, this makes fine-tuning much more efficient from a wall-clock time perspective when put in the context of % improvement in accuracy.
Note: To calculate cost ($) per % improvement Feature Extractor ResNet 50 without optimization is used as the baseline (46.41%).
Finally, our fifth question: what is the real-world impact of misclassification? This difference in accuracy has a material impact on the performance of the model, which can be seen rather clearly when observing misclassified images. Common misclassifications for image classification fall in the areas of semantic gap, viewpoint variation, background clutter, illumination, deformation, occlusion, and intraclass variation. As an example, take a look at how the fine-tuned Resnet 18 model compares to the optimized and fine-tuned Resnet 18 model. Below are examples of the most common errors for both models. In the table below, we see that the most common errors are due to mix-ups in Model or Year, we also see that the optimized model is closer to the actual label than the non-optimized model. These errors speak to the difficulty and granularity of the dataset.
While a majority of the errors we see are related to mix-ups in Model and Year, the non-optimized ResNext 18 model falters when it comes across background clutter and intraclass variance. Furthermore, as seen with the Land Rover example below, the model itself is unable to read and intelligently discern from the logos themselves the Make, Model, and Year. This speaks to the model relying on physical properties of the car, such as shape, color, and size, to classify the image.
While the non-optimized model faces problems in intraclass variance, we see that the optimized ResNet 18 model does a much better job in intraclass variance but still has problems when it comes to background clutter and deformation.
As we see from the images above, a majority of the errors are related to Model and Year mix-ups that even mere humans such as myself would not be able to distinguish. While the non-optimized ResNet 18 is unable to classify correctly when it comes to intraclass variance and background clutter, optimized ResNet 18 solves for these problem areas well but continues to have problems with background clutter and introduces classification problems related to deformation.
Why does this matter?
Model training is expensive and while hyperparameter optimization can make for more impactful models, it can increase modeling related costs. In this post, we have walked through two different transfer learning methods in conjunction with Multitask Optimization that make model training and optimization more accessible to all. Surprisingly, or maybe not so surprisingly to some, we have seen that a shallower model tuned to its fullest extent (fine-tuned and hyperparameter optimized) widely outperforms its deeper version (ResNet 50) playing as a feature extractor. Not only does this shallow network result in better performance, it is also more cost and time effective, even when paired with optimization. We see that Multitask Optimization’s ability to frugally learn the hyperparameter space, paired with the right orchestration, drives enough efficiency in wall-clock time and compute cost to make optimization feasible for most teams and most models.
Here, we have explored two techniques used for transfer learning, but there are many other ways to leverage a pre-trained network. For future work, we are most interested in exploring more transfer learning methodologies, such as freezing specific layers in the network and tuning others, to understand how best to anchorage on a pre-trained model. This process could also be optimized to automatically learn which layer to freeze or unfreeze. Along with exploring transfer learning techniques, we are curious to understand how optimizing the pre-processing workflow can lead to positive gains for model performance. Specifically, as we have seen, the Car dataset’s properties of having few images for each label makes classification quite difficult. There are two potential strategies we would pursue in future analysis to address this challenge. First, we could augment the dataset by hyperparameter tuning an image augmentation process to bolster the dataset. Second, we could use style transfer and optimization to augment the dataset further with artistic representations of the images and analyze which styles are most difficult to classify (my money is on Cubism).
With advances in data collection and innovation in architecture, computer vision has greatly advanced within the past decade. These advances in deep learning architecture have allowed more complex problems to be tackled, such as model interpretability, model resiliency, high-performing recommendation systems, and AI-based cybersecurity. Practical model development techniques, such as transfer learning and multitask learning, promise to give any team the power to bring these breakthroughs to bear on their own modeling use cases. In this post, we show that combining an architectural advance (ResNet), a practical modeling technique (transfer learning) and a novel approach to hyperparameter optimization (multitask) is capable of solving a difficult modeling problem (classifying a niche and minimally populated dataset). The result was a high-performing model that is achievable for any team, regardless of their resource constraints.
If you’re interested in more things computer vision, ICML and CVPR are coming up this summer (we’ll be there- come say hi). Here’s the repo to play around with the code. To use SigOpt, please contact us to get started. Happy Modeling!
Thank you to Nick Payton, Michael McCourt, and Scott Clark for their thoughts and inputs.
 K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. In Proc. of CVPR 2016.
 A. Klein, S.Falkner, S.Bartels, P. Hennig, F.Hutter. Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets. In Proc. of AISTATS 2017.
 J. Krause, M. Stark, J. Deng, L. Fei-Fei. 3D Object Representations for Fine-Grained Categorization. In Proc. of ICCV 2013.