Black-Box Image Augmentation for Better Classification

Meghana Ravikumar

This blog was originally published on MLconf.

Data augmentation is a classic technique to introduce healthy noise to your dataset, generate more data, and just spice up your modeling workflow. The main idea behind data augmentation is that models learn generalizable patterns the more data they see. For example, say you are training an image classifier with the following images of cats.

Not only will the model believe that all cats in the world are black, it also assumes that every cat sits all the time. Or in other words, the classifier is unable to form a wholesome representation of a cat. So what would happen if you showed it the following images of cats?

The classifier wouldn’t be able to recognize that the objects in the images are in fact cats (I’m not convinced that the far right image is in fact a cat). So for small datasets, data augmentation is used as a way to introduce variance and provide your classifier with a better sense of the world.

Specifically, image augmentation is commonly used to supplement niche and small datasets for computer vision. Some of the most rudimentary data augmentation techniques entail simple transformations to an image. These transformations could be random crops, flips along the vertical or horizontal axis, whitening, and changes in color effects. More advanced techniques include applying style transfer to create novel representations of the images.

In his most recent work, Quoc Le explores using deep reinforcement learning to dynamically realize the best image augmentation techniques for a classic benchmarking dataset (ex: ImageNet) and a CNN architecture (ex: ResNet50). The introduction of these transformed images in the dataset allows the downstream model to learn more generalizable and robust features about the objects it is trying to classify. These generalizable features boost the model’s performance during training and when classifying unseen data. In this post, we will focus on automatically finding optimal augmentation parameters for simple image transformation techniques to help an image classifier learn more generalizable features and boost its classification accuracy.

We will include image augmentation to an image classifier from a previous post,  Insights for Building High-Performing Image Classification Models. As we saw in a previous post, transfer learning and Multitask Optimization result in a high-performing image classification model for the Stanford Cars dataset. Specifically, we see fine-tuning ResNet 18 outperforms using ResNet 50 as a feature extractor. Paired with Multitask Optimization, a more efficient form of Bayesian Optimization, fine-tuning ResNet 18 results in our best performing model, achieving 87.33% accuracy. This model will be used as the baseline for the rest of the post. Image augmentation will be introduced to this model’s (fine-tuned ResNet 18) training process to build a new classifier and explore the effects of image augmentation in terms of augmented images produced and model classification accuracy.

Can we improve the baseline image classifier’s performance?

Despite relatively high performance, an image-by-image review of errors suggests that the baseline model still faces significant challenges. From our previous post, we see that most of the challenges occur when the model to classifies images taken at non-conventional angles (ex: angles with only a portion of the car in the image), and images that are very similar to one another in the label space (ex: an Audi S5 Coupe 2012 being misclassified as an Audi S6 Sedan 2011).

Some of the main difficulties our baseline model faces stem from the characteristics of the Stanford Cars dataset. The dataset consists of 16,185 high resolution photos of cars spanning 196 granular labels distinguished by Car, Make, Year with each class approximately 0.5% of the whole. Its complexity arises from the specificity of the dataset, the small amount of data available per label, and the amount of data available as a whole. Essentially, the dataset is small compared to the number of labels, which means that our model may not be learning robust, generalizable features for each label. We will focus on mitigating the effects of a small dataset to number of labels ratio.

Our Approach

We will explore the hypothesis that by augmenting our dataset, we allow our model to learn more robust and generalizable features for each label and as a result will produce more accurate classifications than our previous model.

By visually inspecting different image augmentation options (courtesy of PyTorch and PIL), we decided to use horizontal flips and color jitter (saturation, contrast, brightness, and hue) as the main transformations used to augment our dataset. These two transformations allow us to supplement our data while adding healthy variance without negatively disfiguring the images.

We will use our previous best-performing model as our baseline and re-train with image augmentation. The model training process for this experiment will include image augmentation for pre-processing and fine-tuning for model training. We repeatedly see different optimal image augmentation techniques and parameters for different datasets.1 As each image augmentation parameter has large search parameters, it is not feasible to try each possible parameter and each combination manually. To effectively automate this search process, as each training cycle takes 4.9 hours, we will use Multitask Optimization (a type of Bayesian Optimization) to effectively search our image augmentation parameter space. Furthermore, we create a feedback loop to ensure that the parameter decisions are most beneficial for the model.

Let’s take a look at the feedback loop.

In the feedback loop above, SigOpt (our Multitask Optimization implementation) provides hyperparameter suggestions for image augmentation and model training. These hyperparameter suggestions are based off of the downstream classification performance of the model. As we explore the parameter space for image augmentation and training together, we are able to home-in on the optimal set of parameter values that allow the model to learn the most from the augmented and original images. Effectively, allowing the model to choose the image augmentation features that result in higher model performance.

Through this experiment, we will explore the following questions: In terms of accuracy, how well does our model perform with image augmentation? What do the features selected through black-box image augmentation tell us about the model? Do the models learn from generalizable features that we as humans can identify? And, finally, what do the misclassifications look like?


We will be using the Stanford Cars dataset. Below are a couple of examples of the images.

For further information, please see the Data section of our previous post.


We will be using the best performing model from our previous post as a baseline. We introduce image augmentation for data preprocessing to the baseline and continue to use Multitask Optimization as our hyperparameter optimization method.

Experiment Design – The Feedback Loop

Here we will build on our previous post to out-perform our baseline by introducing image augmentation. Instead of manually tuning the parameters to the image transformation methods, this experiment includes these parameters in a greater hyperparameter tuning feedback loop where the parameters for preprocessing are tuned concurrently with model training parameters. As image augmentation techniques have standardized over time, they often do not apply across datasets and might not explore all possible transformations.1 Even with simple transformations such as saturation, hue, and contrast, trying all possible combinations is possible manually and expensive with grid and random search. With this hyperparameter feedback loop, we allow for exploration of the whole parameter space to find the truly optimal configurations for the dataset. Not only do we allow for a true search of the parameter space, we also use the model’s classification accuracy from training on the augmented images to inform the selection of augmentation parameters. In other words, by parameter tuning the model’s hyperparameters and the augmentation parameters together, we identify augmentation parameters that directly benefit model training. This allows us to directly boost our model’s performance and will give us an understanding as to which features in the images are import for the model.

More Experiment Design

Each image is augmented once resulting in a total of 32,370 images with 16,185 augmented images and 16,185 original images. The validation set is a 20% split for each label consisting of only the original images, as the model performance should be measured solely on the original Stanford Cars dataset. Like our previous experiment, the hyperparameter tuning for this experiment will be run across 20 parallel AWS ec2 instances.

Multitask Optimization will search the following hyperparameter space:

*Note: Batch Size grows in factors of 2 ex: 16, 32, 64, etc.

All trainings use the following common techniques:

*Note: All images that are augments of the original will go through a horizontal flip process.

**Note: The validation split is 20% of each label and contains only the original images (no augmented images).

The results will focus on model performance, and most interestingly, the features of the original images the augmentation amplifies.


The first question we set out to understand is does image augmentation lead to improved classification accuracy? In the context of the Stanford Cars dataset and model architecture at hand, we find that the combination of image augmentation paired with Multitask Optimization gives us a 6.65% boost in accuracy!

Although each image is augmented once, it leads to a sizeable jump in accuracy. The image augmentation parameters are chosen in a black box fashion and are solely informed by the model’s validation performance at the end of 35 epochs.

This leads us to ask the question- does the black box selection highlight any human identifiable features in the images? To answer this, let’s take a look at a handful of original images and their augments.

The black box augmentation transforms the original image to a hyper-vibrant image. The transformation applies heavy contrast to heighten the boundaries between the objects in the image. Similarly, a high level of saturation is applied to accentuate these boundaries and differentiate the objects. The high degree of saturation also shifts the color spectrum to its extremes, allowing for easier separation and better distinction of the objects. The power of these transformations is obvious in row 2 of the table above. Here the car is occluded by a lack of illumination and the banner text, but through these black-box transformations, is accentuated and highly visible. The transformation also makes the images brighter, negating white backgrounds and creating more obvious reflective patterns on the vehicles themselves. In general, the transformations accentuate boundaries between objects and make the color schemes less complicated than the ones found in the original image. This may allow for the model to learn the edge patterns of cars, and deal with background clutter, illumination, and intraclass variance better.

Finally, our fourth question: what is the real-world impact of misclassifications? As we see here, the ResNet 18 + HPO model’s performance falters in the face of too much background clutter or images taken at different angles. The augmented images from our black box augmentation-optimization process should address these issues. Let’s take a look at common misclassifications for our best performing augmented model.

Ninety percent of the misclassifications by the augmented model are similar to the misclassification represented in the first row of the above table. These misclassifications are related to slight errors in the car make but are otherwise consistent with the actual Car, Make, and Year. The other ten percent of misclassifications are more dissimilar in classification labels as seen in the rest of the rows in the above table. Despite some of these images being challenging to classify, the model makes an educated guess and chooses a Car, Make, and Year that most resembles the body of the car in the image. We see that the model is able to group and recognize large cargo vans, small coupe/convertible models, and even make a solid judgement call on the shape of the Acura Integra Type R 2001 (row 2) to predict a label despite minimal information. This could indicate that the augmented and hyperparameter optimized model is able to learn more robust features for the labels.

Why does this matter?

Transfer learning has become a critical training methodology to curtail training expenses and leverage large benchmarking datasets (such as ImageNet and COCO). In our previous post, we explored using Multitask Optimization in conjunction with two different transfer learning methods to make model training and optimization more accessible to all. In this post, we apply image augmentation in the context of transfer learning and Multitask Optimization, resulting in a higher performing model. By introducing hyperparameter optimization as a wholesome approach to the modeling workflow, we are able to successfully leverage our model training to inform our augmentation techniques. We not only see that black-box optimization techniques for image augmentation are performant, but also help us learn about our models.

In this experiment, through a feedback loop, we allow the model to choose the features of the images that work the best for it to be able to classify images. The resulting augmentation images are somewhat perplexing.  As computer vision problem areas are focused around too much background clutter, intraclass variance, occlusion, and illumination, we would expect these causes of concern to be directly addressed and mitigated by the model. Instead, we see the background being highlighted even more, illumination to be heightened, types of occlusion welcomed, contrast to be accentuated, hues changed, and saturation to be maximized. At least to me, these transformations are non-intuitive. So, why are these transformations important? What features is the model learning from these images? What problems do these transformations solve? How would this model perform with more test data? Would these transformations hold for a deeper ResNet model or a different architecture? As we hypothesized before, it could be that the transformations provide better boundary lines between objects in the image. Maybe the transformations are related to the model learning robust versus non-robust features as discussed by Madry. Or maybe there’s healthy noise added by these transformations that are better for generalization.

The above are examples of augmented images. In both images, the augmentation has accentuated the difference between the car and background. On the left, we see that some occlusion also exists with the yellow road overpowering the lower third of the car.

For future work, it would be fascinating to answer these questions related to robustness and model interpretability to understand how these image augmentation transformations affect features learnt by the model. Secondly, it would also be interesting to see how image augmentation transformations learnt for this data-model pair work for different datasets, models, and data-model pairs. This would give us a better understanding of specific dataset and model properties to provide a framework as to how to leverage other models and datasets as baselines.

As computer vision rapidly advances, we see more work in model robustness and interpretability. We see a growing understanding on how assumptions made for a single dataset and model architecture pair falter in the face of another dataset-model pairs1, leading to active research in establishing effective transfer methods, and novel baselines and standardizing historic knowledge. In our previous post, we learn how to make the most out of a pre-trained model. Here, we leverage black-box optimization not only to out-perform our previous best, but to learn more about our model. By using a simple feedback loop and Multitask Optimization, we find parameters best suited for our data-model relationship and can use this as a baseline to further explore policies for similar problems.

Next Steps

We’ll also be at NeurIPS in December. Come stop by and say hello! Here’s the repo and a live link to the finished SigOpt experiment. To try out SigOpt for yourself, please contact us. Happy Modeling!


Thank you to Nick Payton and Tobias Andreasen for their thoughts and inputs.


[1]  E.D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, Q.V. Le. Auto Augment: Learning Augmentation Strategies from Data. In Proc. of CVPR 2019.

[2]  A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, A. Madry. Adversarial Examples Are Not Bugs, They Are Features.

Use SigOpt free. Sign up today.

Meghana Ravikumar AI Product Manager