Parametrizing Data Augmentation in COVID-Net

Olivia Kim, Michael McCourt, and Linda Wang
Advanced Optimization Techniques, Convolutional Neural Networks, Data Augmentation, Deep Learning, Healthcare, Multimetric Optimization

At SigOpt, we are thrilled to collaborate with the outstanding community of experts from around the world. In this post, we discuss a recent collaboration with Linda Wang at the Vision and Image Processing Lab (VIP Lab) at the University of Waterloo. During recent months, Linda has been working with DarwinAI Corp. to develop COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest X-Ray Images; a preprint of the resulting article is present here.

To create this COVID-Net (as with any neural network) many decisions were made by Linda and the VIP Lab and DarwinAI teams; these include decisions about the training strategy, loss function, and data augmentation. SigOpt is a tool for efficiently and adaptively testing these decisions to identify high performing outcomes. In support of these efforts, we have worked with Linda to run an experiment on one of the COVID-Net models that she and the VIP Lab and DarwinAI teams already created, to explore the significance of certain parameters related to training, loss function, and data augmentation using SigOpt’s multimetric optimization feature. This experiment, and the tools used, are part of SigOpt’s academic program: researchers at universities and non-profits can use SigOpt at no cost. This blog post presents the results of this experimentation and some insights regarding the significance of certain parameters.

COVID-Net project

The VIP Lab team and the DarwinAI team developed the COVID-Net project with the goal of openly developing neural networks to detect COVID-19 cases using chest X-ray images, giving access to models and datasets upon which the global community can build and experiment. They are in regular consultation with physicians to determine the diagnostic needs of medical professionals. X-rays were labeled as either being COVID-19, pneumonia or healthy. As of April 26, 2020, the dataset compiled by the VIP Lab and DarwinAI teams is described in the table below.

Healthy Pneumonia COVID-19
Training 7966 5451 152
Validation 100 100 31

Table 1: Dataset distribution; the COVID-19 class is significantly underrepresented.

The dataset continues to grow, so please visit the Github repo for the most up to date information.

As is always the case when building classification tools, there is a need to balance the potential of false positives with the potential of false negatives. With SigOpt’s multimetric optimization tool, we investigate the impact of free parameters on performance and identify how the parameters can be set to reach suitable performance on both factors. In the following sections, we

  • Define some of the free parameters considered in the training and data augmentation process used in COVID-Net,
  • Review the performance of COVID-Net for parameters tested during the optimization process, focusing on those best performing under practical circumstances, and
  • Explore the impact of the parameters on reaching high performing neural networks.

Before continuing, we want to explain the aspects of development that were not explored during this experiment. Key among this is the neural network architecture, which uses the COVIDNet-Large architecture provided in the Github repo. Additionally, we make no changes to the datasets which have been defined for training and validation purposes. We also leverage the data augmentation strategy that the VIP Lab and DarwinAI teams provided; this is explained below.  We fixed all trainings to take 10 epochs.

COVID-Net tuning using SigOpt

As can be observed above, the dataset is extremely imbalanced, with only 1.1% of the training cases labeled as COVID-19. To account for this, the loss function (which is the same loss function originally used to train COVID-Net) weights the COVID-19 cases more heavily during neural network training; this weight is one of the free parameters that we analyzed in the tuning process.

Additionally, the VIP Lab and DarwinAI teams used data augmentation to counter the low number of COVID-19 labels; each image, during training, was randomly perturbed. The augmentation strategies they used involved increasing the brightness, rotating and translating images, performing horizontal flipping on the images, and zooming in to the center. The images were also preprocessed to crop some hospital-specific figures from the top 1/6th of each image.  For each batch consumed during training, exactly 25% of the images were COVID-19 samples (chosen randomly and modified randomly).

The two competing metrics to be maximized during this parameter tuning are

  • COVID-19 PPV (fraction of positive predictions which were actually positive), and
  • COVID-19 sensitivity (fraction of positive cases which were correctly predicted).

We set minimum thresholds of 0.75 on both of these metrics to define the practical expectations (that no classifier will be helpful if it simply always predicts positive COVID-19 cases); this serves to help SigOpt better focus its efforts on the most practically beneficial outcomes.

Our multimetric experimentation seeks to adaptively explore the impact of these parameters to understand the tradeoff between improving PPV and sensitivity of COVID-Net. Here, we summarize the seven parameters we tuned during this experimentation, which are grouped into three categories.

  • Adam parameters
    • Learning rate – [10-10, 10-1]
    • Batch size – {4, 8, 16, 32}
  • Loss function parameters
    • COVID-19 class weighting – [5, 60]
  • Data augmentation parameters
    • Brightness – [0, 0.3]
    • Rotation – [0, 15]
    • Translation – [0, 25]
    • Zoom – [0, 0.2]

These parameter ranges were chosen after some minor searching, with a bias towards larger regions to promote exploration. The augmentation parameters with arguments in Keras’ ImageDataGenerator tool as shown in the code below; the number 224 appears because the images are rescaled to dimensions 224×224.

augmentation_operation = ImageDataGenerator(
featurewise_center=False,
featurewise_std_normalization=False,
rotation_range=augmentation_rotation,
width_shift_range=augmentation_translation / 224,
height_shift_range=augmentation_translation / 224,
horizontal_flip=True,
brightness_range=(1 - augmentation_brightness, 1 + augmentation_brightness),
zoom_range=(1 - augmentation_zoom, 1 + augmentation_zoom),
fill_mode='constant',
cval=0.,
)

Tuning results

We used a p3.2xlarge machine on AWS to run these experiments; for reference, training the network took roughly an hour over 10 epochs.  Relevant links include

In the figure below, we present the observed feasible outcomes from the search — we ran 75 suggested parameter choices, with the first 22 manually designed based on earlier iterations of COVID-Net.

Numerical results of the multimetric experiment

Figure 1: Feasible metric values identified during the tuning process; the right panel provides a zoomed in view.  The minimum thresholds of 0.75 were used to inform the optimizer about the most desired outcomes.  The metrics take starkly discrete values because of the low number (31) of positive COVID-19 cases in the validation set.

The left panel of Figure 1 demonstrates that there is a strongly competitive nature to these metrics (achieving the highest specificity value requires the lowest PPV value).  However, within the thresholds set earlier, the metrics are only slightly competitive, though the results are less dense than we might desire because of the discrete nature of these metrics.  The discrete metric values also hide the fact that 75 parametrizations were considered from the 7 dimensional parameter space, since several ended with matching metric values.

The right panel zooms into the region of interest, depicting two Pareto efficient results of sensitivity/PPV 0.94/0.97 and 0.87/1.00.  All results, along with the associated parametrizations, can be found at the experiment link.

In discussions that the VIP Lab and DarwinAI teams had with physicians, they have identified that, given the current diagnostic situation, higher sensitivity is more valuable than higher PPV. This has the effect of trying to push the number of false negatives as low as possible.  The table below shows a selection of the high performing metric values, including the sensitivity/PPV of the healthy and pneumonia X-ray classes.

Observation ID COVID-19 Healthy Pneumonia
Sensitivity PPV Sensitivity PPV Sensitivity PPV
14527723 0.94 0.97 0.95 0.91 0.91 0.94
14522424 0.90 0.97 0.95 0.91 0.91 0.93
14517347 0.90 0.97 0.94 0.92 0.93 0.93
14521182 0.90 0.93 0.96 0.91 0.91 0.95
14527981 0.87 1.00 0.90 0.94 0.96 0.89

Table 2: A selection of the highest COVID-19 sensitivity results.  While these PPV and sensitivity metrics are, indeed, competitive, there were several results which saw both sensitivity and PPV above 0.9 for all three classes.

Parameter analysis

As part of this tuning process, we wanted to identify the significance of different parameters in impacting the final performance.  The figure below provides a visualization of how high performing results tended to be clustered with certain parameter values.

Parallel axes graph demonstrating the significance of certain parameters to high performing outcomes

Figure 2: Parallel axes plot of high performing results, as first defined in the right panel of Figure 1.  Certain parameters, such as the learning rate and the batch size, have a significant impact on the outcome.  The class weighting, surprisingly, does not seem to have as much impact.

The most obvious insight seems to be that the batch size does have a great impact on the outcome, with all the high performing results occurring for batch size of 8.  This is, at least somewhat, impacted by the choice of the VIP Lab and DarwinAI teams to make every training batch contain a fixed fraction of COVID-19 examples; we did not experiment with different circumstances/strategies of balancing the undersampled class, which might have yielded different insights.  Something similar could be said about the learning rate, but we imagine the structure corresponds to the standard behavior of learning rates having a single high performing region (see, e.g., page 424).

The data augmentation process seems very relevant in improving the performance of the classifier.  Of particular note is that high translation and rotation values seem to be very beneficial.  We do not speculate here as to why this may be the case, though it certainly merits further investigation.

In contrast, the classifier seems to benefit from low zoom, and it also seems generally apathetic to changes in brightness values.  Perhaps there is some affinity towards values in the middle of our domain, but not to the exclusion of other regions.  This may be a byproduct of the choice to crop the top 16% of the images before processing and the fact that, while the images may suffer from translation and rotation complications, brightness was already consistent across the images.

One surprise in this situation is that the class weighting, to which the loss associated with COVID-19 training examples was multiplied, seemed to have no significant impact on the outcome.  So long as the value was acceptably high (perhaps 15 or 20) there was no obvious benefit.  We speculate that the inclusion of 25% COVID-19 samples in each batch may have nullified the need for this parameter.  This may benefit from allowing higher values, but given that the classifier performance is already near its peak, there may be little to be gained with this parameter.

Call to action

Our experimental process has provided some interesting insights regarding the value of data augmentation in this classification process.  In particular, carefully tuned data augmentation has yielded improved sensitivity and PPV in COVID-Net beyond the previous baselines.  We hope that some of these augmentation results, in particular, can yield similar improvements in future research.  If any of the readers of this post are working on academic research that would benefit from efficient parameter optimization, SigOpt’s academic program is provided at no cost.

We want to thank the healthcare workers and first responders around the world who are making incredible contributions to society at the most difficult and pressing time. We also want to explicitly thank the physicians that have consulted with the VIP Lab and DarwinAI teams on the COVID-Net project.

To readers of this blog post engaged in diagnosing COVID-19 cases, additional research would benefit from more positive COVID-19 X-rays to help improve the performance of these diagnostic tools. If you have labeled X-ray data available, please contact [email protected] or visit https://figure1.typeform.com/to/lLrHwv.  To anyone working on COVID-19 research, if you think SigOpt would be beneficial, please contact us for complementary access.

Olivia Kim
Olivia Kim Software Engineer
MichaelMcCourt
Michael McCourt Research Engineer
Linda Wang Guest Author