In this blog post, we demonstrate the use of Metric Constraints in an application. We take inspiration from MicronNet: A Highly Compact Deep Convolutional Neural Network Architecture for Real-time Embedded Traffic Sign Classification by Wong et. al. and analyze the German traffic sign dataset (GTSRB) with the intent of optimizing the top-1 accuracy with a constraint on the model size (measured in total number of parameters). We compare this feature with Multimetric optimization on the same dataset to illustrate how they serve different purposes and needs.
Data Preprocessing
We chose a cropped and resized version of the dataset instead of the original .ppm format. The training, validation, and test split is prepared as follows.
Num of training set: 34799
Num of validation set: 4410
Num of test set: 12630
Image data shape = (32, 32, 3)
Number of classes = 43
Due to class imbalance in the original GTSRB, we perform several steps of data augmentation, such as flipping the image horizontally/vertically and applying projective transformation. The class distributions before and after data augmentation are shown below; the augmented training set contains 146874 samples. We do not augment the validation and test sets. You can find the preprocessing/augmentation pipeline in this colab notebook.
Figure 1. Class distribution of the raw training data.
Figure 2. Class distribution of the augmented training data.
Setting up the SigOpt Experiment
We follow the CNN architecture proposed in the MicronNet paper with a few modifications1. The model architecture and the associated hyperparameters are presented in Table 1.
Table 1. A description of the architecture hyperparameters and their corresponding search space.
Layer / Strides / Paddings | Filter Shape |
Conv2D / S1 / P0 | 1 × 1 × 1 |
Conv2D / S1 / P0 | kernel_size_1 × kernel_size_1 × num_filters_1 |
MaxPooling2D / S2 / P0 | 3 × 3 |
Conv2D / S1 / same | kernel_size_2 × kernel_size_2 × num_filters_2 |
MaxPooling2D / S2 / P0 | 3 × 3 |
Conv2D / S1 / same | kernel_size_3 × kernel_size_3 × num_filters_3 |
MaxPooling2D / S2 / P0 | 3 × 3 |
Dense / S1 | 1 × fc_1 |
Dense / S1 | 1 × fc_2 |
Softmax / S1 | 1 × 43 |
Hyperparameter | Range |
kernel_size_1 | [2, 7] |
num_filters_1 | [10, 50] |
kernel_size_2 | [2, 7] |
num_filters_2 | [30, 70] |
kernel_size_3 | [2, 7] |
num_filters_3 | [40, 160] |
fc_1 | [10, 1000] |
fc_2 | [10, 1000] |
We use stochastic gradient descent to train the network with the following fixed parameter choices: base learning rate of 0.01, learning rate decay of 1e-4, momentum of 0.9, batch size of 32. We train each hyperparameter configuration for 10 epochs. While the optimizer hyperparameters can also be considered as part of this efficient search (see, e.g., this post), we fix them in this experiment to focus entirely on the architecture hyperparameters. You can find the code for this experiment in this colab notebook.
We optimize for the top-1 validation accuracy of the network, under a model size constraint. We set the threshold on the model size to be 0.15M parameters, meaning that as long as the network has fewer than 0.15M parameters, we are indifferent to its size and only want to maximize the validation accuracy. The SigOpt experiment can be found here. We have updated the web experience for the Metric Constraints feature, as displayed below in Figure 3 and 4.
Figure 3. Inspecting the Experiment History plot for each metric. Note that observations that do not satisfy the thresholds are grayed out.
Figure 4. Updated History page displays the thresholds set for constraint metrics.
Comparison to Multimetric Experiment
For a side-by-side comparison, we also ran a Multimetric experiment where we are jointly maximizing the accuracy and minimizing the model size with the same setup. The goal of Multimetric experiments is to find the Pareto-optimal solutions, which are useful for understanding the tradeoff between the metrics.
For multimetric experiments, there is no sense of ordering among the Pareto-optimal solutions (since they are equally good); therefore, SigOpt continuously searches for additional network configuration along the Pareto frontier with a smaller size by sacrificing the accuracy metric, and vice versa. This is in contrast to Metric Constraints, where smaller network size (once below the threshold) has no additional benefit. We illustrate this difference between the two types of optimization strategies in Figure 5.
Figure 5. Comparing the resulting metric values of Metric Constraints and Multimetric experiments. Right panel shows the zoomed-in plot of all models with validation accuracy ≥ 0.9 and size ≤ 0.15.
In this example, Metric Constraints feature finds more high performing networks that satisfy the size constraint. In comparison, Multimetric feature searches for smaller networks, by sacrificing validation accuracy. If understanding the trade-off or searching for the entire Pareto frontier is not of the high priority, Metric Constraints is recommended, as we can achieve a high validation accuracy with fewer observations.
Combining Metric Constraints and Multimetric
We can also use the Metric Constraints feature in conjunction with Multimetric experiments. Suppose we are interested in understanding the tradeoff between validation accuracy and multiply–accumulate operation (MACs) while still having the same constraint on the size of the network2. You can find this SigOpt experiment here.
Figure 6. Validation accuracy vs. MACs as displayed on the SigOpt Experiment Analysis page. The Pareto optimal solutions are shown labeled in orange. The gray points are network configurations that don’t satisfy the size constraint of 0.15M parameters.
As shown in Figure 6, we can observe the Pareto frontier of validation accuracy and MACs subject to the network size constraint. Note that several configurations would have been on the frontier are grayed out since they do not meet the size constraint. Additionally, we can further introspect how MACs and size metrics, or size and validation, correlate with each other on the Experiment Analysis page.
Figure 7. Inspecting the Experiment History plots for different metrics.
Remarks
Metric Constraints allow SigOpt to factor in additional metrics during the optimization process. However, this additional capability comes at a cost. First, SigOpt needs to maintain a separate model for each metric, increasing the computation burden to maintain/update the models and generating suggestions from these models. With each additional metric constraint, the corresponding feasible parameter space decreases; therefore, just searching for the feasible region becomes increasingly hard (without conducting any optimization).
When the observations are noisy, interpreting results is also difficult. We have briefly discussed how to interpret uncertainty for multimetric problems, in particular the idea of probabilistic Pareto frontier, in a previous blog. Similarly for Metric Constraints experiments, the feasibility of each observation is not deterministic; therefore, we would need to first devise a method to compute the probabilistic feasibility of each observation. From there, we would need to understand how to factor the optimized metric value (which is also noisy) to the probabilistic feasibility. All of these remain open research questions as of right now.
Here at SigOpt, we are continuously improving and expanding our product to empower our users. We look forward to seeing how our users can leverage the Metric Constraint feature to tackle new problems. If you have comments or questions, please reach out to our customer success team.
Use SigOpt free. Sign up today.