FastAutoAugment

This is a summary of the paper Fast AutoAugment.

Written by Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, Sungwoong Kim.

Introduction and problem statement
Methodology
- Search space
- Search strategy
Experiments
Discussion and conclusions
Own considerations
Beyond Fast AutoAugment
References

Introduction and problem statement

Data augmentation became a very important technique for improving the generalization of the trained model nowadays. It also acts as a natural regularizer, preventing the overfitting of complex model architectures.

As machine learning nowadays goes into the direction of automating itself, which does not require specific expert knowledge about the dataset, but rather tries to learn specific parameters, people also try to make data augmentation as automatic as possible. AutoAugment is one of such works, which uses reinforcement learning for finding optimal data augmentation [2]. This method is very effective, hence, it outperformed state-of-the-art methods at that time, but it is rather slow as shown in Table 1. In Fast AutoAugment authors tried to tackle this problem and proposed a more efficient data augmentation algorithm, based on Bayesian optimization for faster policy search. Rather than training each subsequent model for each policy update separately, FastAutoAugment tries to sample new candidate policies using Bayesian optimization and then evaluate these policies using test-time augmentations to select the best performing policies. In this sense, this method tries to match the augmented distribution with the original distribution as much as possible using dense matching algorithms.

Table 1. Comparison of error rate (%) to the best results obtained at the time of publishing the paper.

(Top-5 for ImageNet and Top-1 for others). GPU-hours estimated on NVIDIA Tesla P100.

Methodology

Search space

Fast AutoAugment method tries to learn an augmentation policy. Augmentation policy is a sequence of augmentation operations with several parameters each, which can be applied to the training set. After applying an augmentation policy to the training set, an augmented training set can be obtained. The model can now be trained on this augmented dataset, which makes the model more robust and invariant to the input data.

Each policy consists of many sub-policies. Sub-policy consists of two subsequent operations. Two operations in the sub-policy are applied one after another: if the first operation is applied to the image, the second operation with the specific probability will be applied to the image, transformed by the first operation.

Each operation has three parameters. These three parameters span the search space:

type of operation (rotation, flip, etc)
magnitude (the application rate of operation)
probability of applying operation

$\begin{array}{l}\bar{\mathcal{O}}(x; p, \lambda) :=\begin{cases} \mathcal{O}(x; \lambda): & \text{with probability $p$}.\\ x: & \text{with probability $1 - p$}. \end{cases}\end{array}$

Search strategy

The search algorithm looks for the optimal policy and thus looks for the sequence of operations with a specific type, magnitude, and probability.

The overall method works as follows: the training set is split into k-folds through the stratified shuffling method (in each fold there is an equal percentage of different classes) [4]. Each data fold is then split into two parts $\begin{array}{l}D_A\end{array}$ and $\begin{array}{l}D_M\end{array}$ , which can be referred to as augmentation and model parts of the data. The augmentation part is used in the Bayesian optimization algorithm, which tries to find the best policy. The model part of the data is used for training the model. It should be noticed, that a major difference to AutoAugment paper is, that you train your model only once for the data fold and you never retrain it. In the AutoAugment the same part is a huge bottleneck because you need to retrain the subsequent model every time to obtain the accuracy result, which can be used for optimizing the search algorithm.

Bayesian optimization

Evaluation of model performance for every candidate policy is computationally expensive, so the authors utilize the Bayesian hyperparameter optimization technique for policy exploration and exploitation. It is applied to the augmentation part of the data. The Bayesian optimization method selects the number of candidate policies. For each candidate policy, there is a corresponding accuracy, which was obtained through test-time augmentation on a trained model.

The Bayesian optimization method is a hyperparameter search method, which in comparison to traditional random search and grid search tries to perform informed search in the sense, that it leverages information from previous runs. Evaluation of an objective function is a very expensive operation, that’s why it is more feasible to introduce a model, which can represent objective function well, but evaluation of this function can be still feasible. Idea is to spend more time, searching for “suitable” hyperparameters, rather than trying to evaluate an objective function on every possible policy, which requires much more time to evaluate [3]. For more details on the Bayesian optimization method, please refer to [3] and [5].

So, the overall idea of bayesian optimization is to find the optimal policy, which maximizes the accuracy of the model.

$\begin{array}{l}\displaystyle \tau_* = argmax_{\tau} \ R(\theta^*|\tau(D_A))\end{array}$

Based on the test-time augmentation score, the top-N policies are selected from the candidate policy set for each data fold. The overall policy is a concatenation of top policies from all data folds. Formally, it can be defined as:

$\begin{array}{l}\displaystyle T(D) = \bigcup\limits_{\tau \in T} \{ (\tau(x), y) : (x,y) \in D \}\end{array}$

Figure 1. Policy search procedure of FastAutoAugment [1]

Experiments

Experimental results are obtained on the common classification datasets (e.g. CIFAR-10, CIFAR-100, SVHN, and ImageNet). Tables 2 and 3 show, that Fast AutoAugment has comparable results with AutoAugment method, at the same time achieving a significant decrease in the policy search time. The authors emphasize, that the transfer policy was searched on a reduced CIFAR-10 dataset with only 4,000 data entries. Time for the most complex case (PyramidNet + ShakeDrop) is reduced from 5,000 GPU-hours for AutoAugment to 780 GPU-hours for Fast AutoAugment.

Table 2. Comparison of the test error (%) for CIFAR-10

For CIFAR-100 it can be observed, that there exist gaps between the AutoAugment and Fast AutoAugment performance. As the authors of the paper emphasize, the reason can be a lack of policy search during policy exploration or the overtraining of some parameters in the model.

Table 3. Comparison of the test error (%) for CIFAR-100

For the experiment on ImageNet authors also use the reduced dataset, consisting of 6,000 samples of 120 different classes. Table 4 shows that the results for Fast AutoAugment are comparable with the results of AutoAugment, at the same time results for more complex architecture (ResNet-200) show remarkable improvement over AutoAugment. Authors conclude, that considering the generalization property of data augmentation, results can be even improved with a higher number of epochs. It should be noted, that Fast AutoAugment is 33 times faster than AutoAugment considering the policy search time.

Table 4. Validation set Top-1/Top-5 error (%) for ImageNet

Results on SVHN in Table 5 also show the comparability of the AutoAugment and Fast AutoAugment results, which are state-of-the-art for the mentioned dataset.

Table 5. Comparison of the test error (%) for SVHN

Discussion and conclusions

Impact of a number of sub-policies on a performance.

The authors of the paper assumed, that the increasing number of sub-policies will improve the performance of the model. They conducted an experiment and showed, that for CIFAR-10 and CIFAR-100 the performance increases up to 100-125 sub-policies. Figure 2 illustrates this experiment. In Tables 2 and 3, it is also notable, that there exists a gap for the results of direct and transferred policies. especially for more complex architectures. This is because the transferred policies were used from the less complex architectures and they were simply too primitive for more complex architectures.

Figure 2. Validation error (%) for CIFAR-10 and CIFAR-100 with an increasing number of sub-policies [1]

Augmentation policies per class

The authors also investigated the idea of training the policies per class on CIFAR-10 with Wide-ResNet-40-2 architecture. They obtained a slightly improved error rate, but the difference was not very significant. Also, it was emphasized that this idea should more significantly improve the performance of the datasets, which have major geometrics differences among their data classes.

It was also mentioned, that the authors experimented with other hyperparameters, especially for Bayesian optimization, but this didn't have a major impact on the model performance.

In conclusion this paper can be described in one sentence as:

a method of learning augmentation policies, which solves the problem of too long and expensive policy search process (present in AutoAugment), leveraging the idea of Bayesian hyperparameter optimization and achieving comparable or in some cases better results (Table 4), than AutoAugment.

Own considerations

From my point of view, this paper is very interesting, because it makes the AutoAugment method usable for many more tasks, requiring much less time. The method can be applied directly to the dataset and it is not necessary to use an existing policy, which makes this method very practical. Fast AutoAugment solves a bottleneck of AutoAugment method, showing comparable or sometimes better results. Paper is well structured and it also provides the publicly available code, which is optimized for efficient usage.

At the same time paper has several drawbacks: the comparison of the policy search time in Table 1 is not very fair, because AutoAugment and Fast AutoAugment methods were trained on different GPUs. Moreover, the comparison of performance results is done sometimes based on the reduced and sometimes on the full dataset, which could be explicitly stated in the comparison tables. Also, there is no discussion of the fact, that Fast AutoAugment outperforms the AutoAugment method significantly for the ResNet-200 architecture on ImageNet (Table 4). Also the core concept of the paper - Bayesian hyperparameter optimization - is not explained clearly, so you need to search for other resources to understand it in detail.

A major drawback of AutoAugment and Fast AutoAugment methods is, that they operate on a proxy task, which is decoupled from the model complexity and from the dataset size. This additional task, that has to be solved, produces additional computational overhead, and only gives sub-optimal results. Solution of this problem was proposed in [5].

Another interesting topic for discussion is where and how this method can be used in other applications. First of all, this method is applicable to any dataset. Depending on the dataset, injecting some minor changes into the operation set of magnitude range, it can improve the performance of the model. Especially, this method can be utilized for small datasets, which are very common in the medical imaging area.

Beyond Fast AutoAugment

Interestingly, Fast AutoAugment makes use of a proxy optimization task for finding the candidate policies, which is decoupled from the dataset and from the real objective function. There also exist several papers, that go beyond Fast AutoAugment and try to move the optimization task to the model training part.

One of those papers is RandAugment [6], a method proposed by the authors of AutoAugment. It eliminates the need for a proxy task, and thus moves the whole augmentation process to searching hyperparameters of the model. The number of parameters is also reduced from 30 to only 2. Paper also proves, that fixing magnitude to some value does not have a very significant impact on the performance, thus it can be eliminated from the parameters. The probability term of applying operation is exchanged with the uniform probability of $\begin{array}{l}\frac{1}{K}\end{array}$ .

There can be $\begin{array}{l}K^N\end{array}$ candidate policies, with K – number of operations and N – consecutive transformations for one image. Table 6 shows, that the search space for RandAugment is only $\begin{array}{l}10^2\end{array}$ compared to $\begin{array}{l}10^{32}\end{array}$ in Fast AutoAugment and AutoAugment. Therefore RandAugment can be easily moved from a separate proxy task to the hyperparameter search process for the current model. Table 6 also shows that RandAugment achieves state-of-the-art results, especially for more complex models.