Continual Adaptation of Visual Representations via Domain Randomization and Meta-learning

This is the blogpost for the paper 'Continual Adaptation of Visual Representations via Domain Randomization and Meta-learning' written by Riccardo Volpi, Diane Larlus and Grégory Rogez.

Introduction

Problem Statement and Motivation

While modern Computer Vision methods have achieved high performance on specific tasks, this commonly comes at the expense of adaptability to new visual domains; if a model is sequentially trained on new domains after deployment, it is bounded to gradually forget the domains it was initially trained with. This effect is called catastrophic forgetting [1], and it becomes more severe if there is large dissimilarity between the already learned domains and the new domain.

Catastrophic forgetting is particularly problematic in applications where the model is expected to perform under conditions that will repeatedly change. Although retraining the model with samples from all already encountered domains is technically feasible [2], the required information may no longer be available because of data privacy or storage constraints. In these scenarios, the need for models inherently robust against catastrophic forgetting arises.

Related Work and Contribution

Four approaches set the basis for this study: (i) continual learning, which aims to creating models able to learn new patterns without forgetting old ones [2], (ii) domain randomization, an effective data augmentation strategy to improve the performance on domains different to those encountered during training [3], (iii) domain adaptation, a method focused on achieving robustness against domain changes [4], and (iv) meta-learning, which can be used for continual learning, by transferring models to new tasks [5].

This paper presents "continual supervised domain adaptation", a continual learning approach focused on Computer Vision applications where the task to be executed remains the same, and the visual domain changes. The differences between previous approaches and the one proposed are depicted in figure 1. Two main contributions are delivered by this new method:

Models trained with continual supervised domain adaptation are significantly more robust against catastrophic forgetting, without the need for data storage or model expansion.
The study also presents a meta-learning algorithm, based on the new concept of "auxiliary meta-domains", and a regularization strategy that allows the model to (i) learn the task to be performed, (ii) resist the alteration of parameters associated to previously learned domains to prevent catastrophic forgetting, and (iii) facilitate the learning of the new domain.

Figure 1. Comparison between standard domain adaptation, domain generalization and domain randomization (top),
and the new proposed approach: continual domain adaptation (bottom).

Methodology

Notation

$\begin{array}{l}\mathrm{M_θ}\end{array}$ : model to train
$\begin{array}{l}\mathrm{(D_i)^N_{i=0}}\end{array}$ : sequence of $\begin{array}{l}\mathrm{N}\end{array}$ different domains
$\begin{array}{l}\mathrm{P_i}\end{array}$ : distribution of domain $\begin{array}{l}\mathrm{D_i}\end{array}$
$\begin{array}{l}\mathrm{S_i\,\~ P_i}\end{array}$ : set of samples from distribution $\begin{array}{l}\mathrm{P_i}\end{array}$
$\begin{array}{l}\mathrm{S_i\,= \{(x_k;y_k)\}^m_{k=1}}\end{array}$ : set of $\begin{array}{l}\mathrm{m}\end{array}$ training pairs, with datapoint $\begin{array}{l}\mathrm{x_k}\end{array}$ and label $\begin{array}{l}\mathrm{y_k}\end{array}$
$\begin{array}{l}\mathrm{Ψ\,= \{T_q\}^M_{q=1}}\end{array}$ : image transformation set with $\begin{array}{l}\mathrm{M}\end{array}$ elements
$\begin{array}{l}\mathrm{T\,\~ Ψ}\end{array}$ : especific transformation of the set
$\begin{array}{l}\mathrm{n}\end{array}$ : number of basic transformations to by combined
$\begin{array}{l}\mathrm{S_j\,= \{(T(x_k);y_k)\}^m_{k=1}}\end{array}$ : new set of training pairs, after applying transformation $\begin{array}{l}\mathrm{T}\end{array}$
$\begin{array}{l}\mathrm{L_{task}\,(θ)}\end{array}$ : loss function
$\begin{array}{l}\mathrm{θ^*_{D_i}}\end{array}$ : model trained on domain $\begin{array}{l}\mathrm{D_i}\end{array}$
$\begin{array}{l}\mathrm{θ^*_{D_i\,-> D_{i+1}}}\end{array}$ : model trained on domain $\begin{array}{l}\mathrm{D_i}\end{array}$ and then fine-tuned on domain $\begin{array}{l}\mathrm{D_{i+1}}\end{array}$
$\begin{array}{l}\mathrm{H}\end{array}$ : number of gradient descent steps for training

Problem formulation

This study aims to tackle the issue of catastrophic forgetting in a model that is sequentially fine-tuned on different domains. Such issue can be formally written as:

$\begin{array}{l}\mathrm{L_{task}\,(θ^*_{D_i -> D_{i+1}}) > L_{task} (θ^*_{D_i})}\end{array}$

Objective:

Devise a model able to learn to perform the given task when confronted with new visual domains, without degrading its performance on previously encountered domains. In terms of training, this means that the loss must be minimized each time with respect to: (i) the task to be performed, (ii) the already learned domains, and (iii) the new domain yet to be learned.

Assumptions:

Locally i.i.d. data distribution
The set of samples $\begin{array}{l}\mathrm{S_i}\end{array}$ is no longer accessible when the domain $\begin{array}{l}\mathrm{D_{i+1}}\end{array}$ with samples $\begin{array}{l}\mathrm{S_{i+1}}\end{array}$ is encountered.

Assessment:

The performance of the model is evaluated for every domain $\begin{array}{l}\mathrm{D_i}\end{array}$ after all $\begin{array}{l}\mathrm{N}\end{array}$ domains have been encountered.

Image transformation sets

To reach the objective, the proposed method includes (i) a domain randomization process for data augmentation to facilitate the adaptation of the model to new visual domains, as well as (ii) the generation of auxiliary meta-domains for the meta-learning algorithm to prevent the forgetting of previously seen domains. Both (i) and (ii) depend on the definition of adequate image transformation sets, which means these play a critical role.

To construct the sets, the authors first consider several basic transformations (color, geometric, noise injection, etc.), each with a specific magnitude, expressed in percentage. Each element $\begin{array}{l}\mathrm{T}\end{array}$ of the transformation set $\begin{array}{l}\mathrm{Ψ}\end{array}$ is obtained by combining $\begin{array}{l}\mathrm{n}\end{array}$ of these basic transformations, as proposed in a previous study [3].

The data augmentation is realized by sampling a transformation $\begin{array}{l}\mathrm{T}\end{array}$ from the set $\begin{array}{l}\mathrm{Ψ}\end{array}$ and applying it to an original sample $\begin{array}{l}\mathrm{(x;y)}\end{array}$ . The model will be trained with both the original training pairs and the transformed pairs $\begin{array}{l}\mathrm{(T(x);y)}\end{array}$ .
The auxiliary meta-domains are generated with a similar procedure: the original set of samples $\begin{array}{l}\mathrm{S_i\,= \{(x_k;y_k)\}^m_{k=1}}\end{array}$ from the specific domain $\begin{array}{l}\mathrm{D_i}\end{array}$ is altered using a transformation $\begin{array}{l}\mathrm{T}\end{array}$ from the set $\begin{array}{l}\mathrm{Ψ}\end{array}$ . The resulting auxiliary meta-domain $\begin{array}{l}\mathrm{S_j\,= \{(T(x_k);y_k)\}^m_{k=1}}\end{array}$ is afterwards employed for the meta-learning algorithm.

Meta-learning algorithm

The Meta-DR algorithm receives as input (i) the sample set of the current domain $\begin{array}{l}\mathrm{S_i}\end{array}$ , (ii) the image transformation set $\begin{array}{l}\mathrm{Ψ}\end{array}$ , (iii) the weights of the model before training $\begin{array}{l}\mathrm{θ^{i-1}}\end{array}$ , and (iv) the hyperparameters for optimization. After execution, it outputs the weights $\begin{array}{l}\mathrm{θ^{i}}\end{array}$ of the model after training for the current domain.

The algorithm performs $\begin{array}{l}\mathrm{H}\end{array}$ learning steps. For each step during training:

A batch $\begin{array}{l}\mathrm{(x';y')}\end{array}$ is uniformly sampled from $\begin{array}{l}\mathrm{S_i}\end{array}$ and is then transformed into an auxiliary meta-domain $\begin{array}{l}\mathrm{(T(x');y')}\end{array}$ , using a uniformly sampled transformation $\begin{array}{l}\mathrm{T}\end{array}$ .
The auxiliary meta-domain compensates for the unavailability of sample sets from previously encountered domains, and is used to perform the meta-update of the weights $\begin{array}{l}\mathrm{θ'^{t}_T}\end{array}$ , where $\begin{array}{l}\mathrm{t}\end{array}$ is the training step, and with a meta-learning rate of $\begin{array}{l}\mathrm{α}\end{array}$ .
Next, a second batch $\begin{array}{l}\mathrm{(x;y)}\end{array}$ is uniformly sampled from $\begin{array}{l}\mathrm{S_i}\end{array}$ , and is augmented to $\begin{array}{l}\mathrm{(T(x);y)}\end{array}$ via domain randomization, using the transformation $\begin{array}{l}\mathrm{T}\end{array}$ .
Finally, the learning step for the current domain is performed, consisting of (i) the learning of the current task, based on the batch $\begin{array}{l}\mathrm{(x;y)}\end{array}$ , the weights $\begin{array}{l}\mathrm{θ^{t}}\end{array}$ and a learning rate of $\begin{array}{l}\mathrm{η}\end{array}$ ; (ii) the backward transfer, to minimize the forgetting of the previous domains, using the batch $\begin{array}{l}\mathrm{(x;y)}\end{array}$ , the meta-learning weights $\begin{array}{l}\mathrm{θ'^{t}_T}\end{array}$ , and an escalation factor $\begin{array}{l}\mathrm{β}\end{array}$ ; and (iii) the forward transfer, to ease the adaptation of the model to new visual domains, based on the augmented data $\begin{array}{l}\mathrm{(T(x);y)}\end{array}$ , the meta-learning weights $\begin{array}{l}\mathrm{θ'^{t}_T}\end{array}$ , and an escalation factor $\begin{array}{l}\mathrm{γ}\end{array}$ .

After performing all learning steps, the resulting weights $\begin{array}{l}\mathrm{θ^{H+1}}\end{array}$ are the weights $\begin{array}{l}\mathrm{θ^{i}}\end{array}$ of the model after training for the current domain. The algorithm is summarized in figure 2 and the overall training and testing phases are presented in figure 3.

Figure 2. Pseudocode for the Meta-DR algorithm

Figure 3. Training with the proposed meta-learning algorithm being applied to the primary (original) domain
and to the generated auxiliary meta-domains throughout the domain sequence, and subsequent testing on all domains.

Experiment and Results

Experimental protocols

The performance of the proposed method is tested on three different tasks, namely digit recognition, image classification, and semantic scene segmentation. For each task, accuracy is evaluated on every visual domain at the end of training, when all domains have been encountered. The experiments are conducted more than once, and the results are reported as average with standard deviations. The sequence of domains, the network architecture and other relevant details for each task are listed below:

Digit recognition
The sequence of domains is constructed with the following digits datasets in two different orders: MNIST→MNIST-M→SYN→SVHN (protocol P1) and SVHN→SYN→MNIST-M→MNIST (protocol P2).
Image size: 32x32 pixels
Training dataset: 10,000 samples for each domain
Network: ResNet-18
Transformations: $\begin{array}{l}\mathrm{Ψ_1}\end{array}$ = color perturbation, $\begin{array}{l}\mathrm{Ψ_2}\end{array}$ = color perturbation + image rotation, $\begin{array}{l}\mathrm{Ψ_3}\end{array}$ = color perturbation + image rotation + noise perturbation
Each protocol is executed 3 times.
Image classification
The domain sequence is constructed with the PACS dataset, increasing the level of realism: Sketches→Cartoons→Paintings→Photos.
Image size: 224x224 pixels
Network: ImageNet
Transformations: $\begin{array}{l}\mathrm{Ψ_4}\end{array}$ = color perturbation
Each protocol is executed 5 times.
Semantic scene segmentation
The domain sequence is constructed with the KITTI and KITTI2 datasets. The protocols introduce different levels of dissimilarity between the encountered domains: Clean→Foggy→Cloudy (protocol P1), Clean→Rainy→Foggy (protocol P2), and Clean→Sunset→Morning (protocol P3).
Training dataset: 75% of original dataset
Network: U-Net architecture with ResNet-34 backbone pretrained on ImageNet
Each protocol is executed with 10 random samples.

Training methods

For each task, the performance of Meta-DR is compared against other methods, for which data of previously encountered domains is not available:

Naive: an approach in which the trained model is sequentially fine-tuned on the new domains as they are encountered. This is the experimental lower bound.
Naive+DR: the sequential fine-tuning is supported by domain randomization (DR).
L2+DR and EWC+DR: L2 regularization and Elastic Weight Consolidation [6], methods for continual learning, both supported by DR.

Additionally, the method is tested in scenarios where an episodic memory (in different sizes) is available:

ER, ER+DR and ER+Meta-DR: Experience Replay [7], tested as (i) stand alone, (ii) supported by DR, and (iii) combined with the new Meta-DR algorithm.
GEM: Gradient Episodic Memory [8] , tested with different memory sizes as (i) stand alone, and (ii) supported by DR.

Finally, the comparison also includes methods that are not exposed to catastrophic forgetting. However, they do not necessarily represent the experimental upper bound:

Oracles: complete sample sets of all domains are permanently available, either from the beginning of the sequence (all) or over iterations (cumulative).

(Randomization strategy for DR is the same for all training methods.)

Results

Classification
The results for the Digits and PACS datasets are shown in tables 1 and 2, respectively. The proposed Meta-DR algorithm has a comparable or better performance when contrasted with all the training methods, for which no data on previous domains is available, even when using various different transformations. The SVHN dataset in protocol P2 is an exception; probably because of its complexity, the usage of auxiliary meta-domains is not reflected as better performance. The availability of an episodic memory (methods complemented with ER or GEM) boosts the performance of all methods in both classification tasks.

Table 1. Results of the experiments on digit recognition. Size of the episodic memory for ER is of 100 samples per domain.
Transformation $\begin{array}{l}\mathrm{Ψ_3}\end{array}$ = color perturbation + image rotation + noise perturbation

Table 2. Results of the experiment on PACS dataset
Transformation $\begin{array}{l}\mathrm{Ψ_4}\end{array}$ = color perturbation

Ablation study
To evaluate the impact that the "recall" and "adapt" components of the loss function (Meta-DR, step 10) have on the performance of the meta-learning algorithm, the results of an ablation study for digit classification (P1) are presented in table 3. Performance on early encountered domains improves when the "recall" component is active (MNIST and MNIST-M, second versus first line). Similarly, the "adapt" component enables the model to keep learning during its lifespan and improves performance on later encountered domains (SYN and SVHN, third versus first line). The results of this experiment show that both components fulfill their objective of preventing forgetting and learning the new domain.

Table 3. Ablation study

Semantic scene segmentation
For this task, the performance obtained with the Meta-DR outperforms the Naive approach and matches or outperforms the Naive+DR method in early encountered domains, especially when the change in domains is large, as is the case for protocols P1 and P2. In the case of protocol P3, for which the dissimilarity between visual domains is not as large, the DR or Meta-DR approaches do not provide a boost in performance. Results are summarized in table 4.

Table 4. Results of semantic scene segmentation: P1 (top), P2 (center), P3 (bottom)
Transformation $\begin{array}{l}\mathrm{Ψ_4}\end{array}$ = color perturbation

Conclusion

This study provides theoretical background and experimental data on the performance of continual domain adaptation supported on two pillars: domain randomization as strategy for data augmentation, and a meta-learning algorithm based on the new concept of "auxiliary meta-domains". Both components make possible the creation of learning representations that are resilient against catastrophic forgetting. Further exploration of data augmentation strategies is needed to endow the model with higher adaptability capacities.

Own Review

This study addresses a challenge that is current and relevant for a wide variety of applications. The need for research in this area is well grounded, and the methods it is based on [2, 3, 4, 5] are explained in a pertinent manner throughout the paper. The methodology, with some exceptions to be explained later, and the results sections are described in detail, and accompanied by the values used for each parameter, which allows posterior researchers to recreate the study and assessment scenarios.

The weakness of the methodology section consists in the following: the selection of the image transformation sets is described as “a core part of the study”. However, the underlying assumption or justification for selecting transformation $\begin{array}{l}\mathrm{Ψ_3}\end{array}$ (which combines color, geometry and noise) for digit recognition, and transformation $\begin{array}{l}\mathrm{Ψ_4}\end{array}$ (which only includes color) for the PACS dataset and the semantic segmentation experiment is not explained. Furthermore, this is contradicted in the appendix, where it is stated that the same transformation was used for both classification tasks. The selection of the parameter $\begin{array}{l}\mathrm{N\,= 2}\end{array}$ for the number of combinations is not covered in sufficient detail either. One last remark with respect to the methodology is the selection of the number of meta-domains to be used in the algorithm: why a single meta-domain is employed is not justified.

In sight of the aforementioned gaps, further research on the image transformation sets to be used with this methodology is necessary, as already declared by the authors. Another interesting direction would be to study the effects that a higher number of meta-domains has on the performance of the proposed method.

References

[1] Michael Mccloskey and Neil J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. The Psychology of Learning and Motivation, 24:104–169, 1989.

[2] German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual Lifelong Learning with Neural Networks: A Review. Neural Networks, 113:54–71, 2019.

[3] Riccardo Volpi and Vittorio Murino. Model vulnerability to distributional shifts over image tranformation sets. In Proceedings of the International Conference on Computer Vision (ICCV), 2019.

[4] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In Proceedings of the European Conference on Computer Vision (ECCV), 2010.

[5] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2017.

[6] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. PNAS, 2017.

[7] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K. Dokania, Philip H. S. Torr, and Marc’Aurelio Ranzato. On Tiny Episodic Memories in Continual Learning. arXiv:1902.10486 [cs.LG], 2019.

[8] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient Episodic Memory for Continual Learning. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2017.

Seitenhierarchie