1. Introduction
1.1 Concept of Incremental Learning (IR)
Our human learning processes differ from typical machine learning algorithms. For instance, we integrate new visual information while retaining prior knowledge, such as not forgetting a pet's name while recognizing new animals in a zoo. Conversely, most machine learning models are trained in batch settings. The concept of class-incremental learning strives to achieve dynamic learning that mimics real-world situations, such as a model continuously learning from sequential data streams, as depicted in Figure 1. Additionally, the model should be able to perform multi-class classification for all previously observed classes without encountering catastrophic forgetting [1].
Figure 1. Class-incremental learning [1]
1.2 Class-incremental learning in medical image analysis
In general, training a Deep Learning-based model requires exhaustive computational power on large-scale data. Such process must be done many times when a new disease or class is discovered if we don't use IR. Moreover, it takes a lot of work to fully annotate old and new classes. In reality, we will find new classes of interest with time going on. Especially for segmentation annotation as annotating on each pixel, annotation adopting both old and new classes will be not only very time-consuming, but can also lead to mislabeling. Lastly, Medical data is highly private and sensitive; thus, large amounts of data may not be available at once. It may take some time and effort to get data.
2. Broad categorization of incremental learning methods
There mainly three of methods exists in IR: Memory-based method, Regularization method, and Knowledge distillation method. This section shall briefly introduce those methods.
2.1. Memory-based method
This method aims to save a small number of examples that represent the whole dataset effectively and utilize them to mitigate catastrophic forgetting through reply. In general, there are some protocols or algorithms in order to save into the memory, such as called exemplar sets. To use an analogy with the human world, it is the similar process as taking a lecture and writing down only the important information in a notebook. Additionally, the same goes for scheduling when to review it. A well know paper in this field is iCaRL [1] and it computes a current class mean of the feature representation of the image set. Then it stores only selected exemplar sets with closer to the mean until the fixed number of K.
2.2. Regularization method
Regularization method aims to encourage or penalize the updates of individual parameters depending on their weight importance, so that it prevents the degradation of the model performance on prior tasks. Namely, Elastic Weight Consolidation (EWC) employed a concept of neurobiological theories concerning synaptic consolidation [2]. It uses the Fisher information matrix to determine the degree of rigidity or flexibility of each parameter and applies this information in updating the model for related tasks. Figure 2 shows the overview of the EWC concept. Suppose the model already learned the task A. EWC deals with learning the task B as a new task by adding regularization while not forgetting the previous task A.
Figure 2. EWC concept [2]
2.3. Knowledge Distillation method
Knowledge Distillation (KD) is the process of transferring knowledge from a large network to a smaller network. This method has been proposed in [3]. Not only that but also KD has made significant impacts in the realm of IR. One of the most famous examples is Learning without Forgetting (LwF) proposed in [4]. In each stage, this model records responses of a previous task. Regarding Figure 3 below, when the model learns a new task, the training process will optimize the model so as to predict the correct label of the new task image indicating Θn. At the same time, the model aims for the output layer, represented as Θo, from the original task to maintain its accuracy from the previous stage. That strategy is effective and has the added bonus of acting as a regularizer, which improves performance on new tasks. In addition, this concept introduces KD loss, which measures the performance difference between the original network and the network learned a new task.
Figure 3. Illustration of LwF model [4]
3. Cutting-edge research
This chapter describes cutting edge research in IR applying those major existing methods presented in the previous chapter.
3.1. Adversarial Continual Learning (ACL)
ACL employed the concept of adversarial learning with a disjoint latent space representation in the IR scenario [5]. Their model consists of mainly three components. First, task-specific or private latent space tries to learn specific features of each task. Second, a shared feature space is acquired for all tasks which improve the transfer of knowledge and the ability to recall previous tasks through saving general features. This component is crucial to mitigate forgetting previously learned tasks. That task-specific and shared module will be factorized, which provides the representation containing the semantics of different modalities. Lastly, the discriminator is to predict a task label based on the shared module. Thus, the shared module is trained to generate features adversely that fool the discriminator while the discriminator tried to recognize task labels correctly. Figure 4 below depicts the overview of the ACL architecture. P represents the task-specific modules, Shared (S) expresses the shared module, and p is the factorized module between P and the shared module. Notably, the testing phase will only use a relevant task-specific module to predict the input class.
Figure 4. Overview of the ACL architecture [5]
3.1.1. Loss functions
In general, the design of loss functions can have a significant impact on model performance in IR. The ACL introduced three losses and then combined them to create their final loss function.
First, as in the loss function in ordinary classification models, the author uses a customized cross-entropy loss below.
whereX^kisn sample tuples of input in the task k,Y^kis output label in the taskk,f^k_{\theta}is the classifier with factorized the private module and the shared module. {M}^kexpresses the storage of exemplars with fixed number of images. Finally, the \sigmais the softmax function. Next, of course, this paper introduced an adversarial learning method, the adversarial loss has been employed in order to conduct the min-max game between adversarial discriminatorD and the shared moduleS. The discriminator attempts to classify the input task label k while the share module tries to generate features so as to fool the discriminator. Their proposed adversarial loss function are formulated below.
Note that training forSandD is considered finished whenS generates features thatD cannot use to predict the task label, resulting in a task-invariant representation forS. In the end, this adversarial loss enables the shared module to become as task-invariant as feasible.
Moreover, orthogonality constraints [6], also known as "difference" loss [7], are used to separate the shared features across all tasks from the private encoded features. The purpose of introducing this loss is that the private module aims to learn only the features that are unique to the current task and not present in. This is done to ensure that Z_{S} and Z_{P} are factorized and the features are specific to the task at hand. This concept can be formulated below.
Where F indicates the Frobenius norm. Finally, the completed formula for the loss function is below.
where those \lambda are regularizers used to regulate the impact of each component.
The below animation represents the process of training in ACL involving introduced components as well as loss functions.
Figure 5. Overview of the ACL architecture
3.1.2. Evaluation and Experiments
For performance evaluation for a trained model, authors used accuracy across all tasks in average. The backward transfer (BWT) was introduced to measure catastrophic forgetting. If BWT obtains a negative value, it derives happening catastrophic forgetting. Those employed metrics are shown below.
In regards to the experimental settings, the proposed model was evaluated on the miniImageNet dataset split into 20 stages for incremental learning. Seven baseline models were compared, including Hard Attention Mask (HAT) [8], Progressive Neural Network (PNN) [9], Continual Learning with Tiny Episodic Memories (ER-RES) [10], and Efficient Lifelong Learning with A-GEM (A-GEM) [11]. ORD-FT refers to the ordinal model in basic fine-tuning without considering catastrophic forgetting. ORD-FT is joint training where all training data is fed into each stage with the ordinal network. ACL-JT allows access to all train data in each incremental stage, but joint training is not the same as incremental learning as the entire train dataset is available in all stages. ER-RES and A-GEM store 13 images per class for the reply buffer, while ACL only keeps one image per class. The ACL network used a customized AlexNet for both private and shared modules.
In the above setting, Table 1 shows the results of the comparison between ACL and baselines. Of course, the model with a joint training setting achieved higher accuracy compared to the ACL. However, the size of the architecture in those models becomes significantly larger due to feeding the entire dataset in each stage as the joint training settings. In terms of the performance comparison between non-joint training baseline models and ACL, ACL outperformed in accuracy as well as BWT. It indicates that ACL was capable of avoiding catastrophic forgetting. Moreover, ACL only requires one image per class, making it the most efficient in terms of memory size when compared to ER-RES and A-GEM.
Table 1. Experimental result [5]
3.2. Class-Incremental Learning by Knowledge Distillation with Adaptive Feature Consolidation
Class-Incremental Learning by Knowledge Distillation with Adaptive Feature Consolidation (AFC) utilized the regularization and knowledge distillation method in IR [12]. It minimizes the increase in loss caused by model updates by using the adaptive weighting of feature maps. The model balances the protection of important features for robustness and the adaptation of less crucial features for flexibility. Thus, the main focus of this approach is to estimate the significance of each feature map for knowledge distillation and lays the theoretical basis for adaptive weighting like previous work related to regularization as well as the knowledge distillation method. This important feature map will help mitigate catastrophic forgetting. Figure 6 depicts the overview of the AFC architecture.
Figure 6. Overview of the AFC architecture [12]
3.2.1. Loss functions
In the loss function setting for AFC, two losses were proposed. First, Classification Loss ℒ_{cls}^t is shown below.
where 𝜂 is a learnable scaling parameter, 𝛿 is a constant to enforce stronger class separation. B is the mini-batch size. In regard to the prediction part of \^y , Instead of using a dot product for the classifier, AFC uses Local Similarity Classifier, which takes the cosine similarity between the classification layer and the final embedding layer.
Of course, AFC involves the concept of the Knowledge Distillation method, their loss includes the discrepancy loss ℒ_{disc}^t written below.
where 𝐿 is the number of building blocks, 𝐶 is the number of channels, and 𝑍_{ℓ,𝑐} denotes 𝑐^{𝑡ℎ} channel of the \ell^{𝑡ℎ} layer or block. Notably, this loss can determine the discrepancy between the previous and current network at the layer level. \widetilde{\it I}_{\ell,c}^{t} is importance across layers. It can be interpreted as a weight factor because the larger value of this map incurs the bigger changes in the loss. In the end, the total loss is formed below.
where \lambda_{disc} is a hyperparameter and \lambda^t is set to \sqrt{\frac{n^{t}}{n^{t}-n^{t-1}}} as n is the size of the dataset.
3.2.2. Evaluation and Experiments
In their experiment, the researchers compared with baseline models using average incremental accuracy, which means accuracy across classes at each stage. The used dataset was CIFER100. The incremental training process involved training the models on half of the classes, then dividing the remaining classes among the stages. For example, with two classes learned per stage, there were 25 incremental stages. The classification protocol and image processing were based on PODNet [13], and a 32-layer ResNet was used as the backbone and trained for 160 epochs with a batch size of 128.
Table 2 shows that AFC outperformed other methods in incremental learning. In the 25-stage setting, ACL had the largest performance difference compared to other baseline models, demonstrating its robustness in incremental learning scenarios. NME refers to the nearest mean-of-exemplars in [1] and CNN is from [14], both being exemplar management protocols.
Table2. Experimental result in AFC [12]
4. State-of-art use case in a medical domain
4.1. Continual Class Incremental Learning for CT Thoracic Segmentation
Incremental learning research is still active and has been applied to medical domains. One of the examples is Continual Class Incremental Learning for CT Thoracic Segmentation called ACLSeg [20]. This paper applied the main idea of ACL to conduct multi-organ CT scan segmentation in the adversarial training setting. There are some changes compared to the original ACL architecture to adopt segmentation tasks. First, the shared module contains Astous Spatial Pyramid Pooling (ASPP) [21], which helps to capture multi-context information at a reduced computational cost. This is desirable due to the size of medical data, and the need to segment small anatomical structures. Furthermore, instead of factorizing the private and the shared module, ACLSeg first multiplies them and those are added element-wise, finally, concatenated to be used as an input feature representation in the segmentation module P'. In addition, PixelShuffle is implemented in the segmentation module P' for upsampling. Figure 7 below represents the overview of the ACLSeg architecture.
Figure 7. Overview of the AFCSeg architecture [20]
4.1.1. Evaluation and Experiments
ACLSeg employs the same loss functions as those proposed in ACL. However, in order to adapt to the segmentation task, the evaluation of segmentation quality was conducted using the Dice Coefficient (DC). The author utilized the metrics proposed in the paper [22] to evaluate knowledge retention. The metric \Omega_{base} assesses the model's retention of the first learned class after subsequent learning,\Omega_{new} evaluates the model's ability to learn new classes, and \Omega_{all} measures the model's overall ability to both retain prior knowledge and acquire new information.
The experiment was conducted using the AAPM dataset, which comprises five organs, including the spinal cord, right lung, left lung, heart, and oesophagus. ACLSeg was compared to both a fine-tuning model and the Learning without Forgetting approach in five stages of incremental learning, where one class was learned in each stage. The results of this experiment are presented in Table 3. Although LwF performed better than ACLSeg in segmenting the first learned class, it struggled to learn new classes. On the other hand, ACLSeg demonstrated its ability to effectively retain previously learned classes for segmentation while achieving improved segmentation quality in the learning of new tasks.
Table 3. Ω scores, (Std. dev. of 3 runs), and overall dice score of the final model for class incremental learning on 5 classes [20]
Furthermore, Figure 8 presents a visualization of the segmentation quality in each stage between LwF and ACLSeg. This comparison shows that the quality of segmentation is improving not only in dice scores but also in visuals.
Figure 8. Ground truth and segmentation results for a given input slice using LwF and ACLSeg after learning each task [20]
5. Review & Discussion
5.1. Comparison of papers on strengths and weaknesses
Three incremental learning models have been introduced in this blog post. ACL and AFC are both classification models, while ACLSeg is a segmentation model and an example of a medical application. The following summarizes the strengths and weaknesses of the models in comparison.
- ACL
- Strength: The use of private and shared modules clarifies the roles in incremental learning and can help prevent catastrophic forgetting. The private module retains task-specific features learned so far, while the shared module saves more general information for all tasks.
- Weakness: As this model introduced adversarial learning protocol, it becomes more complicated compared to other methods. In addition, all learned private modules so far must be saved somewhere else, which requires to have extra memory space.
- AFC
- Strength: Compared to ACL, AFC is simpler, using the knowledge distillation and regularization method. The importance map is applied to each layer like applying weight to the network to prevent forgetting learned tasks, and the model outperforms ACL in terms of memory usage.
- Weakness: The model still has some complications due to the use of the importance map in the loss function. In regard to the performance evaluation in classification incremental learning, the author only used accuracy. Thus, AFC should have also used metrics for the measurement of forgetting previously learned tasks, such as BWT.
- ACLSeg
- Strength: ACLSeg is capable of clearly segmenting organs in incremental learning settings compared to LwF. Although there has been not much research on incremental learning in medical domains, ACLSeg is a successful medical imaging incremental learning method.
- Weakness: As noted above in the ACL's weakness, it is more complex than models that don’t use adversarial learning, such as LwF. Furthermore, memory efficiency is still challenging compared to the knowledge distillation-based method.
This comparison between the three introduced models is summarized in the table below.
5.2. Takeaways
- Incremental learning approach will make positive impacts in classification as well as segmentation. When the volume of data is too large, incremental learning proves to be a valuable solution in terms of the efficiency of time and recourse.
- Especially for medical situations, it is rare for all necessary data to be available at once due to privacy issues and other factors. Incremental learning allows us to train a model dynamically as existing models can be enhanced for performance using only new data as it becomes available.
- In a segmentation scenario, the use of incremental learning can alleviate the complexity of annotation. Annotating new classes on previously annotated images can be challenging. However, the incremental learning method enables the training of the existing model using only images annotated with the new class, reducing the annotation burden.
6. References
[1] Rebu, S.A., Kolesnikov, A., Sperl, G., Lampert, C.H.: iCaRL: Incremental classifier and representation learning. In: CVPR (2017)
[2] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., & Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America, 114(13), 3521–3526.
[3] Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. 1–9. http://arxiv.org/abs/1503.02531
[4] Li, Z., & Hoiem, D. (2018). Learning without Forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12), 2935–2947. https://doi.org/10.1109/TPAMI.2017.2773081
[5] Ebrahimi, S., Meier, F., Calandra, R., Darrell, T., Rohrbach, M. (2020). Adversarial Continual Learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12356. Springer, Cham. https://doi.org/10.1007/978-3-030-58621-8_23
[6] Salzmann, M., Ek, C.H., Urtasun, R., Darrell, T.: Factorized orthogonal latent spaces. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. pp. 701–708 (2010)
[7] Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., Erhan, D.: Domain separation networks. In: Advances in neural information processing systems. pp. 343–351 (2016)
[8] Serra, J., Suris, D., Miron, M., Karatzoglou, A.: Overcoming catastrophic forget- ting with hard attention to the task. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 4548–4557. PMLR (2018)
[9] Rusu, A.A., Rabinowitz, N.C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., Hadsell, R.: Progressive neural networks. arXiv preprint arXiv:1606.04671 (2016)
[10] Chaudhry, A., Rohrbach, M., Elhoseiny, M., Ajanthan, T., Dokania, P.K., Torr, P.H., Ranzato, M.: Continual learning with tiny episodic memories. arXiv preprint arXiv:1902.10486 (2019)
[11] Chaudhry,A.,Ranzato,M.,Rohrbach,M.,Elhoseiny,M.:Ecientlifelonglearning with A-GEM. In: International Conference on Learning Representations (2019)
[12] Kang, M., Park, J., & Han, B. (2022). Class-Incremental Learning by Knowledge Distillation with Adaptive Feature Consolidation. 16050–16059. https://doi.org/10.1109/cvpr52688.2022.01560
[13] Arthur Douillard, Matthieu Cord, Charles Ollion, and Thomas Robert. PODNet: Pooled Outputs Distillation for Small-Tasks Incremental Learning. In ECCV, 2020.
[14] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Learning a Unified Classifier Incrementally via Rebalancing. In CVPR, 2019.
[15] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large Scale In-
cremental Learning. In CVPR, 2019.
[16] Yaoyao Liu, Yuting Su, An-An Liu, Bernt Schiele, and Qianru Sun. Mnemonics Training: Multi-Class Incremen- tal Learning without Forgetting. In CVPR, 2020.
[17] Ameya Prabhu, Philip HS Torr, and Puneet K Dokania. GDumb: A Simple Approach that Questions Our Progress in Continual Learning. In ECCV, 2020.
[18] Christian Simon, Piotr Koniusz, and Mehrtash Harandi. On Learning the Geodesic Path for Incremental Learning. In CVPR, 2021.
[19] Hongjoon Ahn, Jihwan Kwak, Subin Lim, Hyeonsu Bang, Hyojun Kim, and Taesup Moon. SS-IL: Separated Softmax for Incremental Learning. In ICCV, 2021.
[20] Elskhawy, A., Lisowska, A., Keicher, M., Henry, J., Thomson, P., Navab, N. (2020). Continual Class Incremental Learning for CT Thoracic Segmentation. In: , et al. Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning. DART DCL 2020 2020. Lecture Notes in Computer Science(), vol 12444. Springer, Cham.
[21] Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818 (2018)
[22] Kemker, R., McClure, M., Abitino, A., Hayes, T. L., & Kanan, C. (2018). Measuring catastrophic forgetting in neural networks. 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, 3390–3398. https://doi.org/10.1609/aaai.v32i1.11651