1. Introduction

1.1 Semi-Supervised Learning

Traditionally, machine learning methods have been divided into two different branches: supervised learning, which uses only labelled dataset and unsupervised learning, which tries to infer underlying structure from unlabelled input data. Semi-supervised learning (SSL) aims to combine these branches to solve their key challenges and improve performance. It uses a large amount of unlabelled data and only a small number of labelled samples. Unlike unsupervised learning, it works for a variety of problems. [1]

1.1.1 Motivation

A frequent issue arising when applying different ML methods is the lack of labelled data, caused by the difficulty of obtaining the labels. Manual labelling can be a very costly and time-consuming process. In many cases, including the medical domain, annotating the dataset requires expert knowledge and experience. At the same time, in a wide range of different ML applications, a large amount of unlabelled data is quite easily available. Examples include computer aided medical diagnosis or speech recognition. [2]

1.1.2 Assumptions

An important prerequisite must be satisfied for SSL to help - knowledge about the data distribution p(x) over the input space, which can be gained from the unlabelled data, has to contain information relevant to the inference of the posterior distribution p(y|x). Otherwise, the accuracy might be degraded because of the unlabelled samples misguiding the inference. This condition seems to be fulfilled in most of the real-world learning problems. [4]

However, the relationship to the underlying distribution of data can differ. SSL algorithms use given assumptions: [1]

Smoothness assumption
points that are close in the input space should have close corresponding outputs; can be applied transitively to predict labels for the unlabelled samples

Low-density assumption
the decision boundary should pass through low-density regions in input space


Manifold assumption
the data points lie approximately on lower-dimensional manifolds


1.1.3 Taxonomy

Semi-supervised methods can be categorized based on the assumptions they are using, the way they incorporate the unlabelled data and how they relate to supervised methods.

The main distinction is between inductive and transductive methods. The aim of the former is to create a model that can predict labels of unseen data, while the latter provide the predictions only for objects encountered during the training. The focus is placed on the inductive methods, which are separated according to the way they incorporate unlabelled data: via a pseudo-labelling step (wrapper methods), in a pre-processing step (unsupervised pre-processing) or directly inside the objective function (intrinsically semi-supervised). Using perturbation-based models, which are usually implemented using neural networks, is one of the most popular approaches. [1]

1.2 Semi-Supervised Federated Learning

Federated learning is a machine learning technique enabling training of a shared model on isolated data, distributed on different devices, by aggregation of locally computed updates. Since datasets are not transmitted, it offers privacy advantages. What distinguishes it from the traditional decentralised methods are the facts, that there is no identical distribution assumption, and the data can be unbalanced. [5]

Semi-supervised federated learning (SSFL) is an approach, in which semi-supervised methods are used in a federated setting, to make use of unlabelled data. This scenario includes situations, where distributed clients have access to both labelled and unlabelled samples, as well as those, where only a few or one node, for example the server, stores labelled datasets. The key challenges of SSFL are dealing with the non-iid data distribution and communication constraints. [6]

1.2.1 Motivation

Machine learning approaches could benefit from data and computing power, which are available locally on mobile devices. However, privately held data is usually unlabelled and manual annotating can be difficult, considering both its size and the fact, that sometimes domain knowledge is required. For instance, smart devices generate huge quantities of data, such as text inputs or physiological indicators. [6] Moreover, large unlabelled datasets can facilitate alleviating the frequent real-life problem of non-iid data, by providing a better understanding of their distribution. [7] Some fields could especially benefit from utilizing decentralised and unlabelled data in a privacy-preserving way. One example is the medical domain, in which collaboration across medical institutions can help with the scarcity and distribution bias of data. [8]

2. Methods and Results

2.1 Semi-Supervised Learning

Common types of semi-supervised methods include: [9]

  • Consistency Regularisation
    The idea is that the output of the model should remain unchanged under some realistic perturbations. Methods learn by encouraging the consistency of the predictions based on the unlabelled data, for instance, across different augmentations, including adversarial versions or across time. To do that, a regularisation term is included in the loss function to specify the constraints assumed apriori. This approach relies on the manifold or the smoothness assumption. The most common method structure is Teacher-Student, where the student model learns as usual, and the teacher model generates targets.
    Examples:
    • Mean Teacher [10]
    • CPS [11]
  • Pseudo-Labelling
    The idea is to use the model being trained to generate predictions for unlabelled data and subsequently add samples with pseudo-labels to the training set, provided that the confidence of the prediction falls over some specified threshold.
  • Pre-training
    These methods rely on unsupervised, task-agnostic pretraining and then supervised fine-tuning of the model, using the small, labelled dataset.
    Example:
    • SimCLRv2 [12]
  • Hybrid
    Methods combining different ideas, usually consistency regularisation and pseudo-labelling.
    Examples:
    • FixMatch [13]
    • FlexMatch [14]

2.1.1 Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision - CPS [11]

This is an example of applying consistency regularisation to solve the task of semantic segmentation. In the paper a cross pseudo supervision approach is proposed. It consists of two parallel segmentation networks: P1 = f(X; θ1) and P2 = f(X; θ2), having the same structure but initialised with different weights (θ1, θ2). The input (X) - labelled and unlabelled images are fed into both networks. The networks use softmax activation and output segmentation confidence maps (P1, P2), based on which pseudo segmentation maps (Y1, Y2) - predicted one-hot label maps, are computed.

With labelled samples the learning remains straightforward - the networks are supervised separately with the ground-truth. However, in the case of unlabelled images, the pseudo segmentation map generated by one network is used to supervise the confidence map of the other, and vice versa. This enforces consistency between predictions of the networks.


The training objective contains losses: supervised (Ls) and the cross pseudo supervision loss (Lcps). lce is the cross-entropy loss.

          

Results

The evaluation was performed using ResNet models. Conducted experiments showed improvements of CPS over fully-supervised baselines. Ablation studies considering the choice of loss functions and data augmentation were performed. The combination of CPS with self-training was also studied and showed improvements over both self-training and the CPS method.










CPS was compared to SOTA SSL segmentation methods under different partition protocols and PASCAL VOC 2012 and Cityscapes datasets were used. CPS outperformed all the others and adding CutMix Augmentation [21] improved the results.


2.1.2 Big Self-Supervised Models are Strong Semi-Supervised Learners - SimCLRv2 [12]

This is an example of using the pre-training SSL approach to solve the task of image classification. The algorithm consists of three steps:

  1. Unsupervised pretraining
    Unlabelled data is utilised in a task-agnostic way, in order to learn general representations via unsupervised pretraining. Using a big neural network is an important aspect, as it proves useful when not many labelled samples are available. An adopted and improved SimCLR [15] - an approach based on contrastive learning, was used for pretraining. SimCLRv2 added a nonlinear transformation (projection head) after convolutional layers of SimCLR and incorporated a memory mechanism inspired by MoCo [16].
    The contrastive loss:
  2. Supervised fine-tuning
    Pretrained network is adapted for a specific task using labelled data. A part of the projection head is incorporated into the fine-tuning process.

  3. Distillation of task predictions
    Unlabelled data is leveraged again, but this time in a task-specific way, which results in the pretrained network being improved and distilled into a smaller one with little loss in accuracy. Teacher-student method is used in this step, with the fine-tuned network being the teacher.
    The distillation loss:




Results

The evaluation was performed using ResNet models of different sizes, on ImageNet dataset. Conducted studies showed that models of bigger size were more label-efficient, that using a deeper projection head during pretraining, when fine-tuning is performed from the middle layer, results in better accuracy and that the distillation step improves the performance.


The best models were compared with previous SOTA SSL methods. SimCLRv2 outperformed others in checked cases.

2.1.3 Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling - FlexMatch [13]

This is an example of applying the hybrid approach to solve the task of image classification. FlexMatch applies Curriculum Pseudo Labelling strategy to FixMatch [17] - a method leveraging consistency regularisation and pseudo-labelling.

FixMatch utilises two kinds of augmentations - weak and strong. The weakly-augmented sample is fed into the model, which produces an artificial label for it. Then the pseudo-label is used as a target, when the same sample, but this time strongly-augmented, is an input to the model.


The unsupervised loss:

FixMatch, as well as other popular semi-supervised algorithms, relies on a fixed threshold when computing the unsupervised loss. This approach causes a lot of unlabelled samples to be ignored and does not consider different learning difficulties between classes. FlexMatch uses an approach called Curriculum Pseudo Labelling, which dynamically estimates the learning status. It assumes, that the learning effects of a class are indicated by the number of samples assigned to this class and whose predictions fall above a threshold. The estimated learning effect is normalised and used to scale the fixed threshold - the harder the class is to learn, the lower the threshold.

The estimated learning effect of class c at time t is given by the σ function and the new threshold Τ is obtained with these formulas:

 

The losses (Ls - supervised, Lu,t - unsupervised):
               

Additionally, a warm-up process is used to deal with the unreliability of the estimated learning status at the beginning of training.


Results

FlexMatch was evaluated on CIFAR10/100, SVHN, STL-10 and ImageNet datasets. Ablation studies considering the upper limit of thresholds, mapping functions and threshold warm-up were conducted.


The proposed approach was compared to methods using a fixed threshold. FlexMatch achieved the best performance on all the datasets, except for SVHN.

2.2 Semi-Supervised Federated Learning

2.2.1 Benchmarking Semi-supervised Federated Learning [6]

This paper presents an example of a SSFL setting. The setup consists of K clients with unlabelled datasets and a server with a limited labelled dataset. Distributions of classes can differ among the users and no data is exchanged. There are two main components: the local training process and combining of all the local models - the server computes an averaged model using weights received from the users and weights from its model. Subsequently, the server broadcasts the averaged weights to the users for the next round of training.

The loss function used by the server:

The loss function used by the clients:

During training, the consistency regularisation and pseudo-labelling approaches, inspired by FixMatch, are used.

Different variants of the averaging and normalisation methods were considered:

Averaging:

  • naive (using FedAvg [18])

  • grouping-based - users are divided into groups and the average is computed group-wise

Normalisation:

  • batch normalisation - normalises layer inputs by the mean and variance of a batch
  • group normalisation - organises channels into groups and computes the mean and variance along them


Results

The evaluation was performed using ResNet18 on Cifar-10 and SVHN datasets and a CNN model on EMNIST. In particular, the impact of non-iidness, communication period, labelled data number, communicating users and model averaging methods were studied. The results showed that using grouping-based averaging and group normalisation improve the performance.

Solutions proposed in the paper were compared to supervised federated methods - Supervised FedAvg and DataSharing, and two supervised communication efficient algorithms - EASGD and OverlapSGD. The number of labelled samples used for all the methods is 1000. In this setup the proposed SSFL approach outperforms the rest of the methods.


3. Medical Applications

3.1 Tripled-Uncertainty Guided Mean Teacher Model for Semi-supervised Medical Image Segmentation [20]

Tripled-Uncertainty Guided Mean Teacher is a model created for the task of medical image segmentation. It utilises consistency regularisation and multi-task learning. Apart from segmentation, two additional tasks are used: foreground and background reconstruction - for acquiring semantic information, and signed distance field prediction - for enforcing shape constraints.

The architecture follows the mean teacher idea. It consists of two models, which share the same encoder-decoder structure - a common encoder for all the tasks, and one decoder per task. The student model is trained and assigns the exponential moving average of its weights to the teacher model at each step, whereas the predictions of the teacher are used to supervise the student. Additionally, a tripled-uncertainty guided framework is introduced to help with generating more reliable pseudo-labels - uncertainty estimation is imposed on all the tasks. Consistency is enforced between the two models as well as between the results for the three tasks.


The objective functions (Ls - supervised loss, Lcons - consistency losses):

                


     


Results

The method was evaluated on datasets: 2017 ACDC challenge for cardiac segmentation and PROMISE12 for prostate segmentation.

Contributions of different components were investigated in an ablation study (Seg - segmentation task, SDF - SDF prediction task, Rec - reconstruction task, Unc - uncertainty estimation).


It was compared to U-net (supervised) and different semi-supervised methods. Proposed approach outperformed all the others.

3.2 Semi-supervised Peer Learning for Skin Lesion Classification - FedPerl [21]

In this paper, a SSFL framework for the classification of skin lesions in dermoscopic images is proposed. The key idea is that involving similar clients can be beneficial for the learning process, especially since data and class distributions might differ between users. There are three main components:

  • Building communities
    Similar clients are clustered together based on the similarity matrix, which defines cosine similarity between the first two statistical moments of clients' weights.

  • Peer learning
    Top most similar peers share their knowledge (learned model parameters) to help given client with pseudo labelling. Dynamic policies are used to ensure that only the peers, who improve the performance of the client are chosen.
    Pseudo-labels are assigned according to (ft - similar peers’ prediction):
  • Peer anonymisation
    An anonymised peer collects the knowledge from top peers, to make sure that their identities are not exposed. Additionally, it reduces the communication cost.
    The anonymised peer (fa) is utilised in pseudo-labelling. Consistency between the local and peer knowledge is imposed (LCON):
         

The objective function for the client (LSSL - inspired by FixMatch):

         


Results

The method was evaluated on images collected from publicly available datasets: ISIC19, HAM10000,  Derm7pt, PAD-UFES and ISIC20 (used for final tests).

A variety of experiments, considering the influence of peer anonymisation, peer learning, dynamic learning policies, the importance of the similarity matrix, and different SSFL settings, were performed. The approach was validated in a heterogeneous setting suffering from class imbalance and its generalisation to unseen clients was investigated. It achieved the best results in most cases.

The method was compared to SOTA in a scenario with few labelled clients. FedPerl outperforms FedAvg [18] and achieves comparable results to FedIRM [8].


4. My review

I believe that semi-supervised learning and semi-supervised federated learning are very promising approaches, which are proving beneficial in different areas, one example being the medical domain. However, leveraging unlabelled data, especially in federated settings, is a challenging task and there is still room for improvement.

I found the approaches in the reviewed papers very interesting. The advantage of the CPS, SimCLRv2 and FlexMatch methods is their simplicity - they achieve state-of-the-art results without introducing complicated architectures. On the other hand, the tripled-uncertainty guided mean teacher is a more complex model, but including additional components proves useful in the task of medical image segmentation. During my research, I noticed that the prevailing tendency is to use intrinsically semi-supervised methods based on consistency regularisation, but taking inspiration from different machine learning problems, like natural language processing, which is done in the case of SimCLRv2, results in some significant improvements. Both papers concerning the semi-supervised federated learning approach cover many important aspects of the problem and are leveraging a different kind of groupings of users during the training, which turns out to be an advantageous idea. All the papers were very thorough and contained ablation studies as well as many experiments, but the paper introducing FedPerl had the most extensive evaluation section. I reckon that the main weakness of the papers concerning the semi-supervised learning approach was the fact, that they were evaluated only on curated datasets with close to uniform class distribution and no novel classes in unlabelled data, which can greatly differ from real-life scenarios. 

Since the papers I reviewed covered two different tasks - segmentation and classification and used different datasets, proportions of the labelled to unlabelled samples, and different metrics or settings, they are not easy to compare. There is one exception - SimCLRv2 outperforms FlexMatch on the ImageNet dataset. However, it was not tested on the CIFAR dataset, where FlexMatch achieves the best results.

5. References

[1] van Engelen, J.E., Hoos, H.H. A survey on semi-supervised learning, Machine Learning 109, 2020

[2] A. Chebli, A. Djebbar and H. F. Marouani, Semi-Supervised Learning for Medical Application: A Survey, 2018 International Conference on Applied Smart Systems, 2018

[3] Cheplygina V, de Bruijne M, Pluim JPW. Not-so-supervised: A survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. Med Image Anal. 2019

[4] Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. Semi-Supervised Learning (Adaptive Computation and Machine Learning). The MIT Press. 2006

[5] McMahan, H. B., Eider Moore, Daniel Ramage, Seth Hampson and Blaise Agüera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. AISTATS, 2017

[6] Zhang, Zhengming, Zhewei Yao, Yaoqing Yang, Yujun Yan, Joseph Gonzalez and Michael W. Mahoney. Benchmarking Semi-supervised Federated Learning., 2020

[7] Jin, Yilun, Xiguang Wei, Yang Liu and Qiang Yang. Towards Utilizing Unlabeled Data in Federated Learning: A Survey and Prospective., 2020

[8] Liu, Quande, Hongzhen Yang, Qi Dou and Pheng-Ann Heng. Federated Semi-supervised Medical Image Classification via Inter-client Relation Matching. MICCAI, 2021

[9] Yang, Xiangli, Zixing Song, Irwin King and Zenglin Xu. A Survey on Deep Semi-supervised Learning., 2021

[10] Tarvainen and H. Valpola, Mean teachers are better role models: Weight-averaged consistency targets improve semisupervised deep learning results, in NIPS, 2017

[11] Chen, Xiaokang, Yuhui Yuan, Gang Zeng and Jingdong Wang. Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

[12] Chen, Ting, Simon Kornblith, Kevin Swersky, Mohammad Norouzi and Geoffrey E. Hinton. Big Self-Supervised Models are Strong Semi-Supervised Learners., 2020

[13] Sohn, Kihyuk, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin Dogus Cubuk, Alexey Kurakin, Han Zhang and Colin Raffel. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence., 2020

[14] Zhang, Bowen, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura and Takahiro Shinozaki. FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling. NeurIPS, 2021

[15] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, 2020

[16] K. He, H. Fan, Y. Wu, S. Xie and R. Girshick, "Momentum Contrast for Unsupervised Visual Representation Learning," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

[17] Sohn, Kihyuk, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin Dogus Cubuk, Alexey Kurakin, Han Zhang and Colin Raffel. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence., 2020

[18] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al., Communication-efficient learning of deep networks from decentralized data, 2016

[19] Wang, Kaiping, Bo Zhan, Chen Zu, Xi Wu, Jiliu Zhou, Luping Zhou and Yan Wang. Tripled-Uncertainty Guided Mean Teacher Model for Semi-supervised Medical Image Segmentation. MICCAI, 2021

[20] Bdair, T., Navab, N., Albarqouni, S. (2021). FedPerl: Semi-supervised Peer Learning for Skin Lesion Classification. In: , et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. MICCAI 2021

[21] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk, Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. ICCV, 2019.








  • Keine Stichwörter