This blog post summarizes the paper Model-Contrastive Federated Learning published by Qinbin Li, Bingsheng He and Dawn Song in 2021.
1. Motivation
In several fields of research, one of the main problems is that the dataset is distributed on several different servers. Due to mostly privacy concerns, the data cannot be collected to a shared device. Published methods addressing this challenge were not able to achieve high accuracy on image data. One of the recent federated learning frameworks is MOON (model-contrastive learning) [6], which focuses on increasing the model performance on non-i.i.d. (not independent and identically distributed) datasets.
1.1. Related works
MOON is based on FedAvg [8], which uses iterative model averaging (Figure 1). The main server aggregates all local models updated (with SGD) on the client-side to update the global model. Recent works are focusing on the improvement of the 2nd (like MOON) and 4th steps.
Figure 1. Framework of FedAvg [8]. (source [6])
Li et al. [5] introduced in FedProx a proximal term with L2-norm to control the local updates. Karimireddy et al. [4] proposed SCAFFOLD to reduce the gradient dissimilarity with variance reduction in the update of local models. Others reduced the impact of the unbalanced classes during local training with Ratio Loss [13]. Li et al. [7] and Acar et al. [1] addressed the challenge of distributed data using batch normalization (FedBN [7]) and dynamic regularization (FedDyn [1]), respectively. More recently, FedProc [9] uses prototypes as global knowledge to correct the local training of the clients (parties).
2. Methodology
MOON [6] focuses on feature representations. If a class is represented locally only by a few samples, the local model drifts away from the global model. MOON controls this drift by monitoring the similarity between the model instances in order to make corrections in the local training. Thus the objective of MOON is to solve:
where the index N denotes the number of parties (P1, …, PN). The local dataset of a party Pi is marked with Di. D denotes the whole dataset, which is the union of all of the Di datasets. ω notates the model. In the above equation, Li is the empirical loss of Pi, which can be calculated as .
2.1. Network architecture
The network includes three main components:
- The base encoder separates the representation vectors from the inputs;
- The projection head forms the representations to a fixed dimension space (experiments showed that this component improves the accuracy by ~2% [6]);
- The output layer determines the predicted values for each class.
During the local training, the monitored loss value (referred to as local loss) represents the concatenation of two different loss values: the cross-entropy (Lsup) and model-contrastive (Lcon) loss terms (Figure 2). The local loss can be calculated as:
where the weight of the model-contrastive loss term can be controlled with the parameter μ. Lsup is calculated for every local model from the output layer at timestep t. The contrastive loss is determined as:
where τ is the temperature parameter that controls the concentration level of the distribution. Lcon is based on [2]. At the local training, the global model is saved to the client, where it is trained on the client's dataset. This model is called the local model (ωit). For every input x, the representation vectors from the projection heads of the global model (zglob), local model (z) and local model from the previous time step (zprev) are compared. With Lcon, the method aims to decrease the distance between zglob and z, while increasing the distance between zprev and z, since the global model should be able to represent the whole dataset better.
Figure 2. Framework of MOON with the calculation of the loss term [6]
The objective is, therefore, to minimize
In a round of updating the local and global models (Algorithm 1), only a set of the client models are updated. Every round contains three steps:
- Sending the global model to each party,
- The server collects the local model from each party,
- The server updates the global model with weighted averaging.
Listing 1. Algorithm of MOON. The difference between the algorithm of FedAvg [8] and MOON [6] is in the PartyLocalTraining function.
3. Experiments
The comparison of MOON with previous (and future) methods works with three datasets, where two different network architectures were implemented in MOON, depending on the dataset:
CIFAR-10: CNN (6x5x5 convolutional layer, 2x2 max pooling layer, 16x5x5 convolutional layer, 2x2 max pooling layer)
CIFAR-100, Tiny-ImageNet: FC (20 and 84 units, ReLU activations)
Parameter settings:
SGD optimizer: learning rate=0.01, weight decay=0.00001, momentum=0.9,
Batch size: 64,
Number of local epochs: SOLO: 300, FL: 10,
Number of communication rounds: CIFAR-10/100: 100, Tiny ImageNet: 20,
Temperature parameter: 0.5,
Number of parties: 10,
Tuned μ from {0.1, 1, 5, 10} (MOON & FedProx):
CIFAR-10: 5 & 0.01,
CIFAR-100: 1 & 0.001,
Tiny ImageNet: 1 & 0.001.
By comparing the overall accuracy of the methods (Table 1), we can observe that MOON can achieve high accuracy with the help of the projection head. The positive effect of the projection head can also be seen in Table 2, where the model-contrastive loss could outperform the other second loss terms. The convergence of MOON was not influenced by the model-contrastive loss.
Method | CIFAR-10 | CIFAR-100 | Tiny-ImageNet |
---|---|---|---|
FedProc [9] | 70.7%±0.3% | 74.6%±0.1% | 35.4%±0.1% |
MOON [6] | 69.1%±0.4% | 67.5%±0.4% | 25.1%±0.1% |
FedAvg [8] | 66.3%±0.5% | 64.5%±0.4% | 23.0%±0.1% |
FedProx [5] | 66.9%±0.2% | 64.6%±0.2% | 23.2%±0.2% |
SCAFFOLD [4] | 66.6%±0.2% | 52.5%±0.3% | 16.0%±0.2% |
SOLO [6] | 46.3%±5.1% | 22.3%±1.0% | 8.6%±0.4% |
Table 1. Top-1 accuracy of the studied methods [6] [9]
second term | CIFAR-10 | CIFAR-100 | Tiny-ImageNet |
---|---|---|---|
none (FedAvg) | 66.3% | 64.5% | 23.0% |
L2-norm | 65.8% | 66.9% | 24.0% |
MOON | 69.1% | 67.5% | 25.1% |
Table 2. Effect of the loss terms on the accuracy [6]
MOON needs less communication rounds to reach the same accuracy as FedAvg [8] (Table 3), although the average training time of MOON is bigger, because of the additional loss term. If we compare these two attributes, MOON is the fastest method, but is also the most computationally expensive one.
Method | CIFAR-10 | CIFAR-100 | Tiny-ImageNet | |||
---|---|---|---|---|---|---|
#rounds | speedup | #rounds | speedup | #rounds | speedup | |
FedAvg [8] | 100 | 1× | 100 | 1× | 20 | 1× |
FedProx [5] | 52 | 1.9× | 75 | 1.3× | 17 | 1.2× |
SCAFFOLD [4] | 80 | 1.3× | - | <1× | - | <1× |
MOON [6] | 27 | 3.7× | 43 | 2.3× | 11 | 1.8× |
Table 3. The needed amount of rounds to achieve the same accuracy as FedAvg [8] for 100 rounds [6]
By increasing the number of communication rounds and the number of parties, MOON could achieve higher accuracy, if μ=10 (Lcon has a higher influence), however, it needs more training time (Table 4). Therefore MOON is scalable.
Method | #parties=50 (all parties participate in a round) | #parties=100 (20 parties participate in a round) | ||
---|---|---|---|---|
100 rounds | 200 rounds | 250 rounds | 500 rounds | |
FedProc [9] | 63.6% | 72.5% | 68.9% | 70.6% |
MOON (μ=1) [6] | 54.7% | 58.8% | 54.5% | 58.2% |
MOON (μ=10) [6] | 58.2% | 63.2% | 56.9% | 61.8% |
FedAvg [8] | 51.9% | 56.4% | 51.0% | 55.0% |
FedProx [5] | 52.7% | 56.6% | 51.3% | 54.6% |
SCAFFOLD [4] | 35.8% | 44.9% | 37.4% | 44.5% |
SOLO [6] | 10%±0.9% | 7.3%±0.6% |
Table 4. Effect of different number of parties with the same fraction on the accuracy. [6] [9]
The heterogeneity of the data could also have an impact on the accuracy (Table 5): MOON can achieve high accuracy on more unbalanced datasets as well (β=0.1), which proves that MOON is the most efficient and robust method.
Method | β=0.1 | β=0.5 | β=5 |
---|---|---|---|
FedProc [9] | 68.9% | 74.6% | 75.5% |
MOON [6] | 64.0% | 67.5% | 68.0% |
FedAvg [8] | 62.5% | 64.5% | 65.7% |
FedProx [5] | 62.9% | 64.6% | 64.9% |
SCAFFOLD [4] | 47.3% | 52.5% | 55.0% |
SOLO [6] | 15.9%±1.5% | 22.3%±1.0% | 26.6%±1.4% |
Table 5. Test accuracy with different concentration parameters. [6] [9]
By observing the influence on the accuracy of the number of local epochs, it is clear that the models trained for only one epoch achieve similar performance. After increasing the number of epochs, the accuracy values slowly decrease, since each local optima differs from the global one (Figure 3).
Figure 3. Influence of the number of local epochs on the test accuracy. [6]
4.Conclusion
MOON is able to achieve high performance, although the recently published FedProc [9] method could reach higher accuracy. The success of MOON is based on the modifications made in the local training, with the help of the introduced loss function (weighted addition of cross-entropy loss and model-contrastive loss). The experiments proved that MOON is a stable, robust, accurate algorithm and could be applied in a very wide range of fields: healthcare, autonomous driving, etc.
5. Student’s review
5.1 Strengths and weaknesses
MOON outperforms previous works, making it an accurate, effective and robust model. The code is available on GitHub, thus new hyperparameter settings with different models, datasets could also be tested. The authors focused on the improvement of the local training, however, by improving the aggregation phase (combining MOON with FedAvgM [3]), they were still able to achieve improvements of 2-3% compared to the presented approach.
The paper mentions three other promising approaches which were not compared to MOON. Two of them use the same idea as MOON, they concentrate on improving the local models before averaging them. FedBN [7] uses batch normalization before averaging the local models, and was able to outperform FedAvg [8] and FedProx [5] as well. In another publication [13], the Ratio Loss function was introduced to reduce the impact of the unbalanced classes at each party. It would be interesting to compare the performance of these similar approaches to MOON. The idea used for the model-contrastive loss is very similar to the triplet loss approach as introduced in [10]. It worked with embedded vectors and created a representation map from the anchor, positive and negative samples.
5.2 Further improvement
The efficiency of the model-contrastive loss could be improved with the help of data augmentation, as stated in [2]. FedProc [9] is a published improvement of MOON, which also improved the local training phase. It introduced not only an improved loss term (global prototypical contrastive loss) but a new network architecture as well, which could also increase the performance of MOON.
6. References
- Acar, D. A. E., Zhao, Y., Navarro, R. M., Mattina, M., Whatmough, P. N., & Saligrama, V. (2021). Federated learning based on dynamic regularization. arXiv preprint arXiv:2111.04263.
- Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020, November). A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597-1607). PMLR.
- Hsu, T. M. H., Qi, H., & Brown, M. (2019). Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335.
- Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., & Suresh, A. T. (2020, November). Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning (pp. 5132-5143). PMLR.
- Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., & Smith, V. (2020). Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2, 429-450.
- Li, Q., He, B., & Song, D. (2021). Model-Contrastive Federated Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10713-10722).
- Li, X., Jiang, M., Zhang, X., Kamp, M., & Dou, Q. (2021). Fedbn: Federated learning on non-iid features via local batch normalization. arXiv preprint arXiv:2102.07623.
- McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017, April). Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics (pp. 1273-1282). PMLR.
- Mu, X., Shen, Y., Cheng, K., Geng, X., Fu, J., Zhang, T., & Zhang, Z. (2021). FedProc: Prototypical Contrastive Federated Learning on Non-IID data. arXiv preprint arXiv:2109.12273.
- Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 815-823).
- Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., & Khazaeni, Y. (2020). Federated learning with matched averaging. arXiv preprint arXiv:2002.06440.
- Wang, J., Liu, Q., Liang, H., Joshi, G., & Poor, H. V. (2020). Tackling the objective inconsistency problem in heterogeneous federated optimization. arXiv preprint arXiv:2007.07481.
- Wang, L., Xu, S., Wang, X., & Zhu, Q. (2020). Addressing Class Imbalance in Federated Learning. arXiv preprint arXiv:2008.06217.