This blog post summarizes the paper Model-Contrastive Federated Learning published by Qinbin Li, Bingsheng He and Dawn Song in 2021.

1. Motivation

In several fields of research, one of the main problems is that the dataset is distributed on several different servers. Due to mostly privacy concerns, the data cannot be collected to a shared device. Published methods addressing this challenge were not able to achieve high accuracy on image data. One of the recent federated learning frameworks is MOON (model-contrastive learning) [6], which focuses on increasing the model performance on non-i.i.d. (not independent and identically distributed) datasets.

1.1. Related works

MOON is based on FedAvg [8], which uses iterative model averaging (Figure 1). The main server aggregates all local models updated (with SGD) on the client-side to update the global model. Recent works are focusing on the improvement of the 2nd (like MOON) and 4th steps.

Figure 1. Framework of FedAvg [8]. (source [6])

Li et al. [5] introduced in FedProx a proximal term with L₂-norm to control the local updates. Karimireddy et al. [4] proposed SCAFFOLD to reduce the gradient dissimilarity with variance reduction in the update of local models. Others reduced the impact of the unbalanced classes during local training with Ratio Loss [13]. Li et al. [7] and Acar et al. [1] addressed the challenge of distributed data using batch normalization (FedBN [7]) and dynamic regularization (FedDyn [1]), respectively. More recently, FedProc [9] uses prototypes as global knowledge to correct the local training of the clients (parties).

2. Methodology

MOON [6] focuses on feature representations. If a class is represented locally only by a few samples, the local model drifts away from the global model. MOON controls this drift by monitoring the similarity between the model instances in order to make corrections in the local training. Thus the objective of MOON is to solve:

where the index N denotes the number of parties (P₁, …, P_N). The local dataset of a party Pi is marked with Di. D denotes the whole dataset, which is the union of all of the D_i datasets. ω notates the model. In the above equation, L_i is the empirical loss of P_i, which can be calculated as .

2.1. Network architecture

The network includes three main components:

The base encoder separates the representation vectors from the inputs;
The projection head forms the representations to a fixed dimension space (experiments showed that this component improves the accuracy by ~2% [6]);
The output layer determines the predicted values for each class.

During the local training, the monitored loss value (referred to as local loss) represents the concatenation of two different loss values: the cross-entropy (L_sup) and model-contrastive (L_con) loss terms (Figure 2). The local loss can be calculated as:

where the weight of the model-contrastive loss term can be controlled with the parameter μ. L_sup is calculated for every local model from the output layer at timestep t. The contrastive loss is determined as:

where τ is the temperature parameter that controls the concentration level of the distribution. L_con is based on [2]. At the local training, the global model is saved to the client, where it is trained on the client's dataset. This model is called the local model (ω_i^t). For every input x, the representation vectors from the projection heads of the global model (z_glob), local model (z) and local model from the previous time step (z_prev) are compared. With L_con, the method aims to decrease the distance between z_glob and z, while increasing the distance between z_prev and z, since the global model should be able to represent the whole dataset better.

Figure 2. Framework of MOON with the calculation of the loss term [6]

The objective is, therefore, to minimize

In a round of updating the local and global models (Algorithm 1), only a set of the client models are updated. Every round contains three steps:

Sending the global model to each party,
The server collects the local model from each party,
The server updates the global model with weighted averaging.

Listing 1. Algorithm of MOON. The difference between the algorithm of FedAvg [8] and MOON [6] is in the PartyLocalTraining function.

3. Experiments

The comparison of MOON with previous (and future) methods works with three datasets, where two different network architectures were implemented in MOON, depending on the dataset:

CIFAR-10: CNN (6x5x5 convolutional layer, 2x2 max pooling layer, 16x5x5 convolutional layer, 2x2 max pooling layer)
CIFAR-100, Tiny-ImageNet: FC (20 and 84 units, ReLU activations)

Parameter settings:

SGD optimizer: learning rate=0.01, weight decay=0.00001, momentum=0.9,
Batch size: 64,
Number of local epochs: SOLO: 300, FL: 10,
Number of communication rounds: CIFAR-10/100: 100, Tiny ImageNet: 20,
Temperature parameter: 0.5,
Number of parties: 10,
Tuned μ from {0.1, 1, 5, 10} (MOON & FedProx):
- CIFAR-10: 5 & 0.01,
- CIFAR-100: 1 & 0.001,
- Tiny ImageNet: 1 & 0.001.

By comparing the overall accuracy of the methods (Table 1), we can observe that MOON can achieve high accuracy with the help of the projection head. The positive effect of the projection head can also be seen in Table 2, where the model-contrastive loss could outperform the other second loss terms. The convergence of MOON was not influenced by the model-contrastive loss.

Method	CIFAR-10	CIFAR-100	Tiny-ImageNet
FedProc [9]	70.7%±0.3%	74.6%±0.1%	35.4%±0.1%
MOON [6]	69.1%±0.4%	67.5%±0.4%	25.1%±0.1%
FedAvg [8]	66.3%±0.5%	64.5%±0.4%	23.0%±0.1%
FedProx [5]	66.9%±0.2%	64.6%±0.2%	23.2%±0.2%
SCAFFOLD [4]	66.6%±0.2%	52.5%±0.3%	16.0%±0.2%
SOLO [6]	46.3%±5.1%	22.3%±1.0%	8.6%±0.4%

Table 1. Top-1 accuracy of the studied methods [6] [9]

second term	CIFAR-10	CIFAR-100	Tiny-ImageNet
none (FedAvg)	66.3%	64.5%	23.0%
L₂-norm	65.8%	66.9%	24.0%
MOON	69.1%	67.5%	25.1%

Table 2. Effect of the loss terms on the accuracy [6]

MOON needs less communication rounds to reach the same accuracy as FedAvg [8] (Table 3), although the average training time of MOON is bigger, because of the additional loss term. If we compare these two attributes, MOON is the fastest method, but is also the most computationally expensive one.

Method	CIFAR-10		CIFAR-100		Tiny-ImageNet
Method	#rounds	speedup	#rounds	speedup	#rounds	speedup
FedAvg [8]	100	1×	100	1×	20	1×
FedProx [5]	52	1.9×	75	1.3×	17	1.2×
SCAFFOLD [4]	80	1.3×	-	<1×	-	<1×
MOON [6]	27	3.7×	43	2.3×	11	1.8×

Table 3. The needed amount of rounds to achieve the same accuracy as FedAvg [8] for 100 rounds [6]

By increasing the number of communication rounds and the number of parties, MOON could achieve higher accuracy, if μ=10 (L_con has a higher influence), however, it needs more training time (Table 4). Therefore MOON is scalable.

Method	#parties=50 (all parties participate in a round)		#parties=100 (20 parties participate in a round)
Method	100 rounds	200 rounds	250 rounds	500 rounds
FedProc [9]	63.6%	72.5%	68.9%	70.6%
MOON (μ=1) [6]	54.7%	58.8%	54.5%	58.2%
MOON (μ=10) [6]	58.2%	63.2%	56.9%	61.8%
FedAvg [8]	51.9%	56.4%	51.0%	55.0%
FedProx [5]	52.7%	56.6%	51.3%	54.6%
SCAFFOLD [4]	35.8%	44.9%	37.4%	44.5%
SOLO [6]	10%±0.9%		7.3%±0.6%

Table 4. Effect of different number of parties with the same fraction on the accuracy. [6] [9]

The heterogeneity of the data could also have an impact on the accuracy (Table 5): MOON can achieve high accuracy on more unbalanced datasets as well (β=0.1), which proves that MOON is the most efficient and robust method.

Method	β=0.1	β=0.5	β=5
FedProc [9]	68.9%	74.6%	75.5%
MOON [6]	64.0%	67.5%	68.0%
FedAvg [8]	62.5%	64.5%	65.7%
FedProx [5]	62.9%	64.6%	64.9%
SCAFFOLD [4]	47.3%	52.5%	55.0%
SOLO [6]	15.9%±1.5%	22.3%±1.0%	26.6%±1.4%

Table 5. Test accuracy with different concentration parameters. [6] [9]

By observing the influence on the accuracy of the number of local epochs, it is clear that the models trained for only one epoch achieve similar performance. After increasing the number of epochs, the accuracy values slowly decrease, since each local optima differs from the global one (Figure 3).

Figure 3. Influence of the number of local epochs on the test accuracy. [6]

4.Conclusion

MOON is able to achieve high performance, although the recently published FedProc [9] method could reach higher accuracy. The success of MOON is based on the modifications made in the local training, with the help of the introduced loss function (weighted addition of cross-entropy loss and model-contrastive loss). The experiments proved that MOON is a stable, robust, accurate algorithm and could be applied in a very wide range of fields: healthcare, autonomous driving, etc.

5. Student’s review

5.1 Strengths and weaknesses

MOON outperforms previous works, making it an accurate, effective and robust model. The code is available on GitHub, thus new hyperparameter settings with different models, datasets could also be tested. The authors focused on the improvement of the local training, however, by improving the aggregation phase (combining MOON with FedAvgM [3]), they were still able to achieve improvements of 2-3% compared to the presented approach.

The paper mentions three other promising approaches which were not compared to MOON. Two of them use the same idea as MOON, they concentrate on improving the local models before averaging them. FedBN [7] uses batch normalization before averaging the local models, and was able to outperform FedAvg [8] and FedProx [5] as well. In another publication [13], the Ratio Loss function was introduced to reduce the impact of the unbalanced classes at each party. It would be interesting to compare the performance of these similar approaches to MOON. The idea used for the model-contrastive loss is very similar to the triplet loss approach as introduced in [10]. It worked with embedded vectors and created a representation map from the anchor, positive and negative samples.

5.2 Further improvement

The efficiency of the model-contrastive loss could be improved with the help of data augmentation, as stated in [2]. FedProc [9] is a published improvement of MOON, which also improved the local training phase. It introduced not only an improved loss term (global prototypical contrastive loss) but a new network architecture as well, which could also increase the performance of MOON.

Seitenhierarchie