This is the blogpost written by Tzu Chien Chang for the paper "Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization" from authors Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, H. Vincent Poor.

Introduction

Federated learning [1] is an emerging machine learning technique where data collection and data training are done in edge devices. Generally, in federated learning there is a central server that hosts the trained model. After the model is sent to each client, training can be done anytime depending on the nature of the devices. Central server then collects and aggregates the update of the model from selected clients and renews the global model. It enables us to train the model without knowing the data, which is especially useful when we are dealing with privacy sensitive data [2]. Figure 1 demonstrates an easily understandable way to illustrate federated learning by animation.

Figure 1. Simple animation to depict how federated learning works [3].

Or to show the federated learning problems by formula:

$\begin{array}{l}\displaystyle min[F(x) = \sum_{i=1}^{m} p_iF_i(x)]\end{array}$

where m is the number of clients participating in the training, p_i = n_i / n is relative sample percentage and F_i is the objective function at each client.

Although it seems promising, there are four major challenges that hinder the development of federated learning [4].

Expensive Communication : Communications through network is much slower than local computation. We could reduce this by either send smaller messages or send messages less frequently.
Systems Heterogeneity : The size of data, computational power, network stability, local solvers and learning rate etc. differ from clients to clients.
Statistical Heterogeneity : Each client can have drastically different distributions of data, namely non-IID data.
Privacy Concerns : Sensitive information can still be revealed to third party or central server during the communication.

There are a lot of prior works being proposed to solve some of the challenging problems mentioned above. See table 1 for some of the most popular methods.

FedSGD [1]	FedAvg [1]	FedProx [5]
This is essentially the well-known "Stochastic Gradient Descent" but in a federated setting. Each time we randomly pick a subset of clients and use all the data on the node to perform training. The server then makes a gradient descent step based on the aggregated gradient it receives.	FedAvg allows clients to train for more than one batch or one epoch. After the clients send back the final updated weights to the central server, server then averages the weights proportional to the number of data at that node.	FedProx is a more general version of FedAvg. It adds a proximal term with a tunable parameter to FedAvg, allowing the algorithm to have better accuracy with the cost of more communication rounds.

Table 1. Comparison between FedSGD, FedAvg and FedProx.

However, these prior federated optimization algorithms all assume that the number of SGD iterations $\begin{array}{l}\tau_i\end{array}$ at all clients are the same (homogeneous) [6,7,8], which is far from practical applications, since the number of data, computing speed and network stability etc. at each client can be drastically different. The authors show that when the number of updates across all the clients are different (heterogeneous), the stationary point that these methods would reach can be arbitrarily far from the original objective function, depending on $\begin{array}{l}\tau_i\end{array}$ . Figure 2 gives us a good intuition of how heterogeneous setting can make the new parameter set $\begin{array}{l}x^{(t+1,0)}\end{array}$ deviate from its true global objective.

Figure 2. Comparison between homogeneous and heterogeneous updates. Green and blue points represent global and local objectives, respectively.

One naive solution to this problem would be to set a predefined number of updates for all the clients. The clients that reach the number first would have to be idle until all the clients finish their training. By this method we can solve the objective inconsistency issue, with the cost of wasting computational resources.

So far, there are no complete analysis of the phenomenon of objective inconsistency being proposed. It is also unclear how to best train the model with heterogenous setting and their convergence speed.

With this, there are two major contributions of this paper.

Proposed a novel theoretical framework that subsumes all the previous popular federated optimization algorithms and allows heterogeneous updates, different local solvers as well as non-IID datasets.
Based on this framework, FedNova is proposed to properly weigh the local updates while averaging and the result it achieves are consistently better than previous works according to their experiments.

Methodology

$\begin{array}{l}\mathrm{E}\end{array}$	$\begin{array}{l}\mathrm{B}\end{array}$	$\begin{array}{l}\mathrm{n_i}\end{array}$	$\begin{array}{l}n\end{array}$	$\begin{array}{l}\mathrm{\tau_i}\end{array}$	$\begin{array}{l}m\end{array}$	$\begin{array}{l}p_i\end{array}$	$\begin{array}{l}\eta\end{array}$	$\begin{array}{l}x_i^{(t,k)}\end{array}$	$\begin{array}{l}\Delta_i^{(t)}\end{array}$
Epoch	Mini-batch size	Number of samples at client i	Total number of samples	$\begin{array}{l}\mathrm{\frac{E\,n_i}{B}}\end{array}$ , number of local SGD iterations at client i	Total number of clients	$\begin{array}{l}\frac{n_i}{n}\end{array}$ , relative sample size at client i	Local learning rate	Model of client i at k-th local update	Cumulative local progress at client i

Table 2. All the notations that are used in this article.

Before we dive into the equations, please take a look at table 2 for the meaning of each notation.

First of all, a small case study is conducted to better showcase the influence of objective inconsistency. The authors chose a simple local objective function $\begin{array}{l}F_i(x) = \frac{1}{2}(x-e_i)^2\end{array}$ where $\begin{array}{l}e_i \in \Re^d\end{array}$ for the simulation. From the result shown in figure 3, we can clearly see that the choice of algorithms can greatly influence the performance of the model. Especially when the setting is heterogeneous, FedNova surpasses others significantly.

More importantly, when FedAvg amis at optimizing $\begin{array}{l}F(x) = \Sigma_{i=1}^{m} F_i(x)\end{array}$ , it is actually aiming at the weighted surrogate objective $\begin{array}{l}\tilde{F}(x) = \frac{\Sigma_{i=1}^{m} \tau_i F_i(x)}{\Sigma_{i=1}^{m} \tau_i}\end{array}$ , which is equal to $\begin{array}{l}F(x)\end{array}$ only in the homogeneous case. This surrogate objective can be arbitrarily away from the true global objective depending on $\begin{array}{l}\tau_i\end{array}$ .

Figure 3. Comparison between FedAvg, FedProx, VRLSGD and FedNova in three different conditions. Left: $\begin{array}{l}\tau_i\end{array}$ is fixed to 30. Middle: $\begin{array}{l}\tau_i\end{array}$ can vary from 1 to 96 with mean = 30. Right: $\begin{array}{l}\tau_i\end{array}$ is an IID Gaussian process with mean = 30.

Novel Theoretical Framework For Heterogeneous Optimization

$\begin{array}{l}\displaystyle x^{(t+1,0)} - x^{(t,0)} = - \tau_{eff} \Sigma_{i=1}^{m} w_i \cdot \eta d_{i}^{(t)}\end{array}$

, which optimize $\begin{array}{l}\tilde{F}(x) = \Sigma_{i=1}^{m} w_i F_i(x)\end{array}$

There are three key elements that can have a variety of forms for different algorithms.

Locally averaged gradient $\begin{array}{l}d_{i}^{(t)}\end{array}$ : $\begin{array}{l}d_{i}^{(t)} = \frac{G_{i}^{(t)} a_i}{\left\| a_i \right\|_{1}}\end{array}$ , where $\begin{array}{l}G_{i}^{(t)} = [g_i(x_{i}^{(t,0)}), g_i(x_{i}^{(t,1)}), ..., g_i(x_{i}^{(t,\tau_i)})] \in \Re ^ {d \times \tau_i}\end{array}$ stacks all the SGD updates together, and $\begin{array}{l}a_i \in \Re^{\tau_i}\end{array}$ is a vector of length $\begin{array}{l}\tau_i\end{array}$ that determines how the gradients are locally aggregated. For example, for the case of FedAvg, $\begin{array}{l}a_i = [1,1,..., 1]\end{array}$ and $\begin{array}{l}\left\| a_i \right\|_{1} = \tau_i\end{array}$ . Therefore, $\begin{array}{l}d_{i}^{(t)}\end{array}$ is just the average of all updates.
Aggregation weights $\begin{array}{l}w_i\end{array}$ : Determine how the locally averaged gradient are accumulated together. Note that $\begin{array}{l}\Sigma_{i=1}^{m} w_i = 1\end{array}$ and $\begin{array}{l}w_i\end{array}$ can affect the surrogate objective $\begin{array}{l}\tilde{F}(x)\end{array}$ .
Effective number of steps $\begin{array}{l}\tau_{eff}\end{array}$ : You can think of this parameter as global learning rate[9,10] that server can decide. The average number of local updates is $\begin{array}{l}\bar{\tau} = \Sigma_{i=1}^{m} \frac{\tau_i}{m}\end{array}$ , so the server can use this as a baseline to speed up or slow down the effect of each step.

We can restructure the formula above as the following:

$\begin{array}{l}\displaystyle x^{(t+1,0)} - x^{(t,0)} = \Sigma_{i=1}^{m} p_i \Delta_i^{(t)} = - \Sigma_{i=1}^{m} p_i \left\| a_i \right\|_{1} \cdot \frac{\eta G_{i}^{(t)} a_i}{\left\| a_i \right\|_{1}} = - \underbrace{(\Sigma_{i=1}^{m} p_i \left\| a_i \right\|_{1})}_{\tau_{eff}: effective \, local \, steps}\Sigma_{i=1}^{m} \eta \underbrace{( \frac{p_i \left\| a_i \right\|_{1}}{\Sigma_{i=1}^{m} p_i \left\| a_i \right\|_{1}})}_{w_i: weight} \underbrace{( \frac{G_{i}^{(t)} a_i}{\left\| a_i \right\|_{1}} )}_{d_i: normalized \, gradient}\end{array}$

Note that once we decide the $\begin{array}{l}a_i\end{array}$ vector (local solver), the choice of $\begin{array}{l}\tau_{eff}\end{array}$ and $\begin{array}{l}w_i\end{array}$ are then implicitly fixed. We can use the formula above to accommodate some of the most popular algorithms proposed before.

FedAvg: Since $\begin{array}{l}a_i = [1,1,..., 1]\end{array}$ , $\begin{array}{l}\left\| a_i \right\|_{1} = \tau_i\end{array}$ and $\begin{array}{l}\tau_{eff} = \Sigma_{i=1}^{m} p_i \tau_i\end{array}$ . More importantly, its $\begin{array}{l}w_i = \frac{p_i \tau_i}{\Sigma_{i=1}^{m} p_i \tau_i}\end{array}$ means the clients with more local steps would be implicitly assigned with higher weights, causing objective inconsistency.
FedProx: $\begin{array}{l}a_i = [(1-\alpha)^{\tau_i -1}, (1-\alpha)^{\tau_i -2},..., (1-\alpha), 1] \in \Re^{\tau_i}\end{array}$ , where $\begin{array}{l}\alpha = \eta \mu\end{array}$ and $\begin{array}{l}\mu\end{array}$ is a tunable parameter that can highly affect the behaviour of the algorithm. When $\begin{array}{l}\mu\end{array}$ is larger, $\begin{array}{l}w_i\end{array}$ is closer to $\begin{array}{l}p_i\end{array}$ , which reduce the inconsistency but with the cost of slower convergence.
SGD with decayed learning rate: If client's learning rate is exponentially decayed, then $\begin{array}{l}a_i = [1, \gamma_i,..., \gamma_i^{\tau_i - 1}]\end{array}$ where $\begin{array}{l}\gamma_i\end{array}$ can be different across clients.
SGD with momentum: $\begin{array}{l}a_i = [ 1- \rho^{\tau_i}, 1- \rho^{\tau_i -1}, ..., 1- \rho] \, / \, (1- \rho)\end{array}$ , where $\begin{array}{l}\rho\end{array}$ is the momentum factor.

Convergence Analysis

Theorem 1: First theorem guarantees that any algorithms that follow this framework would converge to a stationary point (error is bounded), if learning rate is sufficiently small and the communication round is pre-defined.
Theorem 2: This theorem states the error term between surrogate objective the true global objective:

$\begin{array}{l}\displaystyle \min_{t \in [T]} \left\| \nabla F(x^{(t,0)}) \right\|^{2} \leq \underbrace{ 2 [ \chi_{p \| w}^{2} (\beta^2 -1 ) + 1 ] \epsilon_{opt} }_{vanishing \, error \, term} + \underbrace{2 \chi_{p \| w}^{2} \kappa^2}_{non-vanishing \, error \, term \, due \, to \, objective \, inconsistency}\end{array}$

FedNova : Federated Normalized Averaging Algorithm

When we set $\begin{array}{l}w_i = p_i\end{array}$ , the non-vanishing term in the equation in Theorem 2 can be eradicated. This simple trick gives us FedNova:

$\begin{array}{l}\displaystyle x^{(t+1,0)} - x^{(t,0)} = - \tau_{eff}^{(t)} \Sigma_{i=1}^{m} p_i \cdot \eta d_{i}^{(t)}\end{array}$

, where $\begin{array}{l}d_{i}^{(t)} = \frac{G_{i}^{(t)} a_i^{(t)}}{\left\| a_i^{(t)} \right\|_{1}}\end{array}$

Furthermore, if we set $\begin{array}{l}\tau_{eff} = \Sigma_{i=1}^{m} p_i \tau_i\end{array}$ (same as FedAvg), FedNova becomes the following:

$\begin{array}{l}\displaystyle x^{(t+1,0)} - x^{(t,0)} = (\Sigma_{i=1}^{m} p_i \tau_i^{(t)}) \Sigma_{i=1}^{m} p_i \frac{\Delta_i^{(t)}}{\tau_i^{(t)}}\end{array}$

Comparing to FedAvg $\begin{array}{l}x^{(t+1,0)} - x^{(t,0)} = \Sigma_{i=1}^{m} p_i \Delta_i^{(t)}\end{array}$ , each local updates from clients is essentially just re-scale by $\begin{array}{l}\Sigma_{i=1}^{m} p_i \tau_i^{(t)} \, / \, \tau_i^{(t)}\end{array}$ . After staring at equations, let's look at schematic representations of both methods.

While looking at figure 4, you can see the difference between FedAvg and FedNova and how FedNova correct the updates to eliminate inconsistency. It is interesting to point out that $\begin{array}{l}w_i\end{array}$ can control the direction of the arrow and $\begin{array}{l}\tau_{eff}\end{array}$ is used to manipulate how far the arrow travels.

Figure 4. Schematic representation of the comparison between FedAvg and FedNova. Black arrow: Local updates at clients. Green: Updates made by FedNova. Blue: Updates made by FedAvg.

Experiments and Results

Two sets of data are being used to evaluate the performances of FedNova as well as all previous works.

IID dataset

For the case of IID dataset, they simply use logistic regression on a Synthetic Federated Dataset [5]. From figure 5, we can notice that FedNova is consistently achieving lower loss than FedAvg and FedProx at the same communication rounds, while the only difference between FedAvg and FedNova is $\begin{array}{l}w_i\end{array}$ when gathering normalized gradients.

Figure 5. Comparison between different federated methods trained on IID dataset. Left: All client perform 5 local epochs. Middle: 30% (C=0.3) of the clients are randomly picked to participate in the training for 5 epochs. Right: 30% of the clients are randomly picked to perform training for $\begin{array}{l}U(1, 5)\end{array}$ epochs.

Non-IID dataset

They trained a VGG-11[11] network on CIFAR-10 [12] dataset for 100 communication rounds with 16 clients. The dataset is partitioned using Dirichlet distribution $\begin{array}{l}Dir_{16}(0.1)\end{array}$ to make it non-IID [13]. Overall, the accuracy obtained from FedNova is 5 to 10 % higher than from FedAvg. When the local epoch is fixed to 2 (local updates at clients vary from $\begin{array}{l}\tau_i\end{array}$ = 8 to $\begin{array}{l}\tau_i\end{array}$ = 204 in one epoch), both methods are generally worse than the case when epoch is $\begin{array}{l}U(2, 5)\end{array}$ . For both methods, simply adding a momentum term can give accuracy a boost of 5-7%. Note that FedNova can further combine server momentum [10,14] and client momentum to attain an accuracy of 81.15 $\begin{array}{l}\pm\end{array}$ 0.38%.

Table 3. Results of FedAvg and FedNova with several local solvers trained on non-IID dataset. FedAvg along with Proximal [5] is FedProx. VR [15] is cross-client variance reduction technique.

Conclusion

In federated learning, the clients like mobile phones, sensors are generally really heterogeneous. In other words, it is pretty common for them to have different number of updates across the entire spectrum. However, previous popular federated algorithms such as FedAvg or FedProx assume that all clients possess identical local updates, which is the major cause of objective inconsistency when it comes to heterogeneous cases. Within the context of the paper, the authors proposed a generalized theoretical framework which subsumes previous works and naturally allows clients to have different local updates, local solvers, learning rate, etc. Based on this framework, FedNova was proposed to accommodate different local updates, different hyper-parameters and local solvers, or even combine with acceleration techniques [15,16,17]. Lastly, both IID and non-IID datasets were employed to examine the performance of several federated methods. The experiments show that FedNova is a promising choice as the result it obtained are consistently better than previous algorithms.

Review and Discussion

Federated learning is a new term in my dictionary, but the concept and idea are straightforward. It is exciting to see what's being done everyday and how it can be employed to make our life better. In my opinion, despite of the poor article structure, the idea of a grand framework to subsume all previous popular methods is intriguing and it can help future researchers to better understand and analyze the problems at hand. Moreover, FedNova is a simple but innovative algorithm that outperform other methods by 5-10 %. From the history of machine learning, it is sometimes this final few mile of progress that push the concept into production/reality, so I admire a small tweak of method to improve the overall performance.

However, it is questionable if FedNova would still outperform others in other datasets, since only two experiments were performed in this paper as datasets faced by federated optimization typically vary highly from one to another. There are more federated algorithms [18,19,20,21] than what are mentioned in this paper. It would be more convincing to include more methods. Also, it is not mentioned in the paper that whether this framework can accommodate adaptive methods. It is critical since these are the optimizers that are constantly being used.

References

[1] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication- efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.

[2] N. Carlini, C. Liu, J. Kos, Ú. Erlingsson, and D. Song. The secret sharer: Measuring unintended neural network memorization & extracting secrets. arXiv preprint arXiv:1802.08232, 2018.

[3] http://vision.cloudera.com/wp-content/uploads/2018/11/2018-10-31-181344-federated_learning_animated_labeled.gif

[4] Li, T., Sahu, A.K., Talwalkar, A., Smith, V.: Federated learning: Challenges, methods, and future directions. arXiv preprint arXiv:1908.07873 (2019)

[5] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Conference on Machine Learning and Systems, 2020.

[6] Xinwei Zhang, Mingyi Hong, Sairaj Dhople, Wotao Yin, and Yang Liu. FedPD: A feder- ated learning framework with optimal rates and adaptivity to non-IID data. arXiv preprint arXiv:2005.11418, 2020.

[7] Aymeric Dieuleveut and Kumar Kshitij Patel. Communication trade-offs for local-sgd with large step size. In Advances in Neural Information Processing Systems, pages 13579–13590, 2019.

[8] Jianyu Wang and Gauri Joshi. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. arXiv preprint arXiv:1808.07576, 2018.

[9] Jianyu Wang, Vinayak Tantia, Nicolas Ballas, and Michael Rabbat. SlowMo: Improving communication-efficient distributed SGD with slow momentum. In International Conference on Learning Representations, 2020.

[10] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konecˇny`, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.

[11] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[12] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[13] Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khaz- aeni. Federated learning with matched averaging. In International Conference on Learning Representations (ICLR), 2020.

[14] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.

[15] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich, and Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for on-device federated learning. arXiv preprint arXiv:1910.06378, 2019.

[16] Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, and Yifei Cheng. Variance reduced local SGD with lower communication complexity. arXiv preprint arXiv:1912.12844, 2019.

[17] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smithy. Feddane: A federated newton-type method. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pages 1227–1231. IEEE, 2019.

[18] D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman, J. Gonzalez, and R. Arora, “Fetchsgd: Communication-efficient federated learning with sketching,” arXiv preprint arXiv:2007.07682, 2020.

[19] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization,” arXiv preprint arXiv: 1909.13014, 2019.

[20] H. Yuan and T. Ma. Federated accelerated stochastic gradient descent. arXiv preprint arXiv:2006.08950, 2020.

[21] Mohammad Mohammadi Amiri, Deniz Gunduz, Sanjeev R Kulkarni, and H Vincent Poor. Federated learning with quantized global model updates. arXiv preprint arXiv:2006.10672, 2020.

Seitenhierarchie

Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization