Introduction
Before we dive into Simplifying Graph Convolutional Networks by Wu et al. [1] - which will be the topic of this post - let's first have a brief look at the formation of Artificial Neural Networks ((A)NNs) in comparison to Graph Neural Networks (GNNs). Neural Networks were developed over time, driven by the need for more complex models. We can say, they were derived from linear models which were not sufficient enough to model more complex functions (e.g. XOR). The complexity of neural networks therefore increased over time and linear models stayed baseline models to compare performance with. Figure 1 shows an overview of the history of artificial neural networks.
Fig. 1: History of artificial neural networks [12]
The development of Graph Neural Networks - which allow training directly on data in non-Euclidean spaces (e.g. graphs) [3] - on the other hand was different. Graph Neural Networks have been introduced quite recently, when artificial neural networks were already well established. Subsequently, they do not have a linear counterpart from which they were derived. This is the reason, why Wu et al. [1] introduced Simple Graph Convolution (SGC) as a linear model, which represents a simplified linear version of Graph Convolutional Networks (GCNs) [2]. Simple Graph Convolution is to Graph Convolutional Networks basically what linear models (e.g. linear regression) are to neural networks.
In the following post, we will have a look at the structure of Simple Graph Convolution, how it can be compared to Graph Convolutional Networks [2], and have a look at its performance on baseline datasets. The source code of SGC can be found here.
Notations and Definitions
- An undirected graph G can be defined as G = (V,A), where V represents the set of nodes \{v_1, ..., v_n\} and A \in \mathbb{R}^{n \times n} is the symmetric adjacency matrix specifying the edge weights between the nodes. If a_{ij}=0 then there is no edge between a_i and a_j.
- The degree matrix D = diag(d_1, ..., d_n) is a diagonal matrix with d_i = \sum_j {a_{ij}} .
- For each node v_i there is a corresponding feature vector x_i \in \mathbb{R}^d and all feature vectors are summarized in the whole feature matrix X \in \mathbb{R}^{n \times d}
Methodology
The structure of Simple Graph Convolution is closely related to to the structure of Graph Convolutional Networks (GCNs). That's why here we will first have a look at GCNs and then compare its structure to Simple Graph Convolution. Figure 2 gives an overview of those two graph models and highlights their difference.
Fig. 2: Overview of Simple Graph Convolution (SGC) in comparison with Graph Convolutional Networks (GCNs); figure from the original paper [1]
Graph Convolutional Networks
In 2017, Kipf and Welling [2] introduced Graph Convolutional Networks (GCNs). Similar to Convolutional Neural Networks (CNNs), for example, GCNs also learn new feature representations for the input features X. In the following, we will look at the node classification problem, where the goal is to predict one label for each node in the graph. In the semi-supervised approach of node classification, the labels are only known for a subset of the nodes and are predicted for the rest.
A GCN is defined by its k layers and the aggregation function which averages the hidden representation of each node with its neighbors. A k-layer GCN is equivalent to a k-layer MLP (Multi-Layer Perceptron), except for this aggregation over node neighbors at the beginning of each layer. A graph convolutional layer can be split into three steps: (1) feature propagation, (2) linear transformation, and (3) point-wise nonlinear activation.
- Feature Propagation, is what distinguishes a GCN from basic MLPs. At the beginning of each layer, the features are averaged over the feature vectors of the node's neighbors. This can be described by the following matrix multiplication: \tilde{H}^{(k)} ← SH^{(k-1)} where S stands for the normalized adjacency matrix including self-loops: S = \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} , \tilde{A} = A + I , and \tilde{D} is the degree matrix of \tilde{A}. We can think about this step as a smoothing of the hidden representations along the edges of the graph. This operation encourages similar predictions among connected nodes.
- Feature Transformation, is the same step as in MLPs. Each layer has a learnable weight matrix \Theta^{(k)}, which takes the smoothed hidden feature representation and transforms them linearly: \hat{H} ← \tilde{H}^{(k)} \Theta^{(k)}
- Non-linear Activation, subsequently adds a non-linear component to the propagation, also like in an MLPs, and we get the following new (and final) hidden feature representation: H^{(k)} ← ReLU(\tilde{H}^{(k)} \Theta^{(k)})
For more detailed information about GCNs, e.g. check out the blog posts of the authors.
Simple Graph Convolution
If we now compare the steps of GCNs to the steps in a Simple Graph Convolution, we can see that we need to remove the non-linear transformation in step (3) from the pipeline to get a SGC.
Simple Graph Convolution can be summarized as a fixed/parameter-free feature extraction component \tilde{X} = S^KX (which can also be described as a smoothing operator) followed by a linear logistic regression classifier \hat{Y} = softmax(\tilde{X} \Theta)}.
Given that the feature extraction step does not require any learnable parameters, the whole learning process of SGCs can be reduced to a logistic regression on \tilde{X}. That's cool, right? Because we know that the training of logistic regression scales well to large data and is drastically faster than that of complicated non-linear models. Therefore, SCG significantly outperforms GCNs and other GNNs in regard to time efficiency (see more in the Experiments section below). You can see the detailed comparison between Graph Convolutional Networks and Simple Graph Convolution in Figure 1 and Table 1.
Graph Convolutional Network | Simple Graph Convolution | |
---|---|---|
Initial node representation | H^{(0)} = X | H^{(0)} = X |
Hidden feature representations (steps (1), (2), and (3)) | \tilde{H}^{(k)} ← SH^{(k-1)} H^{(k)} ← ReLU(\tilde{H}^{k} \Theta^{(k)}) | removing non-linearities: H^{(k)} ← SH^{(k-1)} \Theta^{(k)} |
Output layer/classifier | \hat{Y}_{GCN} = softmax(SH^{(K-1)} \Theta^{(K)}) | \hat{Y}_{SGC} = softmax(S...SX\Theta^{(1)}\Theta^{(2)}...\Theta^{(K)}) \hat{Y}_{SGC} = softmax(S^K X\Theta) |
Table 1: Comparison between Graph Convolutional Networks and Simple Graph Convolution
Spectral Analysis
If we look at Simple Graph Convolution from a spectral perspective, we can see that it corresponds to a fixed filter on the graph spectral domain.
Wu et al. [1] also showed that adding self-loops to the original graph shrinks the underlying graph spectrum. SGC, therefore, acts as a low-pass filter on the scaled graph domain.
General Graph Convolutions
Let's first have a look at graph convolutions in general. The graph Fourier analysis relies on the spectral decomposition of graph Laplacians. The normalized graph Laplacian is denoted as following: L = I_N - \ D^{-\frac{1}{2}} A \ D^{-\frac{1}{2}} (where D is the diagonal degree matrix D_{ii} = \sum_j{A_{ij} ,\ D \in \mathbb{R}^{N \times N} and I_Nis the identity matrix representing self-loops). The eigendecomposition is formed by \Delta = U \Lambda U^{T} where U \in \mathbb{R}^{n \times n} represents the eigenvectors and \Lambda is the diagonal matrix of eigenvalues. In the graph domain, eigenvectors denote Fourier modes and eigenvalues denote frequencies of the graph.
Let x \in \mathbb{R}^n be a signal defined on the graph vertices. The graph Fourier transform of x is defined as \hat{x} = U^T x and the inverse operation is x = U \hat{x}. The graph convolution operation between a filter g and a signal x is therefore:
g * x = U((U^T g) \odot (U^T x)) = U \hat{G} U^T x,
where \hat{G} = diag(\hat{g}_1, ..., \hat{g}_n) is the diagonal matrix with spectral filter coefficients \hat{g}_i.
Graph convolutions can be approximated by polynomials of Laplacians. Graph Convoltional Networks [2] use an affine approximation and attain the following basic GCN convolution operation:
g * x = \theta ( I + D^{-\frac{1}{2}} A D^{-\frac{1}{2}})\ x
Spectral Analysis of SGC
Kipf & Welling [2] introduced the renormalization trick, which replaces the normalized adjacency matrix S = D^{-\frac{1}{2}} A D^{-\frac{1}{2}} with the augmented normalized adjacency matrix \tilde{S}_{adj} = \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}}, where \tilde{A} = A+I and \tilde{D} = D+I. This results in the augmented normalized Laplacian \tilde{\Delta}_{sym} = I_N - \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}}. The spectral filters associated with \tilde{S}_{adj} can then be approximated by a polynomial of the eigenvalues of the Laplacian: \hat{g} (\tilde{\lambda}_i) = (1 - \tilde{\lambda}_i)^K, where \tilde{\lambda}_i are the eigenvalues of \tilde{\Delta}_{sym}.
They also showed that adding self-loops leads to a smaller largest eigenvalue of the normalized graph Laplacian. For example, for the Cora dataset the largest eigenvalue shrinks from 2 to approximately 1.5 after adding self loops and therefore eliminates the effect of negative coefficients. Additionally, the scaled spectrum leads to \tilde{S}_{adj} (which includes self-loops) acting as a low-pass-type filter. Low-pass type filters capture low-frequency signals. So in the graph setting, this corresponds to a smoothing operation and therefore results in similar feature representations among connected nodes and similar predictions in neighborhoods (which matches the underlying idea of similar nodes being connected in a graph).
For more information on the spectral analysis, please refer to e.g. [2], [1], and [8].
Related Work
The first spectral graph-based convolutional neural network was introduced by Bruna et al. in 2014 [9]. ChebyNets [10] define graph convolutions based on Chebyshev polynomials and therefore eliminate the computationally expensive eigendecompositions and were introduced in 2016. Thekumparampli et al. [11] also introduced a linear graph learning model, which removes the intermediate fully-connected layers from Graph Neural Networks and replaces them with attention mechanisms, which respect the underlying graph structure. However, adding attention mechanisms also comes with the inevitable increase in model complexity. Their Attention-based graph neural network achieves very similar results to SGC on the same benchmark datasets.
Cai and Wang [13] introduced Local Degree Profile (LDP), a simple graph representation for non-attributed graphs, based on local information. They also report competitive performance on graph classification datasets. LDP takes linear time to compute and can also be used as a baseline algorithm for graph classification.
Experiments
Wu et al. evaluated SGC on several different datasets. For example, they performed semi-supervised node classification on the citation network datasets Cora, Citeseer, and Pubmed [4] and compared the performance of SGC to other state-of-the-art graph neural networks like GCN [2], GAT [5], FastGCN [6], and GIN [7]. For this blog post, a subset of the whole evaluation in their paper was selected.
Network | Cora | Citeseer | Pubmed |
---|---|---|---|
GCN | 81.4 ± 0.4 | 70.9 ± 0.5 | 79.0 ± 0.4 |
GAT | 83.3 ± 0.7 | 72.6 ± 0.6 | 78.5 ±0.3 |
FastGCN | 79.8 ±0.3 | 68.8 ± 0.6 | 77.4 ± 0.3 |
GIN | 77.6 ± 1.1 | 66.1 ± 0.9 | 77.0 ± 1.2 |
SGC | 81.0 ± 0.0 | 71.9 ± 0.1 | 78.9 ± 0.0 |
Table 2: Test Accuracy (%) averaged over 10 runs on citation network datasets, executed by Wu et al. [1]
We can see that the performance of SGC is competitive with other comparable state-of-the-art graph neural networks. On the Citeseer dataset, SGC even outperforms GCN, which might be due to SGC having fewer parameters and therefore suffering less from over-fitting.
In addition to the competitive performance of GCN, table 3 shows that SGC is much faster to train. Especially the Graph Attention Network (GAT) [5] takes multiple times longer than the SGC. This is due to the small number of learnable parameters in SGC. Figure 3 also shows how the training time is related to the accuracy of a model. We can see, that SGC achieves competitive performance while requiring the lowest training time in comparison to the other models.
Network | Cora | Citeseer | Pubmed |
---|---|---|---|
GCN | 0.49 | 0.59 | 8.31 |
GAT | 63.10 | 118.10 | 121.74 |
FastGCN | 2.47 | 3.96 | 1.77 |
GIN | 2.09 | 4.47 | 26.15 |
SGC | 0.13 | 0.14 | 0.29 |
Table 3: Training time in seconds of different graph models on citation datasets. The numbers are averaged over 10 runs and specified in the original paper [1].
Fig. 3: Performance and Training Time comparison of different graph models on the two datasets "Pubmed" and "Reddit"; figure from the original paper [1]. The training times of GaAN and DGI were not evaluated, because the source code was not released.
In some evaluation cases, an MLP was applied to the data before performing the graph learning with SGC to achieve better performance. This way the original features can be mapped into a different space before applying the simple graph convolution.
Discussion
With Simple Graph Convolutions, Wu et al. [1] explored the simplest possible way to formulate a graph convolutional model. SGC is the combination of a graph-based pre-processing step, followed by a standard multi-class logistic regression. But even though the model is kept as simple as possible, it reaches competitive performance on several benchmark datasets and sometimes even outperforms more complex state-of-the-art graph neural networks. Due to the simple linear model, training time is reduced significantly and SGC scales easily to large datasets. Wu et al. argue that learning graph convolutional filters is superfluous.
Furthermore, SGC is interpretable in itself based of the spectral analysis of the model. It can be interpreted as a low-pass-type filter on the spectral domain, which captures low-frequency signals and corresponds with the smoothing of features in this setting. They also showed how the renormalization trick shrinks the spectral domain of the graph.
The fixed feature extractor S^Kcan be pre-computed, since it doesn't contain any learnable features. This also leads to faster training.
Student Review
In my opinion, SGC is a valuable addition to the graph model selection in literature. Having a linear model, to which more complex models can be compared, provides an important baseline. Those linear components exist for artificial neural networks and it is beneficial to also have a comparable linear model for graph models. Given the high performance of SGC, I believe it it is a good idea, to try solving a graph learning problem with SGC first and then move on to more complex models. Also, when developing new graph learning models, it can be useful to first compare them to SGC to compare performance.
The paper does not discuss any regression tasks of SGC, but only focuses on classification tasks - both node classification and graph classification. It would have been nice to also include a regression task to the evaluation of SGC.
The interpretability of SGC is only mentioned briefly. Given the low-pass-type filter interpretation of SGC, it would have been interesting to see a visualization of those filters and therefore get more insight into the model.
Since there exist several linear graph model, it would have been nice if the paper compared SGC in more detail with other models like Attention-based Graph Neural Networks [11] or LDP [13]. It would have also been interesting to get a more clear distinction between the available linear graph model, as well as a comparison between their training times and performances.
Overall, I think that SGC performs extremely well and can be a good baseline model. There even might be potential in Graph Neural Networks to exploit non-linearities more, since SGC shows that the non-linearities might not be as relevant as the feature propagation among the graph structure.
References
[1] Wu, Felix, et al. "Simplifying graph convolutional networks." International conference on machine learning. PMLR, 2019.
[2] Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." arXiv preprint arXiv:1609.02907 (2016).
[3] Bronstein, Michael M., et al. "Geometric deep learning: going beyond euclidean data." IEEE Signal Processing Magazine 34.4 (2017): 18-42.
[4] Sen, Prithviraj, et al. "Collective classification in network data." AI magazine 29.3 (2008): 93-93.
[5] Veličković, Petar, et al. "Graph attention networks." arXiv preprint arXiv:1710.10903 (2017).
[6] Chen, Jie, Tengfei Ma, and Cao Xiao. "Fastgcn: fast learning with graph convolutional networks via importance sampling." arXiv preprint arXiv:1801.10247 (2018).
[7] Xu, Keyulu, et al. "How powerful are graph neural networks?." arXiv preprint arXiv:1810.00826 (2018).
[8] Wu, Felix, et al. "Simplifying graph convolutinal networks (Supplementary Material)." url: http://proceedings.mlr.press/v97/wu19e/wu19e-supp.pdf
[9] Bruna, J., Zaremba, W., Szlam, A., and Lecun, Y. "Spectral networks and locally connected networks on graphs." In International Conference on Learning Representations (ICLR’2014), 2014.
[10] Defferrard, Michaël, Xavier Bresson, and Pierre Vandergheynst. "Convolutional neural networks on graphs with fast localized spectral filtering." arXiv preprint arXiv:1606.09375 (2016).
[11] Thekumparampil, Kiran K., et al. "Attention-based graph neural network for semi-supervised learning." arXiv preprint arXiv:1803.03735 (2018).
[12] https://www.slideshare.net/deview/251-implementing-deep-learning-using-cu-dnn/4
[13] Cai, Chen, and Yusu Wang. "A simple yet effective baseline for non-attributed graph classification." arXiv preprint arXiv:1811.03508 (2018).