Exploiting Edge Features in Graph Neural Networks

A Blogpost Discussing the Paper 'Exploiting Edge Features in Graph Neural Networks' by Liyu Gong, Qiang Cheng presented for CVPR 2019.

Introduction

Graphs are an often overlooked Data Structure when we think of learning from Data. This is due to the difficulty in learning from the very complex structures of Graphs. However, recent scientific publications have made significant inroads into this field.

Graphs

Graphs are Data Structures that consists of vertices/nodes joined by edges. The Edges can be non-/directional or non-/weighted. Both the Edges and Nodes can have features associated with them.

Fig 1: An Example Graph

Fig 2: Visual Representation of the Cora Dataset as a Graph ^[3]

Graph Learning

Graph Learning is important as many systems of interactions in the natural world such as Molecular Structures, Biological Protein Networks and many others can be easily modeled using Graphs. Graphs have an advantage over the conventional tabular representation of data (Euclidean format) in that they can be very compact as the resulting matrix would be sparse in nature which leads to making the learing process from these tables cumbersome.

Fig 3: Example Graph Problem of Edge Classification

Medical Applications

Disease Prediction ^[5] [7]: Patients are represented as nodes with features as associated medical images of the patients. The Network learns the probability of a certain disease appearing in an individual.
Link Prediction ^[6]: The Network learns the characteristics features of relevant edges in Drug-Disease prediction, Protein-Protein interaction and Drug-Drug interactions.

Graph Neural Networks

These are networks that directly operate on the graph structure. Graph Neural Networks are typically used for the purposes of node classification, edge classification or a classification of the network as a whole. The typical Graph Neural Network learns from two matrices that are constructed from the underlying graph.

Feature Descripton Matrix (H): Aggregate of features from all nodes of the network.
Adjacency Matrix (A): Encodes the Structure of Edges.

A Typical layer of a Graph Neural Network can be compactly represented by the following mathematical expression:

$\begin{array}{l}\displaystyle H^{(l+1)} = f(H^{(l)},A)\end{array}$

Adjacency Matrix

This matrix compactly represents the structure of the Graph i.e it denotes which nodes are connected and which are not. An example is presented below.

Fig 4: A Graph Structure with five nodes with non-directional, non-weighted edges

$\begin{array}{l}A = \begin{bmatrix} 1 & 1 & 0 & 0 & 1 \\ 1 & 1 & 1 & 1 & 0 \\ 0 & 1 & 1 & 0 & 0 \\ 0 & 1 & 0 & 1 & 1 \\ 1 & 0 & 0 & 1 & 1 \\ \end{bmatrix}\end{array}$

The Adjacency Matrix is binary with 1 denoting the presence of a edge between the nodes that are given by the respective matrix indices. A 0 denotes the absence of an edge. Each node is connected to itself. The matrix is binary as the graph under consideration is non-directional and non-weighted.

Related Work

In order to fully understand the contributions of the paper in discussion, we have to take a look back at the pre-existing methods.

Graph Convolutional Network

This Network is directly inspired from the Convolutional Neural Network (CNN). The idea is similar in that we transfrom the relevant information from the Spatial domain to the Spectral domain, convolve with the filter and then transform the result back to the spatial domain. The calculation begins with the computation of the Graph Laplacian.

$\begin{array}{l}\displaystyle L = I_n - D^{-1/2}WD^{-1/2}\end{array}$

W is the Weighted matrix and D is the Diagonal Degree matrix $\begin{array}{l}D_{ii} = \sum_j W_{ij}\end{array}$ .

This Laplacian matrix is diagonalizable by the Fourier basis such that $\begin{array}{l}L = U\Lambda U^T\end{array}$ where $\begin{array}{l}\Lambda\end{array}$ and $\begin{array}{l}U\end{array}$ the are the eigenvalues and orthonormal eigenvectors of L. This gives the Graph fourier transfrom of $\begin{array}{l}x\end{array}$ as $\begin{array}{l}\hat{x} = U^T x\end{array}$ and the inverse as $\begin{array}{l}x = U \hat{x}\end{array}$ .

This gives us the following representation for for a node vector being filtered by a filter ( $\begin{array}{l}g_{\theta}\end{array}$ ).

$\begin{array}{l}\displaystyle y = g_{\theta} (L) x = g_{\theta} (U \Lambda U^T)x = U g_{\theta}(\Lambda) U^T x\end{array}$

Drawbacks:

GCN requires the knowledge of the complete graph for performing the convolution operation. This is due to the fact that all the Eigenvalues(λ) of the graph structure has to be calculated for this operation which is very expensive computationally.
New additions to the graph requires recalculation of the entire architecture parameters due to change in Eigenvalues(λ) .
Cannot transfer knowledge from one domain to another. (Transductive Learning)

Graph Attention Network

In order to overcome the drawbacks of the GCN, the GAT was proposed. It is a non-spectral approach in that the convolutions are performed directly on spatially close neighbours. Attention is a mechanism by which the network finds out how infuential the features of the neighbouring nodes are in learning the feature of the node under consideration. The basic building block of the GAT is the Graph Attention Layer. The following sections explain how the GAT works.

Fig 5: A Example Graph fed into a Graph Attention Layer

Here $\begin{array}{l}h_i\end{array}$ is a feature vector of length F.

Step 1: Linear Transformation

The first step performed by the Graph Attention Layer is to apply a linear transformation - Weighted matrix $\begin{array}{l}W\end{array}$ to the feature vectors of the nodes.

Step 2: Computation of Attention Coefficients

Attention Coefficients determine the relative importance of neighbouring features with respect to each other. They are calculated using the formula: $\begin{array}{l}e_{ij} = a(W\vec{h_i},W\vec{h_j})\end{array}$ . Here a is function that we determine subject to the restriction that $\begin{array}{l}a: \mathbb{R}^{F'} \times \mathbb{R}^{F'} \rightarrow \mathbb{R}\end{array}$ . $\begin{array}{l}i\end{array}$ and $\begin{array}{l}j\end{array}$ are neighbouring nodes.

Fig 6: Individual Steps in calculation of attention coefficients for each pair of Neighbouring nodes

As we observe in the series of images above, we first calculate the self-attention coeficient and then compute attention coefficents with all of the neighbours.

Step 3: Normalization of Attention Coefficients

Due to the varied strcuture of graphs, nodes can have a different number of neighbours. In order to have a common scaling acroos all neighbourhoods, the Attention coefficients are Normalized.

$\begin{array}{l}\displaystyle \alpha_{ij} = \dfrac{\exp (LeakyReLU(e_{ij}))}{\sum_{k \in N} \exp (LeakyReLU(e_{ij}))}\end{array}$

Here N is the Neighbouhood of node $\begin{array}{l}i\end{array}$ .

Step 4: Computation of Final Output Features

Now we compute the learned features of nodes. $\begin{array}{l}\sigma\end{array}$ is a Non-Linear Transformation.

Fig 7: Example Network Architecture

$\begin{array}{l}\displaystyle \vec{h_i '} = \sigma (\sum_{j \in N} \alpha_{ij} W \vec{h_j '} )\end{array}$

Step 5: Computation of Multiple Attention Mechanisms

In order to improve the stability of the learning process Multi-head attention is employed. We compute multiple different attention maps and finally aggregate all the learned representations.

$\begin{array}{l}\displaystyle \vec{h_i '} = \sigma (\dfrac{1}{K} \sum_{k=1}^K \sum_{j \in N} \alpha_{ij}^k W^k \vec{h_j '} )\end{array}$

K denotes the number of independant attention maps used.

Fig 8: Overall look at all operations involved in the update of learned features of nodes ^[3]

Methodology

Graph Structure

This Paper presents a Network Architecture that can handle a much more complex graph like shown below with all features being real-valued.

wgw

Fig 9: Representation of Graph Structure to be learned

We see that the graph now has edges which are multi-dimensional with features baing able to be real-valued.

Proposed Network Architecture

The paper proposes a novel architecture that along with learing from node feature also learns from edge features. The proposed network is named the Edge Enhanced Graph Neural Network (EGNN).

Fig 10: Comparison of Network Architectures. The left one is the traditional GAT and the right one being the proposal ^[1]

Doubly Stochastic Normalization of Edges

The edge features matrix has to be normalised after each layer of the graph network as it is used as the filter in the convolutional operation with the node features matrix. The authors present the idea of using the Doubly Stochastic Normalization of Edges over the conventional Row Normalization of the GAT and the Symmetric Normalization of the GCN.

$\begin{array}{l}\displaystyle \tilde{E_{ijk}} = \dfrac{\hat{E_{ijk}}}{\sum_{k=1}^N E_{ikp}}\end{array}$

$\begin{array}{l}\displaystyle E_{ijp} = \sum_{k=1}^N \dfrac{\tilde{E_{ikp}} \tilde{E_{jkp}}}{\sum_{v=1}^N E_{vkp}}\end{array}$

Here each edge has $\begin{array}{l}p\end{array}$ features and $\begin{array}{l}\hat{E_{ijp}}\end{array}$ deotes the raw $\begin{array}{l}p^{th}\end{array}$ feature of edge connectining nodes $\begin{array}{l}i\end{array}$ and $\begin{array}{l}j\end{array}$ . $\begin{array}{l}E_{ijp}\end{array}$ denotes the normalized feature.

Network Layers

In contrast to the above discussed GAT which handles only one-dimensional binary valued edge features, the Edge Enhanced Graph Neural Network can handle multi-dimensional real-valued features. The Edge Enhanced Graph Neural Network can be constructed with two main type of layers - Attention based layer or the Convolution based layer.

Attention based EGNN layer - EGNN(A)

This layer is designed as an extension from the GAT layer but is designed to handle multi-dimensional real-valued features. The Layer is described using the following Mathematical Representation.

$\begin{array}{l}\displaystyle X^l = \sigma [\prod_{p=1}^P (\alpha^l_{..p}(X^{l-1},E^{l-1}_{..p})g^l(X^{l-1}))]\end{array}$

P is the total number of features for an edge. $\begin{array}{l}\alpha\end{array}$ the calculated attention coefficient.

In this experiment, $\begin{array}{l}g^l(X^{l-1}) = W^l X^{l-1}\end{array}$ where $\begin{array}{l}W\end{array}$ is a Parameter matrix. Notice the similarity in calculation with the GAT. The main point of difference being that Edge features are also used in calcualtion of attention coefficients.

Attention Coefficient Calculation

$\begin{array}{l}f^l(X_i^{l-1}, X_j^{l-1}) = \exp \{L(a^T[WX_i^{l-1}||WX_j^{l-1}])\} \\\end{array}$

$\begin{array}{l}\displaystyle \hat{\alpha_{ijp}}^l = f^l(X_i^{l-1}, X_j^{l-1}) E_{ijp}^{l-1}\end{array}$

$\begin{array}{l}\displaystyle \alpha_{...p}^l = DS(\hat{\alpha_{..p}}^l)\end{array}$

L - LeakyReLU ; W - Parameter Matrix ; || - Concatenation Operation ; DS - Doubly Stochastic Normalization Operator

Update of Edge Features

$\begin{array}{l}\displaystyle E^l = \alpha^l\end{array}$

Convolution based EGNN layer - EGNN(C)

The Graph Covolution Operation is a special case of the Graph Attention Operation. It directly utilises the edge feature matrix.

$\begin{array}{l}\displaystyle X^l = \sigma [\prod_{p=1}^P (E_{..p}X^{l-1}W^l)]\end{array}$

Edge Features for Directed Graph

The proposed network architecture also takes into account the directional nature of the graph by expanding the number of edge channels to three. Now each $\begin{array}{l}E_{ijp}\end{array}$ is expanded to $\begin{array}{l}[E_{ijp} \ \ E_{jip} \ \ E_{ijp}+E_{jip}]\end{array}$ . The three channels denote the type of neighbourhood - forward, backward and undirected.

Experiment

Dataset

Citation Networks

The Citation Network Datasets are all directed graphs with edge directions representing the directions of citations. The nodes i.e Papers can be classified into many labels. The problem being solved is that of multi-label classification. The Datasets used are Cora, Citeseer and Pubmed. Cora dataset has very high class imbalance thereby necessitating the use of weighted loss functions.

	Cora	Citeseer	Pubmed
#Nodes	2708	3327	19717
#Edges	5429	4732	44338
#Node Features	1433	1433	3703
#Classes	7	6	3

Fig 11: Class Distribution of Citation Network Datasets ^[1]

Molecular Analysis

A Molecule can be represented as a graph with the nodes being atoms and the edges representing chemical bonds. The Datasets used are Tox21, Lipophilicity and Freesolv. Tox21 dataset is associated with the problem of network classification while the other two are associated with the problem of whole graph regression.

	Tox21	Lipophilicity	Freesolv
#Compounds	7831	4200	642
#Node Features	25	25	25
#Edge Features	42	21	25

Experimental Setup

Each of the Three Datasets is split into three subsets – training, validation and testing.
The Authors used two splitting of Data for Citation Networks.
- Sparse: 5%, 15% and 80% for training, validation and test
- Dense: 60%, 20% and 20% as above
Data Split for Molecular Analysis - training(80%), validation(10%) and testing(10%).
2 EGNN layers for all experiments.
The Non-linear activation employed is the Exponential Linear unit.

	Tox21	Lipophilicity	Freesolv
Loss Function	Sigmoid Cross-Entropy	Mean-squared Loss	Mean-squared Loss

Results

Citation Networks

List of Network Architecutes used for Testing

GCN: baseline
GAT: baseline
EGNN(C): Full EGNN(C) with doubly stochastic normalization and multi-dimensional edge features.
EGNN(A): Full EGNN(A) with all functionalities.
All architectures with * implies the use of a weighted loss function
-D: No doubly stochastic normalization
-M: Ignores multi-dimensional edge Features
-A: No adaptiveness across layers

Fig 12: Table of Results obtained from Citation Networks Experiment ^[1]

Fig 1: Charts of Results for Citation Networks

Observations:

Almost all EGNN variations perform better than the baselines. This indicates that the ideas do have merit.
The Weighted Loss function improves the baseline GAT and GCN performance on the Cora dataset considerably for both Sparse and Dense splittings.
The non-weighted loss function variants of EGNN perform as good as the weighted loss function variants on the Cora dataset. This indicates that our additions somehow balance out the class imbalances inherently present.
The addition of weighted loss functions to the more balanced datasets does not lead to an improvement in performance.
Performance on Dense splittings of data are much better than on the sparse splits. This is to be expected as the network has more data to learn from in the Dense split.
The network variation with all of the ideas present does not perform the best in most cases. This indicates that certain ideas do not contribute that much in learning certain datasets.

Molecular Analysis

Fig 13: Table of Results obtained from Molecular Analysis Experiment* ^[1]

Fig 14: Charts of Results for Molecular Analysis

The Baseline models used were Random Forest and Weave (Specialized GCN). EGNN(A) and EGNN(C) perform similarly on all three datasets. The greatest improvement over the baseline is observed in the regression taks.

* AUC stands for Area Under Curve. Higher indicates better classification. RMSE stands for Root Mean Squared Error.

Conclusion

The authors have been able to learn from Complex Graphs without discarding Information unlike previous papers which were unable to handle the full complexity of Graph networks. The proposals of multi-dimensional real-valued edge features, adapting edge features across layers, doubly stochastic normalization and inclusion of directed graphs will help us in learning more complex graph architecures.

Limitation

Difficult to Perform ‘Deep Learning’

Due to the Nature of Graphs and the Proposed Methodology each new layer added to the Network will increase the number of calculations needed exponentially as we need to keep updating the neighbours of neighbours.

Fig 15: Limitation of Graph Learning

This is not an issue if the hardware setup has enough memory to fit the entirety of the graph in memory.

References

[1] Liyu Gong, Qiang Cheng. Exploiting Edge Features in Graph Neural Networks. In Computer Vision and Pattern Recognition, 2019.

[2] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, pages 3844–3852. Curran Associates, Inc., 2016.

[3] P. Velickovic, G. Cucurull, A. Casanova, and A. Romero. Graph Attention Networks. In International Conference on Learning Representations, 2018.

[4] T. N. Kipf and M. Welling. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations, 2017.

[5] Sarah Parisot, Sofia Ira Ktena, Enzo Ferrante, Matthew Lee, Ricardo Guerrero, Ben Glocker, Daniel Rueckert. Disease Prediction using Graph Convolutional Networks: Application to Autism Spectrum Disorder and Alzheimer's Disease. In Medical Image Analysis, 2018.

[6] Xiang Yue, Zhen Wang, Jingong Huang, Srinivasan Parthasarathy, Soheil Moosavinasab, Yungui Huang, Simon M Lin, Wen Zhang, Ping Zhang, Huan Sun. Graph embedding on biomedical networks: methods, applications and evaluations, CoRR 2019.

[7] Xi Zhang,Lifang He,Kun Chen, Yuan Luo, Jiayu Zhou, and Fei Wang. Multi-View Graph Convolutional Network and Its Applications on Neuroimage Analysis for Parkinson’s Disease. In AMIA Annual Symposium Proceedings, 2018.

Seitenhierarchie

Exploiting Edge Features in Graph Neural Networks

Introduction

Graph Learning

Medical Applications

Graph Neural Networks

Related Work

Graph Convolutional Network

Graph Attention Network

Step 1: Linear Transformation

Step 2: Computation of Attention Coefficients

Step 3: Normalization of Attention Coefficients

Step 4: Computation of Final Output Features

Step 5: Computation of Multiple Attention Mechanisms

Methodology

Graph Structure

Proposed Network Architecture

Doubly Stochastic Normalization of Edges

Network Layers

Attention based EGNN layer - EGNN(A)

Convolution based EGNN layer - EGNN(C)

Experiment

Dataset

Molecular Analysis

Experimental Setup

Results

Citation Networks

Molecular Analysis

Conclusion

Limitation

References