OUTLINE

1. Introduction
2. Methodology of state-of-the-art graph diffusion models
3. Results and discussion
- 3.1 Experimental results
- 3.2 Discussion
4. References

1. Introduction

1.1. Diffusion probabilistic models

Diffusion probabilistic models also known, as diffusion models, are becoming more and more popular in the world of Deep Learning. Introduced in 2015 [1], these models were initially inspired by statistical physics and gave poor results in most of the benchmarks from that year. In a nutshell, diffusion models are trained to corrupt some initial data with tiny gaussian perturbations and then to denoise the result of this corruption process: the final goal is to get back to the initial data. However, as the years go by, more and more variants of diffusion models were released and used in different scenarios. From 2020, [2] found out that the architecture in question is really promising in the field of deep generations. In fact, diffusion models achieved and still achieve nowadays state-of-the-art results in many generative tasks, with an emphasis on pictures and videos [3].

1.2. Stakes of graph manipulation

On the other hand, graph manipulation is gaining more and more interests of industry from a wide range of fields. Indeed, to present some examples, molecules and polymers are usually represented as graphs in drugs and material discovery industry [4], while social medias also represent our social networks as graph [5]. In the medicine field, in addition of molecules, polymers and drugs discovery, we can find graphs in many other different contexts: for instance, we can model the link between diseases, Its causes and the drugs to cure It with graphs [15], or we can model a hospital logistics network to have a better management of its ressources [16]. As a result, graph manipulation has become increasingly popular, with the publication of numerous papers using different approaches with more or less success. To generate graph, Variational Auto Encoders (VAEs) [6], Generative Adversarial Networks (GANs) [7], normalizing flows [8] or even energy-based models (EBMs) [9] were already used with convincing results. However, using these generated graph to augment a dataset for some downstream tasks such as node attribute prediction does not give satisfactory results.

1.3. Challenges of using diffusion models for graph generation

Here comes the idea of using diffusion models for graph generation to improve the current results. Nevertheless, most first attempts were disappointing [10]. In fact, the different trials reveal the same challenges to overcome when using diffusion models with graph. First, diffusion models initially embeds graph into a continuous space and then add gaussian noises: this destroys the distribution of the original discrete graph structures and so make some structural information such as connectivity being undefined. Secondly, diffusion models usually do not let us introducing easily constraints during the generation process without retraining [11]. Finally, diffusion models makes the data cross an extensive amount of noise and denoising layers, making the training and inference slow. We are going to take our interest in three different state of the arts papers which tries to overcome these challenges: Data-Centric Learning from Unlabeled Graphs with Diffusion Model (Liu et al., 2023), Autoregressive Diffusion Model for Graph Generation (Kong et al., 2023) and DiGress: Discrete Denoising diffusion for graph generation (Vignac et al., 2023).

2. Methodology of state-of-the-art graph diffusion models

2.1. Data-Centric Learning from Unlabeled Graphs with Diffusion Model (Liu et al., 2023)

2.1.1. Motivation and overview of the proposed framework

In this paper, Liu et al. are trying to address the problems faced by current self-supervised learning methods on graph. The main idea was to train a diffusion model to generate new graphs out of an unlabeled graph dataset - data that we have in much greater quantity - in order to later augment any labeled dataset by feeding the diffusion model in question with the labeled graphs. This augmented labeled dataset is then fed to a prediction model.

To do so, they present a new model called Data-Centric Transfer framework (DCT). The key point to respect during the research was to keep minimal sufficient knowledge from the unlabeled dataset. The knowledge should be minimal to guarantee the relevance of the generated graphs: they should be related to the downstream task and keep the same label more than being part of the distribution of the unlabaled dataset. The knowledge should also be sufficient so that the model actually learns dataset invariant knowledge: for example, when generating molecules, a model should learn that an oxygen usually only has two covariant bonds.

Figure 1: Comparison of the existing approaches and DCT

As just stated, the final goal of the diffusion model that we are studying is to augment a labeled dataset based on graphs from this same labeled dataset. This is one big novelty compared to the existing approaches. In fact, the existing approaches augment the labeled dataset either by training a graph property predictor to select and label relevant unlabeled graphs, either by generating graphs directly from the standard diffusion model, and then use a trained predictor to assign them labels a posteriori. A comparison between the proposed framework and the old ones is presented on Figure 1.

2.1.2. Methodology

The training process of DCT can be spliited into two phases. The first one is the self-supervised training on the big unlabeled dataset and the second one is the augmentation of the small task specific labeled dataset.

2.1.2.1. Self-supervised learning on unlabelled dataset

The first part of DCT training is relatively classic. Each unlabelled graph goes through a diffusion process of T steps parametrized by the stochastic differential equation (SDE) shown in Equation 1.

Equation 1: G^(t) is the diffused Graph G at time step t, f is the drift coefficient, g is the diffusion coefficient and w is the standard Wiener process.

Then, the result of this diffusion process gors through T steps of denoising parametrised by the reverse-time SDE in Equation 2.

Equation 2: p_t(G^(t)) is the marginal distribution at time t in forward diffusion and w is the reverse standard Wiener process.

2.1.2.2. Task specific learning on labelled dataset

Figure 2: Diffusion model in DCT

In the second part of DCT training, Liu et al. focused on transferring minimal sufficient information from the model they obtained from the self-supervised learning phase. The final goal is to be able to generate for each graph of the labelled dataset a new graph which is as different as possible from the input graph, but that we can be label with the same label as the input graph. That way the augmented labelled dataset will still be task related while having more diversity than the non augmented labelled dataset. In other words, in this part, DCT is trained to minimise the quantity shown in Equation 3. To simplify, the first term represents the mutual information between the input of DCT and the output of DCT and has to be minimised since we want them to be as different as possible to add diversity to the labelled dataset ; and the second term represents the mutual information between the output of DCT and the label of the input and is a quantity that we want to maximise as we want G' to have the same label as G.

Equation 3: G' is the graph generated by DCT from G.

While I₁ can be optimised easily by incorporeating a distance measure in the graph space, It may be more difficult to see how to estimate I₂. In fact, to do this, the prediction model is initially trained on the non augmented labelled dataset for some epochs f_θ and then freezed to be used in order to minimise p(y|f_θ(G')). This lead to the definition of the loss presented in Equation 4 that we want to minimise.

Equation 4: I_bound is a distance measure in the graph space

Finally, after having introduced these notations, we can present the SDE used to train DCT in this phase during the denoising part. This SDE is shown in Equation 5. In fact, the diffusion process is reused without modification from the first training phase.

Equation 5: α is a normalising term.

2.2. Autoregressive Diffusion Model for Graph Generation (Kong et al., 2023)

2.2.1. Motivation and overview of the proposed framework

The paper by Kong et al. are focusing more on the abilities of diffusion models to deal with graphs. In fact, actually, diffusion models struggle to have the same gain of performance with graphs in comparison with images or videos. The main problem with the previous diffusion models that the author try to solve is that these models do not suit the graph which are discrete data. Adding Gaussian noise for many steps can change the distribution of the original discrete graph structures, making the data-label association less meaningful.

Thus, they propose the model GraphArm which is the first autoregressive diffusion-based graph generation model. As with all diffusion models, GraphArm takes an input (here a graph) and diffuse It to a noise space and then denoise It to get a new graph. However, each phase has some specific characteristics. During the diffusion, instead of traditionally gradually adding gaussian noise to the adjacency matrix, noise is added in the form of one-by-one removal of a node and its adjacent edge in a specific order. During the denoising process, the graph structure is being recovered node by node in the reversed order used to diffuse the input graph with a label assigned to each of them at the same time.

2.2.2. Methodology

Let's split the main findings in two parts: the ones applied during the diffusion process and the ones applied during the generation process.

2.2.2.1. Diffusion process

Figure 3: Diffusion process in GraphArm

With the existing models, the diffusion process is destroying the original distribution of the graph dataset. To circumvent this problem, the idea of changing the way to add noise to the input came: this way should above all be compatible with discrete data. In the end, Kong et al. decided to apply the absorbing diffusion process introduced by [12].

In absorbing diffusion, for each step, one specific node and its neighbouring edges are removed from the input graph, and this until all the nodes are removed. To do so, an absorbing node state is defined, such that when a node has to be removed, it enters into this state: the node is masked and is linked to all the other nodes by masked edges, even the ones which were not linked to this node to mask. Note that masking a node (respectively an edge) means transforming the vector representing It to a vector specific to all masked nodes (respectively all masked edges).

Now comes the question of how to choose the node to absorb. Since all the nodes have to be removed, an idea can be to just select It randomly. However, intuitively, It can be easy to understand that depending on the graph, having a specific order can help the generation process. The paper gives the example of community-structured graphs which usually consist in overlapping subgraphs: It is more natural to generate one subgraph and then generate and connect the others one by one. In the end, the idea which was kept is to train diffusion ordering network q_Φ which decides at each time step t which node v_σ(t) to mask to get G_t.

The entire diffusion process is represented on Figure 3.

2.2.2.2. Generation process

Figure 4: Generation process in GraphArm

To generate a graph from the output of the diffusion process, GraphArm has to adapt to the way the diffusion process is designed.

That way, for each step t, GraphArm creates the graph G_t by first selecting one node to unmask. Then, It creates a graph G'_t+1 consisting of the graph G_t to which we add the node to unmask and a masked edge between It and all the nodes of G_t+1. After creating G'_t+1, the model find the edge to unmask and finally predicts the types of the new node and edges. To select the node to unmask, the reversed order defined to mask the nodes during the diffusion process is used. To select the edges to unmask and predicting the node and edges type, a Graph Attention Network (GAT) is used [13].

The entire generation process is shown on Figure 4.

2.3. DiGress - Discrete Denoising diffusion for graph generation (Vignac et al., 2023)

2.3.1. Motivation and overview of the proposed framework

Figure 5: Overview of DiGress

Like Kong et al., Vignac et al. try to solve the problem of bad performances when using diffusion models with graph data. Thus, they developped DiGress, a Discrete Graph Denoising Diffusion model. To adapt a diffusion model to discrete data, first, in the diffusion process, instead of traditionally adding a gaussian noise to the adjency matrix, DiGress adds or removes edges and edits the category of edges and nodes. For the denoising process, a transformer based network is used.

An overview of the proposed framework is shown in Figure 5.

2.3.2. Methodology

Let's consider separately the diffusion process and the generation process.

2.3.2.1. Diffusion process

It is in the diffusion process that Vignac et al. try to adapt the current diffusion models to discrete data such as graph. To do so, for each time step t, the designed model is adding and removing edges as well as changing category of edges and nodes.

Equation 6: G^t is the diffused graph at time step t, X^t is the matrix of the one hot encodings of the nodes types at time step t stacked horizontally, Q_X^t is the transition probability matrice of the nodes from time step t-1 to time step t such that an element at position (i,j) represents the probability that a node had type i in step t-1 and has type j at time step t, E^t is the matrix of the one hot encodings of the edges types at time step t, Q_E^t is the transition probability matrice of the edges from time step t-1 to time step t such that an element at position (i,j) represents the probability that an edge had type i in step t-1 and has type j at time step t

Mathematically, passing through the step t is equivalent to Equation 6. Multiplying X^t and E^t by their respective transition probability matrices represents the switch between one category to an other from time step t-1 to time step t. To represent the removal or reinsertion of one edge, we simply define one category of edge representing the removal of It: that way, multiplying E^t with Q_E^t can make an edge entering in this category, which means that this edge is removed from the graph.

Now comes the question of the model to choose to define the transition probability matrices. A common approach is to use a uniform transition of Q^t over the classes such that q_X and q_E becomes uniform over the categories. Since the graphs are usually not uniform over the categories, this method may not be the one to give the best performances. Vignac et al. decided to use a transition of Q^t to an optimal prior distribution which is close to the data distribution. Figure 6 shows the difference between the result of a uniform transition and the transition developped in the paper, and shows how the generation process takes advantage of this finding.

Figure 6: Reverse diffusion chains generated from a model trained on uniform transition noise (top) and marginal noise (bottom). When noisy graphs have the right marginals of node and edge types, they are closer to realistic graphs, which makes training easier.

2.3.2.2. Generation process

For the generation process, Vignac et al. reused the Graph Transformer Network developped by [14] with some additions. Figure 7 depicts the final architecture of the model used for generation process.

Figure 7: Architecture of the denoising network

The main addition to the reused model is the incorporation of structural and spectral features. That way, the denoising network is able to catch more information such as the presence of substructures. Besides, the model is also designed to adapt easily to conditional generation. While most of the current methods require a retrain of the entire model, DiGress incorporates a regressor which is trained to predict target properties of a clean graph G from a noisy version of It: the final goal is to use this regressor during the generation process in order to guide It. To adapt DiGress to new conditions of generation, one only need to train this regressor instead of the whole network.

3. Results and discussion

3.1. Experimental results

Since GraphARM and DiGress are more focused on pure use of diffusion models for graph generation while DCT is more focused on pertinence of the generated graphs to augment a dataset for downstream tasks such as classification and regression, the results must be splitted in two sections. The first one will compare the generation performance of GraphARM and DiGress. The second one will present the performance improvements done by DCT in prediction tasks.

3.1.1. Graph generation results

Table 1 shows the performance of GraphARM and DiGress on six generic graph datasets. It also compares these two models to the some existing approaches. Four different measures are used. Deg. is the abbreviation of Degree and measures the difference between the degrees distribution of the generated graphs compared to the original ones (lower is better). Clus. is the abbreviation of Clustering coefficient and measures the difference between the clustering coefficients of the generated graphs and the ones of the original graphs (lower is better). Orbit measures the difference in the frequency of structural motifs or subgraphs between the generated graphs and the original ones (lower is better). Time/s is the required time in seconds to run the benchmark.

Table 1: First results are bold, second ones are underlined.

3.1.2. Prediction tasks results

Table 2 showsthe performance of DCT on classification and regression tasks. GDA stands for Graph Data Augmentation.

3.2. Discussion

Considering Table 1 and Table 2, we can directly say that DCT, GraphArm and DiGress are all powerful frameworks that are generally beating the existing models in prediction tasks and graph generation tasks.

3.2.1. Graph generation results analysis

Let's focus first on the graph generation tasks and compare GraphArm and DiGress to the existing models. In all the three different metrics (Deg., Clus., Orbit) for the six different datasets, GraphArm or DiGress perform state of the art results 12 times out of the 18 measures. If we separate both models, DiGress achieves top 2 performance in 8 measures out of the 18, while GraphArm achieves top 2 performance in 9 measures out of the 18: both models leads to great performance improvements comparing It to existing models. Considering time now, GraphArm is one of the most time efficient models compared to the 12 others of Table 1. DiGress, on the other end, is at least in the first half of the most time efficient models in five out of the six benchmarks.

Let's now compare GraphArm and DiGress alone. DiGress outperforms GraphArm in 10 out of the 18 measures, revealling the fact that DiGress may have slightly better performances than GraphArm. If we consider the metrics separately, DiGress is better in 4 benchmarks out of the 6 for the degrees measure, while both outperforms 3 times the other in the two other measures. As a consequence, we can say that DiGress has slightly better performances than GraphArm thanks to a better learning of the degrees distribution. However, if we consider the speed of both models, GraphArm is always more time efficient than DiGress, having a time inference which can be up to 40 times quicker than DiGress without fall of performance (for Breast benchmark). This may be due to the fact that DiGress is using a transformer based architecture for the denoising model while GraphArm is using a GAT architecture. Knowing this latter information, we may also say without any numerical results from both papers that GraphArm is more computer efficient than DiGress.

With the small performance improvements of DiGress compared to GraphArm, we can conclude that It may be better to consider using GraphArm for inference as of time of writing this blog post. However, further research can be done on DiGress, especially by replacing the transformers architecture of the denoising process by an other lighter architecture.

3.2.2. Prediction tasks results analysis

In all measures of Table 2, DCT is objectively achieving state of the arts results in all measures without any exception. These results are all the best that DCT improves the former results significantly. For example, for obgb-FreeSolv, DCT outperforms by +45.8% the best self-supervised model. One data that we miss for our analysis is nevertheless the time and computer efficiency of DCT. But considering the outstanding performance compared to the current models, we can conclude that data-centric augmentation made by DCT is very efficient today and the framework used can be further studied for improvements. For instance, one can try to incorporate the graph generation framework of GraphArm into DCT in order to better represent the sparcity of graph data. Actually, some more recent work have already been done on trying to create a model which would have good performances in both graph generation and prediction tasks [17] and are worth the consideration in order to improve the three models we have presented in this blog post.

4. References

Jascha Sohl-Dickstein and Eric A. Weiss and Niru Maheswaranathan and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics (2015)
Jonathan Ho and Ajay Jain and Pieter Abbeel. Denoising Diffusion Probabilistic Models (2020)
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems (2021)
Hans-Joachim Böhm, Alexander Flohr, and Martin Stahl. Scaffold hopping (2004)
Grover, A., Zweig, A., and Ermon, S. Graphite: Iterative generative modeling of graphs (2019)
Simonovsky, M. and Komodakis, N. Graphvae: Towards generation of small graphs using variational autoencoders (2018)
Maziarka, Ł., Pocha, A., Kaczmarczyk, J., Rataj, K., Danel, T., and Warchoł, M. Mol-cyclegan: a generative model for molecular optimization (2020)
Madhawa, K., Ishiguro, K., Nakago, K., and Abe, M. Graph-nvp: An invertible flow model for generating molecular graphs (2019)
Liu, M., Yan, K., Oztekin, B., and Ji, S. Graphebm: Molecular graph generation with energy-based models (2021)
Niu, C., Song, Y., Song, J., Zhao, S., Grover, A., and Ermon, S. Permutation invariant graph generation via score-based generative modeling (2020)
Hoogeboom, E., Gritsenko, A. A., Bastings, J., Poole, B., van den Berg, R., and Salimans, T. Autoregressive diffusion models (2021)
Jacob Austin, Daniel Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces (2021)
Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks (2021)
Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs (2021)
Payal Chandak, Kexin Huang and Marinka Zitnik, Building a knowledge graph to enable precision medicine (2023)
Yu, H., Sun, X., Solvang, W. D., and Zhao, X., Reverse logistics network design for effective management of medical waste in epidemic outbreaks: Insights from the coronavirus disease 2019 (covid-19) outbreak in wuhan (china)
Cai Zhou, Xiyuan Wang, Muhan Zhang, Latent Graph Diffusion: A Unified Framework for Generation and Prediction on Graphs (2024)

Seitenhierarchie

Graph Diffusion Models