Author: Bigom, Andreas. Tutor: Farshad, Azade.
In this blog post, the topic of Scene Graph Representation Learning will be introduced and discussed. Firstly, an introduction and motivation upon the subject will be presented. Furthermore the fundamental theories of Graph Neural Networks will be introduced. Hereafter, the three chosen theoretical papers will be discussed and their individual solution to feature extraction will be compared. Moreover a subjective review of the papers and Scene Graph Representation Learning will be given and lastly the diverse applications of the Scene Graph Neural Networks in medicine will be discussed.
Introduction
In this section a brief introduction to the field of Graph Representation Learning and more specifically Scene Graph Representation Learning will be presented. Moreover, use cases of the method will be introduced alongside a motivation of the topic.
Graphs and Scene Graphs
A great number of modern day problems can be formulated using graph structures, even though it might not be obvious.
When doing medical research, molecules can be thought of as graphs, with atoms as nodes and atomic bonds as edges. When designing a digital navigation systems, maps can be constructed as graphs, with intersections as nodes and roads as edges. When designing the next social media platform, graphs can be build from social networks, with people as nodes and relationships as edges. The scene of an image, video or fictional setting can be described using a graph structure, with objects as nodes and predicates as edges. These examples, as illustrated in figure 1 below, are just a subset of the plentiful use cases of the graph data structure in modern day problem solving.
Figure 1: Examples of the introduction of graph data structures in problem solving.
However, these utilizations of the graph data structure are not novel and have been proposed long before the acceleration of present-day of Machine Learning techniques. Nevertheless, scene graphs have seen an immense advance in attention lately as a partial solution to computer vision. Scene Graphs might not be the most prominent application of graphs, but like most graph structures they are usually not complicated to grasp.
Figure 2: Example of a Scene Graph construction from an image (blue and red ovals represent nodes and edges respectively) [4].
The utilization of Scene Graphs usually stem from a problem in which one wishes to model the scene of an image, video or another physical setting. In figure 2 an example of a Scene Graph construction from an image is illustrated in which the blue and red ovals represent nodes and edges of the graph respectively. As evident from the figure, the objects of the input image are represented by nodes and the relations between the objects are denoted by edges. By constructing such a Scene Graph, the scene of the input can be converted to a manageable data structure in which neither the features nor the semantics are lost in the process. These graph structures are a central part of computer vision and allows Machine Learning algorithms to achieve an analytical understanding of the scene.
Solving Problems using Graphs
Once a given problem setting has been modeled using graphs, a main objective could be to predict problem specific features of the graphs or perform classification tasks to build an understanding of the current data or to predict future data.
Figure 3: Main objective of Graph Neural Networks.
However, in order to extract relevant features from these graphs, a model which is capable of learning the complex relationships between nodes and edges has to be proposed. Conventional Neural Networks are not suited for this type of data structure and this primarily leads to sub-optimal learning.
In contrast, Graph Neural Networks are designed with the graph data structure in mind and utilizes the relational connections to achieve superior results. Graph Neural Networks are often able to extract relevant features in order to perform feature prediction or graph classification tasks. These tasks include resent research within medicine development, navigational logistics and much more.
Methods & Theory
In this section the fundamental methods and theories of Scene Graphs and Graph Neural Networks will be presented. These theories include the representation of graphs in Machine Learning, the Graph Neural Network setup and structure and lastly a short introduciton into the backbone of Graph Neural Networks, namely update rules.
Graphs, Nodes & Edges
Figure 4: The representations of graphs, nodes and edges in machine learning.
As mentioned in the previous section, graphs are build up of nodes and edges. These nodes and edges which composes the graphs structure usually have important features relevant for the problem setting. If the graph is constructed from a molecule, these features could include information about the atoms and chemical bonds. In order to allow various Machine Learning techniques to be performed on the graphs, these features have to be represented in such a way that calculations can be applied to them. In order to solve this, the features of the nodes and edges are represented by vectors.
Furthermore, in some problem settings, the weight and directions of the edges in a graph might be crucial. In these cases, this information is inscribed in the adjacency matrix which expresses the connections between nodes.
The Graph Neural Network
Once the graphs relevant for the problem settings have been sampled and constructed, the next task is to implement the Graph Neural Network responsible analyzing the data. The Graph Neural Network can in many ways be compared to a conventional Neural Network. A Graph Neural Network is build up of multiple layers, which purpose is to receive some input data, process the data and return a relevant output as illustrated in figure 5, much like a conventional Neural Network.
Figure 5: Overall setup of the Graph Neural Network model.
Through its layers, the Graph Neural Network will process the input graph in such a way, that the relevant features specified for the problem setting will be more prominent. This allows for easy classification or regression task to be performed on the feature highlighted output. Much like in conventional Neural Networks, the Graph Neural Network can be build up of multiple types of layers, through which the graph will be iteratively processed and updated, as illustrated in figure 6.
Figure 6: Simplified overview of the Graph Neural Network Model.
In Graph Neural Networks however, these processes are called update rules, rather than layers. Many different update rules have been proposed, all possessing different abilities to capture and extract features from the graph. Extensive explanations of these update rules and the theory behind them is unfortunately out of scope for this blog post. However, the most adopted and commonly used update rules can be seen in figure 7.
Figure 7: A subset of the most adopted and commonly used update rules for Graph Neural Networks [4].
Papers
In this section, the three main papers chosen as the focal points of this blog post will be presented, through an introduction to the methods, theories and problem settings each paper addresses. Furthermore, a comparison of the methodology and results will be conducted.
The Feature Extraction Problem Setting Space
As can be argued for most Machine Learning fields, the problem setting space of Scene Graph Representation Learning is relatively large. Many different problems can be solved using Scene Graph Representation Learning, each with a distinct problem formulation and thus, a different learning setup.
In order to understand the problem settings addressed in the three papers, their respective learning setup solutions and how they generally relate to each other, an illustration of a subset of this problem setting space is shown in figure 8.
Figure 8: Feature extraction problem setting space for Graph Neural Networks.
From this illustration it is evident that all three chosen papers for this blog post are addressing problems at different branches of the space. Both LaGraph [6] and ITSGG [2] work with predictive learning, meaning they aim to learn feature prediction of a given graph. On the contrary, VarScene [9] works with generative learning, as their objective is to learn novel and realistic graph generation.
To achieve their objective the papers also work both within supervised and self-supervised learning and utilize different learning setups; Contrastive, Classificational and Distributional. The relevant learning setup for each paper will be explained in depth within their respective section.
LaGraph
Main Idea
The first paper which will be presented in this blog post is LaGraph [6]. From figure 8, it can be see that LaGraph is a predictive model which utilizes self-supervised contrastive learning to perform node and graph level representation learning. The main idea behind LaGraph is to leverage the abundant amount of unlabeled data to learn feature extraction through a reconstructive contrastive setup.
Figure 9: Illustration of the Contrastive Learning setup.
In figure 9 a visualization of the contrastive learning setup is illustrated. In the contrastive learning setup, the objective is to minimize the distance between similar datapoints and maximize the distance between different datapoints. The way contrastive learning achieves this, is to perform randomized data augmentation on the datapoints, whereafter the model will be incentivized to minimize the distance between datapoints originating from the same datapoint and maximize the distance between datapoints originating from different datapoints.
The advantage of this learning setup, is that we are able to leverage the abundance of unlabeled data. However, this comes at the cost of not being in control of which features the model will learn from the graphs.
Methodology
The way the LaGraph model specifically enables this contrastive learning is through an autoencoder setup, as visualized in figure 10.
Figure 10: Overview of LaGraph model setup [6].
Firstly a randomized data augmentation is performed on the graphs, in the form of a random node masking. Both the original and the masked graphs are then fed to an encoder network to retrieve low dimensional embeddings. Hereafter, the model is faced with two main objective functions. The first objective function ensures that the encoder is able to reconstruct the masked features, by calculating the loss as the mean squared error of the two embeddings. The second objective function ensures that the decoder actually learns to extract relevant and important features and that the graph can be reconstructed from the embedding. The objective function does this by passing the embedding of the original graph through a decoder and calculating the loss, as the mean squared error between this output and the original graph.
As seen in figure 10, the LaGraph model adapts this learning setup both for node and graph level feature learning and prediction.
Paper: VarScene
Main Idea
As seen from figure 8, VarScene [9] is a generative model, which utilizes self-supervised distributional learning in order to learn novel and realistic graph synthesis. The main idea of VarScene [9] is like LaGraph [6] to utilize the abundance of unlabeled data, to perform representation learning though a minimal distributional discrepancy objective.
Figure 11: Illustration of the minimum distributional discrepancy objetive.
In figure 11 the minimal distributional discrepancy objective is illustrated. As evident form the figure, the objective is to minimize the distribution between a sampled set of graphs and the synthetic generated graphs.
The advantage of this learning strategy is, much like LaGraph that the model is able to utilize the abundance of unlabeled data. However, this model is not able to generate feature specific graphs, as the synthetically generated graphs are randomly distributed around the distribution of the sampled graphs.
Methodology
The VarScene [9] model is constructed through an autoencoder setup, as illustrated in figure 12.
Figure 12: Overview of VarScene model setup [9].
The encoder will firstly decompose the input graph into stars. Stars are defined as each individual node and all its outgoing edges. Thereafter, the model will create embeddings and random codes for each of the samples stars. Lastly the encoder will aggregate these stars based on distance to a randomly chosen star within the graph, called the pivot star. Thereafter, the decoder will sample the random codes for each star with regards to its distance to the pivot star.
The objective function of the model is calculated as the cosine similarity between the distribution of stars, nodes, edges and a variety of kernels of the sampled graphs and the synthetically generated graphs.
Figure 13: Example of input graph (G) and synthetically generated graphs of VarScene [9] .
From figure 13 an example of the input graph and the corresponding synthetically generated graphs are illustrated. From this illustration, it can be seen how the four generated graphs are similarly distributed as the input graph, while still being novel, which is the main objective of VarScene.
Paper: Iterative Scene Graph Generation
Main Idea
As seen from figure 8, Iterative Scene Graph Generation [2] is a predictive model, which utilizes supervised clasificational learning in order to learn node and edge graph classification. The main idea of Iterative Scene Graph Generation [2] is to learn supervised graph classification through a dynamic conditioning learning setup.
Figure 14: Illustration of the dynamic conditioning learning setup.
In figure 14, an illustration of the dynamic conditioning setup is shown. In the dynamic conditioning setup three individual decoders (subject-, object- and predicate-decoder) are utilized. These decoders are then iteratively called to decode the embeddings of the relationship triplets (one object, one subject and one predicate). For each iteration of the decoding, each of the decoders output is conditioned on the output of all decoders in previous iteration. An advantage of this learning setup, is that the decoding and thereby classification of each subject, object and predicate is conditioned on the surrounding information. E.g. if classifying an object, the current information about the corresponding subject and predicate will be utilized. However, one major drawback of this learning structure is the limitation caused by the limited amount of labeled data within the field.
Methodology
The Iterative Scene Graph Generation model is structured as illustrated in figure 15.
Figure 15: Overview of Iterative Scene Graph Generation model setup [2].
As evident by the figure this model is structured as an autoencoder, with a decoder-encoder setup. The first information retrieval of this model is done by a trainable Convolutional Neural Network backbone model. This model retrieves the initial features from the input image in the form of objects, subjects and predicates. Hereafter, the initial features are passed through an encoder, which will encode the features into embeddings. These embeddings are then passed through the main pillars of this model, namely the decoders. These decoders utilizes dynamic conditioning, much like a transformer network, in order to infer the true features of the input image.
Figure 16: Example of input image and classified Scene Graphs for iteration one, three and six of the decoders [2].
From figure 16, an example of the output of the first, third and sixth iteration of the decoder triplets. As evident from the figure, the model accumulate more accurate information about the Scene Graph for each iteration, as the information is able to flow through the decoder network. This information accumulation is the main idea behind the iterative decoding process of the model.
Paper Comparison
In figure 17 a table comparing the three different methods are shown.
Figure 17: Table over metrics of comparison for each method/paper.
In this table, it is evident that the three papers differ on a lot of areas. The methods are both within the predictive and generative model field. The methods use both self-supervised and supervised learning. The methods both try to extract node-level, edge-level and graph-level features. Lastly, the papers use three very different leanring methods, namely contrastive- and distributional-learning and dynamic conditioning in order to extract these features.
These differences in model and learning setup is caused by the differences in problem settings and shows how critical understanding the given problem setting in order to choose the fitting learning setup. While LaGraph aims to solve a generalized feature extraction task with unlabeled data and find contrastive learning a fitting solution, VarScene is using the same unlabeled data to create synthetic graphs through distributional learning.
Evaluation & Results
In figure 18 a table displaying the evaluation and results of the three main papers are shown.
Figure 18: Table over evaluation and results of each model.
Unfortunately, due to the diversity in problem settings and methods used in the three papers, they are not directly comparable. The papers train on different datasets, evaluate using different evaluation tasks and heuristics and compare their work to works in different fields. However, by looking at the result the papers achieve individually, it is evident that each method is state-of-the-art within the problem setting they aim to solve.
Subjective Review
LaGraph
The LaGraph model is very interesting as it utilizes a very simple, yet effective learning strategy. Moreover, the strength of having close to unlimited data, by using an unsupervised-learning setup enables for stable training. However, due to the inability to control which features are extracted and taught during the learning process, this method might be most suitable for pre-training. In a case where problem specific data is limited, LaGraph could be used to pre-train the initial feature extraction, whereafter the model could be fine-tuned on the scares amount of data.
VarScene
The VarScene model is a novel method with promising results. The authors of the paper are able to show the effectiveness and efficiency of the model compared to similar current state-of-the-art methods. However, the model is limited by the distribution of the sampled graphs, as the model is only able to create realistic and novel graphs within that exact distribution. Moreover, the model also lacks an ability to create feature specific graphs, which would drastically increase the amount of use cases for which the VarScene would be groundbreaking.
Iterative Scene Graph Generation
The last model, Iterative Scene Graph Generation shows great results and demonstrates the usability of dynamic conditioning. Furthermore, with its transformer-based structure, the paper illustrates the huge potential of transformers in Machine Learning and the vast amount of applications of these across Machine Learning fields.
Future Research
Unfortunately, the papers weren't comparable. However, this diversity only illustrates the amount of use cases for Scene Graph Neural networks in problem solving. Furthermore, even though these papers weren't directly comparable, an interesting idea for future research could be to combine the models. E.g. a fascinating combination would be LaGraph and Iterative Scene Graph Generation. While Iterative Scene Graph Generation is limited by the scarce amount of labeled data, LaGraph is able to utilize the abundance of unlabeled data. This opens the possibility of using LaGraph as a pre-training setup for Iterative Scene Graph Generation. Firstly LaGraph could pre-train an autoencoder setup on unlabeled data to establish generalized feature extraction abilities. Hereafter, Iterative Scene Graph Generation could be used to fine tune the model to the problem setting specific tasks on labeled data.
Relevant Papers
In this blogpost the focus is mainly on the three presented papers and the specific problem settings they work within. However, there are countless works that are of relevance for this topic which could have been of benefit to creating an overview of the Scene Graph Representation Learning methods. One example could be the first proposal of the Graph Neural Network model, namely "The graph neural network model" [6]. In this paper the first proposal of the Graph Neural Network model is proposed along with theoretical analysis to support it. Another noteworthy work is BGNN [3] which proposes a confidence-aware bipartite Graph Neural Network to address the long-tailed class distribution of objects, subjects and predicates in Scene Graphs. A third work which is closely related to VarScene [9], but focuses on real world application of the synthetically generated graphs is SceneGen [8]. SceneGen is, like VarScene, a generative model, which aims to generate novel and realistic traffic scenes for training self driving vehicles. A fourth paper which is notable in the literature is BRGL [7]. BRGL is, much like LaGraph a self-supervised learning setup, which aims to perform graph representation learning through an augmentation-bootstrapping setup. This setup allows the model to learn feature extraction by prediction alternative augmentations of the input graphs. These are just a subset of the relevant and notable works within the literature.
The Future of Scene Graph Representation Learning
The field of Graph Neural Networks and Scene Graph Representation Learning is currently on the rise.
Figure 19: Google searches for Graph Neural Networks (from first proposal in 2008 until present day).
By looking at the Google searches for "Graph Neural Network" over the last 14 years, from the first proposal of Graph Neural Networks until present day, it is evident, that there is currently a growing focus on the topic. Furthermore, by investigating the amount of research done within the field on Google Scholar, it can be seen that Graph Neural Networks are arguably under-researched compared to many other Machine Learning areas.
Search | Results on Google Scholar |
---|---|
Neural Network | 2.280.000 results |
Convolutional Neural Network | 557.000 results |
Natural Language Processing | 1.280.000 results |
Graph Neural Network | 39.200 results |
Table 1: Searches and their corresponding number of results on Google Scholar.
From table 1 it is evident, that Graph Neural Networks appear in drastically less literature compared to other significant areas of Machine Learning. This lack of research could be a sign, that Graph Neural Networks may see groundbreaking achievements within the next couple of years (as seen within NNs, CNNs, NLP, RL and more ), as research within the field potentially undergoes and expansion.
Applications in Medicine
In this section, the applications of Scene Graph Representation Learning in the medical field will be discussed. Following this, a specific model comparable to the previously presented paper, Iterative Scene Graph Generation, which is currently in development for medical use, will be presented and discussed.
Applications of Scene Graphs and Graph Neural Networks within in medicine is plentiful, with huge potentials of automizing repetitive tasks, enhancing diagnosis possibilities and enabling easier access to medical help in areas where access is limited. These are just some of the amazing potentials of Machine Learning and especially Graph Neural Networks in medicine. The applications of Scene Graph Representation Learning vary from autonomous operations and Medical Image Diagnosis to Surgical Report Generation and many more.
One paper which focuses on the last use case, namely Surgical Report Generation is SGT [5].
Figure 20: Overview of the structural setup of the SGT model [5].
The SGT model aims to automize the surgical report generation process, which surgeons have to undergo every time they perform surgery. To eliminate this repetitive and time consuming task, the authors of the paper proposes a model as illustrated in figure 20. This model is very comparable to the Iterative Scene Graph Generation model presented earlier in the blog post. Both of these models use a CNN backbone to extract initials features, use transformer based models to understand interactive relations and utilize attention based conditioning. The final output of the SGT model is a finalized fully written surgical report, which has the potential of eliminating wasting the precious time of skilled surgeons.
Conclusion
In conclusion, this paper illustrates the importance of being able to model and understand the graph datastructure. Especially Scene Graphs are of most importance, as these are a key component in enabling computer vision, which is a central pillar in many machine learning problem settings.
Furthermore, the blogpost illustrates the diversity in approaches to solving the feature extraction problem. This is illustrated through a presentation of three papers, each of which utilizes completely different setups in order to learn feature extraction for a specific problem setting. Namely self-supervises contrastive learning, self-supervises distributional learning and supervises classificational learning for both predictive and generative models. These diversities only illustrate the importance of understanding once specific problem setting in order to build the structure of a model to solve the task.
Moreover, the field of Graph Neural Networks and Scene Graph Representation Leanring has seen a huge spike in focus and research within the last couple of years and could potentially see groundbreaking achievements within the next ones.
Lastly, the applications of this technology are endless, especially in medicine, where many models have been preposed to solve critical tasks, e.g. the SGT model for automatic surgical report generation.
References
- Awan, A. (2021). A Comprehensive Introduction to Graph Neural Networks (GNNs).DataCamp. https://www.datacamp.com/tutorial/comprehensive-introduction-graph-neural-networks-gnns-tutorial
- Khandelwal, S., & Sigal, L. (2022). Iterative Scene Graph Generation. arXiv preprint arXiv:2207.13440.
- Li, R., Zhang, S., Wan, B., & He, X. (2021). Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11109-11119).
- Li, T. (2022). Scene Graph Generation, Compression, and Classification on Action Genome Dataset.Medium. https://medium.com/stanford-cs224w/scene-graph-generation-compression-and-classification-on-action-genome-dataset-9f692a1d5394
- Lin, C., Zheng, S., Liu, Z., Li, Y., Zhu, Z., & Zhao, Y. (2022). SGT: Scene Graph-Guided Transformer for Surgical Report Generation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 507-518). Springer, Cham.
- Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, G. (2008). The graph neural network model. IEEE transactions on neural networks, 20(1), 61-80.
- Thakoor, S., Tallec, C., Azar, M. G., Azabou, M., Dyer, E. L., Munos, R., ... & Valko, M. (2021). Large-scale representation learning on graphs via bootstrapping. arXiv preprint arXiv:2102.06514.
- Tan, S., Wong, K., Wang, S., Manivasagam, S., Ren, M., & Urtasun, R. (2021). Scenegen: Learning to generate realistic traffic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 892-901).
- Verma, T., De, A., Agrawal, Y., Vinay, V., & Chakrabarti, S. (2022, June). Varscene: A deep generative model for realistic scene graphsynthesis In International Conference on Machine Learning (pp. 22168-22183). PMLR.
- Xie, Y., Xu, Z., & Ji, S. (2022). Self-Supervised Representation Learning via Latent Graph Prediction. arXiv preprint arXiv:2202.08333.