1. Introduction

1.1. What is representation learning?

Representation learning is a branch of Machine learning. The main idea of representation learning is to automatically learn the good features to represent the data. Representation is a summary of data that omits unnecessary details and preserves important content by transforming raw data into simpler data. In contrast, traditional machine learning, such as Regression, Decision trees, and Support Vector Machines, require domain knowledge to manually do feature engineering.

1.2. What is modeling interaction?

Modeling interaction is to create a model that can understand the interaction between various elements. This concept can be applied to various domains, such as:

Images: Understanding the interactions between pixels within an image.
Text: Analyzing the interactions between words within a sentence.
Proteins: Investigating the interactions between individual atoms within a protein structure.

1.3. Why representation learning is important for modeling interactions?

Representation learning is crucial for modeling interactions because it enables the transformation of raw, complex data into a more manageable and meaningful format. This process helps in uncovering the underlying patterns and features that are essential for understanding interactions.

2. Related Papers

Below are three papers with different modeling interactions. Two interactions use a data-driven approach. And the other uses a Model-Based approach.

2.1. PIGNet: a physics-informed deep learning model toward generalized drug-target interaction predictions

2.1.1. Introduction

The objective of Drug-Target Interaction analysis is to identify drugs (ligands) that effectively bind to specific proteins (targets) within the body. Rather than relying on conventional, time-consuming, and costly experimental methods, we employ a data-driven approach. This approach utilizes existing data on known drug-target interactions to predict potential new interactions. These predictions can then be explored in the field of drug development. The paper “PIGNet: a physics-informed deep learning model toward generalized drug-target interaction predictions” addressed this issue by using a Physics-informed graph neural network and data augmentation and has been accepted by Chemical Science 2022.

2.1.2. Challenges

The primary challenge we're facing relates to a deficiency in 3D structure data. Previous techniques, such as 3D Convolutional Neural Networks (3D CNNs) and Graph Neural Networks (GNNs), encountered a common problem of overfitting the training data. This overfitting resulted in poor performance when tested, indicating a low generalization ability of the models.

2.1.3. Method

The goal is to improve the model generalization ability. And previous method only learned the data-intrinsic bias instead of the underlying physics of the protein-ligand interaction. The paper proposes two methods to solve the problem.

Physics-informed graph neural network (PIGNet)

This model gets the input of a protein-ligand complex and predicts the binding affinity (Energy) of a protein-ligand complex from a sum of atom-atom pairwise interactions. The input is composed of two parts, molecular graph G(H, A) and Dij distance between the atom pairs. In a molecular graph, H represents node features and A represents the Adjacency matrix. Gated graph attention networks (Gated GAT) update the adjacency matrix of covalent bonds and Interaction network updates the adjacency matrix of intermolecular interactions. The final node features will then feed into physics-informed parameterized equations to train the variable in the equations. The output is total protein-ligand binding affinity summing from all atom-atom pairwise binding affinity.

Data augmentation

To deal with data deficiency, they propose two data augmentation methods to increase the model generalization ability. First, docking augmentation could differentiate different binding poses. Screening augmentation could differentiate different non-binding complexes.

2.1.4. Results

They use two datasets CASF-2016 and CSAR NRC - HiQ. And four metrics docking (find out native bind pose), screening (identify specific binding ligand), scoring (correlation between predicted and experimental results), and ranking (order drugs binding ability to targets) power to evaluate the model. The results show that the PIGNet model has better results in docking, screening, and ranking. We could see that the AK-Score model has a better result in scoring, but it has a bad result in docking. This means the model couldn’t differentiate different binding poses which means the model didn’t have good generalization ability.

2.2. What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions

2.2.1. Introduction

Human-object interaction (HOI) is essential for understanding the scene and recognizing the action between the human and the object in the image. The objective of HOI is to localize humans and objects and predict their interactions from images. The result of HOI is a triple consisting of <Human, Object, Action> such as the image shown below. The paper “What to Look at and Where: Semantic and Spatial Refined Transformer for Detecting human-object Interactions” addressed this issue by including two new modules in transformer architecture and has been accepted by CVPR 2022

2.2.2. Challenges

The previous methods are mainly divided into two approaches.

The two-stage approach first localizes the instances and then predicts the interaction of instances. The main challenge is that off-the-shelf object detectors do not identify the interaction between these items and it’s time-consuming to go over all pairs of person and object.
The one-stage approach uses transformer architecture to detect all the components of an HOI triple directly. However, not all object action pairs are meaningful and there is a challenge of the simple query to decode all the rich elements of an HOI triple.

2.2.3. Method

This paper proposes a one-stage transformer-based semantic and spatial refined transformer (SSRT) to solve the problem. They add two modules support feature generator and query refine to help select the most relevant object-action pairs within an image and refine the queries’ representation with semantic and spatial features.

In support feature generator, the OA candidates sampler will first take the encoder’s output as input. Predict and select the top-K (object, action) candidates without localization. And then use these OA candidates to generate semantic and spatial features. In the semantic feature generator, they use a CLIP text encoder to compute the semantic representation of each pre-selected OA candidate and then project these features to the image feature space. In the spatial features generator, they first create a spatial map (2*B*B size binary map for human and object bounding box) for each OA label. Pass the spatial map to 2 convolution layers and then project to image feature space. Finally, aggregate semantic and spatial features with elementwise multiplication.

In Query Refiner (QR), the aggregated features are sent to the QR. First, randomly initialize queries and attend to themselves via self-attention. Then these queries attend to support features that serve as keys and values through cross-attention.

2.2.4. Results

The author compares their method to both one-stage and two-stage approaches with mean average precision (mAP) on V-COCO and HICO-DET datasets and achieves a better result.

2.3. Physical Interaction: Reconstructing Hand-object Interactions with Physics

2.3.1. Introduction

Hand-object interaction is common in daily life. It could be applied in gaming, virtual reality, and robotics. The goal is to reconstruct hand-object interaction from a single view based on real-time. This paper "Physical Interaction: Reconstructing Hand-object Interactions with Physics" proposes a physics-based method to better solve the ambiguities in the reconstruction and has been accepted by SIGGRAPH 2022.

2.3.2. Challenges

The challenge lies in reconstructing Hand-object interaction with physically plausible results. Since with a single view-base, severe observation missing caused by occlusions may lead to physically implausible results. Previous methods such as optimization-based methods need object templates in motion estimation, learning-based methods have poor generalization for novel objects and physics-based interaction modeling could only apply to simple object shapes that rely on strong object shape assumption

2.3.3. Method

They propose a physics-based optimization method for refining contact status optimization (Predict contact force) and contact movement modeling

First, the input is depth images and uses kinematic motion tracking to get object shape and motion and kinematic hand pose. On top of it, they use physical rules such as force and moment to refine tip position and force prediction. The idea is that an object’s motion is driven by the forces exerted at the contact points. And then the use of confidence-based slide prevention to refine hand pose and final tip position. They calculate the confidence of tips (the more depth pixels of the tips the higher the confidence). If the confidence is high the result will be based on kinematic motion tracking. If it is low then the result will be based on physics law.

2.3.4. Result

The average pixel error from cso+fri+conf (the paper method) is lower than bl (pure kinematic motion tracking).

Qualitative results show that their method has better physically plausible results.

3. Review

3.1. Comparison

In my opinion, a data-driven approach is appropriate for drug-target and Human-object interaction since these domains involve complex and diverse data, making it challenging to manually extract relevant features. In the case of drug-target interaction, incorporating prior knowledge, such as the binding affinity of atom-atom pairs, can enhance the generalization capabilities of the model, allowing it to make more accurate predictions. For human-object interaction, the reduction of object-action candidates and the incorporation of semantic and spatial features are essential strategies for improving the model's performance.

In contrast, Hand-object interaction uses a model-based approach since it typically occurs in simpler, controlled environments, which can lead to clearer and more predictable features. The task is to reconstruct the hand and object and those are only moving things in the image so the feature is clearer. Using the physics knowledge in the model could have better optimization.

3.2. Possible applications in medical regions

This post gives a brief overview of modeling interaction with data-driven and model-based approaches. The presented method could be applied in many medical regions.

Drug-target interaction could revolutionize drug discovery. By accurately predicting drug-target interactions, the model can help identify potential new drugs and understand their interactions with various targets in the body. This can accelerate the drug development process, reduce costs, and enable more personalized treatment options.
Human-object interaction can be applied to enhance patient monitoring systems, especially in elder care or for patients with conditions that require constant observation. By accurately detecting and understanding human-object interactions, such systems can alert someone in case of unusual activities.
Hand-object interaction could be applied in rehabilitation technologies and surgical training simulations. For instance, accurate modeling of hand-object interactions can improve the design of prosthetic limbs, making them more intuitive and easier to control. It can also be used in virtual reality simulations for training surgeons, providing a more realistic experience.

4. REFERENCES

[1] Seokhyun Moon, Wonho Zhung, Soojung Yang, Jaechang Lim, and Woo Youn Kim. 2022. PIGNet: a physics-informed deep learning model toward generalized drug–target interaction predictions. Chemical Science.
[2] A S M Iftekhar, Hao Chen, Kaustav Kundu, Xinyu Li, Joseph Tighe, and Davide Modolo. 2022. What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions. CVPR.
[3] Haoyu Hu, Xinyu Yi, Hao Zhang, Jun-Hai Yong, and Feng Xu. 2022. Physical Interaction: Reconstructing Hand-object Interactions with Physics. SIGGRAPH.

Seitenhierarchie

Representation Learning for Modeling Interactions