Author: Borbala Fazakas
Supervisors: Azade Farshad, Yeganeh, Y. M.
Machine learning-based solutions have proven highly effective, with successful applications across engineering, healthcare, finance, and other fields. However, despite the need to integrate information from multiple modalities to model real-world scenarios, multimodal models have achieved only limited success. Since the relationships between different modalities are often governed by physical laws, this work explores how incorporating physics-based information into machine learning models can enhance multimodal data modeling. We will review state-of-the-art approaches that leverage physics information and discuss the key challenges that remain in this area.
Motivation
Multimodality
Machine learning research has always aimed to build machine agents that can perceive, interact with, and reason about the world, similarly to how humans do it. This clearly requires the agents to be able to process different types of sensory data, e.g., visual, audio, or even tactile information, human language, etc.
Formally, by modality, we refer to how a natural phenomenon is perceived or represented [Liang et al., 2024]. Data of different modalities representing the same scene may contain some redundant information: the same information can be deduced from multiple different modalities individually, but also complimentary information: different information being present in different modalities. This setup motivates exploiting information from multiple modalities in order to resolve ambiguities and extend the available knowledge for enhanced reasoning capabilities in machine learning models.
Physics-informed machine learning
Physics-informed machine learning is a pioneering field of machine learning aimed at overcoming the cost inefficiency, robustness, and interpretability issues of traditional machine learning methods by integrating mathematical physics into the models [Karniadakis et al., 2021]. Thus, the vast amount of prior knowledge about the laws governing the physical world does not need to be “reinvented” by the machine learning models based on a limited number of data samples, and the model’s search space can be significantly reduced.
The advantages of physics-informed machine learning are multi-fold, as highlighted by [Karniadakis et al., 2021] or [Banerjee et al., 2024]:
- Better generalization in a small data regime: introducing physics laws often acts as regularization, effectively constraining the models to a lower-dimensional search space, thus enabling effective training with a smaller amount of data, with a lower risk of overfitting.
- Generating physically plausible predictions: physics-informed models produce a more physically realistic output by considering physics laws that a purely data-driven model could not have deduced based on data.
- Better handling of noisy data: due to the strong prior introduced through the physical laws, these models are more robust to noisy data.
- Interpretability: instead of treating the model as fully “black box,” the physics-informed approach ensures that the model’s behavior can be traced back to known laws and principles, and some guarantees can be given about the consistency of the model outputs with the physical world. As Doumeche et al. highlighted, physics-informed models combine the power of machine learning with the interpretability of traditional physical modeling [Doumeche et al., 2023].
- Accelerated training: some works also found evidence that physics information accelerates model training, in that the models converge faster to an optimal solution [Banerjee et al., 2024, Kashinath et al., 2021].
Physics Information for Multimodal Models
Knowing that the data of different modalities of the physical world are correlated, and their connection is governed by the laws of physics, the question naturally arises: can physics information facilitate the modeling of multimodal data?
While physics information has already been proven to be helpful for several unimodal use cases, it is expected that incorporation of such prior would be especially helpful when it comes to understanding the correlations between different modalities. Events in the real world, e.g., a car accident, manifest in multiple forms: through sound - a boom, through heat being released, through the deformation of the cars, which is visible to the human eye, etc. But all of these manifestations depend on the car dynamics - speed, collision angle, etc., the specifics of the colliding materials, and many more factors. Thus, understanding the underlying physics laws is expected to be informative for modeling the distribution of multimodal data.
Problem Statement
First, let’s define a multimodal dataset: we assume that we have an input dataset D with Mmodalities of different dimensionalities. Then, each data point x_i is a tuple (x_1^i, x_2^i, ..., x_M^i), with x_m^i \in \mathcal{R}^{d_m}
A multimodal model needs to be able to handle such a heterogeneous input dataset. While the set of potential applications is very wide, ranging from classification tasks through dimensionality reduction tasks to generation tasks, in general, we expect a multimodal machine learning model to model the distribution of multimodal data, i.e., define:
p(x) = p(x_1, x_2, ..., x_M) |
for any point x \in \mathcal{R}^{d_1 * d_2 * ... * d_M}
For physics-informed multimodal models in particular, we expect the same modeling task to be completed but with additional helper information about the underlying physics laws.
Challenges
The primary challenge of multimodal machine learning is 1. representation learning [Baltrusaitis et al., 2018, Morency et al., 2022] - learning good representations for the data is generally critical to the success of a machine learning-based solution [Bengio et al., 2013], but this is a non-trivial task in the case of non-heterogeneous data. This leads us to another critical and strongly related challenge, 2. alignment: based on the feature representations, a model needs to identify the connections between elements of different modalities - e.g., similarity measures may need to be defined, and semantic relationships need to be recognized, all while still keeping the complementary information of various modalities.
Having informative representations which support alignment would then enable solutions to the following secondary challenges [Baltrusaitis et al., 2018, Morency et al., 2022]:
- multimodal reasoning (or fusion): fusing data from multiple modalities enables better-informed predictions. Ambiguities in individual modalities can be resolved, and complementary information can be exploited.
- generation: translate one modality to another to generate new content or summarize multimodal data for conciseness.
- transference (or co-learning): transferring knowledge between representations, i.e., exploiting training data in one modality to gain knowledge and later making inferences on another (potentially low-resource) modality.
On top of these challenges, of course, the natural question is: how can we fuse together physics information, which generally comes in the form of exact mathematical equations, with a data-driven model? Data-driven models are statistics-based, using fundamentally different representations for information compared to exact mathematical physics models. We can essentially say that the two approaches "speak different languages".
Background and Related Work
Multimodal Networks
Representing Multimodal Data
There are three main approaches to representing multimodal data, as highlighted by [Liang et al., 2024] (see figure 1).
- Multimodal data can be fused together into one single representation or at least fewer representations than the number of modalities. Fusion can happen early, directly on raw data, or later, after some uni-modal encoders have been applied to each modality.
- A second approach is building coordinated representations, with exactly one representation per modality, with the general goal of bringing semantically relevant data closer together in a coordinated space.
- Last but not least, another approach is representation fission, aimed at creating decoupled representations, a higher number of them than the number of input modalities, while separately representing shared, redundant information and modality-specific information.
Notable Multimodal Models
One popular model and training technique nowadays is OpenAI’s CLIP [Radford et al., 2021] (see figure 2). The key technique here is contrastive learning: given a set of images and the corresponding textual labels, a vision and a text encoder are jointly trained to achieve coordinated representations, with the embeddings of the matching image-text pairs being closer to each other than other pairs. CLIP not only achieves a zero-shot good performance on the ImageNet benchmark but also a much better performance than competitor models on the extended ImageNet test datasets (e.g., adversarial examples, drawings, sketches, etc.), proving that knowledge gained from textual data can be effectively transferred to a vision task and leads to a more robust model that generalizes well.
Another notable example is Google's Flamingo model [10] (see figure 3), which goes with a fused data representation instead of coordination. Specifically, fusion happens via the cross-attention mechanism of a transformer model [Vaswani, 2017], and allows for creating a fused representation for any interleaved text and image sequence as input, thus enabling effective few-shot learning on images. Flamingo is at the basis of today's multimodal Gemini models [Team et al., 2023].
Physics-Informed Neural Networks
Introducing Physics Information
Introducing physics information into neural networks to guide them towards physically consistent solutions can happen in multiple different ways. First, considering the type of bias that is introduced, we can have [Karniadakis et al., 2021]:
- Observational Biases, introduced by data samples representing underlying physics laws, e.g., through additional precomputed physical variables and properties.
- Inductive Biases, introduced as ”hard constraints” about the data model, often by modifying the neural network architecture. For example, graph neural networks used for protein prediction directly encode constraints and properties of the structure of the proteins in the network architecture [Jha et al., 2022]. Another notable example, specific to physics information, is a Hamiltonian Neural Network [Greydanus et al., 2019] used for modeling Hamiltonian mechanics.
- Learning Biases, introduced into the network generally as ”soft constraints” by augmenting the loss function with additional terms to represent the laws of the underlying physical process. This usually leads to a multi-task-like training procedure. A notable example is the Physics Informed Neural Network (PINN) [Raissi et al., 2019].
Considering the source of the physics information, we can differentiate physics-informed machine learning models as follows [Banerjee et al., 2024]: some introduce the governing equations and constraints directly in the form of differential equations, conservation laws as a loss component (see [Raissi et al., 2019]). Others integrate information from physics models and simulators (see [Yuan et al., 2023, Shrestha et al., 2025]. Another widely used approach is data augmentation with physical variables and properties, which offers better guidance to the model than raw data (e.g., Lutjens et al., 2020]).
Classical Works
Neural ODEs introduced by [Chen et al., 2018] (see figure 4) represent one of the first attempts to model physical systems in a neural network. Chen et al. describe the advantages of going from discrete-time processes, which update the internal state step-by-step (see recurrent or residual neural networks, for example), to continuous time processes, which are commonly used in physics. Their model f learns to model directly the continuous dynamics of a system from data and predicts the \frac{dh(t)}{dt} = f (h_t, θ, t). For training, an ODE solver is applied, but instead of memory-inefficient backpropagation through the ODE solver, the Adjoint Sensitivity Method is used.
Physics Informed Neural Networks (PINN) were introduced by [Raissi et al., 2019] (see figure 5) as one of the first examples for combining data-driven and physics-based models. PINN’s generally model a physical quantity u(t, x), which must fulfill the differential equation describing the underlying dynamic system: \frac{du(t, x)}{dt} + \mathcal{N}[u] = 0, t \in [0, T], x \in \Omega, with some initial conditions (IC) and boundary conditions (BC), where \mathcal{N} is a nonlinear operator. The model produces a physics-consistent u(t, x) by learning via a multi-task approach: the loss’s first component is a mean-squared error for u(t, x) in the initial and boundary data points, whereas the second component enforces the differential equation by minimizing the absolute value of f(t, s) = \frac{du(t,x)}{dt} + \mathcal{N}[u]. The authors effectively use this approach to solve Schrödinger’s equation for quantum mechanics or Burger’s equation for fluid mechanics.
SOTA Methods
For now, there are very few multimodal models available that would use physics information to facilitate multimodal fusion. In this chapter, we’ll review some of the notable works and analyze them in the framework introduced earlier.
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos [Su et al., 2023]
Su et al. demonstrate how physics prior can be exploited to generate high-fidelity impact sound for silent videos [Su et al., 2023]. This task is particularly challenging, given how sensitive the impact sound is to the underlying physics: the object geometries, the materials, the impact locations... This explains why most existing works rely on physics simulation and require many of these parameters as input, which means they're not scalable for most real-world applications. On the other hand, existing deep learning approaches seem promising, but they often generate unfaithful sounds due to oversimplifying the problem and not understanding the underlying physics.
The authors combine the best of both worlds by including physics priors in a deep-learning-based diffusion model. First, they extract physics priors from the audio samples of the training dataset, as shown in figure 6. They apply two different modules for this purpose. The first module is responsible for capturing the impact sound generated by the direct object interactions, and it estimates physics parameters through modal analysis: the audio sample is modeled as a set of damped sinusoidal signals, with frequency, power, and decay rate parameters for each component. The second module is responsible for capturing the background noise and the sound reflection, all dependent on the environment. The output of this module is the residual parameters, which are learned by modeling the sound environment as an exponentially decaying filtered noise. This second module relies on neural networks and is trained based on a reconstruction loss, where the reconstructed sound depends on the parameters of both modules.
During training (see figure 7), a dataset of video-sound pairs is applied. For each sample, the physics prior extractor described above is used to build a physics latent, whereas a ResNet-50 with TSM [Lin et al., 2019] encodes the video information into a visual latent. Then, the main component of the model is a U-Net based diffusion model, which is conditioned not only on the visual latent but also on the physics one}.
Of course, at inference time, there is no audio signal from which the physics latent could be extracted. Thus, the authors choose to use the physics latent of the training sample, which has its visual latent closest to the test sample's visual latent (based on Euclidean distance).
The authors evaluate the final, physics-informed cross-modal generator on the Greatest Hits dataset [Owens et al., 2016]. The model is compared with several baselines (although all constructed by the same authors), including a ConvNet-based model, a conditional Transformer-based model, and different setups for diffusion models: one conditioned by video only, one conditioned by video and class labels and one conditioned by video and other audio features. Four out of the five metrics considered compare the distribution of the generated and the ground-truth spectrograms, and the fifth one measures recognition accuracy (to see if a classifier can tell apart generated sound samples from real ones). As figure 8 shows, the physics prior-conditioned diffusion model clearly outperforms all other baselines in all the metrics. Moreover, the authors also evaluate the human perception of the generated sound through the Amazon Mechanical Turk. The responses reinforce that the physics-informed model produces a sound that is considered better in quality and also matches the ground-truth sound more, compared to the ConvNet- or Transformer-based models.
Generating Physically Realistic and Directable Human Motions from Multi-modal Inputs [Shrestha et al., 2025]
Shrestha et al. demonstrate another use-case of physics information for multimodal data: physically realistic human motion generation [Shrestha et al., 2025]. Their proposed solution, the Masked Humanoid Controller (MHC) can take as input motion directives of several different modalities, including a VR controller, a joystick controller, a video or a text description (See figure 9).
Moreover, the MHC can "catch up" if the directive and the initial pose are conflicting, it can "combine" complementary (e.g., upper body and lower body) motions, and it can "complete" sparse or underspecified input (for example if the directive only specifies the upper body movement, but the lower body movements need to be made up entirely by the model) (see figure 10).
To achieve these capabilities, the first challenge handled by the authors is the multimodal data representation. In general, motions for a humanoid with J joints are represented as a sequence of poses q_{1:H}, where each pose is a tuple $(q^r, q^\omega, q^l, q^g)$: $q^r$ denotes the root joint (including its position, orientation, linear and angular velocity), $q^\omega \in \mathcal{R}^{J*6}$ represents the 3D joint orientations, $q^l \in \mathcal{R}^{J*3}$ denotes the relative positions of the joints, compared to the root joint and $q^g \in \mathcal{R}^{J*3}$ marks the global joint positions. It's clear that motion directives in the different modalities do not all offer all of this information. For example, from a VR controller, we generally only acquire the local joint positions and the root position. To represent all the motion directives of different modalities in a unified framework, the authors choose a masked motion representation (visualized in figure 11): each directive is represented as a sequence $d = (\hat{q}_{1:H}, \hat{i}_{1:H})$, which, besides the expected motion sequence, also includes a mask $\hat{i}_{1:H}$ that marks which dimensions of the poses are present and used as motion directives.
Another challenge is integrating physics knowledge into the training process (see figure 12). This is done via reinforcement learning in order to learn a generalizable policy that also supports zero-shot motion generation. The output of the policy model is the actuator set points, which are then used as input for a PD controller to achieve the desired motions. The training dataset contains a set of reference motions, and in each training episode, an initial pose and some motion directives are randomly chosen. The physics information is integrated at the reward level. The reward consists of three components: the first component is the tracking reward: the PD controller's output is fed into a physics simulator, and the alignment of the simulated character with the original motion directives is measured. The second component relies on some targeted discriminators and measures the "naturalness" of the generated motions. The third component introduces physics information in the form of an energy cost, which explicitly penalizes large changes in consequent actions, knowing that these generally lead to physically implausible motions.
After training the motion generator described above on the Reallusion MoCap dataset [Peng et al., 2022], the authors evaluate it on a test dataset generated by an ASE controller [Peng et al., 2022]. The same ASE controller is used as a baseline for the Reallusion training dataset. The main metric is E_{MPJPE}, which marks the average of the root-relative error of each humanoid joint. The results shown in figure 13 indicate that the physics-informed policy model outperforms the baseline for all measured scenarios: imitation, catch-up, and combine. The multimodal directives are only qualitatively evaluated, and no data is given for reproducibility.
Unsupervised Physics-Informed Disentanglement of Multimodal Data [Walker et al., 2024]
Walker et al. make another attempt at using physics information for a better understanding of multimodal data, specifically for disentanglement of multimodal data [Walker et al., 2024]. Their goal is to build low-dimensional fingerprints for multimodal data, providing an interpretable, independent representation of the underlying factors determining the multimodal observations. They argue that these fingerprints facilitate scientific discovery in that they allow working with a reduced data quantity, and they can act as surrogates instead of measurements of costly experiments.
Their proposed solution is the Physics-Informed Multimodal Autoencoder (PIMA), illustrated in figure 14. The encoder part of the network has independent encoders for each modality, and these unimodal encoders produce the parameters of a Gaussian distribution. Then, a multimodal embedding is obtained via a Product-of-Experts fusion, which means that the joint probability distribution is given by the normalized product of the unimodal Gaussian distributions. Once the fused representations are available, a Gaussian Mixture Model assumption is applied: the authors attempt to disentangle the data by assuming a clustering of the samples in N classes.
On the decoder side, the authors experiment with two types of separate modality and class-specific decoders: one that is a general black box model, and one that is physics-informed, the second one being tightly coupled to the specific problem. The authors demonstrate the advantages of a physics-informed decoder on two particular use cases; here, we'll only present the physics-informed solution for the MNIST dataset, consisting of the original MNIST dataset and the measurements of a simulated dynamical system. For this problem, the authors generate multimodal sample pairs for a specific digit c in MINST such that the governing differential equations of the dynamical system (used for the second modality) are parameterized by c. The expert model for the 2nd modalitity's decoder is a Spectral Neural ODE (NODE), which has information about the form of the governing differential equations but needs to learn the c-dependent parameters from the data samples. The authors prove that a variational autoencoder with these two modalities and this NODE expert model achieves a 90% average classification accuracy on the MINST dataset in an unsupervised manner, outperforming the purely data-driven approaches by over 7%. Detailed results are visible in figure 15.
Summary and Analysis
In the following, we'll classify the SOTA approaches presented above based on the framework introduced in the Background section.
Paper | Multimodal Data Representation | Introducing Physics Information |
---|---|---|
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos [Su et al, 2023] | Independent: Focus on including complementary information for Visual Data vs Physics Prior. | Observational Bias, Physics Property: Estimate of Sound Properties based on Training Data. |
Generating Physically Realistic and Directable Human Motions from Multi-modal Inputs [Shrestha et al, 2025] | Independent but Unified Framework: Each modality is mapped to the same space through independent, pre-trained encoders. | Learning Bias, Governing Equation: Energy Cost; Learning Bias, Physics Simulator |
Unsupervised Physics-Informed Disentanglement of Multimodal Data [Walker et al, 2024] | Fusion: Each modality is mapped to the same space through independent encoders. Then, Product-of-Experts fusion is applied. | Inductive Bias, Governing Equation: The neural network setup relies on assumptions about the physical system. |
Conclusion
Motivation and Problem Statement
While physics-informed multimodal networks are not widely used yet, we have found several works demonstrating their advantages for data-efficient and physically plausible models. In particular, [Su et al., 2023] demonstrate that physics priors reduce the need for a detailed mathematical description of the incoming data, thus scaling well for raw video datasets. [Shrestha et al., 2025] show how a physics simulator can be effectively integrated into a deep-learning model through reinforcement learning, with the intention of keeping the predictions physically plausible. They also exemplify how physics constraints can be introduced as an additional component in a loss function, emulating multi-task learning. Finally, we have seen how [Walker et al., 2024] introduce an inductive bias in a variational autoencoder's expert Neural ODE decoder to facilitate the disentanglement of multimodal data.
Remaining Challenges and Limitations
Despite the promising results, we shouldn't forget that all previously presented models lack a conceptual understanding of physics. This results in several limitations. For example, the physics information is often introduced as a soft constraint, but the model gives no hard guarantees for correctness. Complex systems cannot be modeled easily: we can only introduce a limited number of equations in the composite loss function, as multi-task learning doesn't automatically scale to many tasks. On the other hand, physics simulators can model complex systems, but training via these is quite expensive. Scaling is problematic not only at the system complexity level but also at the neural network size level, as PINN-like networks suffer from vanishing gradient problems [Gnanasambandam et al., 2022]. Data scarcity is another challenge: especially in the medical domain, collecting multimodal datasets for training and evaluation is difficult due to data protection laws, etc. Last but not least, we should mention that evaluating physics consistency is challenging by itself and often cannot be done without human feedback.
Future Works and Open Questions
We can certainly say that exploiting physics information specifically for multimodal modeling is underexplored, and we expect more work to come out in the near future. Whether the existing approaches will dominate is not certain, though: it could be that this problem will rather be solved by scaling up models and training data, effectively building physics-aware foundation models as it happened for language- and vision-language models (see [Herde et al., 2024]'s work for example for building foundation models for partial differential equations).
References
[Alayrac et al., 2022] Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. (2022). Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
[Baltrušaitis et al., 2018] Baltrušaitis, T., Ahuja, C., and Morency, L.-P. (2018). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):423–443.
[Banerjee et al., 2024] Banerjee, C., Nguyen, K., Fookes, C., and George, K. (2024). Physics-informed computer vision: A review and perspectives. ACM Computing Surveys, 57(1):1–38.
[Bengio et al., 2013] Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828.
[Chen et al., 2018] Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. (2018). Neural ordinary differential equations. Advances in Neural Information Processing Systems, 31.
[Doumèche et al., 2023] Doumèche, N., Biau, G., and Boyer, C. (2023). Convergence and error analysis of PINNs. arXiv preprint arXiv:2305.01240.
[Gnanasambandam et al., 2022] Gnanasambandam, R., Shen, B., Chung, J., Yue, X., et al. (2022). Self-scalable tanh (STAN): Faster convergence and better generalization in physics-informed neural networks. arXiv preprint arXiv:2204.12589.
[Greydanus et al., 2019] Greydanus, S., Dzamba, M., and Yosinski, J. (2019). Hamiltonian neural networks. CoRR, abs/1906.01563.
[Jha et al., 2022] Jha, K., Saha, S., and Singh, H. (2022). Prediction of protein–protein interaction using graph neural networks. Scientific Reports, 12(1):8360.
[Karniadakis et al., 2021] Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S., and Yang, L. (2021). Physics-informed machine learning. Nature Reviews Physics, 3(6):422–440.
[Kashinath et al., 2021] Kashinath, K., Mustafa, M., Albert, A., Wu, J., Jiang, C., Esmaeilzadeh, S., Azizzadenesheli, K., Wang, R., Chattopadhyay, A., Singh, A., et al. (2021). Physics-informed machine learning: Case studies for weather and climate modelling. Philosophical Transactions of the Royal Society A, 379(2194):20200093.
[Liang et al., 2024] Liang, P. P., Zadeh, A., and Morency, L.-P. (2024). Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10):1–42.
[Lin et al., 2019] Lin, J., Gan, C., and Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093.
[Lütjens et al., 2020] Lütjens, B., Leshchinskiy, B., Requena-Mesa, C., Chishtie, F., Díaz-Rodriguez, N., Boulais, O., Piña, A., Newman, D., Lavin, A., Gal, Y., et al. (2020). Physics-informed GANs for coastal flood visualization. arXiv preprint arXiv:2010.08103.
[Morency et al., 2022] Morency, L.-P., Liang, P. P., and Zadeh, A. (2022). Tutorial on multimodal machine learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts, pages 33–38.
[Owens et al., 2016] Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., and Freeman, W. T. (2016). Visually indicated sounds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2405–2413.
[Peng et al., 2022] Peng, X. B., Guo, Y., Halper, L., Levine, S., and Fidler, S. (2022). ASE: Large-scale reusable adversarial skill embeddings for physically simulated characters. ACM Transactions on Graphics (TOG), 41(4):1–17.
[Radford et al., 2021] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
[Raissi et al., 2019] Raissi, M., Perdikaris, P., and Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707.
[Shrestha et al., 2025] Shrestha, A., Liu, P., Ros, G., Yuan, K., and Fern, A. (2025). Generating physically realistic and directable human motions from multi-modal inputs. In European Conference on Computer Vision, pages 1–17. Springer.
[Su et al., 2023] Su, K., Qian, K., Shlizerman, E., Torralba, A., and Gan, C. (2023). Physics-driven diffusion models for impact sound synthesis from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9749–9759.
[Team et al., 2023] Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. (2023). Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
[Vadyala and Betgeri, 2023] Vadyala, S. R., and Betgeri, S. N. (2023). General implementation of quantum physics-informed neural networks. Array, 18:100287.
[Vaswani, 2017] Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems.
[Walker et al., 2024] Walker, E., Trask, N., Martinez, C., Lee, K., Actor, J. A., Saha, S., Shilt, T., Vizoso, D., Dingreville, R., and Boyce, B. L. (2024). Unsupervised physics-informed disentanglement of multimodal data. Foundations of Data Science, pages 0–0.
[Yuan et al., 2023] Yuan, Y., Song, J., Iqbal, U., Vahdat, A., and Kautz, J. (2023). Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16010–16021.
[Herde et al., 2024] Herde, M., Raonić, B., Rohner, T., Käppeli, R., Molinaro, R., de Bézenac, E., & Mishra, S. (2024). Poseidon: Efficient Foundation Models for PDEs. arXiv preprint arXiv:2405.19101.