This blog post contains a review of cross-modal learning topic based on three papers published recently.
1. Introduction
Firstly, I will introduce you to cross-modal learning, what it is, the main issues, and its motivation. Then, you will find out about related works. Later on, I will go in deep with three papers named Learning Explicit and Implicit Dual Common Subspaces for Audio-Visual Cross-Modal Retrieval, VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency, and Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations by presenting their methodology, experimental setup, and results. In the end, there will be my review and summary of the presentation.
1.1 What is cross-modal learning?
Maybe you have heard about the multimodal term but not crossmodal. They are two terms that are often confused with each other, but they don’t mean the same thing:
- Multimodal learning means learning from multiple data modalities. For example, a human can use both sight and hearing to identify a person or object, and multimodal learning is concerned with developing similar abilities for computers.
- Cross-modal learning is an approach to multimodal learning where information from one modality is used to improve performance in another. For example, if we hear airplane noise, we can better categorize the object in the air without clearly seeing it[1].
1.2 Problem
We will examine cross-modal learning for audio-visual data in this blog post. Inconsistent distributions and the heterogeneous nature of representations are issues between audio-visual data modalities that make them impossible to compare directly. A frequently used approach to bridge these modality gaps is to project audio-visual features into a common subspace to capture the commonalities and characteristics of modalities for measurement.
1.3 Motivation
The main motivation of cross-modal learning is to use one modality to improve performance of the target modality. Learning modality-common features without losing modality-specific features is one of the paper's motivation we will analyze today. The other two paper's motivation is to make use of different features in modalities like using face attribute features or motions besides lip motion while improving audio separation.
1.4 Related Works
Representation learning is a set of techniques that are used to learn representations of the data that make it easier to extract useful information when building machine learning models. Canonical correlation analysis (CCA)is a statistical technique to derive the relationship between two sets of variables[2]. CCA variant methods have been used to learn explicit common subspace by transferring features from each modality. Kernel canonical correlation analysis(KCCA)[3] uses a kernel "trick" by projecting the data into a high dimensional space to generate non-linear representations. Deep canonical correlation analysis(DCCA)[4] uses CCA with deep neural network techniques to learn complex nonlinear transformations of cross-modal data so that the generated representations are highly linearly correlated. Cluster-CCA[5] learns the linear correlations of the transferred data by using captions to extract the possible correspondences within a cluster. CCA can be seen as a conventional representation learning method. Recently, deep-learning methods have also been used for cross-modal learning with faces and voices. Deep learning methods are used for face reconstruction from audio[6], emotion recognition[7], and talking face generation[8].
2. Learning Explicit and Implicit Dual Common Subspaces for Audio-Visual Cross-Modal Retrieval
This paper aims to improve CCA performance by using dual common subspaces.
2.1 Implicit and Explicit Common Subspace
Figure 1: Implicit and explicit subspaces[9]
Proposed model projects the audio and visual representations into a dual common subspace as explicit common subspace and implicit common subspace. Minimizing the difference between label and feature enables implicit common subspace[9] to capture the unique characteristics of that modality. Explicit common subspace learns the commonalities and correlation for each audio-visual pair extracted from the same video.
2.2 Model
The proposed model consists of three parts: 1) feature extraction; 2) modality representations; 3) feature fusion.
Figure 2: Model Framework[9]
- In the feature extraction part, the global audio and visual representations are extracted by the Vggish and Inception models. VGGish is a pretrained Convolutional Neural Network model. Inception v3 is an image recognition CNN model.
- In the modality representations part, the extracted features are projected into explicit and Implicit common subspaces by learning two distinct kinds of correlation.
- The correlation loss is applied for the explicit common subspace to directly bridge the modality gap based on the pairwise mutual information.
- The discriminative loss ensures the generated representations have rich semantic information in the implicit common subspace.
- The constraint loss maintains the balance of two types of neural networks for each modality. Because end-to-end model training may cause redundancy of NN structures. Constrain loss enforces the two NN structures to learn in different directions with the same input.
- Feature fusion part fuses the output representations from two subspaces into a joint embedding. It is done by a simple fusion mechanism. The linear CCA layer takes concatenation of representations from two subspaces for each modality as input and fuses them.
2.3 Datasets
They used two datasets which are AVE and Vegas. Audio-Visual Event (AVE) dataset contains more than 4000 videos covering 28 event categories, each video is annotated with audio-visual event boundaries. VEGAS dataset is a subset of the Google AudioSet derived from YouTube videos. It contains 10 categories like human voice, vehicle sound, and nature sound categories.
Figure 3: AVE dataset[9]
2.4 Evaluation baselines
For evaluating their proposed model, they applied Mean Average Precision (MAP) as the main computational metric for audio-visual cross-modal retrieval on both the VEGAS dataset and AVE dataset. MAP means the mean of all classes’ average precision. They used precision scope curves as an additional metric to examine local precision. Because MAP metric focuses on the performance of global precision but overlooks local precision.
2.5 Pretrained Models
They are using two pretrained networks as a model. For audio part, Vggish model uses mel spectrogram as an input which is produced by taking short time Fourier transform of audio data. For the visual part, Inception model takes a random frame per second of video. And output dimension of inception model is decreased by principal component analysis.
Figure 4: Precision Scope Curves[9]
2.6 Results
Model improves the performance on both datasets compared to another state-of-the-art model TNN-C-CCA(Triplet Neural Networks with Cluster-CCA). As you can see, it has 4.6 percent improvement for recovering visual data from audio and 4.1 for visual to audio on Vegas dataset. For AVE dataset, it improves 8.4 for recovering visual data, and 2.1 for recovering from visual data over TNN.
Table 1: Model's performance compared to other state-of-the-art models[9]
2.7 Analysis
These results showed that modality-specific characteristics can be used as an additional feature to prevent information loss during bridging the modality gap. And this also emphasizes the importance of multi-common subspaces.
3. VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency
This paper proposes a multi-task learning framework that learns audio-visual speech separation and cross-modal face-voice embeddings jointly. This approach leverages the complementary cues between lip movements and cross modal speaker embeddings for speech separation. The need for speaker facial embedding aroused from unreliable lip movements which can be easy when the speaker turns their head away or because of the microphone occupation in the region of interest. Facial appearances like gender, age, nationality, and body weight can give prior information about sound intuitively[10]. So, this paper aims to visualize the voice of a person based on how they look to better separate that voice’s sound. For facial attributes, there is no need for labeling thanks to pre-trained networks.
Figure 5: AudioVisual Speech Separator Network[10]
3.1 Visual Network
The visual stream of network has two parts which are lip motion analysis network and a facial attributes analysis network.
3.1.1 Lip Motion Analysis Network
The lip motion analysis network takes mouth regions of interest(ROIs) as input and it consists of a 3D convolutional layer followed by a ShuffleNet network to extract a time-indexed sequence of feature vectors. They are then processed by a temporal convolutional network (TCN) to extract the final lip motion feature map of dimension Vl*N. Temporal convolutional network can be thought as causal convolutional RNN.
3.1.2 Facial attributes Analysis Network
For the facial attributes analysis part, ResNet network takes a single face image randomly sampled from the face track to extract a face embedding of dimension Vf that encodes the facial attributes of the speaker. The facial attributes feature is replicated along the time dimension to concatenate with the lip motion feature map and obtain a final visual feature of dimension V*N, where V =Vl+Vf.
3.2 Audio Network
For audio network part, U-Net network is tailored to audio speech separation. Encoder takes a complex spectrogram of the mixture signal of dimension 2*F*T; where F and T are the frequency and time dimensions of the spectrogram as an input. The input passes through a series of convolutional layers and frequency pooling layers to reduce the frequency dimension while preserving the time dimension. Encoder ends up with an audio feature map of dimension D*1*N, where D is the channel dimension. The visual and audio features are being concatenated along the channel dimension to generate an audio-visual feature map of dimension (V + D) * 1 * N. The decoder takes the concatenated audio-visual feature as input. Decoder has the same structure as encoder, but it is replaced version of encoder with upconvolutional layer and frequency upsampling layer. Finally, Tanh layer is used with scaling operation to predict a complex mask of the same dimension as the input spectrogram for the speaker.
3.3 Multi Task Framework
Figure 6: Multitask Framework[10]
In this figure, you can see all parts of the proposed system. I will go step by step again, starting from where we left off. So, explained the audio-visual separator network in the last slide. Now, we can start with mask prediction loss.
3.3.1 Mask Prediction Loss
By taking the difference between predicted and ground truth complex masks, mask prediction loss provides the main supervision to enforce the separation of clean speech. The ratio of clean speech spectrogram to mixed speech spectrogram gives the ground truth complex mask.
Figure 7: Mask prediction loss[10]
3.3.2 Vocal Attributes Network
The vocal attributes network uses ResNet similar to the faces attributes analysis network.
3.3.3 Cross Modal Matching Loss
Cross modal matching loss forces the network to learn cross-modal face voice embeddings such that the distance between the embedding of the separated speech and the face embedding for the corresponding speaker should be smaller than that between the separated speech embedding and the face embedding for the other speaker. It helps to link voice to face better. So, it gives better facial attributes to guide the separation.
Figure 8: Cross modal matching loss[10]
3.3.4 Speaker Consistency Loss
Speaker consistency loss uses the similarity of voice characteristics to match the audio segments which are coming from the same speaker.
Figure 9: Consistency Loss[10]
3.4 Datasets
They used six datasets to validate modal in terms of audio-visual speech separation, speech enhancement, and cross-modal speaker verification.
VoxCeleb2: This dataset contains over 1 million utterances with the associated face tracks extracted from Youtube videos.
Mandarin, TCD-TIMIT, CUAVE, LRS2 are used to compare proposed model with a series of state-of-the-art audio-visual speech separation and enhancement methods.
VoxCeleb1 dataset contains over 100,000 utterances for 1,251 celebrities extracted from YouTube videos. This dataset is used to evaluate cross-modal speaker verification.
3.4 Evaluation
For evaluation Signal-to-Distortion Ratio(SDR), Signal-to-Interference Ratio(SIR), Signal to-Artifacts Ratio(SAR), Perceptual Evaluation of Speech Quality(PESQ), Short-Time Objective Intelligibility(STOI) are used as metrics. Training objective is to decrease the total loss. Model can be fully trained with unlabeled video.
Table 2: Results on Voxceleb2 dataset without background noise[10]
3.4 Results
Table 3: Results on Voxceleb2 dataset with background noise[10]
Table 4: Results on different datasets[10]
3.5 Analysis
VisualVoice approach come up with a speech separation solution that is less vulnerable to lip motion. It gives best performance difference according to other state-of-the-art models in the case of very different speaker appearances and unreliable lip motion situation. And it gives smallest performance difference in the exact opposite case.
4. Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations
We will look into Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations as a last paper. This paper introduces two stage visual sound source separation architecture called appearance and motion network where the stages specialize to appearance and motion cues, respectively. And it introduces also Audio-Motion Embedding (AME) framework to learn the motions of sounds in a self-supervised manner[11].
Figure 10: AMnet Framework[11]
4.1 Audio-Appearance Sound Separation Stage
Appearance network takes a random single frame as an input to dilated Res18-2D network to get semantic representation of object type.
In the sound spectrogram network part mixed audio is being converted to mel spectrogram by the means of short time Fourier transform. And, MobileNetV2 architecture takes mel spectrogram as an input and gives feature map of size Cs*Hs*Ws.
Binary mask is being formed by applying sigmoid and thresholding to channel-wise multiplication of visual feature vector and audio feature vector.
4.2 Audio-Motion Sound Separation Stage
Motion network uses 3D and 1D Res18 modules. I will explain this block in the next chapter.
Sound spectrogram refinement network(SSR) is an encoder-decoder architecture with 7 up and down convolutional layers.
Audio motion transformer module leverages motions of sound that are learned by audio motion embedding.
Residual fusion enables model to reallocate outputs of the audio appearance stage which are assigned wrongly as a component-source couple.
4.3 Audio-Motion Embedding(AME)
Figure 11: Audio-Motion Embedding[11]
The AME maps the motions and audio into the common embedding space. Aligned and misaligned audio, which is formed by shifting the waveform in the time domain, with video clip. It learns how to map by calculating the distances between the video embedding and audio embeddings, and formulating a cost function using a triplet loss approach. It aims to optimize the embedding networks by minimizing the loss function over a large set of videos.
Figure 12: Triplet Loss[11]
Triplet loss enables model to focus on sound-related motions.
4.4 Datasets
Figure 13: Sound spectrogram evaluation on AVE and MUSIC21 Datasets[11]
Two datasets are used for this model. I have already talked about AVE dataset. MUSIC-21 dataset is new, which contains over 1000 videos from 21 instrumental categories.
Sound separation performance is measured by the means of Signal to distortion ratio(SDR), signal to interference ratio(SIR), signal to artifact ratio(SAR) metrics.
Figure 14: Training Objective[11]
AMNet is trained by minimizing binary cross entropy loss of estimated and ground truth binary masks.
4.5 Results
Table 5: AMNet performance on Music21 Dataset[11] Table 6: AMNet Performance on AVE Dataset[11] Table 7: Performance of AME[11]
As you can see from the tables, AMnet performs much better than other state-of-the-art models.
AM embedding also performs well according to consensus Intersection over Union and Area under Curve metrics.
4.6 Analysis
AMNet is trained in a self-supervised manner. So, it has no limitation on source types.
5. Review
Figure 15:Correlation between human senses and modalities[12]
In recent years, there has been a growing interest in the field of sound source separation and localization. The ability to separate and locate individual sound sources in a complex audio environment can have a wide range of applications, from speech recognition and audio-based navigation to music production and sound design. In this blog post, we have discussed three recent papers that explore different approaches to sound source separation and localization by using cross modal learning.
Our body uses different senses to improve the one we need for a better experience, even if we don’t recognize that. Maybe, earlier applications of cross modal learning come up with inspiring from the sense usage of human body. Just like in the real world, in the digital world we can use different modalities to improve the modality we need.
Strengths and weaknesses first model decreases loss of information, and by learning better it also decreases time cost. Second model, gives better accuracy for the case of unreliable lip motion but it has limited affect when the faces are similar. Third model, proposing improved sound localization results but in the case of similar sound sources it can fail.
Lessons learned: Other modalities can be used to increase the performance of one modality. While using other modalities we can lose some modality specific features. This causes modals to have lower performance than it could be. Besides that, we can profit from modalities by using different embedded features of them. For example in the last two papers we discussed, they extracted facial attributes, lip attributes, and motion from visual data to supplement audio data.
6. References
[1] Jina AI. (n.d.). What is Multimodal Deep Learning and what are the applications? Jina AI. Retrieved January 21, 2023, from https://jina.ai/news/what-is-multimodal-deep-learning-and-what-are-the-applications/
[2] Naghshin, V. (2020, August 29). What is Canonical Correlation Analysis? Medium. Retrieved January 29, 2023, from https://medium.com/analytics-vidhya/what-is-canonical-correlation-analysis-58ef4349c0b0
[3] Pei Ling Lai and Colin Fyfe. 2000. Kernel and Nonlinear Canonical Correlation Analysis. Int. J. Neural Syst. Vol.10, no.5 (2000), pp.365ś377. https://doi.org/10.1142/S012906570000034X
[4] Galen Andrew, Raman Arora, Jef Bilmes, and Karen Livescu. 2013. Deep Canonical Correlation Analysis. In Proceedings of the 30th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. Vol.28). PMLR, Atlanta, Georgia, USA, pp.1247ś1255.
[5] Nikhil Rasiwasia, Dhruv Mahajan, Vijay Mahadevan, and Gaurav Aggarwal. 2014. Cluster Canonical Correlation Analysis. In Proceedings of the Seventeenth International Conference on Artiicial Intelligence and Statistics. JMLR.org, Reykjavik, Iceland, pp.823ś831. https: //doi.org/10.1201/b18358-8
[6] T.-H. Oh, T. Dekel, C. Kim, I. Mosseri, W. T. Freeman, M. Rubinstein, and W. Matusik. Speech2face: Learning the face behind a voice. In CVPR, 2019
[7] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman. Emotion recognition in speech using cross-modal transfer in the wild. In ACMMM, 2018
[8] H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang. Talking face generation by adversarially disentangled audio-visual representation. In AAAI, 2019
[9] Donghuo Zeng, Jianming Wu, Gen Hattori, Rong Xu, and Yi Yu. 2022. Learning Explicit and Implicit Dual Common Subspaces for Audio-Visual Cross-Modal Retrieval. ACM Trans. Multimedia Comput. Commun. Appl. Just Accepted (September 2022). https://doi.org/10.1145/3564608
[10] Gao, R., & Grauman, K. (2021). VisualVoice: Audio-visual speech separation with cross-modal consistency. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr46437.2021.01524
[11] Zhu, L., & Rahtu, E. (2022). Visually guided sound source separation and localization using self-supervised motion representations. 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). https://doi.org/10.1109/wacv51458.2022.00223
[12]Tan, Hongwei & Zhou, Yifan & Tao, Quanzheng & Rosén, Johanna & Dijken, Sebastiaan. (2021). Bioinspired multisensory neural network with crossmodal integration and recognition. Nature Communications. 12. 10.1038/s41467-021-21404-z.