Abstract
In this blog post, we will discuss sound and music generative models and compare three modern approaches that aim to create models that can generate realistic sound and music signals.
Introduction
Sound and Music Generative Models
Sound and music generative models are emerging as a promising technology, potentially transforming a broad range of industries ranging from the entertainment industry to healthcare. In the movies and gaming industries, they can be used to enhance background sound effects [1] and reduce the costs of adding background music [1]. Moreover, they can enhance real-time user experience in some Virtual and Augmented Reality applications [2].
Sound Generative Models in Medicine
Sound and sound generative models have also been used in medical settings in diverse medical applications. Generative GAN models were used to generate synthetic ultrasound images for data augmentation [3]. GANs were also used to generate synthetic normal and abnormal heartbeat sounds to train better anomaly detection models and algorithms [6]. Sonification models have been analyzed to create surgical simulation training environments [4]. Thus, past research shows promising potential for using sound models in medical environments.
Challenges of Sound Generative Models
The current development of sound generative models must catch up with its counterparts in visual and text generation [7]. This can be attributed to the nature of audio signals. In comparison to image data, for example, audio signals are only one-dimensional, thus giving providing less flexibility in differentiating between sources and objects in the signal [7]. This is incredibly challenging, considering the reverberations coming from the environment in real-life audio applications [7].
Overview of different generative approaches:
In this blog post, we discuss three different approaches to sound and music generation to understand the current state of sound generative modeling and its future potential.
Those approaches include:
- Autoregressive approach for sound generation conditioned on text input
- Diffusion-based non-autoregressive approach for sound generation conditioned on text input
- Diffusion-based approach for music generation conditioned on text input
Methodology and Evaluation
Autoregressive model approach
Methodology
The authors of the paper AudioGEN: Textually guided audio generation [7], present the AudioGEN framework, which consists of two models, that working together, can generate audio signals based on textual input prompt. The first model is trained to learn to recover audio input from lower dimensional representation. The second model trains a transformer language model over the low dimensional codes generated with the encoder from the first model, to learn to generate raw-audio conditioned on the text input.
Figure 1: AudioGEN framwork (Left: audio representation model, Right: audio language model)
Stage 1: Raw audio encoding, audio representation model (Figure 1 left)
The audio compression model aims to create high-fidelity audio samples from lower dimensional representations. This model is trained end-to-end to learns to reconstruct the original input audio from a compressed representation. It has three essential components: the Encoder Network, the Quantization Layer, and the Decoder Network.
The Encoder Network processes an input audio segment sequence. This sequence is channeled through convolutional blocks designed to extract descriptive features from the audio data. Then, the sequence is fed into an LSTM to handle sequential representations effectively. The output of this process is a latent vector.
Following this, the Quantization Layer operates on the latent representation from the encoder. Utilizing a vector quantization technique[], this layer converts the latent representation into a compressed representation. The quantization layer works to preserve the critical elements of the data while considerably reducing its size.
Lastly, the Decoder Network is trying to reconstruct the time-domain signal from the compressed representation. The whole system is trained to minimize reconstruction loss between original input and the corresponding output sample.
Stage 2: Audio Language modeling (Figure 1 right)
In the audio language modeling step, the authors train a text-conditional Transformer language model, over the learnt codes obtained from the encoder from the audio representation model trained in the first stage.
The text representation is obtained using a pre-trained T5 text-encoder. Cross-attention between audio and text is added to each attention block of the transformer.
The model is trained using the cross-entropy loss function. The authors also apply Classifier Free Guidance (CFG) technique to control the trade-off between sample quality and diversity. This technique is achieved by applying a conditioning ratio to the training process, where the optimization is done both conditionally and unconditionally. In other words, it means that the conditional input was omitted for a number of training samples. The authors used 10% conditioning ratio to train the model.
Datasets
The datasets used for training and evaluation of the generated results include:
- AudioSet [12]
- BBC Sound Effects
- AudioCaps [13]
- VGG-Sound [14]
- FSD50K [15]
- Free To Use Sounds
- Sonniss Game Effects
- WeSoundEffects
- Paramount Motion - Odeon Cinematic Sound Effects
All audio files were sampled at 16kHz. For textual descriptions, the authors used two types of annotations. The first one is multi-label annotations, available for the datasets: AudioSet, VGG-Sound, FSD50K, Sinniss Game Effects, WeSoundEffects, Paramount Motion - Odeon Cinematic Sound Effects. The second type of annotation is natural language captions available for the datasets: AudioCaps, Clotho v2, Free To Use Sounds, and BBC Sound Effects.
Evaluation Results
Table 1: AudioGEN evaluation results
The evaluation of the models was done using both objective and subjective metrics. For the objective function, the authors the Frechet Audio Distance (FAD) [10] over both real and generated samples.
Diffusion-based approach for sound generation
Methodology
In the paper Diffsound: Discrete Diffusion Model for Text-to-Sound Generation [8], the authors propose a non-autoregressive diffusion-based model to generate audio samples, conditioned on textual input. The authors focus on a potential flaw of the autoregressive models - the sequential nature of the training process, where the tokens are fed into the model one by one, and compare it with the diffusion based approach, where the tokens can be fed to the model all at once, thus implying independence from the previous stages.
Figure 2: Diffsound framework architecture
The model, consists of four crucial components: a Text Encoder, a Token Decoder, a Spectrogram Decoder, and a Vocoder.
The Text Encoder encodes the input text prompt into a vector representation. This vector encapsulates the contextual and semantic features of the text, thereby providing a numerical representation that can be processed by subsequent stages of the model. For the text encoder, the authors utilized pre-trained encoder models BERT and Contrastive Language Image Pre-training. After empirically analyzing the outcomes, they concluded that CLIP model was the best in extracting descriptive sound features from the prompt.
Figure 3: Variational Autoencoder with Vector Qunatization model (VQ-VAE)
To get a better understanding of the next two component, Token Decoder and Spectrogram Decoder, we first need to look at the VQ-VAE model (Variational Autoencoder with Vector Quantization), that the authors trained to compress audio spectrograms into lower-dimensional, tokenized representation. The result of training the model, is an encoder that learns to convert spectrograms into low-dimensional spectrogram-tokens, which is a discrete representation and a decoder that is able to convert the spectrogram tokens into actual audio spectrograms. The VQ-VAE differs from a traditional autoencoder by outputting a discrete rather than continuous representation from the encoder, and by having a learnt dynamic rather than static prior. The discrete representation, also called codebooks, capture important features of the data. The pre-trained encoder and decoder for this model, are later used separately in the Token Decoder and Spectrogram Decoder components.
Figure 4: Diffusion-based Token Decoder model
Token Decoder takes the high-dimensional vector representation from the Text Encoder and learns to translate it into a spectrogram token sequence representation. In order to be able to use the text features learned from the text encoder, with mel-spectrograms, the authors utilize the use of mel-spectrogram tokens generated using the pre-trained encoder from the VQ-VAE model, trained earlier. The authors used a usual diffusion based model, with forward and backward processes.
Forward Process (Noise-Adding Process): This is the process of corrupting the spectrogram tokens by adding noise over time. In a series of steps, the model gradually injects noise into the data, transforming the original spectrogram tokens into pure noise. The noise added at each step is calculated based on a Gaussian distribution, and the variance of the noise decreases over time.
Reverse Process (Denoising Process): The reverse process aims to recover the original spectrogram tokens from the noise. Given the noise-added data and the noise schedule, the model is trained to reverse the noise-adding process and reconstruct the original data. This is done iteratively, with the model denoising the data step-by-step.
Spectrogram Decoder performs the task of converting the token sequence representation into actual spectrograms. For this, the authors used the decoder from the VQ-VAE model trained earlier.
The final component of the model is the Vocoder. The Vocoder is used to convert the spectrogram data into raw-audio waveform data. As a basis for the Vocoder, the authors the architecture of the MelGAN model. As the pre-trained MelGAN model is trained on speech data, the authors found it not suitable to use for audio signals. Thus, they trained it again, on a large-scale audio dataset (AudioSet).
Through these stages of transformation — text encoding, token decoding, spectrogram decoding, and vocoding — the model effectively captures the nuances of the input text and tries to reproduce them in the synthesized audio output.
Datasets
The datasets used for training and evaluation of the generated results include:
- AudioSet [12]
- AudioCaps [13]
All audio clips in the two datasets are sampled at 22.05k Hz and padded to 10 seconds long. Log mel-spectrograms are extracted using a 1024-point Hanning window with a 256-point hop size.
Evaluation Results
Table 2: Diffsound evaluations
Diffusion-based approach for music generation
Methodology
The authors of the paper Noise2Music: Text conditioned music generation with diffusion models [9], introduce a novel approach to generating high-quality 30-second music clips from text prompts. This approach involves training two models in succession: the Generator model and the Cascader model.
Figure 5: Noise2Music architecture
The Generator model is responsible for creating an intermediate representation that is conditioned on a text prompt. This model takes a text prompt and generates a compressed representation of a 30-second waveform. The authors explored two options for this intermediate representation: a log-mel spectrogram or a 3.2kHz waveform.
The Cascader model, on the other hand, is tasked with generating high-fidelity audio. It does this by conditioning on the intermediate representation produced by the Generator model and, optionally, the text prompt. The Cascader model is used to generate 24kHz audio from the 16kHz waveform.
The architecture of this framework is based on a 1-D Efficient U-Net architecture. This architecture consists of a series of down-sampling and up-sampling blocks that are connected by residual connections. The diffusion models are conditioned on user prompts in the format of free-form text, which is encoded by a pre-trained language model and ingested by the 1D U-Net layers via cross-attention.
The input to this model is stacked and contains both the audio and text prompts. For the intermediate representation, the authors tried two approaches: using a mel-spectrogram and using a low-fidelity waveform.
Similar to the AudioGEN model, the authors also used Classifier Free Guidance (CFG) optimization method. As mentioned earlier, this method improves the alignment between the quality and diversity of the generated samples.
Datasets
The authors used different datasets for training and evaluation. For the training, the authors collected 6.8 million music audio sources. From each audio source, they took 30-second clips, resulting in a total of 340k hours of music for training. The audio was sampled at 16kHz for the Generator model and 24kHz for the Cascader model. To label the music samples, they took song parameters like title, genre, artist name, instrument, etc., and converted them into pseudo sentences, for example, “music for highway driving.” These sentences were later used as labels.
For the evaluation, they used the MusicCaps dataset, which includes 300 hours of manually labeled songs.
Evaluation Results
Table 3: Noise2Music evaluation results
Comparison
Model | Strengths | Weaknesses | Link to cherry-picked examples |
---|---|---|---|
AudioGEN |
|
| https://felixkreuk.github.io/audiogen/, includes generated sound samples by AudioGEN |
DiffSound |
|
| https://felixkreuk.github.io/audiogen/, includes generated sound samples by DiffSound |
Noise2Music |
|
| https://google-research.github.io/noise2music, includes generated sound samples by Noise2Music |
Review
All three models demonstrate significant advances in sound and music generative models, each offering unique advantages. While autoregressive models offer a direct and intuitive approach to generating audio from text, they may face limitations due to their sequential nature. Diffusion-based models, on the other hand, tackle this limitation, allowing for concurrent input, yet require more complex structures. The choice between these models largely depends on the specific requirements of the task at hand, including factors such as the quality of the generated audio, the resources available, and the real-time applicability of the model. However, the models objectively improve on the state of the sound generative models, and at least two offer subjectively promising generated results.
References
- Zhou, Y., Wang, Z., Fang, C., Bui, T., & Berg, T. L. (2018). Visual to sound: Generating natural sound for videos in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3550-3558).
- Sterling, A., Rewkowski, N., Klatzky, R. L., & Lin, M. C. (2019). Audio-material reconstruction for virtualized reality using a probabilistic damping model. IEEE transactions on visualization and computer graphics, 25(5), 1855-1864.
- Liu, X., Iqbal, T., Zhao, J., Huang, Q., Plumbley, M. D., & Wang, W. (2021, October). Conditional sound generation using neural discrete time-frequency representation learning. In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP) (pp. 1-6). IEEE.
- Matinfar, Sasan, et al. "Surgical soundtracks: Towards automatic musical augmentation of surgical procedures." Medical Image Computing and Computer-Assisted Intervention− MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part II 20. Springer International Publishing, 2017.
- Maack, L., Holstein, L., & Schlaefer, A. (2022). GANs for generation of synthetic ultrasound images from small datasets. Current Directions in Biomedical Engineering, 8(1), 17-20.
- Narváez, Pedro, and Winston S. Percybrooks. "Synthesis of normal heart sounds using generative adversarial networks and empirical wavelet transform." Applied Sciences 10.19 (2020): 7003.
- Kreuk, F., Synnaeve, G., Polyak, A., Singer, U., Défossez, A., Copet, J., ... & Adi, Y. (2022). Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352
- Yang, D., Yu, J., Wang, H., Wang, W., Weng, C., Zou, Y., & Yu, D. (2023). Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
- Huang, Q., Park, D. S., Wang, T., Denk, T. I., Ly, A., Chen, N., ... & Han, W. (2023). Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint arXiv:2302.03917.
- Kilgour, K., Zuluaga, M., Roblek, D., & Sharifi, M. (2019, August). Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms. In INTERSPEECH (pp. 2350-2354).
- Van Den Oord, A., & Vinyals, O. (2017). Neural discrete representation learning. Advances in neural information processing systems, 30.
- Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., ... & Ritter, M. (2017, March). Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 776-780). IEEE.
- Kim, C. D., Kim, B., Lee, H., & Kim, G. (2019, June). Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 119-132).
- Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., & Zisserman, A. (2021). Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16867-16876).
- Fonseca, E., Favory, X., Pons, J., Font, F., & Serra, X. (2021). Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 829-852.
- Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., ... & Frank, C. (2023). Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325.