Author: Gökçe Şengün
Supervisor: Sasan Matinfar
This blog post presents deep learning (DL) techniques applied on the task of Audio Segmentation and Sound Event Detection. The modern medical landscape is replete with sounds: from the steady rhythm of a heartbeat to the critical alerts of machines. Interpreting this soundscape is pivotal for patient monitoring and diagnostics. Our latest research dives into the role of Deep Learning (DL) in dissecting and understanding this auditory information through advanced Audio Segmentation and Sound Event Detection techniques. We have developed innovative algorithms that not only distinguish between different acoustic events but also accurately pinpoint their occurrence over time. This breakthrough paves the way for real-time monitoring systems and automated diagnostic tools, potentially transforming patient care through enhanced detection of anomalies and rapid response to medical emergencies.
- Introduction
- Motivation
- A review of deep learning techniques in audio event recognition(AER) applications
- General Overview
- Architecture
- Dataset
- Evaluation
- You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection
- General Overview
- Architecture
- Dataset
- Evaluation
- THE COCKTAIL FORK PROBLEM: THREE-STEM AUDIO SEPARATION FOR REAL-WORLD SOUNDTRACK
- General Overview
- Architecture
- Dataset
- Evaluation
- Comparison and Discussion
Introduction
What is the Audio Segmentation?
Audio segmentation is the process of dividing an audio signal into meaningful segments based on various acoustic characteristics. These characteristics can include pitch, energy, and rhythm, among others. The purpose of audio segmentation is to identify boundaries between different sounds or events within an audio stream and separate them into distinct segments.
| Figure 1: Audio Segmentation [1]. |
What is the Sound Event Detection?
Sound Event Detection (SED) is a technology used to identify and classify specific sounds within an audio recording. It involves detecting the presence of certain sounds, determining their duration, and categorizing them into predefined classes. The process typically includes signal processing, feature extraction, and applying machine learning.
Figure 2: A schematic illustration of sound event detection in real life audio [2]. |
Motivation
The core motivation for this study is to harness the often-overlooked potential of auditory information. By analyzing sounds like heart murmurs, lung wheezes, or even changes in voice, we can unearth vital clues about a patient's health, clues that are sometimes not visible. Employing deep learning techniques in audio segmentation and sound event detection enriches our understanding, enabling earlier detection of conditions and more effective treatment plans. This innovative approach is about elevating patient care by blending the visual narrative with the revealing story told by sounds, creating a fuller, more accurate picture of health.
Importance in medical applications
In medical applications, integrating audio data analysis with visual assessments offers a more complete understanding of patient health. Audio cues, like heartbeats or respiratory sounds, can reveal early signs of health issues, often detectable before visual symptoms emerge. Leveraging deep learning for audio segmentation and sound event detection in medical settings enhances early diagnosis, allowing for timely interventions. This approach promises improved patient outcomes by ensuring comprehensive monitoring and accurate detection of subtle health changes.
Current Challenges
- Complex Acoustic Environments: Medical environments like hospitals and clinics are characterized by diverse and unpredictable acoustic settings. These environments often feature a mix of overlapping sounds, including equipment beeps, conversations, and patient-specific noises like coughs or heartbeats. The primary challenge lies in accurately distinguishing and segmenting medically relevant sounds amidst this complex background noise.
- Real-Time Computational Demands: Audio segmentation and sound event detection in medical contexts often necessitate real-time processing of extensive audio data. This is crucial in situations such as emergency rooms or intensive care units, where prompt and accurate analysis can be vital. The challenge is to analyze lengthy audio streams with high precision and minimal latency.
- Ethical Concerns in Voice Monitoring: Implementing voice and audio monitoring in healthcare raises various ethical issues, particularly regarding privacy and informed consent. Establishing transparent policies for the collection, storage, and usage of audio data is essential to maintain patient trust and adhere to legal standards, including HIPAA (Health Insurance Portability and Accountability Act) regulations.
A review of deep learning techniques in audio event recognition(AER) applications
General Overview
The convergence of audio signal processing and deep learning has advanced the field of machine perception, particularly in Audio Event Recognition (AER). AER is pivotal in interpreting complex environments for applications ranging from human-machine interaction to surveillance. It faces challenges like background noise and overlapping sound events, demanding robust algorithms. Deep learning techniques, such as CNNs and LSTMs, have proven effective, offering resilience against such noise disturbances, thus enhancing the accuracy and reliability of AER systems.
Methodology
| Figure 3: Methodology of Audio Event Recognition(AER) System [3]. |
The methodology in the diagram for Audio Event Recognition (AER) involves several stages: Audio data is fed into the system where it undergoes feature extraction. Key features such as Mel Frequency Cepstral Coefficients (MFCCs), Zero Crossing Rate (ZCR), spectral features, constant-Q cepstral coefficients (CQCC), and semantic features are identified. These features are then input into a classification system that utilizes deep learning models such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and advanced networks like VGGNet, ResNet, and MobileNet to accurately predict the class labels of audio events.
Architecture
Audio feature Extraction
Feature extraction is a cornerstone of Audio Event Recognition (AER), with techniques tailored to specific applications. Mel Frequency Cepstral Coefficients (MFCCs) play a prominent role, simulating the human auditory response to sound, and are crucial in speech recognition. They are derived via a Fourier transform, mel-scale application, and Discrete Cosine Transform to condense the audio signal into representative data. CQCCs complement MFCCs by offering detailed low-frequency information. High-level semantic features aim to contextualize sounds, crucial for discerning complex events. Combined, these multifaceted features form an elaborate framework for AER, enhancing the accuracy and applicability of audio analysis.
| Figure 4: The Mel Frequency Cepstral Coefficients from an audio signal. [4] |
Classifiers
- Traditional Models: Hidden Markov Models (HMMs), Gaussian Mixture Models (GMMs), Support Vector Machines (SVMs), and basic Artificial Neural Networks (ANNs)
- Popular Deep Architectures: Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), along with their variations like Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs)
- Optimization and Noise Resilience: Particle Swarm Optimization (PSO) for parameter tuning
- Overcoming Gradient Issues: Residual Neural Networks (ResNets)
Classifiers in AER have evolved, with deep learning models surpassing traditional ones like HMMs and SVMs. CNNs, RNNs, LSTMs, and GRUs are preferred for their sequential data processing prowess. PSO optimizes these models, while ResNets address vanishing gradients via skip connections. Knowledge distillation enhances model performance by condensing complex model insights into simpler networks. Despite advances, developing a unified AER framework that handles diverse audio inputs and noisy conditions remains a challenge.
Datasets
There are eight datasets used for AER tasks:
- Chernobyl Dataset (2019): Contains 6620 audio files from various locations in Chernobyl, capturing different radiation levels and natural soundscapes.
- Warblr Dataset (2019): A UK-wide, crowd-sourced dataset with 10,000 audio files for automatic bird species classification, featuring recordings from mobile devices with varying noise levels.
- PolandNFC Dataset (2019): Consists of recordings from autumn nocturnal bird migration by one author, with a large dataset of over 3200 hours but only a small subset of 22 audio clips that are manually annotated for research.
- Sound Events for Surveillance Applications (SESA) Dataset (2020): Comprises 480 training and 105 testing files from Freesound, with four classes including Casual, Gunshot, Explosion, and Siren, for recognizing suspicious activities.
- MIVIA Road Audio Events Dataset (2020): Holds 400 events for road surveillance applications, such as tire skidding and car accidents.
- GTZAN Dataset (2021): Describes various classes used in audio fingerprinting, including Music, Genres, and Speech.
- DCASE-2021: Automated Audio Captioning Dataset (2017): Contains sounds from crowdsourcing environments, including classes like muffled sounds, big car noises, people talking in a small room, and the ringing of a clock.
- FSC22 Dataset (2023): A forest sound dataset containing 34 main classes (like mechanical sounds, forest threats, environmental sounds, human sounds, animal sounds, and vehicle sounds) and 34 sub-classes.
Applications of audio event recognition task
Audio Surveillance
Audio surveillance is emerging as an essential complement to video surveillance, especially in scenarios where visual detection is obstructed or insufficient. This method proves more cost-effective and efficient, requiring lower bandwidth and computational resources. Notable advancements include machine learning models for industrial leak detection, deep learning algorithms for urban noise classification, and CNN-based systems for recognizing critical sounds like gunshots or screams, even in noisy environments. These developments underscore the value of audio analysis in enhancing detection capabilities and overcoming the limitations of visual-only surveillance systems.
Audio Fingerprinting
Audio fingerprinting involves creating a unique signature for audio streams, allowing efficient and accurate identification of sounds. It requires robustness to noise, unique fingerprints for each sound, fast database querying, versatility in recognition, and sensitivity to audio alterations. A proposed system by Altalbe and Ali uses a Least Mean Square filter and double-threshold segmentation for preprocessing, followed by Time Fourier Transform for unique spectral coding, and an LSTM network for classification, achieving an accuracy of 98.56%. This research paves the way for advanced audio monitoring in tasks like fingerprinting and source localization, highlighting the potential of representation learning methods in enhancing audio analysis.
Audio Spoofing
Audio spoofing, a technique for deceiving speaker verification systems by manipulating audio signals, poses a significant challenge. Alzantot et al. developed a recognition system using Deep Residual Neural Networks, creating models like MFCC-ResNet and Spec-ResNet to distinguish between authentic and spoofed audio. These models were trained on diverse datasets and showed varying effectiveness depending on the proximity of the speaker and the quality of playback devices, with Spec-ResNet generally outperforming others. A major finding is the difficulty in developing a unified system capable of efficiently differentiating between synthetic and genuine speech across various environmental sounds, indicating a need for more adaptable and comprehensive audio analysis frameworks.
Evaluation
In summary, the paper highlights advancements in Audio Event Recognition with a shift towards deep learning, surpassing traditional machine learning methods. Challenges include the need for extensive training data and a unified feature extraction framework. The focus on developing comprehensive datasets for complex, real-world scenarios like surveillance and forensics paves the way for applying these advances in medical settings. Enhanced AER systems could significantly improve patient monitoring and diagnostics by accurately processing complex audio cues in healthcare environments, leading to better patient care.
You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection
General Overview
The paper "You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection" introduces the YOHO algorithm, an innovative approach inspired by the YOLO algorithm from computer vision. YOHO transforms sound event detection into a regression problem, rather than the traditional frame-based classification. This method significantly enhances the speed and accuracy of detecting acoustic classes and their boundaries, making it suitable for real-time applications. The paper also discusses potential improvements and diverse applications for YOHO, suggesting its adaptability for various scenarios, including medical emergencies and environmental monitoring. This advancement in audio event detection promises more efficient and versatile sound analysis tools, improving response times and accuracy in critical applications.
Architecture
The paper showcases a comparative analysis between traditional segmentation-by-classification and the advanced YOHO method for audio event detection.
Segmentation-by-classification processes audio spectrograms frame by frame, assigning each frame a classification of music, speech, or silence, which then requires additional post-processing to delineate event boundaries. YOHO, on the other hand, employs a regression approach to directly determine the presence and boundaries of acoustic events within larger 0.307-second blocks of the spectrogram, significantly reducing the number of predictive outputs from 801 to 26 and enhancing processing speed and generalization capabilities.
Adopting the MobileNet architecture, YOHO functions as a convolutional neural network, utilizing log-mel spectrograms for input and incorporating original and YOHO-specific layers. It begins with a 2D convolution that halves the time and frequency dimensions and progresses through various layers that reduce the filter count, leading to a 26x6 output layer. This structure allows YOHO to predict acoustic events more efficiently, facilitating its application in scenarios where rapid and accurate audio detection is crucial, such as medical diagnostics and real-time monitoring systems.
Figure 5: A comparison of segmentation-by-classification and YOHO[5]. |
The YOHO algorithm's output layer uses binary classification to detect the presence of sound classes and regression to determine their boundaries in time steps. The uniform network structure applies sigmoid functions to keep outputs between 0 and 1, ensuring consistency across different inputs. The loss function combines classification and regression errors for each sound class, measuring the algorithm's accuracy in identifying the start and stop points of audio events.
| Figure 6: The output layer of the YOHO algorithm [6]. |
Datasets
We evaluate the robustness of the YOHO algorithm on multiple datasets.
Music-Speech Detection: Contains audio tracks with annotated sections of music and speech to test the algorithm's ability to differentiate between the two.
TUT Sound Event Detection: Features urban environmental sounds with precise annotations, used to assess detection of common city noises.
Urban-SED: A synthetic dataset with urban sounds, designed to simulate city noise environments for testing sound event detection.
Results
The YOHO algorithm demonstrates superior performance in music-speech detection, achieving state-of-the-art results on the MIREX dataset and surpassing various CNN and CRNN models. Despite the challenging TUT sound event detection task, YOHO still outperformed competitors, although not achieving the top result compared to methods using CapsNet. In the Urban-SED dataset, YOHO again excelled, achieving the highest overall F-measure and leading in several sound classes. Future improvements could include adopting techniques from the best-performing models, such as envelope estimation and weakly supervised learning, to further enhance YOHO's performance.
| Figure 7: Efficiency comparison of YOHO, CNN, and CRNN models [7]. |
The graph presents a comparison of inference times for the YOHO, CNN, and CRNN models when applied to music-speech detection tasks, with the testing conducted on Google Colab. YOHO stands out as the fastest model on both GPU and CPU, significantly reducing the inference time per hour of audio compared to the CNN and CRNN models. Notably, YOHO's efficiency is enhanced by its regression-based approach, which directly predicts acoustic boundaries, leading to much faster post-processing and smoothing times. These results emphasize YOHO's potential for applications that require rapid audio event detection and processing.
Evaluation
In the context of medical applications, the YOHO algorithm has demonstrated significant promise by efficiently identifying and segmenting crucial audio events like speech and music, indicating potential for discerning medically relevant sounds amidst diverse acoustics. Although challenges persist in analyzing complex environmental sounds, YOHO’s shift from traditional classification to a regression-based approach offers a faster and potentially more accurate alternative for real-time medical audio analysis. Future enhancements may include integrating advanced neural network structures to refine its applicability to tasks like monitoring vital signs or detecting anomalies in patient sounds, aiming to support swift medical interventions.
The study shows YOHO's superiority over the CRNN model in music-speech detection and environmental sound event detection. YOHO excelled in music-speech detection with a large and diverse training set, achieving state-of-the-art performance on the MIREX 2018 dataset. Environmental sound detection proved more challenging due to a greater number of acoustic classes and smaller, lower-quality training sets, yet YOHO still outperformed CRNN and CNN.
YOHO's approach shifts from frame-based classification to regression, which is a novel paradigm in this field, and while it hasn't reached state-of-the-art in all datasets, the potential for optimization is significant. Attempts to integrate B-GRU blocks into the regression-based CRNN did not improve results, suggesting other structures like CNN-transformers could be more effective.
YOHO is much faster than CNN and CRNN, with fewer outputs to predict and simpler post-processing, making it suitable for real-time applications. The paper concludes by suggesting future improvements to YOHO, such as incorporating ResNets or Inception blocks, and extending the approach to other tasks like singing voice detection and exploring combinations with source separation and semi-supervised learning.
THE Cocktail Fork Problem: Three-Stem Audio Separation For Real-World Soundtrack
General Overview
The cocktail party problem aims to isolate any source of interest within a complex acoustic scene.The human auditory system has the extraordinary ability to do this almost effortlessly, for example, during cocktail parties.
In the cocktail fork problem tackles the audio separation challenge, aiming to distinguish speech, music, and sound effects within complex soundtracks. This separation is crucial for clear audio in professional and medical settings, such as enhancing speech clarity in noisy environments or transcribing medical consultations. The "Divide and Remaster" (DnR) dataset is introduced, providing a foundation for developing and testing source separation algorithms. While initial results are promising based on synthetic data, the technology's application in medical tools for analyzing patient sounds presents an exciting future direction.
Figure 8: The Cocktail Fork Problem [8] |
Architecture
CrossNet unmix (XUMX)
The paper benchmarks source separation models, highlighting the effectiveness of the CrossNet unmix (XUMX) architecture and introducing a The Multi-Resolution CrossNet (MRX) extension.
The CrossNet unmix (XUMX) model is a deep learning-based architecture designed for audio source separation, particularly in music. It excels at isolating individual sound sources like musical instruments or vocals from complex audio mixtures. XUMX leverages neural networks to analyze and differentiate overlapping sounds, making it a valuable tool in music analysis and production for its ability to extract distinct audio elements from dense soundtracks.
|
|---|
| Figure 9: CrossNet unmix (XUMX) architecture and Multi-Resolution CrossNet (MRX) extension with redline [9] |
Multi-Resolution CrossNet (MRX)
The direct connection from "Average Inputs" to "Average Outputs" in the MRX architecture, not present in X-UMX. The MRX model has a special shortcut . This shortcut helps the learning process by preventing the training from getting stuck and by keeping the original sound details intact. This makes the model better at picking out different sounds from a mix and speeds up the learning, making it more efficient.
In summary, the CrossNet unmix (XUMX) architecture serves as a foundational model for proficient audio source separation, with the Multi-Resolution CrossNet (MRX) advancing this further by enhancing learning efficacy and preserving the original audio's integrity, vital for high-fidelity audio processing applications. The XUMX architecture demonstrates versatility in the Divide and Remaster (DnR) dataset, while the MRX stands out for its consistently superior performance in benchmarks.
Datasets
The Divide and Remaster (DnR) dataset incorporates three distinct, established audio datasets:
- LibriSpeech, utilized for speech recognition, consists of readings from public domain audiobooks.
- Free Music Archive (FMA), used for music, features a collection of over 100,000 stereo tracks spanning 161 genres, all sampled at 44.1 kHz.
- Freesound Dataset 50k (FSD50K), selected for sound effects, categorizes sounds into three main types: prominent foreground sounds like dog barks, ambient background noises such as traffic, and specific categories including musical instruments and human speech.
Evaluation
The evaluation of the Multi-Resolution CrossNet (MRX) model demonstrated its effectiveness in audio source separation, outperforming other models across different resolutions and source types. The MRX model’s ability to handle high-resolution audio inputs is particularly advantageous for applications such as speech transcription and sound classification, which are critical components in medical settings. For instance, accurate transcription and sound classification can aid in developing assistive technologies for patients with hearing impairments or in monitoring environments where precision in audio analysis can contribute to patient care and diagnosis. Future directions include integrating these audio separation models with automatic captioning systems, which could be instrumental in creating detailed medical reports or transcribing patient-physician interactions for record-keeping and analysis.
Comparison and Discussion
A Review of Deep Learning Techniques in Audio Event Recognition (AER) Applications
Strengths:
- Comprehensive Overview: Provides a thorough review of the shift from traditional methods to deep learning in AER, which is critical for understanding current and future directions in medical acoustic analysis.
- Focus on Advanced Models: Highlights the effectiveness of CNNs and RNNs, which are crucial for interpreting complex audio data in medical settings, such as analyzing patient speech or monitoring equipment sounds.
Weaknesses:
- Limited Practical Examples: Lacks specific examples of how these techniques can be applied in medical scenarios, which is essential for practical implementation.
- Implementation Challenges: Does not address the potential challenges in applying these models in real-world medical settings, such as integration with existing healthcare systems and user-friendliness for medical professionals.
You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection
Strengths:
- Innovative Approach: Introduces YOHO, a novel algorithm that could enhance speed and accuracy in detecting critical sounds in medical environments, such as rapid response scenarios.
- Real-Time Processing Capability: Its potential for real-time monitoring makes it valuable in urgent medical settings like emergency rooms or intensive care units.
Weaknesses:
- Unverified Performance in Noisy Settings: The paper might not fully explore the algorithm’s effectiveness in the highly variable and noisy environments typical of medical settings.
- Computational Demands: Lack of discussion on the computational requirements for real-time application in medical contexts could be a significant oversight.
The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks
Strengths:
- Addressing Complex Acoustic Environments: Tackles the intricate task of separating overlapping audio sources, which is highly relevant in noisy medical settings like hospitals.
- Potential for Enhanced Diagnostics: The ability to isolate specific sounds (like a patient’s voice or a machine alarm) could significantly aid in clearer medical assessments and diagnoses.
Weaknesses:
- Implementation Hurdles: The complexity of the proposed audio separation technique could pose challenges in integrating it with standard medical equipment.
- Integration with Other Medical Tools: The paper might not fully discuss how this approach could be seamlessly incorporated into existing medical diagnostic systems.
References
[1] YAI Global. (2020). Figure 1: Audio Segmentation. Retrieved from https://yaiglobal.com/index.php/component/k2/item/5-audio-segmentation.
[2] Mesaros, A., Heittola, T., & Diment, A. (2017, November). DCASE 2017 Challenge setup: Tasks, datasets and baseline system. In Detection and Classification of Acoustic Scenes and Events 2017. Retrieved from https://www.researchgate.net/publication/319842878_DCASE_2017_CHALLENGE_SETUP_TASKS_DATASETS_AND_BASELINE_SYSTEM
[3] Prashanth, A., Jayalakshmi, S. L., & Rajamani, V. (2023). A review of deep learning techniques in audio event recognition (AER) applications. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-023-15891-z
[4] S. K. Mahanta, A. F. U. R. Khilji, & P. Pakray, Deep neural network for musical instrument recognition using MFCCs. Computación y Sistemas, 25(2). https://doi.org/10.13053/cys-25-2-3946
[5] Venkatesh, S., Moffat, D., & Miranda, E. R. (2022). "You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection." Applied Sciences, 12(3293). Figure 1, A comparison of segmentation-by-classification and YOHO. https://doi.org/10.3390/app12073293
[6] Venkatesh, S., Moffat, D., & Miranda, E. R. (2022). "You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection." Applied Sciences, 12(3293). Figure 2. An illustration of the output layer of the YOHO algorithm. This network is for music-speech detection. To increase the number of audio classes, we add neurons along the horizontal axis. https://doi.org/10.3390/app12073293
[7] Venkatesh, S., Moffat, D., & Miranda, E. R. (2022). "You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection." Applied Sciences, 12(3293). Figure 4. Average time taken tomake predictions on 1 h of audio for music-speech detection. ‘Prediction’ refers to the time taken by the network to make predictions. ‘Smoothing’ is the post-processing step to parse the output of the network. The GPU used for inference was the Tesla P100.https://doi.org/10.3390/app12073293
[8] The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks." YouTube, uploaded by Mitsubishi Electric Research Labs (MERL), 09.11.2021, accessed 20.12.2023. https://www.youtube.com/watch?v=1BR4SAKDhMk&t=0s
[9] Peternmann, D., Wichern, G., Wang, Z.-Q., & Le Roux, J. (2021). The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks. arXiv:2110.09958 [eess.AS]. https://doi.org/10.48550/arXiv.2110.09958








