Object detection and classification on images is done very effectively with convolutional neural neutworks. This type of network can also be used for audio object classification. In order to create a structure which is similar to an image, audio data is transformed into a two dimensional version. This data can than be used to train fully connected deep neural networks which work on two dimensional matrices.
For this project a classifier was trained with a dataset of 19366 which predicts 32 different classes of sounds with a deep neural network which was originally built for image recognition. Its performance is then measured by calculating metrics such as precision, recall and the F1 score.
The detection Audio Objects is an important step towards a deeper understanding of how humans perceive the environment. Just like image and video data, sounds can easily analysed by algorithms and this predicted data can later be used to augment advanced video analysis problems where audio and video data are analysed separately.


  • No labels