Robust Sound Event Detection in Real World Environment
Gohar, Ali (2020)
Gohar, Ali
2020
Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2020-12-09
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202011298298
https://urn.fi/URN:NBN:fi:tuni-202011298298
Tiivistelmä
Sound present in our daily life contains very useful information and can help in solving multiple problems. Sound event detection (SED) in real life can be used to build products for security, safety and human convenience. Deep neural networks are quite common and perform very well for such tasks. A lot of work has already been done using deep neural networks and recurrent neural networks to tackle various problems.
Considering the informative events for nurses in home care centers and nursing homes, we finalized four target classes (door, water tap, yelling, human fall) to be focused in this thesis. These sound events can also be used in other places e.g home security and public places. Considering other frequent sounds in the daily environment, we took speech, music and coughing as non-target classes. Despite four target classes all other sounds will be considered as background noise and will be ignored. This thesis aims to synthesize a dataset containing all these sound events and train various models to achieve higher accuracy for target classes.
Target and non-target class data along with ambience sounds are collected from the internet to create real-life soundscapes. Soundscapes are created in four folds to use the cross-validation technique during training of the model. To ensure the privacy of the speaker, we decided to extract audio embeddings from these soundscapes and use them for further process. Three different types of audio embeddings (openl3, kumar2018, zhao2020) are extracted and compared to find the optimal one for this task. Supervised learning acoustic detection models are trained and validated on this synthesized data to find the finest model for these events.
We found that embeddings extracted from Zhao2020 and provided to various classification models outperform the other two types of embeddings extraction models on our synthetic dataset. We achieved the best results by combining a recurrent neural network with max polling.
Considering the informative events for nurses in home care centers and nursing homes, we finalized four target classes (door, water tap, yelling, human fall) to be focused in this thesis. These sound events can also be used in other places e.g home security and public places. Considering other frequent sounds in the daily environment, we took speech, music and coughing as non-target classes. Despite four target classes all other sounds will be considered as background noise and will be ignored. This thesis aims to synthesize a dataset containing all these sound events and train various models to achieve higher accuracy for target classes.
Target and non-target class data along with ambience sounds are collected from the internet to create real-life soundscapes. Soundscapes are created in four folds to use the cross-validation technique during training of the model. To ensure the privacy of the speaker, we decided to extract audio embeddings from these soundscapes and use them for further process. Three different types of audio embeddings (openl3, kumar2018, zhao2020) are extracted and compared to find the optimal one for this task. Supervised learning acoustic detection models are trained and validated on this synthesized data to find the finest model for these events.
We found that embeddings extracted from Zhao2020 and provided to various classification models outperform the other two types of embeddings extraction models on our synthetic dataset. We achieved the best results by combining a recurrent neural network with max polling.