Visually attended sound source semantic segmentation : Forming a semantic mask using visual and audio cues
Nuotiomaa, Petteri (2024)
Nuotiomaa, Petteri
2024
Tieto- ja sähkötekniikan kandidaattiohjelma - Bachelor's Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2024-03-14
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202403082762
https://urn.fi/URN:NBN:fi:tuni-202403082762
Tiivistelmä
Sound Source Separation from single input audio mixture is a problem that has had many different proposed solutions. Many of the modern solutions involve using machine learning and especially deep learning to achieve good separation results. The goal of this thesis is to develop a solution to the sound source semantic segmentation problem, evaluate the performance of the proposed method, and compare it with similar solutions
The thesis goes over the theory behind the problem and explores some of the previous solutions used for this problem. During the background research, it became clear that there aren’t many solutions trying to solve the exact problem of sound source semantic segmentation. This is why the thesis examines solutions for different problems, that have some relation to the sound source semantic segmentation.
This thesis proposes a new solution for sound source semantic segmentation. The solution is based on some of the groundbreaking research done in the sound source separation field. The proposed deep learning system has two networks, one for audio and one for visual information. The audio network is a U-Net type network and the visual network is based on the ResNet model of networks. The addition of visual information to the final separation is done by weighing the audio instrument masks with the probabilities of predicted instruments. The network uses common hyperparameters that are used in different works in the field. The loss function is slightly different than any of the previous works as it uses probabilities instead of hard labels for ground truth values.
The results from the proposed system indicate that the system can separate simple audio sources from an audio mixture. The solution didn’t achieve wanted results in all test cases and specifically for duet audios the method had problems separating the instruments. The system showed some promise in the multimodality approach, but the better results for the multimodality approach could be accounted for by the different initial randomization of the network. The proposed solution achieved results that are better than the baseline and the most simple approaches to the same problem but failed to meet the standard separating quality that can be expected from a modern deep learning approach.
The thesis goes over the theory behind the problem and explores some of the previous solutions used for this problem. During the background research, it became clear that there aren’t many solutions trying to solve the exact problem of sound source semantic segmentation. This is why the thesis examines solutions for different problems, that have some relation to the sound source semantic segmentation.
This thesis proposes a new solution for sound source semantic segmentation. The solution is based on some of the groundbreaking research done in the sound source separation field. The proposed deep learning system has two networks, one for audio and one for visual information. The audio network is a U-Net type network and the visual network is based on the ResNet model of networks. The addition of visual information to the final separation is done by weighing the audio instrument masks with the probabilities of predicted instruments. The network uses common hyperparameters that are used in different works in the field. The loss function is slightly different than any of the previous works as it uses probabilities instead of hard labels for ground truth values.
The results from the proposed system indicate that the system can separate simple audio sources from an audio mixture. The solution didn’t achieve wanted results in all test cases and specifically for duet audios the method had problems separating the instruments. The system showed some promise in the multimodality approach, but the better results for the multimodality approach could be accounted for by the different initial randomization of the network. The proposed solution achieved results that are better than the baseline and the most simple approaches to the same problem but failed to meet the standard separating quality that can be expected from a modern deep learning approach.
Kokoelmat
- Kandidaatintutkielmat [8421]