Acoustic Scene Classification With L3 Embeddings: Transfer learning experiment
Seppälä, Joni (2020)
Seppälä, Joni
2020
Tieto- ja sähkötekniikan kandidaattiohjelma - Degree Programme in Computing and Electrical Engineering, BSc (Tech)
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2020-05-21
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202005195463
https://urn.fi/URN:NBN:fi:tuni-202005195463
Tiivistelmä
Countless audio data are recorded on a daily basis in different environments. Being able to recognize the context of the audio automatically would be beneficial in many context-aware systems such as hearing aids and smartphones. Although the audio data is abundant, the labelled audio data can be scarce in some domains. The objective of this thesis is to achieve the highest possible accuracy in the acoustic scene classification task using machine learning (ML), focusing primarily on a transfer learning approach with L3-embeddings - an approach that is robust for limited training data.
The thesis explores how well the L3-embeddings presented in the study Look, Listen and Learn can be applied to the TAU Urban Acoustic Scenes 2019 acoustic scene classification challenge, and how the choice of the downstream classifier might affect the performance. The thesis presents a review of essential theories related to the considered approach, outlines the system that was implemented and compares the obtained results to the baseline, state-of-the-art and human performance.
The implemented system for the task includes training a model with either k-nearest neighbors (k-NN) or feed-forward neural network (FNN) classifier. Audio files are given as an input to the Open L3 library, which generates compressed features, called embeddings, based on them. These embeddings are further given as an input to the system, which uses them for training and testing of the model.
The obtained results reveal that the chosen method works well. Although the hyperparameters of the model were not optimized, the FNN classifier achieved an average accuracy of 81 %, which is close to the state-of-the-art 85 % accuracy, with a much simpler model and only a small proportion of the used parameters. The results also indicate that the chosen classifier significantly affects the obtained accuracies. The average accuracy of the k-NN classifier was 76 %, which is notably less than that achieved by the FNN.
The thesis explores how well the L3-embeddings presented in the study Look, Listen and Learn can be applied to the TAU Urban Acoustic Scenes 2019 acoustic scene classification challenge, and how the choice of the downstream classifier might affect the performance. The thesis presents a review of essential theories related to the considered approach, outlines the system that was implemented and compares the obtained results to the baseline, state-of-the-art and human performance.
The implemented system for the task includes training a model with either k-nearest neighbors (k-NN) or feed-forward neural network (FNN) classifier. Audio files are given as an input to the Open L3 library, which generates compressed features, called embeddings, based on them. These embeddings are further given as an input to the system, which uses them for training and testing of the model.
The obtained results reveal that the chosen method works well. Although the hyperparameters of the model were not optimized, the FNN classifier achieved an average accuracy of 81 %, which is close to the state-of-the-art 85 % accuracy, with a much simpler model and only a small proportion of the used parameters. The results also indicate that the chosen classifier significantly affects the obtained accuracies. The average accuracy of the k-NN classifier was 76 %, which is notably less than that achieved by the FNN.
Kokoelmat
- Kandidaatintutkielmat [7052]