Acoustic Scene Classification With L3 Embeddings: Transfer learning experiment

Seppälä, Joni

Acoustic Scene Classification With L3 Embeddings: Transfer learning experiment

Seppälä, Joni (2020)

Avaa tiedosto

SeppäläJoni.pdf (695.4Kt)

Lataukset:

Seppälä, Joni

2020

Tieto- ja sähkötekniikan kandidaattiohjelma - Degree Programme in Computing and Electrical Engineering, BSc (Tech)
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

Hyväksymispäivämäärä

2020-05-21

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202005195463

Tiivistelmä

Countless audio data are recorded on a daily basis in different environments. Being able to recognize the context of the audio automatically would be beneficial in many context-aware systems such as hearing aids and smartphones. Although the audio data is abundant, the labelled audio data can be scarce in some domains. The objective of this thesis is to achieve the highest possible accuracy in the acoustic scene classification task using machine learning (ML), focusing primarily on a transfer learning approach with L3-embeddings - an approach that is robust for limited training data.

The thesis explores how well the L3-embeddings presented in the study Look, Listen and Learn can be applied to the TAU Urban Acoustic Scenes 2019 acoustic scene classification challenge, and how the choice of the downstream classifier might affect the performance. The thesis presents a review of essential theories related to the considered approach, outlines the system that was implemented and compares the obtained results to the baseline, state-of-the-art and human performance.

The implemented system for the task includes training a model with either k-nearest neighbors (k-NN) or feed-forward neural network (FNN) classifier. Audio files are given as an input to the Open L3 library, which generates compressed features, called embeddings, based on them. These embeddings are further given as an input to the system, which uses them for training and testing of the model.

The obtained results reveal that the chosen method works well. Although the hyperparameters of the model were not optimized, the FNN classifier achieved an average accuracy of 81 %, which is close to the state-of-the-art 85 % accuracy, with a much simpler model and only a small proportion of the used parameters. The results also indicate that the chosen classifier significantly affects the obtained accuracies. The average accuracy of the k-NN classifier was 76 %, which is notably less than that achieved by the FNN.

Kokoelmat

Kandidaatintutkielmat [7052]