Unsupervised Domain Adaptation for Audio Classification
Gharib, Shayan (2020)
Gharib, Shayan
2020
Degree Programme in Information Technology, MSc (Tech)
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2020-05-20
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202005135245
https://urn.fi/URN:NBN:fi:tuni-202005135245
Tiivistelmä
Machine learning algorithms have achieved the state-of-the-art results by utilizing deep neural networks (DNNs) across different tasks in recent years. However, the performance of DNNs suffers from mismatched conditions between training and test datasets. This is a general machine learning problem in all applications such as machine vision, natural language processing, and audio processing. For instance, the usage of different recording devices and ambient noises can be referred to as some of the causing factors for mismatched conditions between training and test datasets in audio classification tasks. Due to mismatched conditions, a well-performed DNN model in training phase encounters a decrease in performance when evaluated on unseen data. To compensate the reduction of performance caused by this issue, domain adaptation methods have been employed to adapt DNNs to conditions in test dataset.
The objective of this thesis is to study unsupervised domain adaptation using adversarial training for acoustic scene classification. We first pre-train a model using data from one set of conditions. subsequently, we retrain the model using data with another set of conditions in order to adapt the model such that the output of the model is condition-invariant representations of inputs. More specifically, we place a discriminator against the model. The aim of the discriminator is to distinguish between data coming from different conditions, while the goal of the model is to confuse the discriminator such that it is not able to differentiate between data with different conditions. The data that we use to optimize our models on, e.g. training data, have been recorded using different recording devices than the ones utilized in the dataset used to evaluate the model, e.g. test dataset. The training data is recorded using a high quality recording device, while the audio recordings in the test set have been collected using two handheld consumer devices with mediocre quality. This introduces a mismatched condition which negatively affects the performance of optimized DNN. In this thesis, we simulate a scenario in which we do not have access to the annotations of the dataset that we try to adapt our model to. Therefore, our method can be used as an unsupervised domain adaptation technique. In addition, we present our results using two different DNN architectures, namely Kaggle and DCASE models, to show that our method is model agnostic, and works regardless of the used models. The results show a significant improvement for the adapted Kaggle and DCASE models compared to the non-adapted ones by an approximate increase of 11% and 6% respectively in the performance for unseen test data while maintaining the same performance on unseen samples from training data.
The objective of this thesis is to study unsupervised domain adaptation using adversarial training for acoustic scene classification. We first pre-train a model using data from one set of conditions. subsequently, we retrain the model using data with another set of conditions in order to adapt the model such that the output of the model is condition-invariant representations of inputs. More specifically, we place a discriminator against the model. The aim of the discriminator is to distinguish between data coming from different conditions, while the goal of the model is to confuse the discriminator such that it is not able to differentiate between data with different conditions. The data that we use to optimize our models on, e.g. training data, have been recorded using different recording devices than the ones utilized in the dataset used to evaluate the model, e.g. test dataset. The training data is recorded using a high quality recording device, while the audio recordings in the test set have been collected using two handheld consumer devices with mediocre quality. This introduces a mismatched condition which negatively affects the performance of optimized DNN. In this thesis, we simulate a scenario in which we do not have access to the annotations of the dataset that we try to adapt our model to. Therefore, our method can be used as an unsupervised domain adaptation technique. In addition, we present our results using two different DNN architectures, namely Kaggle and DCASE models, to show that our method is model agnostic, and works regardless of the used models. The results show a significant improvement for the adapted Kaggle and DCASE models compared to the non-adapted ones by an approximate increase of 11% and 6% respectively in the performance for unseen test data while maintaining the same performance on unseen samples from training data.