Zero-Shot Learning for Audio Classification
Xie, Huang (2020)
Xie, Huang
2020
Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. Only for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2020-11-11
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202010097304
https://urn.fi/URN:NBN:fi:tuni-202010097304
Tiivistelmä
Zero-shot learning has received increasing attention in recent years. In this thesis, we study zero-shot learning for general audio classification through the usage of class labels and sentence descriptions being the semantic side information of acoustic classes. The goal is to obtain audio classifiers that are capable of recognizing audio instances of unseen acoustic classes, which have no available training data but only semantic side information.
In this thesis, we use a bilinear model to learn an acoustic-semantic mapping between intermediate level representations of audio instances and acoustic classes from training data, and then generalize it to unseen acoustic classes. Audio instances are represented by deep acoustic embeddings, which are generated with VGGish models. Under the setting of zero-shot learning, we train VGGish models from scratch by excluding the audio data from unseen classes. For acoustic classes, semantic representations are learned from their semantic side information with language models. In this work, three pre-trained language models (Word2Vec, GloVe, BERT) are used to extract either label embeddings from textual class labels or sentence embeddings from sentence descriptions of acoustic classes. Then, the bilinear model is trained to measure the compatibilities of pairs of acoustic embeddings and semantic embeddings. An audio instance is classified into the acoustic class whose semantic embedding is most compatible with its acoustic embedding.
Evaluation experiments are performed on two audio datasets: ESC-50 and AudioSet. First, prototyping experiments are conducted on ESC-50, which is a small balanced environmental audio dataset. Through the prototyping experiments, it demonstrates that the proposed method has the ability to recognize audio instances from unseen classes based on class label embeddings. Experimental results show that classification performance can be significantly improved by involving sound classes that are semantically close to the unseen test classes in training the bilinear model. In-depth evaluation experiments are carried out with AudioSet, which is an unbalanced large-scale general audio subset. We study the effectiveness of semantic side information of acoustic classes by investigating different semantic embeddings generated by language models. Three types of semantic representations of acoustic classes are investigated: label embeddings, sentence embeddings and their concatenations. Experimental results show that both textual labels and sentence descriptions carry meaningful semantic information about acoustic classes for audio classification. By concatenating label embeddings and sentence embeddings, the classification capability of the proposed method is significantly improved.
In this thesis, we use a bilinear model to learn an acoustic-semantic mapping between intermediate level representations of audio instances and acoustic classes from training data, and then generalize it to unseen acoustic classes. Audio instances are represented by deep acoustic embeddings, which are generated with VGGish models. Under the setting of zero-shot learning, we train VGGish models from scratch by excluding the audio data from unseen classes. For acoustic classes, semantic representations are learned from their semantic side information with language models. In this work, three pre-trained language models (Word2Vec, GloVe, BERT) are used to extract either label embeddings from textual class labels or sentence embeddings from sentence descriptions of acoustic classes. Then, the bilinear model is trained to measure the compatibilities of pairs of acoustic embeddings and semantic embeddings. An audio instance is classified into the acoustic class whose semantic embedding is most compatible with its acoustic embedding.
Evaluation experiments are performed on two audio datasets: ESC-50 and AudioSet. First, prototyping experiments are conducted on ESC-50, which is a small balanced environmental audio dataset. Through the prototyping experiments, it demonstrates that the proposed method has the ability to recognize audio instances from unseen classes based on class label embeddings. Experimental results show that classification performance can be significantly improved by involving sound classes that are semantically close to the unseen test classes in training the bilinear model. In-depth evaluation experiments are carried out with AudioSet, which is an unbalanced large-scale general audio subset. We study the effectiveness of semantic side information of acoustic classes by investigating different semantic embeddings generated by language models. Three types of semantic representations of acoustic classes are investigated: label embeddings, sentence embeddings and their concatenations. Experimental results show that both textual labels and sentence descriptions carry meaningful semantic information about acoustic classes for audio classification. By concatenating label embeddings and sentence embeddings, the classification capability of the proposed method is significantly improved.