Dataset And Deep Neural Network Based Approach To Audio Question Answering
Sudarsanam, Parthasaarathy (2023)
Master's Programme in Computing Sciences
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
Audio question answering (AQA) is a multimodal task in which a system analyzes an audio signal and a question in natural language, to produce a desirable answer in natural language. In this thesis, a new dataset for audio question answering, Clotho-AQA, consisting of 1991 audio files each between 15 to 30 seconds in duration is presented. For each audio file in the dataset, six different questions and their corresponding answers were crowdsourced using Amazon Mechanical Turk (AMT). The questions and their corresponding answers were created by different annotators. Out of the six questions for each audio, two questions each were designed to have ‘yes’ and ‘no’ as answers respectively, while the remaining two questions have other single-word answers. For every question, answers from three independent annotators were collected. In this thesis, two baseline experiments are presented to portray the usage of the Clotho-AQA dataset - a multimodal binary classifier for ‘yes’ or ‘no’ answers and a multimodal multi-class classifier for single-word answers both based on long short-term memory (LSTM) layers. The binary classifier achieved an accuracy of 62.7% and the multi-class classifier achieved a top-1 accuracy of 54.2% and a top-5 accuracy of 93.7%. Further, an attention-based model was proposed, which increased the binary classifier accuracy to 66.2% and the top-1 and top-5 multiclass classifier accuracy to 57.5% and 99.8% respectively. Some drawbacks of the Clotho-AQA dataset such as the presence of the same answer words in different tenses, singular-plural forms, etc., that are considered as different classes for the classification problem were addressed and a refined version called Clotho-AQA_v2 is also presented. The multimodal baseline model achieved a top-1 and top-5 accuracy of 59.8% and 96.6% respectively while the attention-based model achieved a top-1 and top-5 accuracy of 61.3% and 99.6% respectively on this refined dataset.