Sequence Temporal Sub-Sampling for Automated Audio Captioning
Nguyen, Khoa (2020)
Bachelor's Programme in Science and Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
Audio captioning is a novel task in machine learning which involves the generation of textual description for an audio signal. For example, a method for audio captioning must be able to generate descriptions like “two people talking about football”, or “college clock striking” from the corresponding audio signals. Audio captioning is one of the tasks in the Detection and Classification of Acoustic Scenes and Events 2020 (DCASE2020). Most audio captioning methods use the encoder-decoder deep neural networks architecture as a function to map the extracted features from input audio sequence to the output captions. However, the length of an output caption is considerably less than the length of an input audio signal, for example, 10 words versus 2000 audio feature vectors. This thesis work reports an attempt to take advantage of this difference in length by employing temporal sub-sampling in the encoder-decoder neural networks. The method is evaluated using the Clotho audio captioning dataset and the DCASE2020 evaluation metrics. Experimental results show that temporal sequence sub-sampling is able to improve all considered metrics, as well as memory and time complexity while training and calculating predicted output.
- Kandidaatintutkielmat