Audio Captioning with Keyword Guidance
Afolaranmi, James (2025)
Afolaranmi, James
2025
Master's Programme in Computing Sciences and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2025-05-17
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202504294374
https://urn.fi/URN:NBN:fi:tuni-202504294374
Tiivistelmä
The field of Automatic Audio Captioning is getting more traction as it has led to the development of several systems capable of generating textual descriptions of sound events and their inter-relationship in an acoustic environment.
However, many of these systems are bound to produce repetitive and highly generic captions. The aim of this thesis is to propose an AAC system that generates captions targeted towards different sound events through keyword guidance. This thesis investigates the performance of this system on two different datasets, Clotho and MACS, with various keyword setup in the training and testing phase.
This system involves the encoding of audio files with the pre-trained HTS-AT (Hierarchical Token-Semantic Audio Transformer), the encoding of keywords with word2vec, and the generation of captions with a transformer decoder that is built from scratch.
The proposed method with keywords achieved a satisfactory result by producing over 2000 unique captions when tested on the Clotho dataset. This is a significant improvement compared to the baseline model (the proposed system without keywords), which produced about 500 unique captions.
However, many of these systems are bound to produce repetitive and highly generic captions. The aim of this thesis is to propose an AAC system that generates captions targeted towards different sound events through keyword guidance. This thesis investigates the performance of this system on two different datasets, Clotho and MACS, with various keyword setup in the training and testing phase.
This system involves the encoding of audio files with the pre-trained HTS-AT (Hierarchical Token-Semantic Audio Transformer), the encoding of keywords with word2vec, and the generation of captions with a transformer decoder that is built from scratch.
The proposed method with keywords achieved a satisfactory result by producing over 2000 unique captions when tested on the Clotho dataset. This is a significant improvement compared to the baseline model (the proposed system without keywords), which produced about 500 unique captions.