Byte-Pair Encoding in the Task of Automated Audio Captioning
Lipping, Samuel (2022)
Lipping, Samuel
2022
Tietotekniikan DI-ohjelma - Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2022-05-19
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202204203338
https://urn.fi/URN:NBN:fi:tuni-202204203338
Tiivistelmä
Automated audio captioning (AAC) methods largely use deep learning models where the captions are encoded with one-hot word encoding. One-hot word encoding treats each word as being equally similar to all other words. Additionally, it does not allow the encoding or prediction of out-of-vocabulary (OOV) words. Other approaches such as Word2Vec and FastText account for word similarities and FastText even allows for encoding OOV words. In this thesis, we explore the usage of Byte-Pair Encoding (BPE) for caption encoding in AAC. We conduct experiments with two datasets, Clotho and AudioCaps, and the large text corpus Newscrawl. We find that BPE improves the performance of our recurrent model when trained on AudioCaps, while our transformer model only benefits from BPE when training on Clotho and evaluating on AudioCaps. Further, pretraining the BPE model on Newscrawl leads to similar performance as word embeddings on our transformer model, while providing a dataset-independent vocabulary.