Audio Captioning in Finnish and English with Task-Dependent Output
Martin Morato, Irene; Harju, Manu; Mesaros, Annamaria (2024)
URI
https://dcase.community/documents/workshop2024/proceedings/DCASE2024Workshop_Martin-Morato_27.pdfMartin Morato, Irene
Harju, Manu
Mesaros, Annamaria
2024
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202501311839
https://urn.fi/URN:NBN:fi:tuni-202501311839
Kuvaus
Peer reviewed
Tiivistelmä
Describing audio content is a complex task for an annotator; the resulting caption depends on the annotator’s language, culture and expertise. In addition, physiological factors like vision impairment may affect on how the sound is perceived and interpreted. In this work, we explore bilingual audio captioning in Finnish and English. In connection with this study, we release the SiVi-CAFE dataset, a small-size dataset of Sighted and Visually-impaired Captions for Audio in Finnish and English, with a collection of parallel annotations for the same clips. We analyze briefly the differences between captions produced by sighted and visually-impaired annotators, and train a system to produce captions in both languages that also mimics the style of different annotator groups. Obtaining a CIDEr score of 34.75% and 28.75% on the English and Finnish datasets, respectively. Furthermore, the system is able to perform a tagging task, obtaining F-score of 79.73%.
Kokoelmat
- TUNICRIS-julkaisut [20711]