A summarization approach to evaluating audio captioning
Martin Morato, Irene; Harju, Manu; Mesaros, Annamaria (2022-11-03)
URI
https://dcase.community/documents/workshop2022/proceedings/DCASE2022Workshop_Martin-Morato_35.pdfMartin Morato, Irene
Harju, Manu
Mesaros, Annamaria
Teoksen toimittaja(t)
Lagrange, Mathieu
Mesaros, Annamaria
Pellegrini, Thomas
Richard, Gaël
Serizel, Romain
Stowell, Dan
DCASE
03.11.2022
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202302021989
https://urn.fi/URN:NBN:fi:tuni-202302021989
Kuvaus
Peer reviewed
Tiivistelmä
Audio captioning is currently evaluated with metrics originating from machine translation and image captioning, but their suitability for audio has recently been questioned. This work proposes content-based scoring of audio captions, an approach that considers the specific sound events content of the captions. Inspired from text summarization, the proposed measure gives relevance scores to the sound events present in the reference, and scores candidates based on the relevance of the retrieved sounds. In this work we use a simple, consensus-based definition of relevance, but different weighing schemes can be easily incorporated to change the importance of terms accordingly. Our experiments use two datasets and three different audio captioning systems and show that the proposed measure behaves consistently with the data: captions that correctly capture the most relevant sounds obtain a score of 1, while the ones containing less relevant sounds score lower. While the proposed content-based score is not concerned with the fluency or semantic content of the captions, it can be incorporated into a compound metric, similar to SPIDEr being a linear combination of a semantic and a syntactic fluency score.
Kokoelmat
- TUNICRIS-julkaisut [15271]