Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models
Khorrami, Khazar; Räsänen, Okko (2021)
Khorrami, Khazar
Räsänen, Okko
International Speech Communication Association ISCA
2021
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202112038868
https://urn.fi/URN:NBN:fi:tuni-202112038868
Kuvaus
Peer reviewed
Tiivistelmä
Systems that can find correspondences between multiple modal- ities, such as between speech and images, have great potential to solve different recognition and data analysis tasks in an un- supervised manner. This work studies multimodal learning in the context of visually grounded speech (VGS) models, and fo- cuses on their recently demonstrated capability to extract spa- tiotemporal alignments between spoken words and the corre- sponding visual objects without ever been explicitly trained for object localization or word recognition. As the main contribu- tions, we formalize the alignment problem in terms of an au- diovisual alignment tensor that is based on earlier VGS work, introduce systematic metrics for evaluating model performance in aligning visual objects and spoken words, and propose a new VGS model variant for the alignment task utilizing cross-modal attention layer. We test our model and a previously proposed model in the alignment task using SPEECH-COCO captions coupled with MSCOCO images. We compare the alignment performance using our proposed evaluation metrics to the se- mantic retrieval task commonly used to evaluate VGS models. We show that cross-modal attention layer not only helps the model to achieve higher semantic cross-modal retrieval perfor- mance, but also leads to substantial improvements in the align- ment performance between image object and spoken words.
Kokoelmat
- TUNICRIS-julkaisut [19796]