Hyppää sisältöön
    • Suomeksi
    • In English
Trepo
  • Suomeksi
  • In English
  • Kirjaudu
Näytä viite 
  •   Etusivu
  • Trepo
  • TUNICRIS-julkaisut
  • Näytä viite
  •   Etusivu
  • Trepo
  • TUNICRIS-julkaisut
  • Näytä viite
JavaScript is disabled for your browser. Some features of this site may not work without it.

Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models

Khorrami, Khazar; Räsänen, Okko (2021)

 
Avaa tiedosto
khorrami21_interspeech.pdf (1.275Mt)
Lataukset: 



Khorrami, Khazar
Räsänen, Okko
2021

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
doi:10.21437/Interspeech.2021-496
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202112038868

Kuvaus

Peer reviewed
Tiivistelmä
Systems that can find correspondences between multiple modal-<br/>ities, such as between speech and images, have great potential<br/>to solve different recognition and data analysis tasks in an un-<br/>supervised manner. This work studies multimodal learning in<br/>the context of visually grounded speech (VGS) models, and fo-<br/>cuses on their recently demonstrated capability to extract spa-<br/>tiotemporal alignments between spoken words and the corre-<br/>sponding visual objects without ever been explicitly trained for<br/>object localization or word recognition. As the main contribu-<br/>tions, we formalize the alignment problem in terms of an au-<br/>diovisual alignment tensor that is based on earlier VGS work,<br/>introduce systematic metrics for evaluating model performance<br/>in aligning visual objects and spoken words, and propose a new<br/>VGS model variant for the alignment task utilizing cross-modal<br/>attention layer. We test our model and a previously proposed<br/>model in the alignment task using SPEECH-COCO captions<br/>coupled with MSCOCO images. We compare the alignment<br/>performance using our proposed evaluation metrics to the se-<br/>mantic retrieval task commonly used to evaluate VGS models.<br/>We show that cross-modal attention layer not only helps the<br/>model to achieve higher semantic cross-modal retrieval perfor-<br/>mance, but also leads to substantial improvements in the align-<br/>ment performance between image object and spoken words.
Kokoelmat
  • TUNICRIS-julkaisut [24712]
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste
 

 

Selaa kokoelmaa

TekijätNimekkeetTiedekunta (2019 -)Tiedekunta (- 2018)Tutkinto-ohjelmat ja opintosuunnatAvainsanatJulkaisuajatKokoelmat

Omat tiedot

Kirjaudu sisäänRekisteröidy
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste