Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities
Sudarsanam, Parthasaarathy; Martín-Morató, Irene; Virtanen, Tuomas (2025)
Avaa tiedosto
Lataukset:
Sudarsanam, Parthasaarathy
Martín-Morató, Irene
Virtanen, Tuomas
2025
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202601261867
https://urn.fi/URN:NBN:fi:tuni-202601261867
Kuvaus
Peer reviewed
Tiivistelmä
This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing large-scale unlabeled data to learn shared representations. Existing deep learning approach for trimodal alignment involves two-stages, that separately align visual-text and audiotext modalities. This approach suffers from mismatched data distributions, resulting in suboptimal alignment. Leveraging the AVCaps dataset, which provides audio, visual and audio-visual captions for video clips, our method jointly optimizes the representation of all the modalities using contrastive training. Our Results demonstrate that the single-stage approach outperforms the two-stage method, achieving a two-fold improvement in audio based visual retrieval, highlighting the advantages of unified multimodal representation learning.
Kokoelmat
- TUNICRIS-julkaisut [24610]
