Hyppää sisältöön
    • Suomeksi
    • In English
Trepo
  • Suomeksi
  • In English
  • Kirjaudu
Näytä viite 
  •   Etusivu
  • Trepo
  • Väitöskirjat
  • Näytä viite
  •   Etusivu
  • Trepo
  • Väitöskirjat
  • Näytä viite
JavaScript is disabled for your browser. Some features of this site may not work without it.

Integrating Audio and Natural Language through Cross-Modal Learning

Xie, Huang (2025)

 
Avaa tiedosto
978-952-03-4284-5.pdf (4.703Mt)
Lataukset: 



Xie, Huang
Tampere University
2025

Tieto- ja sähkötekniikan tohtoriohjelma - Doctoral Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Väitöspäivä
2025-12-08
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:ISBN:978-952-03-4284-5
Tiivistelmä
As audio data becomes increasingly central in applications such as automated audio captioning (AAC) and language-based audio retrieval, there is a growing need for systems that can link acoustic signals with the textual descriptions that express their meaning. This thesis investigates cross-modal learning between audio and text, defined as the process of learning shared representations that enable models to relate information across these modalities within a common meaning-based (semantic) space. The overarching goal is to develop methods that capture meaningful correspondences between acoustic events and linguistic expressions. Audio–text cross-modal retrieval is used as a proxy task to evaluate the proposed methods.

While recent advances in deep learning have improved the integration of auditory and linguistic information, several challenges remain. These include modeling fine-grained correspondences between temporally localized sound events and the words that describe them, understanding the effect of negative sampling in contrastive learning of audio–text representations, and accounting for the inherently graded and subjective nature of semantic correspondence (i.e., relevance) between audio and text.

To address these challenges, the thesis introduces a set of methods grounded in unsupervised learning, contrastive learning, and human-centered evaluation. First, it presents an unsupervised framework that aligns sound events with semantically relevant textual phrases in audio–text pairs without requiring annotated or time-aligned data. Second, it systematically evaluates eight negative sampling strategies in contrastive learning, showing that the sampling choice substantially affects representation quality and downstream retrieval performance. Third, it introduces a crowdsourced dataset of graded audio–text relevance ratings to capture human perceptual judgements, and proposes a dual-objective training approach that combines these graded ratings with binary relevance labels. Finally, it develops a regression-based method that estimates continuous relevance scores from semantic similarity between captions, providing a scalable, automated way to produce nuanced relevance annotations.

Experimental results on standard audio–text retrieval benchmarks demonstrate the effectiveness of the proposed methods. The findings advance understanding of how meaning is shared between audio and text and provide practical techniques for improving machine perception of sound through natural language.
Kokoelmat
  • Väitöskirjat [5236]
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste
 

 

Selaa kokoelmaa

TekijätNimekkeetTiedekunta (2019 -)Tiedekunta (- 2018)Tutkinto-ohjelmat ja opintosuunnatAvainsanatJulkaisuajatKokoelmat

Omat tiedot

Kirjaudu sisäänRekisteröidy
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste