Integrating Audio and Natural Language through Cross-Modal Learning
Xie, Huang (2025)
Xie, Huang
Tampere University
2025
Tieto- ja sähkötekniikan tohtoriohjelma - Doctoral Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Väitöspäivä
2025-12-08
Julkaisun pysyvä osoite on
https://urn.fi/URN:ISBN:978-952-03-4284-5
https://urn.fi/URN:ISBN:978-952-03-4284-5
Tiivistelmä
As audio data becomes increasingly central in applications such as automated audio captioning (AAC) and language-based audio retrieval, there is a growing need for systems that can link acoustic signals with the textual descriptions that express their meaning. This thesis investigates cross-modal learning between audio and text, defined as the process of learning shared representations that enable models to relate information across these modalities within a common meaning-based (semantic) space. The overarching goal is to develop methods that capture meaningful correspondences between acoustic events and linguistic expressions. Audio–text cross-modal retrieval is used as a proxy task to evaluate the proposed methods.
While recent advances in deep learning have improved the integration of auditory and linguistic information, several challenges remain. These include modeling fine-grained correspondences between temporally localized sound events and the words that describe them, understanding the effect of negative sampling in contrastive learning of audio–text representations, and accounting for the inherently graded and subjective nature of semantic correspondence (i.e., relevance) between audio and text.
To address these challenges, the thesis introduces a set of methods grounded in unsupervised learning, contrastive learning, and human-centered evaluation. First, it presents an unsupervised framework that aligns sound events with semantically relevant textual phrases in audio–text pairs without requiring annotated or time-aligned data. Second, it systematically evaluates eight negative sampling strategies in contrastive learning, showing that the sampling choice substantially affects representation quality and downstream retrieval performance. Third, it introduces a crowdsourced dataset of graded audio–text relevance ratings to capture human perceptual judgements, and proposes a dual-objective training approach that combines these graded ratings with binary relevance labels. Finally, it develops a regression-based method that estimates continuous relevance scores from semantic similarity between captions, providing a scalable, automated way to produce nuanced relevance annotations.
Experimental results on standard audio–text retrieval benchmarks demonstrate the effectiveness of the proposed methods. The findings advance understanding of how meaning is shared between audio and text and provide practical techniques for improving machine perception of sound through natural language.
While recent advances in deep learning have improved the integration of auditory and linguistic information, several challenges remain. These include modeling fine-grained correspondences between temporally localized sound events and the words that describe them, understanding the effect of negative sampling in contrastive learning of audio–text representations, and accounting for the inherently graded and subjective nature of semantic correspondence (i.e., relevance) between audio and text.
To address these challenges, the thesis introduces a set of methods grounded in unsupervised learning, contrastive learning, and human-centered evaluation. First, it presents an unsupervised framework that aligns sound events with semantically relevant textual phrases in audio–text pairs without requiring annotated or time-aligned data. Second, it systematically evaluates eight negative sampling strategies in contrastive learning, showing that the sampling choice substantially affects representation quality and downstream retrieval performance. Third, it introduces a crowdsourced dataset of graded audio–text relevance ratings to capture human perceptual judgements, and proposes a dual-objective training approach that combines these graded ratings with binary relevance labels. Finally, it develops a regression-based method that estimates continuous relevance scores from semantic similarity between captions, providing a scalable, automated way to produce nuanced relevance annotations.
Experimental results on standard audio–text retrieval benchmarks demonstrate the effectiveness of the proposed methods. The findings advance understanding of how meaning is shared between audio and text and provide practical techniques for improving machine perception of sound through natural language.
Kokoelmat
- Väitöskirjat [5236]
