Hyppää sisältöön
    • Suomeksi
    • In English
Trepo
  • Suomeksi
  • In English
  • Kirjaudu
Näytä viite 
  •   Etusivu
  • Trepo
  • Väitöskirjat
  • Näytä viite
  •   Etusivu
  • Trepo
  • Väitöskirjat
  • Näytä viite
JavaScript is disabled for your browser. Some features of this site may not work without it.

Self-supervised Representation Learning on Audio-Video Data

Wang, Shanshan (2025)

 
Avaa tiedosto
978-952-03-3955-5.pdf (7.810Mt)
Lataukset: 



Wang, Shanshan
Tampere University
2025

Tieto- ja sähkötekniikan tohtoriohjelma - Doctoral Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Väitöspäivä
2025-06-09
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:ISBN:978-952-03-3955-5
Tiivistelmä
Feature representation learning using audio-video data has gained significant attention due to its ability to leverage complementary information from both modalities. Audio and visual signals provide distinct yet correlated perspectives of the same scene, making their joint modeling beneficial for applications such as video understanding, environmental sound analysis, and autonomous perception. By integrating both modalities, models can learn richer and more discriminative feature representations, improving performance in tasks like audio-visual scene classification and cross-modal retrieval.

Recently, self-supervised learning (SSL) has emerged as a powerful alternative to supervised approaches, primarily due to its ability to learn meaningful representations without labeled data. SSL exploits intrinsic structures within the data to generate pseudo-labels, enabling models to extract robust and generalizable features from large-scale, unlabeled datasets. This is particularly advantageous in multi-modal learning, where obtaining labeled data is often labor-intensive and costly.

This thesis focuses on processing and learning from audio-visual data directly for scene classification, and learning feature representations for other audio classification tasks. The first part of the thesis focuses on a new audio-video dataset of urban scenes that we curated and published, and presents a case study on audio-visual scene analysis. Our findings show that integrating both audio and visual modalities yields significantly better performance than using each modality alone.

The work in the second part of the thesis explores self-supervised feature representation learning, focusing on enhancing contrastive learning techniques for multimodal learning. The first direction investigated was spatial alignment between audio and visual modalities, using spatial correspondence as a supervisory signal to improve cross-modal representation learning. Our experiments demonstrate that replacing standard log-mel spectrogram features with the first-order Ambisonics intensity vector significantly improves audio-visual spatial alignment task performance. The second direction investigated was an improved contrastive loss function that strengthens the discriminative power of feature embeddings by introducing an angular margin between positive and negative pairs. Results indicate that applying this loss in both supervised and self-supervised learning settings leads to substantial performance improvements. Finally, the third direction investigated was the sampling techniques in self-supervised learning. We introduced a soft-positive sampling strategy to refine contrastive learning by selecting informative positive pairs while reducing the impact of noisy samples. Experimental results suggest that in self-supervised learning setups with small datasets, features learned through soft-positive sampling outperform those obtained from traditional sampling approaches.
Kokoelmat
  • Väitöskirjat [5027]
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste
 

 

Selaa kokoelmaa

TekijätNimekkeetTiedekunta (2019 -)Tiedekunta (- 2018)Tutkinto-ohjelmat ja opintosuunnatAvainsanatJulkaisuajatKokoelmat

Omat tiedot

Kirjaudu sisäänRekisteröidy
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste