Hyppää sisältöön
    • Suomeksi
    • In English
Trepo
  • Suomeksi
  • In English
  • Kirjaudu
Näytä viite 
  •   Etusivu
  • Trepo
  • Kandidaatintutkielmat
  • Näytä viite
  •   Etusivu
  • Trepo
  • Kandidaatintutkielmat
  • Näytä viite
JavaScript is disabled for your browser. Some features of this site may not work without it.

Caption guided sound event tagging

Piilola, Jarkko (2025)

 
Avaa tiedosto
PiilolaJarkko.pdf (1.321Mt)
Lataukset: 



Piilola, Jarkko
2025

Tieto- ja sähkötekniikan kandidaattiohjelma - Bachelor's Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2025-06-04
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202506046681
Tiivistelmä
Sound event tagging (SET) has seen progress through deep learning models, achieving high levels of accuracies. However, these models can face limitations, especially when there is ambiguity in classification. Therefore, it can be beneficial to assist these SET models utilizing multimodality for more contextual information. This thesis presents a system for utilizing audio clip related captions and natural language processing (NLP) to enhance confidence scores produced by a baseline SET model. Both the SET and caption tags are pre-processed, extracted and embedded into a shared vector space using Sentence-BERT (SBERT) model. Cosine similarity is then calculated between tags to assess their contextual correspondence. The adjusted SET confidences are derived from both positive and negative boosting using a mathematical formula that considers caption tag matches related to SET tags and number of captions into account with variable caption weighting. The goal is to produce more accurate and semantically coherent classification confidences. Experimental results show gradual decreasing trend in F1-scores with increasing caption tag weight. However, there are cases where improvements in F1-scores happen given low weight to caption tags, with moderate acceptance and semantic similarity thresholds. The highest F1 score achieved 0.696 with parameters weight alpha = 0.1, confidence threshold = 0.3, similarity threshold = 0.5. These findings suggest that multimodality through caption tagging can complement but not replace sound event tagging, especially when boosting is based on SET model’s confidences.
Kokoelmat
  • Kandidaatintutkielmat [9897]
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste
 

 

Selaa kokoelmaa

TekijätNimekkeetTiedekunta (2019 -)Tiedekunta (- 2018)Tutkinto-ohjelmat ja opintosuunnatAvainsanatJulkaisuajatKokoelmat

Omat tiedot

Kirjaudu sisäänRekisteröidy
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste