Hyppää sisältöön
    • Suomeksi
    • In English
Trepo
  • Suomeksi
  • In English
  • Kirjaudu
Näytä viite 
  •   Etusivu
  • Trepo
  • Kandidaatintutkielmat
  • Näytä viite
  •   Etusivu
  • Trepo
  • Kandidaatintutkielmat
  • Näytä viite
JavaScript is disabled for your browser. Some features of this site may not work without it.

Audio-Visual Scene Classification with Different Late Fusion Combiners

Olán, Toni (2021)

 
Avaa tiedosto
OlanToni.pdf (411.6Kt)
Lataukset: 



Olán, Toni
2021

Tieto- ja sähkötekniikan kandidaattiohjelma - Bachelor's Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2021-07-27
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202106246070
Tiivistelmä
Humans use multiple senses when performing a task. Senses are analogous to modalities, which represent a way in which something is experienced or perceived. With the increasing availability of multimodal data, modeling the human behavior with multimodality has gained more popularity on various machine learning tasks. However, in scene classification, the most popular classifiers still rely either only on audio or images.

Fusing correlated modalities right after feature extraction has been shown to achieve higher accuracy than fusing the predictions of unimodal classifiers. However, this later fusion is more flexible, as a system can still output decent results even if a modality is lost. The predictions of the classifiers based on different modalities can be combined in various ways, and the selection of a combination method is still an open issue.

This thesis studies the accuracies of scene classifiers that are based on audio and visual modalities. The performance of a multimodal ensemble classifier is also compared to the unimodal classifiers. The ensemble model consists of all the studied unimodal classifiers and a combiner, which can combine the predictions of the base classifiers in different ways.

Two classifiers are built for both audio and visual modalities. These unimodal classifiers rely on pre-trained models for feature extractions and small trainable neural networks for the classifications. The pre-trained models utilize transfer learning, as they have been trained for recognizing objects instead of scenes. The selected audio models consist of VGGish and L3, whereas the visual ones consist of ResNet and L3.

The experiments show that the classifiers using the same modality perform similarly, obtaining their highest and lowest scores on the same scenes, with some exceptions. Both audio classifiers perform better on certain scenes than the visual classifiers and vice versa, while a classifier based on the visual modality works better on average. However, the chosen pre-trained model has a clear impact on a classifier, as the other classifier reaches lower accuracies in both modalities. The best performing combiners in the experiments are a simple mean combiner and its class-weighted variant. Compared to the base classifiers, the best ensemble classifier obtains a higher average accuracy and only a single scene out of ten obtains a lower classification score.
Kokoelmat
  • Kandidaatintutkielmat [9202]
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste
 

 

Selaa kokoelmaa

TekijätNimekkeetTiedekunta (2019 -)Tiedekunta (- 2018)Tutkinto-ohjelmat ja opintosuunnatAvainsanatJulkaisuajatKokoelmat

Omat tiedot

Kirjaudu sisäänRekisteröidy
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste