Audiovisual Sensing and Context Recognition for Mobile Devices
Roininen, Mikko Joonas (2010)
Roininen, Mikko Joonas
2010
Tietotekniikan koulutusohjelma
Tieto- ja sähkötekniikan tiedekunta - Faculty of Computing and Electrical Engineering
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2010-11-03
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201011121363
https://urn.fi/URN:NBN:fi:tty-201011121363
Tiivistelmä
Effective automatic analysis methods are required for the thorough utilization of extensive video collections. Content-based video analysis provides means for automated content interpretation. Popular approach for content-based video analysis is to consider it as a machine learning problem and apply supervised classification methods on low-level features extracted from the video key frames. Video recordings are usually accompanied by audio streams carrying complementary information, which can be a valuable asset in aiding the semantic analysis process. Fusion of information from visual and audio modalities have thus been studied in the literature. Multimodal fusion is a nontrivial task due to processing timing, synchronization, and modality applicability variance related issues. The recognition of the surrounding context from video recordings offers interesting possibilities for improved context awareness of mobile devices. For instance, the user profile of the mobile device can be adjusted, and tailored services provided according to the context.
In this thesis, a content-based video context recognition framework is presented. The framework utilizes visual and audio modalities. Support vector machine classifiers trained for six visual low-level features are used for recognition in the visual modality. The acoustic context recognition is provided by a recently introduced audio-based context recognition system. SVM-based fuser along with five rule-based fusion methods are integrated into the framework. In rule-based fusion, the separate classifiers are weighted with genetic algorithm optimized weights using presented weighting functions. The framework and the fusion approaches are evaluated on real-world video data recorded from 21 daily life contexts. Multimodal recognition is shown to outperform both unimodal approaches. The highest correct classification rate is achieved with SVM-based fusion. /Kir10
In this thesis, a content-based video context recognition framework is presented. The framework utilizes visual and audio modalities. Support vector machine classifiers trained for six visual low-level features are used for recognition in the visual modality. The acoustic context recognition is provided by a recently introduced audio-based context recognition system. SVM-based fuser along with five rule-based fusion methods are integrated into the framework. In rule-based fusion, the separate classifiers are weighted with genetic algorithm optimized weights using presented weighting functions. The framework and the fusion approaches are evaluated on real-world video data recorded from 21 daily life contexts. Multimodal recognition is shown to outperform both unimodal approaches. The highest correct classification rate is achieved with SVM-based fusion. /Kir10