Scene Perception through Audio-Visual Learning
Zhu, Lingyu (2024)
Zhu, Lingyu
Tampere University
2024
Tieto- ja sähkötekniikan tohtoriohjelma - Doctoral Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Väitöspäivä
2024-11-15
Julkaisun pysyvä osoite on
https://urn.fi/URN:ISBN:978-952-03-3525-0
https://urn.fi/URN:ISBN:978-952-03-3525-0
Tiivistelmä
Perceiving events and scenes is an experience that intrinsically involves co-occurrence of multiple sensations, particularly looking and listening. However, researchers have made significant advancements in perception learning, mainly by modeling the looking or listening individually. In practice, the visual signal, such as object appearance and motions, can provide valuable information for audio-related tasks. Likewise, the sound accompanying vision often supplies informative cues about objects’ properties, e.g., location information. In this dissertation, I show that a scene’s visual or auditory signal can serve as a useful cue for enhancing and complementing the other modality through learning audio-visual models. In particular, with a large set of multi-modal observations, we have proposed computational models that leverage the appearance and motions from video and temporal information within audio to identify audio components for visually guided sound source separation. Moreover, we extract and apply structural information from vision (e.g., RGB or depth image) and audio for 3D environment perception. To compare against recent baselines, we assess the proposed methods using multiple challenging datasets, which shows promising performance on visually guided sound source separation and 3D perception tasks. I summarize the contributions as below.
Visually guided sound source separation: We develop methods from looking and listening to unlabeled videos for studying how objects with different visual attributes sound. First, we propose a new approach to recursively relocate sound components between sources using visual cues from all sound sources. Next, we present a simple yet efficient framework to optimally utilize categorical information in source separation problems. Afterward, we study how the explicit motions that coincide with audio perform for isolating sound sources, especially if the sound sources are of the similar type. While visual information contains notable features that can correlate with audio, audio carries inherent characteristics that are distinctive and instructive of visual events. To this end, we introduce a three-stream framework that explores the audio streams of slow and fast spectrograms together with visual information for learning efficient source separation solutions.
Perceiving 3D environment with echoes and vision: The geometrical information of a 3D environment is typically conserved in multisensory signals, particularly echo and visual observations in this dissertation. While visual observations, such as RGB or depth images, provide detailed information, their field of view is often very limited. However, the binaural echoes naturally preserve a holistic understanding of 3D environment. To address these, I propose to fuse RGB image with echoes to go beyond the visual field of view for depth estimation. In addition, I leverage echoes to vision to perceive 3D information for embodied agent navigation.
Visually guided sound source separation: We develop methods from looking and listening to unlabeled videos for studying how objects with different visual attributes sound. First, we propose a new approach to recursively relocate sound components between sources using visual cues from all sound sources. Next, we present a simple yet efficient framework to optimally utilize categorical information in source separation problems. Afterward, we study how the explicit motions that coincide with audio perform for isolating sound sources, especially if the sound sources are of the similar type. While visual information contains notable features that can correlate with audio, audio carries inherent characteristics that are distinctive and instructive of visual events. To this end, we introduce a three-stream framework that explores the audio streams of slow and fast spectrograms together with visual information for learning efficient source separation solutions.
Perceiving 3D environment with echoes and vision: The geometrical information of a 3D environment is typically conserved in multisensory signals, particularly echo and visual observations in this dissertation. While visual observations, such as RGB or depth images, provide detailed information, their field of view is often very limited. However, the binaural echoes naturally preserve a holistic understanding of 3D environment. To address these, I propose to fuse RGB image with echoes to go beyond the visual field of view for depth estimation. In addition, I leverage echoes to vision to perceive 3D information for embodied agent navigation.
Kokoelmat
- Väitöskirjat [4908]