On Monocular Depth Estimation for Scene Understanding in Robotics Applications
Mansour, Mostafa (2025)
Mansour, Mostafa
Tampere University
2025
Teknisten tieteiden tohtoriohjelma - Doctoral Programme in Engineering Sciences
Tekniikan ja luonnontieteiden tiedekunta - Faculty of Engineering and Natural Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Väitöspäivä
2025-08-19
Julkaisun pysyvä osoite on
https://urn.fi/URN:ISBN:978-952-03-4001-8
https://urn.fi/URN:ISBN:978-952-03-4001-8
Tiivistelmä
Depth estimation is a fundamental problem in computer vision and robotics, enabling machines to perceive and interact with their surroundings in three dimensions. Accurate depth perception is critical for applications such as autonomous navigation, robotic manipulation, augmented reality, and 3D scene reconstruction. Traditional methods rely on stereo vision or active sensors such as LiDAR, which provide precise depth measurements but come with increased hardware costs, computational requirements, and physical operational constraints. In contrast, monocular depth estimation, which extracts depth information from a single camera, offers a cost-effective and flexible alternative. However, monocular approaches face significant challenges due to the inherently ill-posed nature of depth estimation from single images.
This thesis explores and advances monocular depth estimation techniques, with a focus on integrating ego-motion, semantic priors, and deep learning- based methods to improve accuracy and robustness. The study systematically evaluates the comparative advantages of monocular and stereo depth estimation, analyzing their error characteristics and identifying their optimal use cases. A key contribution of this research is the development of a framework that fuses monocular image sequences with motion parameters to estimate depth more reliably, mitigating the scale ambiguity problem common in monocular approaches. Additionally, the research introduces an enhanced method that integrates semantic scene understanding with Bayesian inference, allowing for improved 3D position and velocity estimation of scene objects in dynamic environments.
Beyond depth estimation, the thesis also investigates the role of monocular depth information in photorealistic scene reconstruction. With the emergence of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) as state-of-the-art methods for scene representation, this work explores the integration of monocular depth estimation within the 3DGS framework. The proposed approach enables high- fidelity, photorealistic 3D scene reconstruction from monocular camera inputs, offering a viable alternative to depth sensor-based mapping techniques.
Experimental validation is conducted using real-world datasets, including benchmark vision and robotics datasets. The results demonstrate that the proposed methods significantly enhance monocular depth estimation accuracy, improve object localization in dynamic scenes, and achieve high-quality 3D reconstructions. The findings contribute to making monocular vision-based depth perception more accessible and practical for real-world applications in robotics, autonomous systems, and immersive virtual environments.
This thesis explores and advances monocular depth estimation techniques, with a focus on integrating ego-motion, semantic priors, and deep learning- based methods to improve accuracy and robustness. The study systematically evaluates the comparative advantages of monocular and stereo depth estimation, analyzing their error characteristics and identifying their optimal use cases. A key contribution of this research is the development of a framework that fuses monocular image sequences with motion parameters to estimate depth more reliably, mitigating the scale ambiguity problem common in monocular approaches. Additionally, the research introduces an enhanced method that integrates semantic scene understanding with Bayesian inference, allowing for improved 3D position and velocity estimation of scene objects in dynamic environments.
Beyond depth estimation, the thesis also investigates the role of monocular depth information in photorealistic scene reconstruction. With the emergence of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) as state-of-the-art methods for scene representation, this work explores the integration of monocular depth estimation within the 3DGS framework. The proposed approach enables high- fidelity, photorealistic 3D scene reconstruction from monocular camera inputs, offering a viable alternative to depth sensor-based mapping techniques.
Experimental validation is conducted using real-world datasets, including benchmark vision and robotics datasets. The results demonstrate that the proposed methods significantly enhance monocular depth estimation accuracy, improve object localization in dynamic scenes, and achieve high-quality 3D reconstructions. The findings contribute to making monocular vision-based depth perception more accessible and practical for real-world applications in robotics, autonomous systems, and immersive virtual environments.
Kokoelmat
- Väitöskirjat [5188]
