Real time deep learning based stereo reconstruction
Chaplygin, Denis (2022)
Chaplygin, Denis
2022
Master's Programme in Computing Sciences
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2022-06-21
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202206175721
https://urn.fi/URN:NBN:fi:tuni-202206175721
Tiivistelmä
3D computer vision plays a principal role in various domains, such as robotics, autonomous navigation and localization, virtual and augmented reality, 3D scene reconstruction and understanding and tracking and surveillance. In addition ongoing research is progressing to advance the current 3D technology and integrate it into marketable products.
However, the current state of the multi--view systems requires one to choose two of three possibilities: quality of result, computing performance, or cost of the solution. Conversely, algorithmic stereo disparity estimation is fast and cheap, but the overall quality is low. LiDAR based systems are expeditious and accurate, but expensive. Finally, deep learning based methods are inexpensive and provide relatively good quality, but inference time is too high for practical usage.
Therefore the main objective of this thesis is to verify, whether a reasonable inference time of one second per frame or less can be achieved while using the neural network approach for stereo disparity calculation and having quality significantly better, compared to the algorithmic approach. The hypothesis is that with the application of dedicated modern tensor processing accelerators, even portable, low--power embedded systems will be able to reconstruct depth information from the high--resolution stereo camera setup with the real time constraint.
To verify the hypothesis described above, superficial research on the modern landscape of stereo neural networks was conducted, alongside testing the possibility of executing those networks on a set of various tensor processing units and the performance of those units.
The performance of the selected networks was estimated using the Nvidia Xavier platform and the Intel NCS2 VPU platform, also known as Intel Movidius platform. Both platforms provide a similar approach for high--performance neural network execution. To achieve minimum inference time, the neural network graph and trained weights are to be compiled into application specific and platform bounded highly optimized inference kernel.
After application of the optimizations mentioned above, it is appeared possible to achieve inference time below one second for a single frame, with the cost of lower input image resolution and reduced calculation precision. However quantitative analysis of stereo disparity estimation quality exposes that estimation precision, achieved with accelerated neural network, is more than two times better compared to the algorithmic approach.
This confirms the hypothesis, but raises new questions on the possibility to improve resolution and depth prediction precision.
However, the current state of the multi--view systems requires one to choose two of three possibilities: quality of result, computing performance, or cost of the solution. Conversely, algorithmic stereo disparity estimation is fast and cheap, but the overall quality is low. LiDAR based systems are expeditious and accurate, but expensive. Finally, deep learning based methods are inexpensive and provide relatively good quality, but inference time is too high for practical usage.
Therefore the main objective of this thesis is to verify, whether a reasonable inference time of one second per frame or less can be achieved while using the neural network approach for stereo disparity calculation and having quality significantly better, compared to the algorithmic approach. The hypothesis is that with the application of dedicated modern tensor processing accelerators, even portable, low--power embedded systems will be able to reconstruct depth information from the high--resolution stereo camera setup with the real time constraint.
To verify the hypothesis described above, superficial research on the modern landscape of stereo neural networks was conducted, alongside testing the possibility of executing those networks on a set of various tensor processing units and the performance of those units.
The performance of the selected networks was estimated using the Nvidia Xavier platform and the Intel NCS2 VPU platform, also known as Intel Movidius platform. Both platforms provide a similar approach for high--performance neural network execution. To achieve minimum inference time, the neural network graph and trained weights are to be compiled into application specific and platform bounded highly optimized inference kernel.
After application of the optimizations mentioned above, it is appeared possible to achieve inference time below one second for a single frame, with the cost of lower input image resolution and reduced calculation precision. However quantitative analysis of stereo disparity estimation quality exposes that estimation precision, achieved with accelerated neural network, is more than two times better compared to the algorithmic approach.
This confirms the hypothesis, but raises new questions on the possibility to improve resolution and depth prediction precision.