Accuracy of ARKit for Creating 6D Object Pose Estimation Datasets
Lauronen, Henrik (2024)
Lauronen, Henrik
2024
Sähkötekniikan DI-ohjelma - Master's Programme in Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2024-07-30
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202407027484
https://urn.fi/URN:NBN:fi:tuni-202407027484
Tiivistelmä
In robotics, autonomous vehicles and augmented reality there is often a need to estimate the location and rotation of an object. This problem is known as 6D object pose estimation and many state of the art solutions use deep learning methods, which require lots of training and testing data. Creating datasets for pose estimation algorithms is a difficult task, as it requires high quality 3D reconstructions of the desired objects, manual annotation and offline processing.
Augmented reality is a technique for displaying synthetic content in real-time overlaid on top of a video in such a way that it blends into the scene. The content often reacts to changes in the scene and user inputs. Therefore augmented reality could provide a way to create training datasets with minimal offline processing. A key component for augmented reality is visual inertial odometry (VIO) which tracks the 6D pose of the camera. The pose is estimated with sensor data and image processing techniques. VIO algorithms often require a tradeoff between accuracy and computational efficiency. Especially in real-time applications, the VIO algorithm needs to be fast enough to be able to create estimates at a reasonable framerate. This often requires optimizations in the algorithm which leads to a reduction in tracking accuracy.
ARKit is an augmented reality framework by Apple. For this thesis we create an iPhone application using ARKit for generating 6D object pose estimation datasets. In the application the user can place a bounding box around an object and record a video, the camera poses and camera intrinsics. We also evaluate the tracking accuracy of the VIO algorithm of ARKit. Measurements are conducted in a motion capture studio, where the iPhone is attached to a camera rig tracked by a motion capture setup. The tracking accuracy is evaluated by comparing the data obtained by the iPhone to the data obtained with the motion capture setup.
A total of four sequences were recorded for the experiments. The results of the experiments show that the tracking accuracy of ARKit in the motion capture studio was poor. A difference of 20-40 centimeters was measured in the trajectories of the values obtained by ARKit and ground truth values. A possible cause for poor tracking is the textureless surfaces of the motion capture studio.
Augmented reality is a technique for displaying synthetic content in real-time overlaid on top of a video in such a way that it blends into the scene. The content often reacts to changes in the scene and user inputs. Therefore augmented reality could provide a way to create training datasets with minimal offline processing. A key component for augmented reality is visual inertial odometry (VIO) which tracks the 6D pose of the camera. The pose is estimated with sensor data and image processing techniques. VIO algorithms often require a tradeoff between accuracy and computational efficiency. Especially in real-time applications, the VIO algorithm needs to be fast enough to be able to create estimates at a reasonable framerate. This often requires optimizations in the algorithm which leads to a reduction in tracking accuracy.
ARKit is an augmented reality framework by Apple. For this thesis we create an iPhone application using ARKit for generating 6D object pose estimation datasets. In the application the user can place a bounding box around an object and record a video, the camera poses and camera intrinsics. We also evaluate the tracking accuracy of the VIO algorithm of ARKit. Measurements are conducted in a motion capture studio, where the iPhone is attached to a camera rig tracked by a motion capture setup. The tracking accuracy is evaluated by comparing the data obtained by the iPhone to the data obtained with the motion capture setup.
A total of four sequences were recorded for the experiments. The results of the experiments show that the tracking accuracy of ARKit in the motion capture studio was poor. A difference of 20-40 centimeters was measured in the trajectories of the values obtained by ARKit and ground truth values. A possible cause for poor tracking is the textureless surfaces of the motion capture studio.