Hybrid ViViT–TimeSformer Transformer Architecture for Multimodal Slip Detection in Robotic Manipulation
Shovon, Tanvir Hasan (2025)
Shovon, Tanvir Hasan
2025
Master's Programme in Computing Sciences and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2025-12-23
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-2025122312103
https://urn.fi/URN:NBN:fi:tuni-2025122312103
Tiivistelmä
Detecting slip during robotic grasping is essential for safe manipulation, as missing early slip events can lead to dropped items, damaged goods, or excessive grip forces. However, vision-based systems struggle when the gripper occludes the contact area, while tactile sensors provide local information but lack the spatial context needed to interpret object motion. Traditional approaches rely on manually tuned force thresholds adjusted for each object, and recent deep learning methods process visual and tactile information separately, making it difficult to capture how slip signatures evolve jointly across both modalities. This thesis proposes a hybrid transformer architecture using a ViViT encoder for RGB video and a TimeSformer encoder for tactile sequences, integrated through cross-modal self-attention fusion that learns to emphasize whichever modality provides more reliable evidence. The approach is evaluated on a public visuo--tactile dataset using two protocols: one tests generalization to unseen grasp configurations, while the other uses leave-one-out cross-validation to assess performance on completely novel objects. Results show the hybrid model achieves 96% accuracy on new grasps and 94% on unfamiliar objects, outperforming single-modality baselines and simpler fusion strategies. Ablation experiments show that the proposed attention-based fusion consistently outperforms simpler strategies such as weighted averaging and gated fusion. In addition, qualitative visualizations reveal that the model attends to physically meaningful regions, including contact deformation and shear patterns. Together, these observations suggest that the hybrid transformer learns slip-related dynamics that generalize beyond individual objects, making attention-based fusion a practical choice for robust robotic perception.
