Reliable and Configurable Human–Robot Collaboration Framework Based on Hand Gesture Recognition Using Emoji-Guided Vision-Language Models
Haikonen, Aaro (2026)
Haikonen, Aaro
2026
Automaatiotekniikan DI-ohjelma - Master's Programme in Automation Engineering
Tekniikan ja luonnontieteiden tiedekunta - Faculty of Engineering and Natural Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2026-02-16
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202602162523
https://urn.fi/URN:NBN:fi:tuni-202602162523
Tiivistelmä
The Industry 5.0 paradigm emphasizes human-centric, sustainable, and resilient industrial systems to reduce the problems of current industrial paradigms. Human–Robot Collaboration (HRC) where human and a robot operate in tandem is a key enabler of the Industry 5.0 paradigm. While collaborative robots are increasingly deployed in manufacturing, current HRC solutions often lack adaptability and require operators to conform to rigid interaction models. Recent advances in artificial intelligence, particularly Vision-Language Models (VLMs), offer new possibilities for creating more intuitive and human-centric collaboration by enabling robots to perceive, interpret, and respond to human actions in a flexible manner. This allows for the robot to adapt to the behavior or preferences of the operator instead of vice versa.
This thesis investigates the integration of a Vision-Language Model into a human–robot collaboration system to enhance collaborative manufacturing task. A modular and configurable framework is proposed in which hand gesture recognition, emoji-based reasoning and robot control are combined into a unified system architecture. The framework is implemented using a locally executed VLM and integrated with a collaborative robot to support a shared assembly task. The system enables multimodal interaction through hand signal input allowing the robot to adapt its behavior based on the operator’s actions and preferences.
The proposed system is evaluated through an experiment-based user study focusing on efficiency, reliability, and user experience. The results indicate that integrating a Vision-Language Model can improve task understanding and flexibility in HRC while maintaining safe operation. Furthermore, user feedback suggests increased performance and improved perceived collaboration compared to conventional interaction methods. The findings demonstrate the potential of Vision-Language Models to support more adaptive and human-centric human–robot collaboration systems in industrial environments.
This thesis investigates the integration of a Vision-Language Model into a human–robot collaboration system to enhance collaborative manufacturing task. A modular and configurable framework is proposed in which hand gesture recognition, emoji-based reasoning and robot control are combined into a unified system architecture. The framework is implemented using a locally executed VLM and integrated with a collaborative robot to support a shared assembly task. The system enables multimodal interaction through hand signal input allowing the robot to adapt its behavior based on the operator’s actions and preferences.
The proposed system is evaluated through an experiment-based user study focusing on efficiency, reliability, and user experience. The results indicate that integrating a Vision-Language Model can improve task understanding and flexibility in HRC while maintaining safe operation. Furthermore, user feedback suggests increased performance and improved perceived collaboration compared to conventional interaction methods. The findings demonstrate the potential of Vision-Language Models to support more adaptive and human-centric human–robot collaboration systems in industrial environments.
