Thermal-RGB Dataset Creation and Annotation Using ImageBind and Pseudo-Ground Truth
Wu, Annan (2025)
Wu, Annan
2025
Master's Programme in Computing Sciences and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2025-12-31
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-2025123112297
https://urn.fi/URN:NBN:fi:tuni-2025123112297
Tiivistelmä
The integration of thermal and RGB imagery is increasingly important in computer vision, particularly for scene understanding under challenging lighting and visibility conditions. Thermal data improves robustness under low-light conditions; however, generating precise and coherent language descriptions from integrated RGB–thermal inputs remains a significant challenge. This thesis introduces a web-based multimodal image captioning system that integrates RGB and thermal imagery while utilizing large language models to enhance descriptive captions.
The proposed system processes RGB and thermal images through various object detection and instance segmentation models. The outputs are integrated to produce an initial, template-based caption that includes object locations, segmentation coverage, and thermal temperature data, serving as pseudo-ground truth. The initial caption, along with the combined RGB–thermal image, is used to generate refined captions using two large language models under various input configurations.
Caption quality is assessed through both automated and human-centered approaches. Automated evaluation uses ImageBind to assess semantic alignment between images and generated captions in a shared embedding space. A user study is conducted to evaluate caption quality with respect to accuracy, detail, fluency, and thermal relevance. The assessment utilizes a dataset comprising 280 RGB-thermal image pairs obtained from two datasets.
The findings demonstrate that large language models can generate coherent and informative captions from RGB–thermal inputs. Nonetheless, the use of template-based initial captions does not reliably improve caption quality and may reduce clarity. Moreover, ImageBind results do not consistently align with human assessments, highlighting the necessity for additional user-based evaluation. This study underscores the advantages and constraints of LLM-based refinement in RGB–thermal image captioning.
The proposed system processes RGB and thermal images through various object detection and instance segmentation models. The outputs are integrated to produce an initial, template-based caption that includes object locations, segmentation coverage, and thermal temperature data, serving as pseudo-ground truth. The initial caption, along with the combined RGB–thermal image, is used to generate refined captions using two large language models under various input configurations.
Caption quality is assessed through both automated and human-centered approaches. Automated evaluation uses ImageBind to assess semantic alignment between images and generated captions in a shared embedding space. A user study is conducted to evaluate caption quality with respect to accuracy, detail, fluency, and thermal relevance. The assessment utilizes a dataset comprising 280 RGB-thermal image pairs obtained from two datasets.
The findings demonstrate that large language models can generate coherent and informative captions from RGB–thermal inputs. Nonetheless, the use of template-based initial captions does not reliably improve caption quality and may reduce clarity. Moreover, ImageBind results do not consistently align with human assessments, highlighting the necessity for additional user-based evaluation. This study underscores the advantages and constraints of LLM-based refinement in RGB–thermal image captioning.
