Investigating T-BERT for Automated Issue–Commit Link Recovery
Parveen, Risha (2025)
Parveen, Risha
2025
Master's Programme in Computing Sciences and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2025-06-02
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202505306425
https://urn.fi/URN:NBN:fi:tuni-202505306425
Tiivistelmä
Traceability links between software artifacts such as issues, commits, and pull requests are essential for maintaining and evolving modern software systems. These links help teams understand the rationale behind code changes, manage maintenance activities, and ensure compliance with standards. However, in practice, such links are often missing, inconsistent, or incomplete in real-world repositories due to the manual effort required to maintain them.
This thesis investigates the use of a transformer-based language model, T-BERT, specifically adapted for traceability link recovery, to automatically identify issue–commit links in software projects. The study evaluates T-BERT’s performance across a diverse set of Python and JavaScript repositories, covering student-led, research, and real-world open-source projects. A language-specific fine-tuning strategy is applied, and the model is assessed using empirical experiments, semantic analysis, quantitative performance metrics, and qualitative developer validation.
The results demonstrate that T-BERT effectively recovers both documented and undocumented links, particularly in structured and well-documented environments, offering meaningful improvements over manual or heuristic approaches. However, challenges arise when dealing with limited documentation, noisy data, and semantic ambiguity, which can reduce the model’s reliability in more complex or less disciplined projects. Developer feedback highlights that while AI-assisted traceability can significantly reduce manual workload, its practical adoption depends on the explainability of predictions and alignment with developer workflows.
This thesis provides empirical evidence of the strengths and limitations of AI-supported traceability, offers practical insights for integrating such tools into software development pipelines, and outlines directions for future research to improve automation, explainability, robustness, and developer trust.
This thesis investigates the use of a transformer-based language model, T-BERT, specifically adapted for traceability link recovery, to automatically identify issue–commit links in software projects. The study evaluates T-BERT’s performance across a diverse set of Python and JavaScript repositories, covering student-led, research, and real-world open-source projects. A language-specific fine-tuning strategy is applied, and the model is assessed using empirical experiments, semantic analysis, quantitative performance metrics, and qualitative developer validation.
The results demonstrate that T-BERT effectively recovers both documented and undocumented links, particularly in structured and well-documented environments, offering meaningful improvements over manual or heuristic approaches. However, challenges arise when dealing with limited documentation, noisy data, and semantic ambiguity, which can reduce the model’s reliability in more complex or less disciplined projects. Developer feedback highlights that while AI-assisted traceability can significantly reduce manual workload, its practical adoption depends on the explainability of predictions and alignment with developer workflows.
This thesis provides empirical evidence of the strengths and limitations of AI-supported traceability, offers practical insights for integrating such tools into software development pipelines, and outlines directions for future research to improve automation, explainability, robustness, and developer trust.