Large Language Models In High-Level Software Testing : A Literature Review
Grönberg, Toni (2026)
Grönberg, Toni
2026
Tietojenkäsittelyopin maisteriohjelma - Master's Programme in Computer Science
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
Hyväksymispäivämäärä
2026-02-11
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202602102428
https://urn.fi/URN:NBN:fi:tuni-202602102428
Tiivistelmä
Large language models (LLMs) have rapidly become influential tools in software engineering, raising new possibilities for supporting software testing across multiple levels of abstraction. Early applications of LLMs have focused strongly on low-level, code-centric tasks such as unit test generation, code repair, and test oracle inference. In contrast, high-level testing activities—including system, integration, acceptance, security, and GUI testing—have received considerably less attention despite their central role in validating the behaviour, reliability, and quality of complex software systems. This thesis investigates how LLMs are currently utilised in high-level software testing and examines the extent to which emerging research addresses the challenges inherent to these tasks.
The thesis begins by introducing the foundations of software testing, defining test levels relevant to high-level dynamic testing, and outlining the essential properties of modern LLMs that enable their use in testing workflows. It then examines the traditional use of LLMs in testing, highlighting the limitations of early prompt-based approaches that relied on narrow context windows, limited autonomy, and weak integration with execution environments. Building on this, the thesis explores the rise of LLM-based testing agents capable of iterative reasoning, tool use, and adaptive test generation within feedback-driven loops. These agentic approaches represent a major shift from static prompt usage toward more autonomous testing systems that can interact with complex software artefacts.
The analysis of high-level application areas demonstrates that LLM-supported methods are emerging in security testing, system testing, GUI exploration, acceptance testing, requirements analysis, and migration testing. Across these domains, LLMs contribute by interpreting naturallanguage specifications, generating structured scenarios, analysing outputs, and navigating interfaces in ways that traditional scripted automation struggles to achieve. At the same time, recurring limitations—such as hallucination, reproducibility issues, integration challenges, and inconsistent evaluation practices—highlight that LLM-based testing remains an evolving area with significant room for maturation.
The thesis concludes that while high-level software testing has only recently begun to benefit from LLM-supported techniques, the field is progressing rapidly through the development of agent-based systems and hybrid architectures. LLMs are not yet reliable enough to replace human expertise in high-level testing, but they increasingly serve as capable collaborators that can accelerate test design, extend automation into previously inaccessible tasks, and support exploratory analysis. These findings suggest that continued research, better evaluation standards, and domain-informed model grounding are essential for advancing the practical adoption of LLMs in high-level software testing.
The thesis begins by introducing the foundations of software testing, defining test levels relevant to high-level dynamic testing, and outlining the essential properties of modern LLMs that enable their use in testing workflows. It then examines the traditional use of LLMs in testing, highlighting the limitations of early prompt-based approaches that relied on narrow context windows, limited autonomy, and weak integration with execution environments. Building on this, the thesis explores the rise of LLM-based testing agents capable of iterative reasoning, tool use, and adaptive test generation within feedback-driven loops. These agentic approaches represent a major shift from static prompt usage toward more autonomous testing systems that can interact with complex software artefacts.
The analysis of high-level application areas demonstrates that LLM-supported methods are emerging in security testing, system testing, GUI exploration, acceptance testing, requirements analysis, and migration testing. Across these domains, LLMs contribute by interpreting naturallanguage specifications, generating structured scenarios, analysing outputs, and navigating interfaces in ways that traditional scripted automation struggles to achieve. At the same time, recurring limitations—such as hallucination, reproducibility issues, integration challenges, and inconsistent evaluation practices—highlight that LLM-based testing remains an evolving area with significant room for maturation.
The thesis concludes that while high-level software testing has only recently begun to benefit from LLM-supported techniques, the field is progressing rapidly through the development of agent-based systems and hybrid architectures. LLMs are not yet reliable enough to replace human expertise in high-level testing, but they increasingly serve as capable collaborators that can accelerate test design, extend automation into previously inaccessible tasks, and support exploratory analysis. These findings suggest that continued research, better evaluation standards, and domain-informed model grounding are essential for advancing the practical adoption of LLMs in high-level software testing.
