Comparative Analysis of LLM System Architectures for a Teaching Assistant Tool
Rahman, Sayem (2025)
Rahman, Sayem
2025
Tietojenkäsittelyopin maisteriohjelma - Master's Programme in Computer Science
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
Hyväksymispäivämäärä
2025-12-21
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-2025121911968
https://urn.fi/URN:NBN:fi:tuni-2025121911968
Tiivistelmä
While Large Language Models automate educational content generation, empirical
guidance on architectural selection remains limited. This thesis compares five archi
tectural patterns: Monolithic, RAG, Fine-Tuned, Multi-Agent, and Hybrid, across
ninety test cases spanning summarization, quiz generation, and QA tasks. Evalua
tion utilized LLM-as-judge scoring, latency measurements, and automated similarity
metrics.
The study indicates that the Multi-Agent architecture delivers the highest quality
but at significant latency cost, whereas the Hybrid architecture offers the optimal
balance of quality and speed. Surprisingly, the simple Monolithic baseline out
performed complex RAG and Fine-Tuned models. Crucially, negative correlations
between judge scores and automated metrics revealed that high-quality pedagogi
cal content naturally diverges from reference answers, rendering standard similarity
metrics less reliable. Finally, performance proved highly task-dependent; no sin
gle architecture excelled universally, suggesting that educational AI systems require
task-aware routing rather than uniform solutions. This research contributes a val
idated framework and critical evidence regarding architectural trade-offs in educa
tion.
guidance on architectural selection remains limited. This thesis compares five archi
tectural patterns: Monolithic, RAG, Fine-Tuned, Multi-Agent, and Hybrid, across
ninety test cases spanning summarization, quiz generation, and QA tasks. Evalua
tion utilized LLM-as-judge scoring, latency measurements, and automated similarity
metrics.
The study indicates that the Multi-Agent architecture delivers the highest quality
but at significant latency cost, whereas the Hybrid architecture offers the optimal
balance of quality and speed. Surprisingly, the simple Monolithic baseline out
performed complex RAG and Fine-Tuned models. Crucially, negative correlations
between judge scores and automated metrics revealed that high-quality pedagogi
cal content naturally diverges from reference answers, rendering standard similarity
metrics less reliable. Finally, performance proved highly task-dependent; no sin
gle architecture excelled universally, suggesting that educational AI systems require
task-aware routing rather than uniform solutions. This research contributes a val
idated framework and critical evidence regarding architectural trade-offs in educa
tion.
