Hyppää sisältöön
    • Suomeksi
    • In English
Trepo
  • Suomeksi
  • In English
  • Kirjaudu
Näytä viite 
  •   Etusivu
  • Trepo
  • Opinnäytteet - ylempi korkeakoulututkinto
  • Näytä viite
  •   Etusivu
  • Trepo
  • Opinnäytteet - ylempi korkeakoulututkinto
  • Näytä viite
JavaScript is disabled for your browser. Some features of this site may not work without it.

Development and evaluation of Retrieval-Augmented Generation methods for document search and question-answering

Korkee, Roope (2025)

 
Avaa tiedosto
KorkeeRoope.pdf (3.291Mt)
Lataukset: 



Korkee, Roope
2025

Tietotekniikan DI-ohjelma - Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2025-05-17
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202505125344
Tiivistelmä
This study investigates the development and evaluation of document-based question-answering systems (RAG, Retrieval-Augmented Generation). The aim is to create a modular base system and analyze its main components. This approach guides the tool's design so users can retrieve information from and interact with document collections using natural language queries. As the development of large language models (LLMs) continues, it is increasingly important to understand how to integrate the retrieval and creation components effectively to develop accurate, responsive, and multilingual information systems.
The system is built around two main pipelines: processing documents and answering questions. The documents are cleaned by including only the textual part, split, embedded, and stored in a vector database. The retrieval component manages user queries: it identifies the relevant stored text chunks, combines them with the query, and forwards it to the language model to answer. The system is built with open-source tools and evaluated on multilingual datasets to evaluate its performance under realistic conditions.
The evaluation tested different document splitting methods, embedding, and language models. The results show that retrieval quality is a key factor in overall system performance, as the retrieved text chunks directly influence the answer generation phase. Semantic chunking combined with sentence-level document splitting balanced retrieval accuracy and processing efficiency, preserving contextual content while ensuring the chunk sizes remained suitable for embedding. A trade-off was observed between accuracy, storage requirements, and multilingual performance. Some embedding methods required significant storage but offered higher accuracy, while others prioritized speed or performed better across multiple languages at the cost of retrieval accuracy. The quality of the responses varied between language models. Medium-sized models offered the best trade-off between efficiency and reliability.
Although the tool performed well overall, challenges remained in ensuring consistency in the relevance of searches, language models’ noise sensitivity, and adherence to the given instructions. The results highlight the importance of component-level optimization when designing RAG systems and provide valuable insights into how different configurations affect overall performance. The developed architecture provides a flexible basis for future development and insights into multilingual and interactive document retrieval features.
Kokoelmat
  • Opinnäytteet - ylempi korkeakoulututkinto [42012]
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste
 

 

Selaa kokoelmaa

TekijätNimekkeetTiedekunta (2019 -)Tiedekunta (- 2018)Tutkinto-ohjelmat ja opintosuunnatAvainsanatJulkaisuajatKokoelmat

Omat tiedot

Kirjaudu sisäänRekisteröidy
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste