Development and evaluation of Retrieval-Augmented Generation methods for document search and question-answering
Korkee, Roope (2025)
Korkee, Roope
2025
Tietotekniikan DI-ohjelma - Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2025-05-17
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202505125344
https://urn.fi/URN:NBN:fi:tuni-202505125344
Tiivistelmä
This study investigates the development and evaluation of document-based question-answering systems (RAG, Retrieval-Augmented Generation). The aim is to create a modular base system and analyze its main components. This approach guides the tool's design so users can retrieve information from and interact with document collections using natural language queries. As the development of large language models (LLMs) continues, it is increasingly important to understand how to integrate the retrieval and creation components effectively to develop accurate, responsive, and multilingual information systems.
The system is built around two main pipelines: processing documents and answering questions. The documents are cleaned by including only the textual part, split, embedded, and stored in a vector database. The retrieval component manages user queries: it identifies the relevant stored text chunks, combines them with the query, and forwards it to the language model to answer. The system is built with open-source tools and evaluated on multilingual datasets to evaluate its performance under realistic conditions.
The evaluation tested different document splitting methods, embedding, and language models. The results show that retrieval quality is a key factor in overall system performance, as the retrieved text chunks directly influence the answer generation phase. Semantic chunking combined with sentence-level document splitting balanced retrieval accuracy and processing efficiency, preserving contextual content while ensuring the chunk sizes remained suitable for embedding. A trade-off was observed between accuracy, storage requirements, and multilingual performance. Some embedding methods required significant storage but offered higher accuracy, while others prioritized speed or performed better across multiple languages at the cost of retrieval accuracy. The quality of the responses varied between language models. Medium-sized models offered the best trade-off between efficiency and reliability.
Although the tool performed well overall, challenges remained in ensuring consistency in the relevance of searches, language models’ noise sensitivity, and adherence to the given instructions. The results highlight the importance of component-level optimization when designing RAG systems and provide valuable insights into how different configurations affect overall performance. The developed architecture provides a flexible basis for future development and insights into multilingual and interactive document retrieval features.
The system is built around two main pipelines: processing documents and answering questions. The documents are cleaned by including only the textual part, split, embedded, and stored in a vector database. The retrieval component manages user queries: it identifies the relevant stored text chunks, combines them with the query, and forwards it to the language model to answer. The system is built with open-source tools and evaluated on multilingual datasets to evaluate its performance under realistic conditions.
The evaluation tested different document splitting methods, embedding, and language models. The results show that retrieval quality is a key factor in overall system performance, as the retrieved text chunks directly influence the answer generation phase. Semantic chunking combined with sentence-level document splitting balanced retrieval accuracy and processing efficiency, preserving contextual content while ensuring the chunk sizes remained suitable for embedding. A trade-off was observed between accuracy, storage requirements, and multilingual performance. Some embedding methods required significant storage but offered higher accuracy, while others prioritized speed or performed better across multiple languages at the cost of retrieval accuracy. The quality of the responses varied between language models. Medium-sized models offered the best trade-off between efficiency and reliability.
Although the tool performed well overall, challenges remained in ensuring consistency in the relevance of searches, language models’ noise sensitivity, and adherence to the given instructions. The results highlight the importance of component-level optimization when designing RAG systems and provide valuable insights into how different configurations affect overall performance. The developed architecture provides a flexible basis for future development and insights into multilingual and interactive document retrieval features.
