Evaluation of Question Answering Systems: Complexity of Judging a Natural Language
Farea, Amer; Yang, Zhen; Duong, Kien; Perera, Nadeesha; Emmert-Streib, Frank (2025)
Avaa tiedosto
Lataukset:
Farea, Amer
Yang, Zhen
Duong, Kien
Perera, Nadeesha
Emmert-Streib, Frank
2025
ACM Computing Surveys
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-2025121811909
https://urn.fi/URN:NBN:fi:tuni-2025121811909
Kuvaus
Peer reviewed
Tiivistelmä
Question answering (QA) systems are a leading and rapidly advancing field of natural language processing (NLP) research. One of their key advantages is that they enable more natural interactions between humans and machines, such as in virtual assistants or search engines. Over the past few decades, many QA systems have been developed to handle diverse QA tasks. However, the evaluation of these systems is intricate, as many of the available evaluation scores are not task-agnostic. Furthermore, translating human judgment into measurable metrics continues to be an open issue. These complexities add challenges to their assessment. This survey provides a systematic overview of evaluation scores and introduces a taxonomy with two main branches: Human-Centric Evaluation Scores (HCES) and Automatic Evaluation Scores (AES). Since many of these scores were originally designed for specific tasks but have been applied more generally, we also cover the basics of QA frameworks and core paradigms to provide a deeper understanding of their capabilities and limitations. Lastly, we discuss benchmark datasets that are critical for conducting systematic evaluations across various QA tasks.
Kokoelmat
- TUNICRIS-julkaisut [23862]
