Hyppää sisältöön
    • Suomeksi
    • In English
Trepo
  • Suomeksi
  • In English
  • Kirjaudu
Näytä viite 
  •   Etusivu
  • Trepo
  • TUNICRIS-julkaisut
  • Näytä viite
  •   Etusivu
  • Trepo
  • TUNICRIS-julkaisut
  • Näytä viite
JavaScript is disabled for your browser. Some features of this site may not work without it.

Experimental Design of Extractive Question-Answering Systems: Influence of Error Scores and Answer Length

Farea, Amer; Emmert-Streib, Frank (2024)

 
Avaa tiedosto
15642wPg_s.pdf (1.967Mt)
Lataukset: 



Farea, Amer
Emmert-Streib, Frank
2024

Journal of Artificial Intelligence Research
doi:10.1613/jair.1.15642
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202407247721

Kuvaus

Peer reviewed
Tiivistelmä
Question-answering (QA) systems are becoming more and more important because they enable human-computer communication in a natural language. In recent years, significant progress has been made with transformer-based models that leverage deep learning in combination with large amounts of text data. However, a significant challenge with QA systems lies in their complexity rooted in the ambiguity and flexibility of a natural language. This makes even their evaluation a formidable task. For this reason, in this study, we focus on the evaluation of extractive question-answering (EQA) systems by conducting a large-scale analysis of distilBERT using benchmark data provided by the Stanford Question Answering Dataset (SQuAD). Specifically, the main objectives of this paper are fourfold. First, we study the influence of the answer length on the performance and we demonstrate that there is an inverse correlation between both. Second, we study differences in exact match (EM) measures because there are different definitions commonly used in the literature. As a result, we find that despite the fact that all of those measures are named”exact match” these measures are actually different from each other. Third, we study the practical relevance of these different definitions because due to the ambivalent meaning of”exact match” in the literature, it is often unclear if reported improvements are genuine or only due to a change in the exact match measure. Importantly, our results show that differences between differently defined EM measures are in the same order of magnitude as reported differences found in the literature. This raises concerns about the robustness of reported results. Fourth, we provide guidelines to improve the experimental design of general EQA studies, aiming to enhance performance evaluation and minimize the potential for spurious results.
Kokoelmat
  • TUNICRIS-julkaisut [23862]
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste
 

 

Selaa kokoelmaa

TekijätNimekkeetTiedekunta (2019 -)Tiedekunta (- 2018)Tutkinto-ohjelmat ja opintosuunnatAvainsanatJulkaisuajatKokoelmat

Omat tiedot

Kirjaudu sisäänRekisteröidy
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste