A Re-examination of chatbot evaluation metrics
Duong, Kien (2022)
Duong, Kien
2022
Master's Programme in Computing Sciences
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2022-05-24
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202205064457
https://urn.fi/URN:NBN:fi:tuni-202205064457
Tiivistelmä
One of the most important and challenging parts of developing a chatbot is its evaluation. Judging a conversation depends on the number of complex elements. The objective of the thesis is to understand the characteristics of two types of automated metrics: trained-metric and untrained-metric, and identify the most suitable metrics for dialog evaluation. Moreover, experiments have been conducted to study the weaknesses of word-overlap metrics in morphology-rich language and solutions for that problem.
In particular, six evaluation metrics including Kullback–Leibler divergence, Coherence, BLEU, Embedding, Entropy, and MaUde were used for the experiment. In addition, three datasets for two different languages (English, and Finnish) are collected to study whether or not languages can influence the quality of the metrics. The metrics are requested to discriminate between the qualified answers and the unqualified answers. The incorrect answers are generated by randomly sampling sentences, which are not relevant to the context in the database. The obtained results indicate that BLEU for 1-gram and greedy-matching are the two most appropriate options for chatbot evaluation. One solution is found to solve the problem related to morphology-rich language. The efficiency of BLEU in Finnish can be boosted by segmenting words into sub-words or morphemes.
In particular, six evaluation metrics including Kullback–Leibler divergence, Coherence, BLEU, Embedding, Entropy, and MaUde were used for the experiment. In addition, three datasets for two different languages (English, and Finnish) are collected to study whether or not languages can influence the quality of the metrics. The metrics are requested to discriminate between the qualified answers and the unqualified answers. The incorrect answers are generated by randomly sampling sentences, which are not relevant to the context in the database. The obtained results indicate that BLEU for 1-gram and greedy-matching are the two most appropriate options for chatbot evaluation. One solution is found to solve the problem related to morphology-rich language. The efficiency of BLEU in Finnish can be boosted by segmenting words into sub-words or morphemes.