Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation
Kuparinen, Olli; Miletić, Aleksandra; Scherrer, Yves (2023-12-01)
Kuparinen, Olli
Miletić, Aleksandra
Scherrer, Yves
01.12.2023
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-2023121810954
https://urn.fi/URN:NBN:fi:tuni-2023121810954
Kuvaus
Peer reviewed
Tiivistelmä
Text normalization methods have been commonly applied to historical language or user-generated content, but less often to dialectal transcriptions. In this paper, we introduce dialect-to-standard normalization -- i.e., mapping phonetic transcriptions from different dialects to the orthographic norm of the standard variety -- as a distinct sentence-level character transduction task and provide a large-scale analysis of dialect-to-standard normalization methods. To this end, we compile a multilingual dataset covering four languages: Finnish, Norwegian, Swiss German and Slovene. For the two biggest corpora, we provide three different data splits corresponding to different use cases for automatic normalization. We evaluate the most successful sequence-to-sequence model architectures proposed for text normalization tasks using different tokenization approaches and context sizes. We find that a character-level Transformer trained on sliding windows of three words works best for Finnish, Swiss German and Slovene, whereas the pre-trained byT5 model using full sentences obtains the best results for Norwegian. Finally, we perform an error analysis to evaluate the effect of different data splits on model performance.
Kokoelmat
- TUNICRIS-julkaisut [20263]