Evaluating prosodic features of synthetic speech
Raatikainen, Arttu (2024)
Raatikainen, Arttu
2024
Tieto- ja sähkötekniikan kandidaattiohjelma - Bachelor's Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2024-05-21
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202405145881
https://urn.fi/URN:NBN:fi:tuni-202405145881
Tiivistelmä
Speech synthesis has developed rapidly due to artificial intelligence. Modern neural-based text-to-speech (TTS) systems can generate intelligible and natural sounding speech. The naturalness and intelligibility of synthetic speech can be expressed with prosodic properties. State-ofthe-art TTS models are capable of explicitly learn prosodic features. However, the generation of prosodically rich speech is a challenging task.
The goal of this thesis is to evaluate how well synthetic speech inherits prosodic features of natural speech it is trained on. The evaluation was conducted by comparing the prosodic features of real and synthetic speech. Compared features include fundamental frequency (F0), speech rate, energy, and spectral tilt.
The data used in the experiments consists mainly of the LJSpeech dataset. The LJSpeech consists of audiobooks from LibriVox audiobook domain. Recordings for natural speech were collected from the LibriVox, which were not included in LJSpeech dataset. Synthetic speech was generated with FastPitch TTS model and HiFi-GAN voice coder, which were trained with LJSpeech. FastPitch uses pitch and duration predictions for the generation of mel-spectrogram.
Overall, there were six audiobook chapters with a total length of 2.5 hours. Features were extracted using Python libraries. The results show that there is difference between prosodic distributions of real and synthetic speech. The TTS system was able to produce high quality synthetic speech. Large statistical difference was found for F0 and speech rate. Synthetic speech had significantly higher speech rate with less variation in pitch contour. The least statistical different features were energy and spectral tilt.
The goal of this thesis is to evaluate how well synthetic speech inherits prosodic features of natural speech it is trained on. The evaluation was conducted by comparing the prosodic features of real and synthetic speech. Compared features include fundamental frequency (F0), speech rate, energy, and spectral tilt.
The data used in the experiments consists mainly of the LJSpeech dataset. The LJSpeech consists of audiobooks from LibriVox audiobook domain. Recordings for natural speech were collected from the LibriVox, which were not included in LJSpeech dataset. Synthetic speech was generated with FastPitch TTS model and HiFi-GAN voice coder, which were trained with LJSpeech. FastPitch uses pitch and duration predictions for the generation of mel-spectrogram.
Overall, there were six audiobook chapters with a total length of 2.5 hours. Features were extracted using Python libraries. The results show that there is difference between prosodic distributions of real and synthetic speech. The TTS system was able to produce high quality synthetic speech. Large statistical difference was found for F0 and speech rate. Synthetic speech had significantly higher speech rate with less variation in pitch contour. The least statistical different features were energy and spectral tilt.
Kokoelmat
- Kandidaatintutkielmat [8430]