Generating speech in different speaking styles using WaveNet
Javanmardi, Farhad (2020)
Javanmardi, Farhad
2020
Degree Programme in Information Technology, MSc (Tech)
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2020-05-20
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202005064996
https://urn.fi/URN:NBN:fi:tuni-202005064996
Tiivistelmä
Generating speech in different styles from any given style is a challenging research problem in speech technology. This topic has many applications, for example, in assistive devices and in human-computer speech interaction. With the recent development in neural networks, speech generation has achieved a great level of naturalness and flexibility. The WaveNet model, one of the main drivers in recent progress in text-to-speech synthesis, is an advanced neural network model, which can be used in different speech generation systems. WaveNet uses a sequential generation process in which a new sample predicted by the model is fed back into the network as input to predict the next sample until the entire waveform is generated.
This thesis studies training of the WaveNet model with speech spoken in a particular source style and generating speech waveforms in a given target style. The source style studied in the thesis is normal speech and the target style is Lombard speech. The latter corresponds to the speaking style elicited by the Lombard effect, that is, the phenomenon in human speech communication in which speakers change their speaking style in noisy environments in order to raise loudness and to make the spoken message more intelligible. The training of WaveNet was done by conditioning the model using acoustic mel-spectrogram features of the input speech. Four different databases were used for training the model. Two of these databases (Nick 1, Nick 2) were originally collected at the University of Edinburgh in the UK and the other two (CMU Arctic 1, CMU Arctic 2) at the Carnegie Mellon University in the US. The different databases consisted of different mixtures of speaking styles and varied in number of unique speakers.
Two subjective listening tests (a speaking style similarity test and a MOS test on speech quality and naturalness) were conducted to assess the performance of WaveNet for each database. In the former tests, the WaveNet-generated speech waveforms and the natural Lombard reference were compared in terms of their style similarity. In the latter test, the quality and naturalness of the WaveNet-generated speech signals were evaluated. In the speaking style similarity test, training with the Nick 2 yielded slightly better performance compared to the other three databases. In the quality and naturalness tests, we found that when the training was done using CMU Arctic 2, the quality of Lombard speech signals were better than when using the other three databases. As the overall results, the study shows that the WaveNet model trained on speech of source speaking style (normal) is not capable of generating speech waveforms of target style (Lombard) unless some speech signals of target style are included in the training data (i.e., Nick 2 in this study).
This thesis studies training of the WaveNet model with speech spoken in a particular source style and generating speech waveforms in a given target style. The source style studied in the thesis is normal speech and the target style is Lombard speech. The latter corresponds to the speaking style elicited by the Lombard effect, that is, the phenomenon in human speech communication in which speakers change their speaking style in noisy environments in order to raise loudness and to make the spoken message more intelligible. The training of WaveNet was done by conditioning the model using acoustic mel-spectrogram features of the input speech. Four different databases were used for training the model. Two of these databases (Nick 1, Nick 2) were originally collected at the University of Edinburgh in the UK and the other two (CMU Arctic 1, CMU Arctic 2) at the Carnegie Mellon University in the US. The different databases consisted of different mixtures of speaking styles and varied in number of unique speakers.
Two subjective listening tests (a speaking style similarity test and a MOS test on speech quality and naturalness) were conducted to assess the performance of WaveNet for each database. In the former tests, the WaveNet-generated speech waveforms and the natural Lombard reference were compared in terms of their style similarity. In the latter test, the quality and naturalness of the WaveNet-generated speech signals were evaluated. In the speaking style similarity test, training with the Nick 2 yielded slightly better performance compared to the other three databases. In the quality and naturalness tests, we found that when the training was done using CMU Arctic 2, the quality of Lombard speech signals were better than when using the other three databases. As the overall results, the study shows that the WaveNet model trained on speech of source speaking style (normal) is not capable of generating speech waveforms of target style (Lombard) unless some speech signals of target style are included in the training data (i.e., Nick 2 in this study).