Techniques for Spectral Voice Conversion
Popa, Victor (2012)
Popa, Victor
Tampere University of Technology
2012
Tieto- ja sähkötekniikan tiedekunta - Faculty of Computing and Electrical Engineering
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:ISBN:978-952-15-2878-1
https://urn.fi/URN:ISBN:978-952-15-2878-1
Tiivistelmä
Voice conversion is a speech technology encompassing transformations applied to the speech signal with the purpose of changing the perceived speaker identity from a source voice to a desired target voice. The principal use of voice conversion is to enable synthesis systems to generate speech with customized voices without need for exhaustive recordings and processing. Voice conversion can be realized as a stand-alone task or alternatively, using adaptation techniques with HMM-based synthesis. Although there are many speaker-dependent voice characteristics, voice conversion deals mainly with those acoustic in nature such as the spectral characteristics and the fundamental frequency. In spite of the remarkable results achieved by the state of the art techniques, further challenges remain to be solved in all sub-areas of voice conversion in order to provide excellent quality and highly successful identity conversion. The objective of this thesis is to develop a stand-alone voice conversion system for coded speech and to propose solutions leading to better quality, versatility or efficiency compared to current techniques. The thesis is focused on the conversion of spectral envelopes but other sub-areas of voice conversion such as speech parameterization or alignment are also treated.
The analysis-modification-synthesis system adopted in this thesis is based on a parametric speech model used in a real speech codec but also for the internal speech representation of a concatenative TTS system. This allows an easy integration of voice conversion with communications related and embedded applications developed for coded speech. The validity of this parameterization has been confirmed by using it in an actual voice conversion scheme. In addition, a voicing level control scheme, a speech enhancement technique and a method for automatic speech data collection have been proposed, all of them taking advantage of this parametric framework.
The versatility of voice conversion systems depends on the characteristics imposed on the training data in order to properly estimate a conversion function. Depending on the characteristics of the training data different alignment strategies can be adopted. Although dynamic time warping (DTW) has become almost a standard for the alignment of parallel training data, a new soft alignment technique is proposed for the same purpose. This concept allows probabilistic one-to-many frame mapping and is proven valid in an experiment with an artificial example. For practical reasons the non-parallel case has been attracting increased interest lately. In this thesis, two techniques for text independent alignment are proposed. The first one is based on phonetic segmentation and temporal decomposition and was successfully used in a voice conversion application. The second method uses a TTS to break the non-parallel problem into two parallel conversions which concatenated realize the desired voice conversion.
GMM-based techniques have been the most popular approach for the conversion of spectral envelopes in spite of their problems related to over-smoothing, over-fitting and the lack of temporal modeling. In order to address these issues and improve the GMM-based voice conversion, this thesis introduces several techniques. First, a new measure of the conversion accuracy is proposed, which can be easily computed from the GMM parameters, and is found to be in line with perceptual and objective metrics. The next technique combines a clustering and mode selection scheme with cluster-wise GMM modeling and is based on the observation that the mapping accuracy improves with the reduction of target data variance. The proposed technique is shown to outperform a comparable approach that uses voicing based clustering. A third technique can be used to adapt an existing well-trained conversion model to a new target speaker using a small amount of data. In a practical evaluation the adapted model outperformed an equivalent model trained exclusively on the reduced data. The continuity issue is addressed in a final idea proposed for future research.
In general the spectral conversion techniques proposed in the literature have been subject to a tradeoff between speech quality and identity conversion and the challenge remains to provide optimal results for both criteria simultaneously. A new spectral conversion technique introduced in this thesis uses bilinear models to decompose the line spectral frequencies (LSF) representation of the spectral envelope into two factors corresponding to speaker identity and phonetic content respectively. In an extensive evaluation on different types and for different sizes of training data this approach was found to perform similarly to a GMM-based conversion even though, in contrast to the GMM, it does not require any tuning. The concepts of contextual and local modeling are also introduced arguing that a more accurate mapping can be achieved if multiple models are fit on relatively small subsets of the full training data rather than using one global model. The validity of the concepts has been verified in practical experiments. Furthermore, a local linear transformation technique is shown to effectively reduce the over-smoothing relatively to a globally trained GMM-based conversion function.
For spectral conversion, several of the proposed techniques can be considered extensions of a baseline vector quantization framework, opening the way for further developments in this direction. In addition to the local linear transformation method which can be easily integrated in this framework, a memory efficient conversion scheme based on multi-stage vector quantization (MSVQ) was also proposed. The experiments indicated substantial memory savings and even accuracy improvement compared to some conventional codebook conversions. A dynamic programming approach to optimizing the frame to frame continuity in a vector quantization framework is also proposed as a future direction.
A final contribution is brought to the existing hybrid technique that combines GMM and frequency warping. The approach is adapted to work with the proposed speech representation and a procedure for automatic formant alignment and warping function calculation is presented.
The analysis-modification-synthesis system adopted in this thesis is based on a parametric speech model used in a real speech codec but also for the internal speech representation of a concatenative TTS system. This allows an easy integration of voice conversion with communications related and embedded applications developed for coded speech. The validity of this parameterization has been confirmed by using it in an actual voice conversion scheme. In addition, a voicing level control scheme, a speech enhancement technique and a method for automatic speech data collection have been proposed, all of them taking advantage of this parametric framework.
The versatility of voice conversion systems depends on the characteristics imposed on the training data in order to properly estimate a conversion function. Depending on the characteristics of the training data different alignment strategies can be adopted. Although dynamic time warping (DTW) has become almost a standard for the alignment of parallel training data, a new soft alignment technique is proposed for the same purpose. This concept allows probabilistic one-to-many frame mapping and is proven valid in an experiment with an artificial example. For practical reasons the non-parallel case has been attracting increased interest lately. In this thesis, two techniques for text independent alignment are proposed. The first one is based on phonetic segmentation and temporal decomposition and was successfully used in a voice conversion application. The second method uses a TTS to break the non-parallel problem into two parallel conversions which concatenated realize the desired voice conversion.
GMM-based techniques have been the most popular approach for the conversion of spectral envelopes in spite of their problems related to over-smoothing, over-fitting and the lack of temporal modeling. In order to address these issues and improve the GMM-based voice conversion, this thesis introduces several techniques. First, a new measure of the conversion accuracy is proposed, which can be easily computed from the GMM parameters, and is found to be in line with perceptual and objective metrics. The next technique combines a clustering and mode selection scheme with cluster-wise GMM modeling and is based on the observation that the mapping accuracy improves with the reduction of target data variance. The proposed technique is shown to outperform a comparable approach that uses voicing based clustering. A third technique can be used to adapt an existing well-trained conversion model to a new target speaker using a small amount of data. In a practical evaluation the adapted model outperformed an equivalent model trained exclusively on the reduced data. The continuity issue is addressed in a final idea proposed for future research.
In general the spectral conversion techniques proposed in the literature have been subject to a tradeoff between speech quality and identity conversion and the challenge remains to provide optimal results for both criteria simultaneously. A new spectral conversion technique introduced in this thesis uses bilinear models to decompose the line spectral frequencies (LSF) representation of the spectral envelope into two factors corresponding to speaker identity and phonetic content respectively. In an extensive evaluation on different types and for different sizes of training data this approach was found to perform similarly to a GMM-based conversion even though, in contrast to the GMM, it does not require any tuning. The concepts of contextual and local modeling are also introduced arguing that a more accurate mapping can be achieved if multiple models are fit on relatively small subsets of the full training data rather than using one global model. The validity of the concepts has been verified in practical experiments. Furthermore, a local linear transformation technique is shown to effectively reduce the over-smoothing relatively to a globally trained GMM-based conversion function.
For spectral conversion, several of the proposed techniques can be considered extensions of a baseline vector quantization framework, opening the way for further developments in this direction. In addition to the local linear transformation method which can be easily integrated in this framework, a memory efficient conversion scheme based on multi-stage vector quantization (MSVQ) was also proposed. The experiments indicated substantial memory savings and even accuracy improvement compared to some conventional codebook conversions. A dynamic programming approach to optimizing the frame to frame continuity in a vector quantization framework is also proposed as a future direction.
A final contribution is brought to the existing hybrid technique that combines GMM and frequency warping. The approach is adapted to work with the proposed speech representation and a procedure for automatic formant alignment and warping function calculation is presented.
Kokoelmat
- Väitöskirjat [4850]