Design and Implementation of a Speaker-Independent Voice Dialing System: A Multi-lingual Approach
Iso-Sipilä, Juha (2008)
Iso-Sipilä, Juha
Tampere University of Technology
2008
Tieto- ja sähkötekniikan tiedekunta - Faculty of Computing and Electrical Engineering
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-200902201010
https://urn.fi/URN:NBN:fi:tty-200902201010
Tiivistelmä
Speech as a means of communication is most natural to human beings. Therefore, it should be straightforward to apply the same modality for commonly used communication devices, such as mobile phones. The current state-of-the-art speech processing technology enables a wide range of applications using both speech input and output modalities. Research in the areas of speech recognition and speech synthesis has been carried out for several decades but real breakthroughs are yet to be seen. In addition to mastering the above-mentioned technologies, a computer should be able to understand the speech as humans do. Only then, natural spoken communication between a human and a machine can be accomplished in a similar manner as in human to human communications. Currently, in many cases, the users’ expectations do not meet the capabilities of the technology. The evolution of voice interfaces for mobile phones is progressing despite the problems mentioned above. During the 1990s the first devices were equipped with so called speaker-dependent voice dialing or voice tagging. Most mobile device manufacturers included this feature to their devices before the year 2000. This can be considered to be the first speech application with hundreds of millions of installations and a multi-lingual user base.
In an application utilizing speaker-independent speech recognition, the user does not need to perform any training and the device is immediately available for use. This is an advantage to the user but, at the same time, presents a number of challenges to the design of the system. The biggest challenge is that the speaker-independent system is language dependent and the system requires additional development effort for every language. Speaker-independent systems are inherently more complex than speaker-dependent since the speech recognizer has to be trained with a large group of speakers for each language. In addition, a speech synthesizer is needed for spoken feedback.
This thesis considers a multi-lingual approach for a voice dialing application running on a mobile device. The proposed techniques and their applications form the core of a multi-lingual speaker-independent voice dialing system. The voice dialing system utilizes multi-lingual speech recognition, speech synthesis and language processing technologies to create a compact and complete system that can be easily extended and configured to existing and new languages. The system is split into subsystems which are designed and developed independently. The subsystems include research topics, such as multi-lingual phonetics, multi-lingual acoustic modeling, pronunciation modeling, speech recognition and speech synthesis. The subsystems communicate within a framework and share common resources. The basis for the overall system is the definition of a multi-lingual phoneme set that represents the phonology of any supported language. The harmonized phonetic presentation is further utilized in pronunciation modeling, speech recognition and speech synthesis. Additionally, the speech recognition engine is developed in such a manner that makes it language-independent and utilizes a very compact representation of the acoustic models. The speech synthesis engine, based on a well known formant synthesizer, is both compact in size and easily runtime configurable with a description language. Finally, the pronunciation modeling engine is developed so that it supports several types of pronunciation modeling methods without dependencies to individual languages. The functionality is implemented so that a collection of runtime parameters define the operations and languages supported by any instance of the system. The contribution of this thesis includes the design and development of the subsystems and their integration to build a real application for mobile devices.
In an application utilizing speaker-independent speech recognition, the user does not need to perform any training and the device is immediately available for use. This is an advantage to the user but, at the same time, presents a number of challenges to the design of the system. The biggest challenge is that the speaker-independent system is language dependent and the system requires additional development effort for every language. Speaker-independent systems are inherently more complex than speaker-dependent since the speech recognizer has to be trained with a large group of speakers for each language. In addition, a speech synthesizer is needed for spoken feedback.
This thesis considers a multi-lingual approach for a voice dialing application running on a mobile device. The proposed techniques and their applications form the core of a multi-lingual speaker-independent voice dialing system. The voice dialing system utilizes multi-lingual speech recognition, speech synthesis and language processing technologies to create a compact and complete system that can be easily extended and configured to existing and new languages. The system is split into subsystems which are designed and developed independently. The subsystems include research topics, such as multi-lingual phonetics, multi-lingual acoustic modeling, pronunciation modeling, speech recognition and speech synthesis. The subsystems communicate within a framework and share common resources. The basis for the overall system is the definition of a multi-lingual phoneme set that represents the phonology of any supported language. The harmonized phonetic presentation is further utilized in pronunciation modeling, speech recognition and speech synthesis. Additionally, the speech recognition engine is developed in such a manner that makes it language-independent and utilizes a very compact representation of the acoustic models. The speech synthesis engine, based on a well known formant synthesizer, is both compact in size and easily runtime configurable with a description language. Finally, the pronunciation modeling engine is developed so that it supports several types of pronunciation modeling methods without dependencies to individual languages. The functionality is implemented so that a collection of runtime parameters define the operations and languages supported by any instance of the system. The contribution of this thesis includes the design and development of the subsystems and their integration to build a real application for mobile devices.
Kokoelmat
- Väitöskirjat [4862]