Implementing a speech-to-text module for a conversational avatar
Pöysti, Noora (2021)
Pöysti, Noora
2021
Tieto- ja sähkötekniikan kandidaattiohjelma - Bachelor's Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2021-06-02
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202105215301
https://urn.fi/URN:NBN:fi:tuni-202105215301
Tiivistelmä
Machine learning, or more specifically deep learning, has been gaining more and more popularity in complex computer science applications. One of these is speech recognition, where speech is translated into text by a computer. The current trend in speech recognition is to use end-to-end models, where a single deep learning model produces the desired outputs from given inputs without the need for pre- or post-processing.
The purpose of this thesis was to explore different speech recognition models to find a suitable one for the Deep Speaking Avatar project’s speech-to-text module. The purpose of the project is to implement a virtual avatar that tracks users with face detection and responds to audio inputs. The speech-to-text module is responsible for translating audio inputs into text files which are then passed on to other parts of the project. In this use case, a suitable speech recognition model is as accurate as possible while still leaving enough memory and computational capacity for the other modules.
Three different speech recognition models were explored in detail: DeepSpeech, Jasper 10x5 DR, and QuartzNet 15x5, where DeepSpeech was used as a baseline when comparing how the other models perform. All three are open-source end-to-end models, and they were evaluated on four different LibriSpeech datasets.
The results show that Jasper 10x5 DR and QuartzNet 15x5 both perform significantly better than the baseline model in terms of word error rate and average execution time. When it comes to model size and memory requirements, QuartzNet 15x5 is significantly smaller and Jasper 10x5 DR larger when compared to the baseline. Both models achieve similar word error rates on all evaluation datasets, but QuartzNet 15x5 is more efficient in terms of execution time and memory usage than Jasper 10x5 DR. Because of this, QuartzNet 15x5 is better suited for the speech-to-text task in the Deep Speaking Avatar project.
The purpose of this thesis was to explore different speech recognition models to find a suitable one for the Deep Speaking Avatar project’s speech-to-text module. The purpose of the project is to implement a virtual avatar that tracks users with face detection and responds to audio inputs. The speech-to-text module is responsible for translating audio inputs into text files which are then passed on to other parts of the project. In this use case, a suitable speech recognition model is as accurate as possible while still leaving enough memory and computational capacity for the other modules.
Three different speech recognition models were explored in detail: DeepSpeech, Jasper 10x5 DR, and QuartzNet 15x5, where DeepSpeech was used as a baseline when comparing how the other models perform. All three are open-source end-to-end models, and they were evaluated on four different LibriSpeech datasets.
The results show that Jasper 10x5 DR and QuartzNet 15x5 both perform significantly better than the baseline model in terms of word error rate and average execution time. When it comes to model size and memory requirements, QuartzNet 15x5 is significantly smaller and Jasper 10x5 DR larger when compared to the baseline. Both models achieve similar word error rates on all evaluation datasets, but QuartzNet 15x5 is more efficient in terms of execution time and memory usage than Jasper 10x5 DR. Because of this, QuartzNet 15x5 is better suited for the speech-to-text task in the Deep Speaking Avatar project.
Kokoelmat
- Kandidaatintutkielmat [8996]