Spoken wake-word detection for conversational avatar
Savukoski, Simon (2023)
Savukoski, Simon
2023
Tieto- ja sähkötekniikan kandidaattiohjelma - Bachelor's Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2023-01-25
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202301221606
https://urn.fi/URN:NBN:fi:tuni-202301221606
Tiivistelmä
Speech recognition is present in smartphones and other intelligent devices in our daily life. Search engines can be used with voice, and one can shut the lights of their smart home with few words. For these machines to understand whether a question is asked, sentences need to be segmented. When intelligent devices constantly listen to users’ speech, power consumption and users’ privacy become a concern. Key- and wake-word detection is used in speech recognition applications to tackle these difficulties.
This thesis implemented wake-word detection for the speech recognition project called Deep Speaking Avatar. Deep Speaking Avatar is a project made by students in Tampere University’s department of signal processing. The implementation records audio samples in real-time, transforming them into image-like Mel-frequency spectrograms. These spectrograms are fed as input to a 1-dimensional convolutional neural network model. Based on the input, the model predicts whether the audio sample is the wake-word or not. If the wake-word is detected, the system starts listening to the user’s speech.
The wake-word detection model was tested with two test subjects, who were asked to give positive and negative audio examples for the model to classify. A female and a male were chosen as test subjects to test the model’s reactions to differing voices. Results show that the model predicts female test set audio with significantly better accuracy than the male test set. Further investigation revealed male test sets contained many wake-words spoken more rapidly or with changing sound tones. The model classified all but two negative samples correctly. A conclusion was drawn, that for explicitly pronounced speech, the implementation performs well enough for the intended use case.
This thesis implemented wake-word detection for the speech recognition project called Deep Speaking Avatar. Deep Speaking Avatar is a project made by students in Tampere University’s department of signal processing. The implementation records audio samples in real-time, transforming them into image-like Mel-frequency spectrograms. These spectrograms are fed as input to a 1-dimensional convolutional neural network model. Based on the input, the model predicts whether the audio sample is the wake-word or not. If the wake-word is detected, the system starts listening to the user’s speech.
The wake-word detection model was tested with two test subjects, who were asked to give positive and negative audio examples for the model to classify. A female and a male were chosen as test subjects to test the model’s reactions to differing voices. Results show that the model predicts female test set audio with significantly better accuracy than the male test set. Further investigation revealed male test sets contained many wake-words spoken more rapidly or with changing sound tones. The model classified all but two negative samples correctly. A conclusion was drawn, that for explicitly pronounced speech, the implementation performs well enough for the intended use case.
Kokoelmat
- Kandidaatintutkielmat [8709]