Multi-Label Speaker Recognition using Recurrent Neural Networks
Niskanen, Lauri (2018)
Niskanen, Lauri
2018
Tietotekniikka
Tieto- ja sähkötekniikan tiedekunta - Faculty of Computing and Electrical Engineering
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2018-12-05
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201811212656
https://urn.fi/URN:NBN:fi:tty-201811212656
Tiivistelmä
Speech recognition is a popular research topic that analyzes human speech. In addition to understanding the spoken message, it is beneficial to know who is speaking. This thesis studies speaker recognition and presents a machine learning based system for identifying the speakers from audio streams. Our implementation is based on Mel-frequency cepstral coefficients (MFCC) and recurrent neural networks.
The system is developed and evaluated on AMI Meeting Corpus dataset. The dataset contains annotated meeting recordings with typically four participants in each. Our system processes the audio files of the recordings in 20 millisecond slices and produces a list of active speakers at each time step.
We measure the performance of our system using various metrics. The results indicate that our system is capable of identifying the speakers with decent accuracy. The best classifier model that we examined is a 1-layer long short-term memory (LSTM) neural network with layer size 256. Neural networks that are more complex than it do not seem to improve the classification results, but they suffer from increased training times. We also suggest alternative classifications methods for future research.
The system is developed and evaluated on AMI Meeting Corpus dataset. The dataset contains annotated meeting recordings with typically four participants in each. Our system processes the audio files of the recordings in 20 millisecond slices and produces a list of active speakers at each time step.
We measure the performance of our system using various metrics. The results indicate that our system is capable of identifying the speakers with decent accuracy. The best classifier model that we examined is a 1-layer long short-term memory (LSTM) neural network with layer size 256. Neural networks that are more complex than it do not seem to improve the classification results, but they suffer from increased training times. We also suggest alternative classifications methods for future research.