Comparison of different features for neural network-based models in speaker identification
Hyyryläinen, Eetu (2023)
Hyyryläinen, Eetu
2023
Tieto- ja sähkötekniikan kandidaattiohjelma - Bachelor's Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2023-01-10
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202301091204
https://urn.fi/URN:NBN:fi:tuni-202301091204
Tiivistelmä
The general understanding in speech and audio processing has been that with larger datasets, task-specific features can be learned straight from a raw audio waveform, but with smaller datasets, traditional signal features, such as log-mel features, are used. The purpose of this work is to compare how different features of speech perform in the case example of the present study, neural network-based speaker identification, with datasets of different sizes. Speaker identification is the task of determining which voice from a group of voices best matches a speaker.
This work consists of two parts. The literature review briefly defines the functioning of a speaker identification system, as well as how speaker identification models have been typically created. In the experimental part of the present study, raw audio and log-mel features were compared with each other in multiple experimental conditions, which included different numbers of training samples and various noise levels.
The results of the present study show that, with no added noise, both features perform very reliably in the task of speaker identification. There was surprisingly not much disparity between the speaker classification accuracies of raw audio-based and log-mel spectrum-based models. Only when the training sample size was small, log-mel features beat raw audio clearly. When the training sample size was large, log-mel features beat raw audio only barely. However, when noise was introduced to the data, raw audio started to outperform log-mel features in the classification accuracy as the noise level increased. With the highest noise level used in the experiments, raw audio was clearly better than log-mel features in classification accuracy, especially with smaller training sampling sizes.
This work consists of two parts. The literature review briefly defines the functioning of a speaker identification system, as well as how speaker identification models have been typically created. In the experimental part of the present study, raw audio and log-mel features were compared with each other in multiple experimental conditions, which included different numbers of training samples and various noise levels.
The results of the present study show that, with no added noise, both features perform very reliably in the task of speaker identification. There was surprisingly not much disparity between the speaker classification accuracies of raw audio-based and log-mel spectrum-based models. Only when the training sample size was small, log-mel features beat raw audio clearly. When the training sample size was large, log-mel features beat raw audio only barely. However, when noise was introduced to the data, raw audio started to outperform log-mel features in the classification accuracy as the noise level increased. With the highest noise level used in the experiments, raw audio was clearly better than log-mel features in classification accuracy, especially with smaller training sampling sizes.
Kokoelmat
- Kandidaatintutkielmat [8709]