Comparison of features for neural network-based speaker identification in the presence of realistic noise
Kangasmäki, Roope (2024)
Kangasmäki, Roope
2024
Tieto- ja sähkötekniikan kandidaattiohjelma - Bachelor's Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2024-05-17
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202405155917
https://urn.fi/URN:NBN:fi:tuni-202405155917
Tiivistelmä
Speaker identification (SI) is the task of identifying speakers from an audio signal. The robustness of SI systems has been a focus of previous research, and key issues in building robust systems have been small dataset sizes, presence of background noise, and generalization to new domains. One essential part of building robust SI systems is feature extraction: a speech signal includes heaps of information only some of which is relevant to SI, which is why extracting features that contain relevant information, but as little redundant information as possible, is important. The goal of this study is to find out what features should be used with different dataset sizes and different noises.
In the experimental part, data augmentation is used to add additive white Gaussian noise, real additive noise, and room impulse responses to speech signals. We train neural network-based models for SI with varying levels of dataset sizes and different noise variants using three different features: raw audio waveform, log-mel spectrogram, and mel-frequency cepstral coefficients.
Our results show that generally when there is a low level of noise or no noise, the log-mel spectrogram turns out to be the best feature for SI, but when there is a medium or a high level of noise, neural networks trained with raw audio waveforms result into the best overall performance. The only exception is when a speech signal is convoluted with a room impulse response to create reverberation, audio waveform performs the worst. In our experiments, mel-frequency cepstral coefficients were never the best-performing feature type when compared with raw audio waveforms and log-mel spectrograms. We also found that the size of the dataset does not usually affect the order of the features from best to worst, but we found that increasing dataset size improves SI performance.
In the experimental part, data augmentation is used to add additive white Gaussian noise, real additive noise, and room impulse responses to speech signals. We train neural network-based models for SI with varying levels of dataset sizes and different noise variants using three different features: raw audio waveform, log-mel spectrogram, and mel-frequency cepstral coefficients.
Our results show that generally when there is a low level of noise or no noise, the log-mel spectrogram turns out to be the best feature for SI, but when there is a medium or a high level of noise, neural networks trained with raw audio waveforms result into the best overall performance. The only exception is when a speech signal is convoluted with a room impulse response to create reverberation, audio waveform performs the worst. In our experiments, mel-frequency cepstral coefficients were never the best-performing feature type when compared with raw audio waveforms and log-mel spectrograms. We also found that the size of the dataset does not usually affect the order of the features from best to worst, but we found that increasing dataset size improves SI performance.
Kokoelmat
- Kandidaatintutkielmat [8430]