Convolutional neural networks for acoustic scene classification
Valenti, Michele (2016)
Valenti, Michele
2016
Tieto- ja sähkötekniikan tiedekunta - Faculty of Computing and Electrical Engineering
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2016-11-09
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201610144598
https://urn.fi/URN:NBN:fi:tty-201610144598
Tiivistelmä
In this thesis we investigate the use of deep neural networks applied to the field of computational audio scene analysis, in particular to acoustic scene classification. This task concerns the recognition of an acoustic scene, like a park or a home, performed by an artificial system. In our work we examine the use of deep models aiming to give a contribution in one of their use cases which is, in our opinion, one of the most poorly explored.
The neural architecture we propose in this work is a convolutional neural network specifically designed to work on a time-frequency audio representation known as log-mel spectrogram. The network output is an array of prediction scores, each of which is associated with one class of a set of 15 predefined classes. In addition, the architecture features batch normalization, a recently proposed regularization technique used to enhance the network performance and to speed up its training.
We also investigate the use of different audio sequence lengths as classification unit for our network. Thanks to these experiments we observe that, for our artificial system, the recognition of long sequences is not easier than of medium-length sequences, hence highlighting a counterintuitive behaviour. Moreover, we introduce a training procedure which aims to make the best of small datasets by using all the labeled data available for the network training. This procedure, possible under particular circumstances, constitutes a trade-off between an accurate training stop and an increased data representation available to the network. Finally, we compare our model to other systems, proving that its recognition ability can outperform either other neural architectures as well as other state-of-the-art statistical classifiers, like support vector machines and Gaussian mixture models.
The proposed system reaches good accuracy scores on two different databases collected in 2013 and 2016. The best accuracy scores, obtained according to two cross-validation setups, are 77% and 79% respectively. These scores constitute a 22% and 6.1% accuracy increment with respect to the correspondent baselines published together with datasets.
The neural architecture we propose in this work is a convolutional neural network specifically designed to work on a time-frequency audio representation known as log-mel spectrogram. The network output is an array of prediction scores, each of which is associated with one class of a set of 15 predefined classes. In addition, the architecture features batch normalization, a recently proposed regularization technique used to enhance the network performance and to speed up its training.
We also investigate the use of different audio sequence lengths as classification unit for our network. Thanks to these experiments we observe that, for our artificial system, the recognition of long sequences is not easier than of medium-length sequences, hence highlighting a counterintuitive behaviour. Moreover, we introduce a training procedure which aims to make the best of small datasets by using all the labeled data available for the network training. This procedure, possible under particular circumstances, constitutes a trade-off between an accurate training stop and an increased data representation available to the network. Finally, we compare our model to other systems, proving that its recognition ability can outperform either other neural architectures as well as other state-of-the-art statistical classifiers, like support vector machines and Gaussian mixture models.
The proposed system reaches good accuracy scores on two different databases collected in 2013 and 2016. The best accuracy scores, obtained according to two cross-validation setups, are 77% and 79% respectively. These scores constitute a 22% and 6.1% accuracy increment with respect to the correspondent baselines published together with datasets.