Autoregressive model based on a deep convolutional neural network for audio generation
Cabello Piqueras, Laura (2017)
Cabello Piqueras, Laura
2017
Information Technology
Tieto- ja sähkötekniikan tiedekunta - Faculty of Computing and Electrical Engineering
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2017-04-05
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201703221212
https://urn.fi/URN:NBN:fi:tty-201703221212
Tiivistelmä
The main objective of this work is to investigate how a deep convolutional neural network (CNN) performs in audio generation tasks. We study a final architecture based on an autoregressive model of deep CNN that operates directly at the waveform level.
In first place, we study different options to tackle the task of audio generation. We define the best approach as a classification task with one-hot encode data; generation is based on sequential predictions: after next sample of an input sequence is predicted, it is fed back into the network to predict the next sample.
We present the basics of the preferred architecture for generation, adapted from WaveNet model proposed by DeepMind. It is based on dilated causal convolutions which allows an exponential growth of the receptive field size with depth of the network. Bigger receptive fields are desirable when dealing with temporal sequences since it increases the model capacity to model temporal correlations at longer timescales.
Due to the lack of an objective method to assess the quality of new synthesized signals, we firstly test a wide range of network settings with pure tones so the network is capable to predict the same sequences. In order to overcome the diffculties of training a deep network and to accelerate the research adjusted to our computational resources, we constrain the input database to a mixture of two sinusoids within an audible range of frequencies. In generation phase, we acknowledge the key role of training a network with a large receptive field and large input sequences. Likewise, the amount of examples we feed to the network every training epoch exert a decisive influence in any studied approach.
In first place, we study different options to tackle the task of audio generation. We define the best approach as a classification task with one-hot encode data; generation is based on sequential predictions: after next sample of an input sequence is predicted, it is fed back into the network to predict the next sample.
We present the basics of the preferred architecture for generation, adapted from WaveNet model proposed by DeepMind. It is based on dilated causal convolutions which allows an exponential growth of the receptive field size with depth of the network. Bigger receptive fields are desirable when dealing with temporal sequences since it increases the model capacity to model temporal correlations at longer timescales.
Due to the lack of an objective method to assess the quality of new synthesized signals, we firstly test a wide range of network settings with pure tones so the network is capable to predict the same sequences. In order to overcome the diffculties of training a deep network and to accelerate the research adjusted to our computational resources, we constrain the input database to a mixture of two sinusoids within an audible range of frequencies. In generation phase, we acknowledge the key role of training a network with a large receptive field and large input sequences. Likewise, the amount of examples we feed to the network every training epoch exert a decisive influence in any studied approach.