Online Speaker Separation Using Deep Clustering
Wang, Shanshan (2019)
Wang, Shanshan
2019
Tietotekniikan DI-ohjelma - Degree Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2019-11-21
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-201911045727
https://urn.fi/URN:NBN:fi:tuni-201911045727
Tiivistelmä
In this thesis, a low-latency variant of speaker-independent deep clustering method is
proposed for speaker separation. Compared to the offline deep clustering separation
system, bidirectional long-short term memory networks (BLSTMs) are replaced with
long-short term memory networks (LSTMs). The reason is that the data has to be
fed to the BLSTM networks both forward and backward directions. Additionally, the
final outputs depend on both directions, which make online processing not possible.
Also, 32 ms synthesis window is replaced with 8 ms in order to cooperate with low-
latency applications like hearing aids since the algorithmic latency depends upon the
length of synthesis window. Furthermore, the beginning of the audio mixture, here,
referred as buffer, is used to get the cluster centers for the constituent speakers in the
mixture serving as the initialization purpose. Later, those centers are used to assign
clusters for the rest of the mixture to achieve speaker separation with the latency
of 8 ms. The algorithm is evaluated on the Wall Street Journal corpus (WSJ0).
Changing the networks from BLSTM to LSTM while keeping the same window
length degrades the separation performance measured by signal-to-distortion ratio
(SDR) by 1.0 dB, which implies that the future information is important for the
separation. For investigating the effect of window length, keeping the same network
structure (LSTM), by changing window length from 32 ms to 8 ms, another 1.1 dB
drop in SDR is found. For the low-latency deep clustering speaker separation system,
different duration of buffer is studied. It is observed that initially, the separation
performance increases as the buffer increases. However, with buffer length of 0.3 s,
the separation performance keeps steady even by increasing the buffer. Compared to
offline deep clustering separation system, degradation of 2.8 dB in SDR is observed
for online system.
proposed for speaker separation. Compared to the offline deep clustering separation
system, bidirectional long-short term memory networks (BLSTMs) are replaced with
long-short term memory networks (LSTMs). The reason is that the data has to be
fed to the BLSTM networks both forward and backward directions. Additionally, the
final outputs depend on both directions, which make online processing not possible.
Also, 32 ms synthesis window is replaced with 8 ms in order to cooperate with low-
latency applications like hearing aids since the algorithmic latency depends upon the
length of synthesis window. Furthermore, the beginning of the audio mixture, here,
referred as buffer, is used to get the cluster centers for the constituent speakers in the
mixture serving as the initialization purpose. Later, those centers are used to assign
clusters for the rest of the mixture to achieve speaker separation with the latency
of 8 ms. The algorithm is evaluated on the Wall Street Journal corpus (WSJ0).
Changing the networks from BLSTM to LSTM while keeping the same window
length degrades the separation performance measured by signal-to-distortion ratio
(SDR) by 1.0 dB, which implies that the future information is important for the
separation. For investigating the effect of window length, keeping the same network
structure (LSTM), by changing window length from 32 ms to 8 ms, another 1.1 dB
drop in SDR is found. For the low-latency deep clustering speaker separation system,
different duration of buffer is studied. It is observed that initially, the separation
performance increases as the buffer increases. However, with buffer length of 0.3 s,
the separation performance keeps steady even by increasing the buffer. Compared to
offline deep clustering separation system, degradation of 2.8 dB in SDR is observed
for online system.