Methods for Analyzing Attention Heads in Transformer-Based Predictive Models of Speech Representation Learning
Dogan, Duygu (2021)
Dogan, Duygu
2021
Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2021-11-03
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202110137569
https://urn.fi/URN:NBN:fi:tuni-202110137569
Tiivistelmä
Transformer has recently become one of the most popular deep learning models often utilized for processing sequential data. At the core of the Transformer, a mechanism called attention lies. The attention mechanism aims to learn the dependencies between different items in a sequence. Furthermore, Transformer applies multiple parallel attentions, called multi-head attention, in order to incorporate different types of dependencies behind the data. As Transformer became widely common, much literature has been devoted to the investigation of attention heads in natural language processing tasks. However, how they work in speech processing tasks is still under-explored. The present study analyzes multiple methodologies to analyze the attention heads in speech representation learning tasks. For this purpose, two different Transformer-based predictive coding models with different learning strategies, Autoregressive Predictive Coding and Contrastive Predictive Coding, are used. Furthermore, attentions are grouped into explainable categories by using temporal analysis along with correlation and linear regression methods with respect to the known characteristic of speech. Additionally, the contributions of individual heads to the performance on phoneme classification tasks are evaluated and analyzed by their designated categories using the aforementioned methods. The results of correlation and linear regression analyses show that individual attention heads have different functionalities and learn from varying speech features. Combined with temporal analysis, the findings further indicate that the heads learning from phonetic features tend to accumulate their attention in neighbor frames more consistently. On the other hand, the heads learning from other acoustic features spread their attention in temporally further past. In addition, the analyses indicate that instead of utilizing all heads, only a subset of heads can be used without seriously affecting the performance of the phoneme classification tasks. Although the choice of the subset is mainly related to the temporal behavior of the heads, it varies depending on the learning strategy of the model. As the model tries to predict further in the future, the best subset of heads consists of the ones that concentrate their attention on the more recent past.