Cross-modal semantic retrieval in neural models of visually grounded speech
Pesä, Paavo (2023)
Pesä, Paavo
2023
Tieto- ja sähkötekniikan kandidaattiohjelma - Bachelor's Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2023-03-01
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202302172479
https://urn.fi/URN:NBN:fi:tuni-202302172479
Tiivistelmä
Machine learning and its subset deep learning have been hot topics of technology for some years now. Visually grounded speech (VGS) models belong to a family of weakly supervised deep learning methods. The objective of VGS models is to learn to recognize semantic similarities between image and speech modalities without ever being trained with strong, ground-truth labels for spoken words and visual objects. Instead, training is done by only knowing the connection between the two modalities in the coarse level of images and spoken utterances. This form of weakly supervised learning is similar to what infants are experiencing while learning their native language.
The first part of this thesis discusses the visually grounded speech models, their usual structure, and the methods used while constructing them. In the latter part, this work focuses on the convolutional neural network (CNN) based VGS models and explores how the structure of speech encoder block can affect the performance of these models on semantic retrieval tasks. More specifically, I design and test three different variations of speech encoder block and compare the performance to the literature. My own implementations differ from the literature by: (i) lower receptive fields, (ii) increased number of filter channels, and (iii) higher receptive fields combined with the increased number of filter channels.
The results of this work show that modifying the hyperparameters of CNN audio encoders affects the VGS model performance. While decreasing the temporal receptive fields of the convolutional layers has a negative effect on cross-modal retrieval scores, increasing the receptive fields yields a better result in the cost of higher computational complexity and training time than the literature.
The first part of this thesis discusses the visually grounded speech models, their usual structure, and the methods used while constructing them. In the latter part, this work focuses on the convolutional neural network (CNN) based VGS models and explores how the structure of speech encoder block can affect the performance of these models on semantic retrieval tasks. More specifically, I design and test three different variations of speech encoder block and compare the performance to the literature. My own implementations differ from the literature by: (i) lower receptive fields, (ii) increased number of filter channels, and (iii) higher receptive fields combined with the increased number of filter channels.
The results of this work show that modifying the hyperparameters of CNN audio encoders affects the VGS model performance. While decreasing the temporal receptive fields of the convolutional layers has a negative effect on cross-modal retrieval scores, increasing the receptive fields yields a better result in the cost of higher computational complexity and training time than the literature.
Kokoelmat
- Kandidaatintutkielmat [8798]