Computational Modeling of Early Language Acquisition with Multimodal Neural Networks
Khorrami, Khazar (2024)
Khorrami, Khazar
Tampere University
2024
Tieto- ja sähkötekniikan tohtoriohjelma - Doctoral Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Väitöspäivä
2024-10-18
Julkaisun pysyvä osoite on
https://urn.fi/URN:ISBN:978-952-03-3616-5
https://urn.fi/URN:ISBN:978-952-03-3616-5
Tiivistelmä
During the first year of life, infants achieve significant milestones in language development, demonstrating remarkable abilities shortly after birth and acquiring early word recognition and comprehension skills as early as 6 months of age. The process of language acquisition in infants has sparked interest among researchers, leading to various theoretical approaches to explain the process. Statistical learning, in which infants leverage inherent statistical properties of speech input to infer its structure, and in particular, cross-situational learning, where children learn word meanings by observing co-occurrences of words and their referents in the external world, have been proposed as potential mechanisms for early language acquisition.
This thesis utilizes computational models to investigate the feasibility of general statistical learning as a means to bootstrap early language acquisition. Motivated by the interplay between auditory and visual stimuli in early language learning, the thesis employs self-supervised artificial deep neural network models for speech representation learning in both unimodal and multimodal (audiovisual) learning frameworks and conducts a series of computational studies to explore the development of early language perception skills.
The exploration begins with the formulation of a Latent Language Hypothesis (LLH), challenging traditional notions of language acquisition by proposing that linguistic knowledge may emerge as latent representations to support predictive processing within and across sensory modalities. Through extensive experiments using visually grounded speech processing artificial neural models, the research examines whether linguistic patterns, including phonemes, syllables, and words, can emerge from cross-modal mapping between speech and visual domains.
Moreover, the thesis investigates the segmentation and alignments of individual visual objects and spoken words within the visually grounded speech processing models. The thesis introduces a set of novel standardized evaluation metrics for the problem together with new model variants, exploring the role of a cross-modal attention layer in enhancing cross-modal object-word alignments.
By leveraging state-of-the-art techniques in self-supervised learning and visually grounded speech processing, the thesis also investigates the joint optimization of auditory and visual learning mechanisms and their role in shaping early linguistic skills. Furthermore, the temporal dynamics of learning are explored, examining how the relative contributions of auditory and audiovisual learning influence the temporal acquisition pattern of linguistic representations.
Finally, the thesis investigates early lexical bootstrapping in a joint model of self-supervised speech and speech-vision learning with realistic-scale speech input and a realistic number of audiovisual naming events during the first 12 months of infants’ life. Through this investigation, the study evaluates whether the realistic-scale sensory input, when paired with statistical learning algorithms, is sufficient for bootstrapping early lexical learning, thereby providing insights on potential mechanisms for early language acquisition in infants.
The findings underscore the importance of both auditory and visual modalities in shaping early linguistic representations and emphasize the significance of statistical unimodal and cross-modal learning in language acquisition. These insights deepen our understanding of infant language acquisition and hold implications for the design of artificial intelligence systems capable of learning from multimodal sensory inputs.
This thesis utilizes computational models to investigate the feasibility of general statistical learning as a means to bootstrap early language acquisition. Motivated by the interplay between auditory and visual stimuli in early language learning, the thesis employs self-supervised artificial deep neural network models for speech representation learning in both unimodal and multimodal (audiovisual) learning frameworks and conducts a series of computational studies to explore the development of early language perception skills.
The exploration begins with the formulation of a Latent Language Hypothesis (LLH), challenging traditional notions of language acquisition by proposing that linguistic knowledge may emerge as latent representations to support predictive processing within and across sensory modalities. Through extensive experiments using visually grounded speech processing artificial neural models, the research examines whether linguistic patterns, including phonemes, syllables, and words, can emerge from cross-modal mapping between speech and visual domains.
Moreover, the thesis investigates the segmentation and alignments of individual visual objects and spoken words within the visually grounded speech processing models. The thesis introduces a set of novel standardized evaluation metrics for the problem together with new model variants, exploring the role of a cross-modal attention layer in enhancing cross-modal object-word alignments.
By leveraging state-of-the-art techniques in self-supervised learning and visually grounded speech processing, the thesis also investigates the joint optimization of auditory and visual learning mechanisms and their role in shaping early linguistic skills. Furthermore, the temporal dynamics of learning are explored, examining how the relative contributions of auditory and audiovisual learning influence the temporal acquisition pattern of linguistic representations.
Finally, the thesis investigates early lexical bootstrapping in a joint model of self-supervised speech and speech-vision learning with realistic-scale speech input and a realistic number of audiovisual naming events during the first 12 months of infants’ life. Through this investigation, the study evaluates whether the realistic-scale sensory input, when paired with statistical learning algorithms, is sufficient for bootstrapping early lexical learning, thereby providing insights on potential mechanisms for early language acquisition in infants.
The findings underscore the importance of both auditory and visual modalities in shaping early linguistic representations and emphasize the significance of statistical unimodal and cross-modal learning in language acquisition. These insights deepen our understanding of infant language acquisition and hold implications for the design of artificial intelligence systems capable of learning from multimodal sensory inputs.
Kokoelmat
- Väitöskirjat [4864]