Machine Learning Methods for Heterogeneous Data : Multimodal deep learning and dimensionality reduction
Chumachenko, Kateryna (2024)
Chumachenko, Kateryna
Tampere University
2024
Tieto- ja sähkötekniikan tohtoriohjelma - Doctoral Programme in Computing and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Väitöspäivä
2024-10-04
Julkaisun pysyvä osoite on
https://urn.fi/URN:ISBN:978-952-03-3601-1
https://urn.fi/URN:ISBN:978-952-03-3601-1
Tiivistelmä
Data-driven methods have transformed the modern world by enabling the development of innovative applications across various fields. One of the main driving forces behind such methods is the availability of large amounts of data that can be used to train predictive models. The data around us exists in different formats, structures, and representations, allowing to solve increasingly complex tasks. Still, task-specific concepts may often lack precise definitions and could be described in multiple ways, leading to discrepancies in the underlying class and feature space distributions. This phenomenon is commonly referred to as data heterogeneity.
In certain scenarios, data heterogeneity can be leveraged to enhance model capabilities by incorporating multiple data representations to solve a specific task. However, it also poses a challenge to machine learning models in terms of robustness. Therefore, addressing differences in data representations has become an essential and actively evolving subfield of machine learning.
This dissertation aims to contribute to the machine learning field by developing methods that are robust to such data heterogeneity or exploiting it to benefit the models. Specifically, we focus on two main subcategories of the area. First, we consider heterogeneous data distributions with respect to class labels, i.e., such distributions where samples of one class cannot be described by a homogeneous unimodal distribution. In particular, we focus on dimensionality reduction methods capable of operating on such data and propose several approaches to address the existing limitations of these methods. We firstly target the training efficiency and propose a speed-up approach, that allows to achieve the solution faster compared to baseline methods. Next, we increase the robustness to data of multiple representations and dimensionalities within a dataset by proposing a multi-view extension to the original definition. The proposed method obtains improved accuracy on several multi-view datasets compared to competing methods. Further, we develop an incremental solution allowing to update the projection matrix with new data without recalculation from scratch, achieving further improvements in efficiency during training. Next, we propose a method to improve robustness to distribution imbalance present within one class, both from the sample size and relevance quantified by sample distance perspectives. The experimental results indicate the method’s effectiveness on several datasets from different domains. Finally, we draw a connection between the fields of subspace learning and deep learning by proposing a data-driven weight initialization approach for feedforward neural networks. The resulting method shows improvements in the neural network convergence speed and overall performance.
In the second part of the dissertation, we develop novel methods leveraging data heterogeneity in the form of the presence of multiple data representations within one dataset for building stronger predictive models. Specifically, we propose multimodal deep learning models and training approaches for different tasks. We consider several common pitfalls of the practical adoption of multimodal models, namely robustness to missing data, unimodal-only inference, as well as adaptation of disjoint unimodal models for multimodal tasks, and propose solutions for these challenges. Specifically, we propose a novel model for audiovisual emotion recognition together with a training regime, improving its robustness to missing data in one of the modalities at inference time. Experimental results show the model’s effectiveness compared to competing fusion methods, as well as the benefits of the training approach towards robustness to missing data. Further, we develop a generalized framework for improving unimodal inference by leveraging multimodal training and show its effectiveness on a set of tasks and model architectures. Finally, we develop a method for adapting disjoint self-supervised unimodal models for downstream audiovisual dynamic facial expression recognition, obtaining state-of-the-art results on two benchmarks.
The methods developed in this dissertation fill some of the gaps in the field of machine learning associated with heterogeneous data by addressing some of the limitations of existing approaches, as well as by introducing novel models and training methods aimed at leveraging data heterogeneity in the form of multimodality to improve the performance of deep learning models on a number of tasks.
In certain scenarios, data heterogeneity can be leveraged to enhance model capabilities by incorporating multiple data representations to solve a specific task. However, it also poses a challenge to machine learning models in terms of robustness. Therefore, addressing differences in data representations has become an essential and actively evolving subfield of machine learning.
This dissertation aims to contribute to the machine learning field by developing methods that are robust to such data heterogeneity or exploiting it to benefit the models. Specifically, we focus on two main subcategories of the area. First, we consider heterogeneous data distributions with respect to class labels, i.e., such distributions where samples of one class cannot be described by a homogeneous unimodal distribution. In particular, we focus on dimensionality reduction methods capable of operating on such data and propose several approaches to address the existing limitations of these methods. We firstly target the training efficiency and propose a speed-up approach, that allows to achieve the solution faster compared to baseline methods. Next, we increase the robustness to data of multiple representations and dimensionalities within a dataset by proposing a multi-view extension to the original definition. The proposed method obtains improved accuracy on several multi-view datasets compared to competing methods. Further, we develop an incremental solution allowing to update the projection matrix with new data without recalculation from scratch, achieving further improvements in efficiency during training. Next, we propose a method to improve robustness to distribution imbalance present within one class, both from the sample size and relevance quantified by sample distance perspectives. The experimental results indicate the method’s effectiveness on several datasets from different domains. Finally, we draw a connection between the fields of subspace learning and deep learning by proposing a data-driven weight initialization approach for feedforward neural networks. The resulting method shows improvements in the neural network convergence speed and overall performance.
In the second part of the dissertation, we develop novel methods leveraging data heterogeneity in the form of the presence of multiple data representations within one dataset for building stronger predictive models. Specifically, we propose multimodal deep learning models and training approaches for different tasks. We consider several common pitfalls of the practical adoption of multimodal models, namely robustness to missing data, unimodal-only inference, as well as adaptation of disjoint unimodal models for multimodal tasks, and propose solutions for these challenges. Specifically, we propose a novel model for audiovisual emotion recognition together with a training regime, improving its robustness to missing data in one of the modalities at inference time. Experimental results show the model’s effectiveness compared to competing fusion methods, as well as the benefits of the training approach towards robustness to missing data. Further, we develop a generalized framework for improving unimodal inference by leveraging multimodal training and show its effectiveness on a set of tasks and model architectures. Finally, we develop a method for adapting disjoint self-supervised unimodal models for downstream audiovisual dynamic facial expression recognition, obtaining state-of-the-art results on two benchmarks.
The methods developed in this dissertation fill some of the gaps in the field of machine learning associated with heterogeneous data by addressing some of the limitations of existing approaches, as well as by introducing novel models and training methods aimed at leveraging data heterogeneity in the form of multimodality to improve the performance of deep learning models on a number of tasks.
Kokoelmat
- Väitöskirjat [4866]