A Scalable Method for Nonlinear Dimensionality Reduction with Applications to Single-Cell Data
Hietanen, Aleksi (2022)
Hietanen, Aleksi
2022
Teknis-luonnontieteellinen DI-ohjelma - Master's Programme in Science and Engineering
Tekniikan ja luonnontieteiden tiedekunta - Faculty of Engineering and Natural Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2022-05-31
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202204274026
https://urn.fi/URN:NBN:fi:tuni-202204274026
Tiivistelmä
Current generation biological measurement technologies enable quantifying cellular characteristics and processes at a genome-wide scale and single-cell resolution, producing invaluable data for research on complex phenomena such as cancer. As such high dimensional data is difficult to process and reason about directly, dimensionality reduction is commonly employed in the analysis of molecular data, either as a preprocessing step for downstream analysis or for purposes of exploratory data analysis by visualizing the data in lower dimensions to gain further insights. However, these vast amounts of high dimensional data pose several challenges to dimensionality reduction methods in common use today. Linear methods like PCA are incapable of capturing the nonlinear, heteroscedastic nature of the data being transformed, whereas the computational complexity of nonlinear methods such as t-SNE becomes an obstacle when dealing with large data sets. In addition, the prevailing nonlinear methods are based on distance metrics, rendering them prone to the curse of dimensionality.
In this work a method for nonlinear dimensionality reduction is proposed which aims to address these issues. By exploiting the properties of two separate neural network architectures, namely a stochastic variant of autoencoders and a parametric variant of t-SNE, we are demonstrably able to mitigate the outlined issues.
The proposed method is compared to existing methods with excellent results in terms of its scalability to large data sets, robustness to sparse and corrupt data, as well as its ability to combat the curse of dimensionality. Additionally, the practical application of the proposed method to single-cell data sets obtained from cancer patients' tissue samples is demonstrated. We believe that such methodological developments benefit more efficiently utilizing emerging single-cell data, which could in turn translate to biologically testable hypotheses and benefits in patient care.
In this work a method for nonlinear dimensionality reduction is proposed which aims to address these issues. By exploiting the properties of two separate neural network architectures, namely a stochastic variant of autoencoders and a parametric variant of t-SNE, we are demonstrably able to mitigate the outlined issues.
The proposed method is compared to existing methods with excellent results in terms of its scalability to large data sets, robustness to sparse and corrupt data, as well as its ability to combat the curse of dimensionality. Additionally, the practical application of the proposed method to single-cell data sets obtained from cancer patients' tissue samples is demonstrated. We believe that such methodological developments benefit more efficiently utilizing emerging single-cell data, which could in turn translate to biologically testable hypotheses and benefits in patient care.