A data-driven transcriptomics -based approach for cell line prioritization to model lung tissue
Annala, Maria (2024)
Annala, Maria
2024
Bioteknologian ja biolääketieteen tekniikan maisteriohjelma - Master's Programme in Biotechnology and Biomedical Engineering
Lääketieteen ja terveysteknologian tiedekunta - Faculty of Medicine and Health Technology
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2024-05-02
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202403213023
https://urn.fi/URN:NBN:fi:tuni-202403213023
Tiivistelmä
The selection of a suitable cell line for an in vitro test system is fundamental for establishing a fit-for-purpose model representing the mechanistic, molecular, and transcriptomic profile of the studied disease and tissue. While tools for defining tissue specificity exist, they were neither developed for comparing the similarities between cell lines and tissues nor for evaluating the suitability of a cell line for toxicological research.
In this thesis, a computational model was developed to provide a data-driven tool to assess the suitability of a cell line for modelling lung tissue and to study pulmonary fibrosis. The transcriptomes of the cell lines retrieved from ENCODE, Human Protein Atlas and EMTAB were compared to the transcriptome of lung tissue retrieved from the GTEx database applying three different computational approaches: Similarity of the whole transcriptome based on ranking genes by their expression, embedded features extracted from link prediction between nodes in a proteinprotein interaction network, and similarity based on Fuzzy labelling of genes according to their expression height. Additionally, the datasets were filtered for cell lines with high similarity to whole blood samples from GTEx. This was performed to penalize cell types naturally present in lung tissue such as blood and vascular structure related cell types, since their high rank does not reflect their similarity to the lung tissue itself, but a general immune profile or vascular structures present in multiple tissues.
A consensus rank was calculated with the Borda method from the three approaches for each cell line that was in the top 20 % list in all three approaches. This resulted in a final list of 16 candidate cell lines from all datasets mainly consisting of fibroblasts and epithelial cell types derived from lung, colon and kidney adenocarcinoma, glioblastoma, embryonal stem cells, and skin. Gene Ontology Analysis (GO) was conducted to evaluate the performance of the model as well as which molecular functions and processes explain the similarities between the cell lines and lung tissue.
The candidate cell lines and the lung tissue shared expression signatures related to epithelial barrier function, filtration of blood or inhaled air, immune signalling and regulation as well as to structural components such as the extracellular matrix (ECM) and cell membrane. Among the most enriched biological processes, molecular functions and cellular components were main components of the ECM network which affect the immune system regulation. Processes related to tissue repair or perturbations in the cellular environment were also enriched, further indicating that the gene expression of the candidate cell lines resembles functions and signaling processes present in the dynamic and complex environment of the lung tissue.
In this thesis, a computational model was developed to provide a data-driven tool to assess the suitability of a cell line for modelling lung tissue and to study pulmonary fibrosis. The transcriptomes of the cell lines retrieved from ENCODE, Human Protein Atlas and EMTAB were compared to the transcriptome of lung tissue retrieved from the GTEx database applying three different computational approaches: Similarity of the whole transcriptome based on ranking genes by their expression, embedded features extracted from link prediction between nodes in a proteinprotein interaction network, and similarity based on Fuzzy labelling of genes according to their expression height. Additionally, the datasets were filtered for cell lines with high similarity to whole blood samples from GTEx. This was performed to penalize cell types naturally present in lung tissue such as blood and vascular structure related cell types, since their high rank does not reflect their similarity to the lung tissue itself, but a general immune profile or vascular structures present in multiple tissues.
A consensus rank was calculated with the Borda method from the three approaches for each cell line that was in the top 20 % list in all three approaches. This resulted in a final list of 16 candidate cell lines from all datasets mainly consisting of fibroblasts and epithelial cell types derived from lung, colon and kidney adenocarcinoma, glioblastoma, embryonal stem cells, and skin. Gene Ontology Analysis (GO) was conducted to evaluate the performance of the model as well as which molecular functions and processes explain the similarities between the cell lines and lung tissue.
The candidate cell lines and the lung tissue shared expression signatures related to epithelial barrier function, filtration of blood or inhaled air, immune signalling and regulation as well as to structural components such as the extracellular matrix (ECM) and cell membrane. Among the most enriched biological processes, molecular functions and cellular components were main components of the ECM network which affect the immune system regulation. Processes related to tissue repair or perturbations in the cellular environment were also enriched, further indicating that the gene expression of the candidate cell lines resembles functions and signaling processes present in the dynamic and complex environment of the lung tissue.