Blood Cancer Lineage Identification: A Machine Learning Approach
Liuksiala, Thomas Edward (2015)
Liuksiala, Thomas Edward
2015
Automaatiotekniikan koulutusohjelma
Tieto- ja sähkötekniikan tiedekunta - Faculty of Computing and Electrical Engineering
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2015-06-03
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201505181291
https://urn.fi/URN:NBN:fi:tty-201505181291
Tiivistelmä
Cancer, one of the most common killers of modern human, is caused by malfunctioning hereditary material manifesting as uncontrolled, life-threatening growth of a tumor. A change in the hereditary material within a cell may cause a re-programming of its gene-regulatory system. If the altered system leads to a gene expression sate enabling limitless growth and replication, the cell has become cancerous. The state of a given tissue can be determined by gene expression array experiments. A massive amount of such measurements have been performed and placed under public access by the research community. Analyzing this type of high-dimensional data, spanning countless instances of separate phenotypes, however, poses computational and algorithmic challenges. Machine learning algorithms, in particular, have proven valuable in mining for pieces of knowledge hidden in biological big data.
This thesis presents a machine learning approach to estimate the nearest healthy gene expression state of a tumor and to quantify the regulatory divergence of the tumor from the normal state. The method was applied to hematological malignancies, or cancers of blood and lymph nodes. First, a hematological gene expression data set was integrated from 9,544 tumors and normal tissue samples available in a public data repository. Secondly, quality control, normalization and bias correction steps were performed to enable collective analysis of this data produced by hundreds of laboratories worldwide.
Principal component analysis at different scales, cluster analysis and supervised classification verified that the data set indeed allows for expression studies involving measurements from multiple laboratories. The characterization of hematological malignancies as gene-regulatory deviations from normal tissues uncovers the developmental lineage of the cancers and places them regulatory-wise between stem cells and mature blood cells. The results open up new biological hypotheses, a new approach to curing cancer and suggest that similar analyses in the context of other malignancies could be equally fruitful.
This thesis presents a machine learning approach to estimate the nearest healthy gene expression state of a tumor and to quantify the regulatory divergence of the tumor from the normal state. The method was applied to hematological malignancies, or cancers of blood and lymph nodes. First, a hematological gene expression data set was integrated from 9,544 tumors and normal tissue samples available in a public data repository. Secondly, quality control, normalization and bias correction steps were performed to enable collective analysis of this data produced by hundreds of laboratories worldwide.
Principal component analysis at different scales, cluster analysis and supervised classification verified that the data set indeed allows for expression studies involving measurements from multiple laboratories. The characterization of hematological malignancies as gene-regulatory deviations from normal tissues uncovers the developmental lineage of the cancers and places them regulatory-wise between stem cells and mature blood cells. The results open up new biological hypotheses, a new approach to curing cancer and suggest that similar analyses in the context of other malignancies could be equally fruitful.