Subcellular Protein Localization in Fluorescense Images Using Convolutional Neural Networks
Huttunen, Riku (2018)
Huttunen, Riku
2018
Tietotekniikka
Tieto- ja sähkötekniikan tiedekunta - Faculty of Computing and Electrical Engineering
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2018-12-05
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201811192622
https://urn.fi/URN:NBN:fi:tty-201811192622
Tiivistelmä
Gene expression is manifested through the synthesis of proteins within the cell. The Cell Atlas, within the Human Protein Atlas project, provides immunofluorescence microscopy images of the cell structures aligned with images showing the stained protein of interest. The images reveal the localization patterns of the majority of proteins found in human cells. These patterns in turn can be used when studying the cellular functions related to gene expression and mutations. In the advent of deep learning methods applied to computer vision problems, machine learning algorithms can be used to categorize the localization patterns into subcellular structures.
In this thesis, two types of neural network algorithms were applied into the classification of the Cell Atlas samples from the dataset used in 33rd Congress of the International Society for Advancement of Cytometry imaging challenge, where the task was to do multi-class multi-label classification of the images into 13 subcellular structures. The algorithms tested, namely, were Convolutional Neural Networks (CNN) and Fully Convolutional Networks (FCN). Model performance was evaluated with class-wise F1 score. The results were promising, with CNN and FCN implementations yielding weighted averages of class-wise F1 scores of 0.822 and 0.810 respectively. Another interesting remark is that the FCN, which outputs probability maps showing where in the image the certain class is present instead of a single probability for the whole image, learns significantly faster than the CNN, suggesting that it efficiently utilizes the spatial information in the training samples. FCN also provides more information in its outputs compared to CNN, which loses the spatial information in its outputs.
Considering the relatively small size of the dataset (20 000 samples, divided to 16 000 training samples and 4 000 testing samples), and the fact that the data is heavily imbalanced, the results are promising. More complex deep learning architectures can take advantage of millions of images, so in the future research the size of labeled dataset should be increased. Also, the quality of the labels can be questioned as they are derived from consensus between individuals without professional training.
In this thesis, two types of neural network algorithms were applied into the classification of the Cell Atlas samples from the dataset used in 33rd Congress of the International Society for Advancement of Cytometry imaging challenge, where the task was to do multi-class multi-label classification of the images into 13 subcellular structures. The algorithms tested, namely, were Convolutional Neural Networks (CNN) and Fully Convolutional Networks (FCN). Model performance was evaluated with class-wise F1 score. The results were promising, with CNN and FCN implementations yielding weighted averages of class-wise F1 scores of 0.822 and 0.810 respectively. Another interesting remark is that the FCN, which outputs probability maps showing where in the image the certain class is present instead of a single probability for the whole image, learns significantly faster than the CNN, suggesting that it efficiently utilizes the spatial information in the training samples. FCN also provides more information in its outputs compared to CNN, which loses the spatial information in its outputs.
Considering the relatively small size of the dataset (20 000 samples, divided to 16 000 training samples and 4 000 testing samples), and the fact that the data is heavily imbalanced, the results are promising. More complex deep learning architectures can take advantage of millions of images, so in the future research the size of labeled dataset should be increased. Also, the quality of the labels can be questioned as they are derived from consensus between individuals without professional training.