Novellette: an RNA-sequencing data analysis pipeline for detecting novel transcripts
Seppälä, Janne (2013)
Seppälä, Janne
2013
Biotekniikan koulutusohjelma
Tieto- ja sähkötekniikan tiedekunta - Faculty of Computing and Electrical EngineeringLuonnontieteiden tiedekunta - Faculty of Natural Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2013-12-04
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201312191514
https://urn.fi/URN:NBN:fi:tty-201312191514
Tiivistelmä
Proteins are the key factors in every living organism and they contribute to almost every biological process and structure in a cell. Proteins are formed through the process of protein synthesis, in which the genetic code of DNA in genes is first transcribed into RNA and then finally translated into a protein. Each gene in a cell is a short part of a longer DNA molecule, and each gene encodes the synthesis of a certain protein or pro-teins. Currently the state-of-the-art tool to evaluate which genes are expressed in a given biological sample is RNA-sequencing, which captures the RNA content of the cells at a specific time point. Although RNA-sequencing is often used to measure the expression of known genes in the genome, it can also be utilized to search for new genes, or novel transcripts.
To date, no standard tool has been published that effectively and thoroughly identi-fies novel transcripts from RNA-sequencing data. In this work, a tool aiming to solve this issue – denoted Novellette – is presented. Novellette attempts to identify differen-tially expressed regions in the genome that do not overlap with any known genes, and then performs a full gene structure analysis to the regions. For this process, information both from processed RNA-sequencing data and the known DNA sequence of the studied organism is utilized when searching for features in the novel transcript candidates that are common for protein-coding genes. The features are then scored and the final novel transcript candidates are ranked based on their score values. In addition to developing an RNA-sequencing tool in this work, the basics of statistical testing and other mathematical methods related to RNA-sequencing data analysis are introduced and the normality of count based RNA-sequencing data is assessed with publically available data.
The results from analyses performed with various input data show that Novellette is able to reliably detect novel transcripts and distinguish protein-coding regions from non-coding regions in the genome with the proposed scoring approach. In addition, the count based RNA-sequencing data is shown to very poorly follow the normal distribu-tion, hence pinpointing the importance of statistical hypothesis testing methods that do not assume data normality. In conclusion, a functional and useful bioinformatics tool has been developed in this work that has the potential to become a standard method for novel transcript identification.
To date, no standard tool has been published that effectively and thoroughly identi-fies novel transcripts from RNA-sequencing data. In this work, a tool aiming to solve this issue – denoted Novellette – is presented. Novellette attempts to identify differen-tially expressed regions in the genome that do not overlap with any known genes, and then performs a full gene structure analysis to the regions. For this process, information both from processed RNA-sequencing data and the known DNA sequence of the studied organism is utilized when searching for features in the novel transcript candidates that are common for protein-coding genes. The features are then scored and the final novel transcript candidates are ranked based on their score values. In addition to developing an RNA-sequencing tool in this work, the basics of statistical testing and other mathematical methods related to RNA-sequencing data analysis are introduced and the normality of count based RNA-sequencing data is assessed with publically available data.
The results from analyses performed with various input data show that Novellette is able to reliably detect novel transcripts and distinguish protein-coding regions from non-coding regions in the genome with the proposed scoring approach. In addition, the count based RNA-sequencing data is shown to very poorly follow the normal distribu-tion, hence pinpointing the importance of statistical hypothesis testing methods that do not assume data normality. In conclusion, a functional and useful bioinformatics tool has been developed in this work that has the potential to become a standard method for novel transcript identification.