Dual codon usage in germline variants of prostate cancer genomes
Kuznetsova, Irina (2015)
Kuznetsova, Irina
2015
Master's Degree Programme in Bioinformatics
BioMediTech - BioMediTech
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2015-05-26
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:uta-201507102077
https://urn.fi/URN:NBN:fi:uta-201507102077
Tiivistelmä
This thesis work builds on the recent discovery by Stergachis and his colleagues, which describes the process the genome uses to write genetic code. This work extends the previous vision in genome research, which states, that codons and regulatory elements work independently. Codons are a triplet of nucleotides that encode amino acids; and regulatory elements are responsible for regulation of the gene expression. However, as discovered by Stergachis and his colleagues in 2013, around 15% of codons within 85% of human genes are occupied by transcription factor binding sites (TFBSs) (see Stergachis et al., 2013). Consequently, these type of codons encode two types of information. They were labelled ‘duons’ and described as highly conserved entities with low levels of genetic variation. Overall, regulatory proteins bind to the same stretches of As, Cs, Ts and Gs and influence the process of gene expression, and also specify the amino acids of the protein that is made. This work applies Stergachis findings of ‘duons’ to analyse a variant data.
An interesting fact of Stregachis work is that a mutation may occur without affecting a protein. This happens due to the ability of some amino acids to be encoded by a multiple combination of nucleotides (codons). Obviously, if an alteration occurs in the codon, which still encodes for the same amino acid, the functionality of the produced protein remains the same. In this case, transcription factors (TFs) bind to an altered (mutated) region, implicating a change of activity of TFs due to the fact, that the genetic pattern has been modified. As a result, wrong instructions are given to the expression of a gene, as Stergachis and his colleagues discovered (Stergachis et al., 2013). His discoveries led to the finding, that 13% of the deoxyribonucleic acid (DNA) mutations leading to a disease development are located in ‘duons’. Thus it is important to investigate disease-associated variants within ‘duons’ that increase the risk of disrupting both regulatory and protein-structural function.
A finding by Kircher in 2014 - the application of a method that aimed at the interpretation of pathogenicity of human genetic variations – lead to a new method. This method developed by Kircher in 2014 is called the combined annotation dependent depletion (CADD) tool. It uses a single C score to annotate a variant as pathogenic. In contrast to other methods the CADD takes into consideration regulatory elements, thus the CADD tool was selected for this project work.
These two research findings are used in the thesis work. The goal of this work was therefore the extraction and recording of variants from provided data, which have potential for ‘duons’. To achieve this goal, the thesis applied the techniques of the C score, the position weight matrix (PWMs), and p value estimation. The aim of this study was to apply the PWMs framework, and C score on provided data, in order to extract and record those variants from the data that have potential for’ duons’. Thus they could be putative causes of a disease development. First of all, the provided data was filtered to identify pathogenic variants based on C score. Afterwards, the above presented concept was used to compute the TFBSs for original reference and mutated nucleotide sequences, where the maximum and minimum difference between these scores were found and used as a criteria for computing p value. Eventually, the resulting set of genes was submitted to the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database and analysed for correlation of mutations to the type of a disease.
The outcome of the KEGG database analysis represents the main pathways where resulting genes are involved into metabolic, cancer, and neuroactive ligand-receptor interaction pathways.
An interesting fact of Stregachis work is that a mutation may occur without affecting a protein. This happens due to the ability of some amino acids to be encoded by a multiple combination of nucleotides (codons). Obviously, if an alteration occurs in the codon, which still encodes for the same amino acid, the functionality of the produced protein remains the same. In this case, transcription factors (TFs) bind to an altered (mutated) region, implicating a change of activity of TFs due to the fact, that the genetic pattern has been modified. As a result, wrong instructions are given to the expression of a gene, as Stergachis and his colleagues discovered (Stergachis et al., 2013). His discoveries led to the finding, that 13% of the deoxyribonucleic acid (DNA) mutations leading to a disease development are located in ‘duons’. Thus it is important to investigate disease-associated variants within ‘duons’ that increase the risk of disrupting both regulatory and protein-structural function.
A finding by Kircher in 2014 - the application of a method that aimed at the interpretation of pathogenicity of human genetic variations – lead to a new method. This method developed by Kircher in 2014 is called the combined annotation dependent depletion (CADD) tool. It uses a single C score to annotate a variant as pathogenic. In contrast to other methods the CADD takes into consideration regulatory elements, thus the CADD tool was selected for this project work.
These two research findings are used in the thesis work. The goal of this work was therefore the extraction and recording of variants from provided data, which have potential for ‘duons’. To achieve this goal, the thesis applied the techniques of the C score, the position weight matrix (PWMs), and p value estimation. The aim of this study was to apply the PWMs framework, and C score on provided data, in order to extract and record those variants from the data that have potential for’ duons’. Thus they could be putative causes of a disease development. First of all, the provided data was filtered to identify pathogenic variants based on C score. Afterwards, the above presented concept was used to compute the TFBSs for original reference and mutated nucleotide sequences, where the maximum and minimum difference between these scores were found and used as a criteria for computing p value. Eventually, the resulting set of genes was submitted to the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database and analysed for correlation of mutations to the type of a disease.
The outcome of the KEGG database analysis represents the main pathways where resulting genes are involved into metabolic, cancer, and neuroactive ligand-receptor interaction pathways.