Analysis of nucleotide repeats at variation sites
OGUNBAYO, ABIDEMI (2012)
OGUNBAYO, ABIDEMI
2012
Bioinformatiikka - Bioinformatics
Biolääketieteellisen teknologian yksikkö - Institute of Biomedical Technology
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2012-06-25
Julkaisun pysyvä osoite on
https://urn.fi/urn:nbn:fi:uta-1-22786
https://urn.fi/urn:nbn:fi:uta-1-22786
Tiivistelmä
Abstract
Background and aims:Nucleotide repeats are sequences that are repeated two or more times in the genome. They are generally classified into two broad groups as tandem repeats or dispersed repeats. They can be found in both prokaryotic and eukaryotic organisms. Previous studies have shown the distribution of the repeats in the genome, likewise, some studies have focused on the development of repeat analysis tools. Some other studies have also pointed out the biological and medical importance of the nucleotide repeats, with much reference to the microsatellite repeats. This present study aimed at finding different patterns of repeats that are present at some pathogenic and neutral variation sites of the human genome. Also its aim is to report the abundance of these repeat patterns and investigate whether and where differences occur between and within the repeat patterns found in both the pathogenic and neutral dataset.
Methods:Datasets of neutral and pathogenic single nucleotide polymorphisms for human genome were downloaded from VariBench. The given genomic.accession.version for both datasets was then used for genomic sequence retrieval from the NCBI ftp site. The resulting sequences were preprocessed with python scripts, to obtain 21 bp each for the samples in each dataset. The 21 bp includes the variation site, located at the center of the sequence, for the analysis of the repeat. REPuter, a repeat analysis tool, was used to carry out the analysis on the sequences. The results of the repeat analysis were further preprocessed with python scripts. Microsoft Excel and R scripts were used for descriptive statistics in order to obtain the distribution of the different patterns of repeats and their nucleotide counts. Inferential statistics was done with ANOVA and Tukey’s HSD test to show if significant differences occur between the repeat patterns in both datasets and where these differences occur.
Results:Forward, complemented, reversed and palindromic repeats were found present in both datasets. Descriptive statistics shows similarity in distribution of the patterns of repeats in both dataset. However ANOVA analysis showed that significant differences occur between the patterns of repeats in both dataset, this lead to further analysis with Tukey’s HSD test, which confirms the exact pairs with significant difference.
Conclusions:Significant difference occurs in pairs of different pattern of repeat within and between the datasets but not in same pair of pattern of repeat. Based on this, one can study further the pattern of specific unit length of repeat nucleotide bases in close proximity to variant sites in pathogenic dataset, to see their correlation to disease development.
Asiasanat:Nucleotide repeats
Background and aims:Nucleotide repeats are sequences that are repeated two or more times in the genome. They are generally classified into two broad groups as tandem repeats or dispersed repeats. They can be found in both prokaryotic and eukaryotic organisms. Previous studies have shown the distribution of the repeats in the genome, likewise, some studies have focused on the development of repeat analysis tools. Some other studies have also pointed out the biological and medical importance of the nucleotide repeats, with much reference to the microsatellite repeats. This present study aimed at finding different patterns of repeats that are present at some pathogenic and neutral variation sites of the human genome. Also its aim is to report the abundance of these repeat patterns and investigate whether and where differences occur between and within the repeat patterns found in both the pathogenic and neutral dataset.
Methods:Datasets of neutral and pathogenic single nucleotide polymorphisms for human genome were downloaded from VariBench. The given genomic.accession.version for both datasets was then used for genomic sequence retrieval from the NCBI ftp site. The resulting sequences were preprocessed with python scripts, to obtain 21 bp each for the samples in each dataset. The 21 bp includes the variation site, located at the center of the sequence, for the analysis of the repeat. REPuter, a repeat analysis tool, was used to carry out the analysis on the sequences. The results of the repeat analysis were further preprocessed with python scripts. Microsoft Excel and R scripts were used for descriptive statistics in order to obtain the distribution of the different patterns of repeats and their nucleotide counts. Inferential statistics was done with ANOVA and Tukey’s HSD test to show if significant differences occur between the repeat patterns in both datasets and where these differences occur.
Results:Forward, complemented, reversed and palindromic repeats were found present in both datasets. Descriptive statistics shows similarity in distribution of the patterns of repeats in both dataset. However ANOVA analysis showed that significant differences occur between the patterns of repeats in both dataset, this lead to further analysis with Tukey’s HSD test, which confirms the exact pairs with significant difference.
Conclusions:Significant difference occurs in pairs of different pattern of repeat within and between the datasets but not in same pair of pattern of repeat. Based on this, one can study further the pattern of specific unit length of repeat nucleotide bases in close proximity to variant sites in pathogenic dataset, to see their correlation to disease development.
Asiasanat:Nucleotide repeats