Development of CAbase and an Exon Analysis Pipeline for Visual Assessment of Predicted Genes for the Carbonic Anhydrases
Isokangas, Lydia (2016)
Isokangas, Lydia
2016
Master's Degree Programme in Bioinformatics
BioMediTech - BioMediTech
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2016-06-02
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:uta-201606101867
https://urn.fi/URN:NBN:fi:uta-201606101867
Tiivistelmä
Background and Aims
Humans are able to quickly recognize and evaluate visual patterns, thus this thesis aims to apply this feature to the analysis of aspects of the conservation of carbonic anhydrase proteins. This was facilitated through the creation of two pipelines:
* One to create a publically available specialized database to service the CA research world named CAbase, and,
* One to create a visual display of the aligned exons of the cDNA transcripts contained within CAbase with indicators to show the positions of start and stop codons along with the locations of the predicted signal and mitochondrial targeting peptides. This pipeline was named Exon_Analysis.
Carbonic anhydrases (CAs) are ubiquitous proteins that reversibly catalyse carbon dioxide into carbonic acid. Through the events of duplication, the CAs exist in at least 16 different isoforms and potentially up to 17 different isoforms.
Methods
The pipelines were created using freely available tools that included python, MySQL, various bioinformatic tools such as Clustal Omega, PRANK, BLAST and Pal2Nal.
The data for CAbase was extracted from Ensembl, NCBI, UniProt, RSCB PDB, UniGene and FlyBase. Additionally, calculated data from using SignalP and TargetP was also included. CAbase is hosted on the Amazon Web Server and can be accessed using any computer that has access to the Internet and has MySQL installed.
Exon_Analysis draws a scaled exon MSA schematic based on a PRANK MSA of the cDNA transcripts for a CA isoform. The exons and other indicators such as the start and stop codon, and the signal and target peptides are all drawn in different colours and in their scaled locations. Thus it is possible to see the conserved nature of the exons within the coding regions and the aligned start and stop codons and the peptides for each CA isoform.
Results
CAbase is now publically available for anyone to use. However, it is still somewhat user unfriendly due to the requirement that user be familiar with SQL. CAbase facilitated the use of Exon_Analysis. This pipeline has enabled the quality control of the predicted genes one exon at a time. Evolutionary events such as the conservation of short exons within CAs VI, IX, XII, XIV, X and XI has been shown in the exon MSA schematics. Both tools could be adapted for use with other proteins.
Conclusion
CAbase is a specialized database for CA researchers that can continue to be adapted to their needs. It allows researchers to create specific queries without having to filter for unwanted proteins. Exon_Analysis creates an exon MSA schematic to visualise the relationship between the exons and other features of interest. This facilitates the quality control of predicted genes one exon at a time.
Humans are able to quickly recognize and evaluate visual patterns, thus this thesis aims to apply this feature to the analysis of aspects of the conservation of carbonic anhydrase proteins. This was facilitated through the creation of two pipelines:
* One to create a publically available specialized database to service the CA research world named CAbase, and,
* One to create a visual display of the aligned exons of the cDNA transcripts contained within CAbase with indicators to show the positions of start and stop codons along with the locations of the predicted signal and mitochondrial targeting peptides. This pipeline was named Exon_Analysis.
Carbonic anhydrases (CAs) are ubiquitous proteins that reversibly catalyse carbon dioxide into carbonic acid. Through the events of duplication, the CAs exist in at least 16 different isoforms and potentially up to 17 different isoforms.
Methods
The pipelines were created using freely available tools that included python, MySQL, various bioinformatic tools such as Clustal Omega, PRANK, BLAST and Pal2Nal.
The data for CAbase was extracted from Ensembl, NCBI, UniProt, RSCB PDB, UniGene and FlyBase. Additionally, calculated data from using SignalP and TargetP was also included. CAbase is hosted on the Amazon Web Server and can be accessed using any computer that has access to the Internet and has MySQL installed.
Exon_Analysis draws a scaled exon MSA schematic based on a PRANK MSA of the cDNA transcripts for a CA isoform. The exons and other indicators such as the start and stop codon, and the signal and target peptides are all drawn in different colours and in their scaled locations. Thus it is possible to see the conserved nature of the exons within the coding regions and the aligned start and stop codons and the peptides for each CA isoform.
Results
CAbase is now publically available for anyone to use. However, it is still somewhat user unfriendly due to the requirement that user be familiar with SQL. CAbase facilitated the use of Exon_Analysis. This pipeline has enabled the quality control of the predicted genes one exon at a time. Evolutionary events such as the conservation of short exons within CAs VI, IX, XII, XIV, X and XI has been shown in the exon MSA schematics. Both tools could be adapted for use with other proteins.
Conclusion
CAbase is a specialized database for CA researchers that can continue to be adapted to their needs. It allows researchers to create specific queries without having to filter for unwanted proteins. Exon_Analysis creates an exon MSA schematic to visualise the relationship between the exons and other features of interest. This facilitates the quality control of predicted genes one exon at a time.