Sequence Conservation Metrics: Implementation and Comparative Analysis
OLATUBOSUN, AYODEJI (2009)
OLATUBOSUN, AYODEJI
2009
Bioinformatiikka - Bioinformatics
Lääketieteellinen tiedekunta - Faculty of Medicine
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2009-09-02
Julkaisun pysyvä osoite on
https://urn.fi/urn:nbn:fi:uta-1-20018
https://urn.fi/urn:nbn:fi:uta-1-20018
Tiivistelmä
Background and Aims: Multiple Sequence Alignments (MSAs) help identify regions of similarity and dissimilarity between its constituent sequences, which is essential in understanding structural, functional and evolutionary relationships of DNA or protein. Different authors from different fields have made efforts to develop new quantitative methods or apply metrics used in other fields in scoring conservation at different positions in MSAs, using different programming languages. This work is an effort to implement many of these methods on a single platform familiar to most bioinformaticians, which will allow direct comparison of the methods, and make them more accessible as a tool to the research community.
Methods: Seven conservation methods namely: Entropy, Variance, Sum of Pairs, Relative Entropy, Jensen-Shannon Divergence method, and the deviation from maximum frequency method and its log form were implemented in Object Oriented Perl. Two weighting schemes and two background frequency schemes were also implemented, leading to 27 variants of the methods. These were used in computing conservation scores for 502 alignments. Correlation analysis and Receiver Operating Characteristic analysis was conducted on the conservation data.
Results: The correlation result agrees closely with that of a previous study. The independent counts scheme gave the best performance as compared to other weighting options. Relative Entropy, Variance and the Jensen-Shannon Divergence were the best three methods as measured by their performance on the catalytic site dataset.
Conclusion: The conservation metrics were successfully implemented and comparative evaluation carried out. A correct specification of the true positive and true negative classes is however needed in order to make generalised statements about the comparative performances of the methods tested.
Asiasanat:Sequence Conservation, Multiple Sequence Alignment, Comparative Analysis, Positional Conservation Measures
Methods: Seven conservation methods namely: Entropy, Variance, Sum of Pairs, Relative Entropy, Jensen-Shannon Divergence method, and the deviation from maximum frequency method and its log form were implemented in Object Oriented Perl. Two weighting schemes and two background frequency schemes were also implemented, leading to 27 variants of the methods. These were used in computing conservation scores for 502 alignments. Correlation analysis and Receiver Operating Characteristic analysis was conducted on the conservation data.
Results: The correlation result agrees closely with that of a previous study. The independent counts scheme gave the best performance as compared to other weighting options. Relative Entropy, Variance and the Jensen-Shannon Divergence were the best three methods as measured by their performance on the catalytic site dataset.
Conclusion: The conservation metrics were successfully implemented and comparative evaluation carried out. A correct specification of the true positive and true negative classes is however needed in order to make generalised statements about the comparative performances of the methods tested.
Asiasanat:Sequence Conservation, Multiple Sequence Alignment, Comparative Analysis, Positional Conservation Measures