Enabling and Performance Benchmarking of a Next-generation Sequencing Data Analysis Pipeline
Luo, Shuang (2019)
Luo, Shuang
2019
Bioengineering
Lääketieteen ja terveysteknologian tiedekunta - Faculty of Medicine and Health Technology
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2019-05-31
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201905311836
https://urn.fi/URN:NBN:fi:tty-201905311836
Tiivistelmä
The development of Next Generation Sequencing (NGS) technology resulted the rapid accumulation of a large amount of sequencing data demanding data mining. Various of variant calling softwares and pipelines came into being. Genome Analysis Toolkit (GATK) and its Best Practices quickly became the industrial gold-standard for variant calling because of its speediness, high accuracy and throughput. GATK has been updated all the time. The latest and strongest version is GATK4 which enabled parallelization and cloud infrastructure optimization via Apache spark. Currently, Broad Institute has cooperated with many cloud providers to deploy GATK Best Practices on cloud platform. However, there is no benchmarking data released for GATK4 and no cooperation with CSC (CSC – IT Center of Science Ltd) cPouta IaaS (Infrastructure as a Service) cloud.
We optimized WDL (workflow description language) script of germline SNPs and Indels short variants discovery workflow from Best Practices and ran it by Cromwell execution engine on a virtual machine of cPouta cloud which featured a 24 cores Intel(R) Xeon(R) CPU E5-2680 v3 with hyper-threading. In addition, we benchmarked pipeline execution time(s) for five seperated pipelines of this workflow with three 30X WGS (Whole Genome Sequencing) datasets: NA12878, NA12891 and NA12892 and explored optimized runtime parameters for GATK4 tools, PairHMM thread scalability in HaplotypeCaller, GATK4 thread scalability for PGC in MarkDuplicates and execution times comparison for GATK4 SortSam vs SortSamSpark and MarkDuplicates vs MarkDuplicatesSpark.
We found the real execution time for similar WGS datasets with different size and features showed consistency and execution time and dataset size were roughly positive correlated. The optimal threads number is 12 for GATK4 HaplotypeCaller in ERC mode, giving rise to 12.4% speed-up. The optimal PGC threads number is 2 for GATK4 MarkDuplicates. And, multi-threading with Spark local runner highly speeded up GATK4 tool execution. SortSamSpark enabled 16 local cores gave rise to a speed-up of 83.6%. MarkDuplicatesSpark enabled 16 local cores gave rise to a speed-up of 22.2% and 37.3%, seperately with and without writing metrics file.
With detailed virtural machine setting up, optimized parameters and GATK4 performance benchmarking data, this thesis is a guide for implementation of GATK4 Best Practices germline SNPs and Indels short variants discovery workflow on CSC cPouta cloud platform.
We optimized WDL (workflow description language) script of germline SNPs and Indels short variants discovery workflow from Best Practices and ran it by Cromwell execution engine on a virtual machine of cPouta cloud which featured a 24 cores Intel(R) Xeon(R) CPU E5-2680 v3 with hyper-threading. In addition, we benchmarked pipeline execution time(s) for five seperated pipelines of this workflow with three 30X WGS (Whole Genome Sequencing) datasets: NA12878, NA12891 and NA12892 and explored optimized runtime parameters for GATK4 tools, PairHMM thread scalability in HaplotypeCaller, GATK4 thread scalability for PGC in MarkDuplicates and execution times comparison for GATK4 SortSam vs SortSamSpark and MarkDuplicates vs MarkDuplicatesSpark.
We found the real execution time for similar WGS datasets with different size and features showed consistency and execution time and dataset size were roughly positive correlated. The optimal threads number is 12 for GATK4 HaplotypeCaller in ERC mode, giving rise to 12.4% speed-up. The optimal PGC threads number is 2 for GATK4 MarkDuplicates. And, multi-threading with Spark local runner highly speeded up GATK4 tool execution. SortSamSpark enabled 16 local cores gave rise to a speed-up of 83.6%. MarkDuplicatesSpark enabled 16 local cores gave rise to a speed-up of 22.2% and 37.3%, seperately with and without writing metrics file.
With detailed virtural machine setting up, optimized parameters and GATK4 performance benchmarking data, this thesis is a guide for implementation of GATK4 Best Practices germline SNPs and Indels short variants discovery workflow on CSC cPouta cloud platform.