Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline

BMC Bioinformatics

Table 1 Mercury computational resource requirements

	Data gen	Alignment		BAM finishing						Variants		Anno
	BCL to FastQ	BWA align	BWA sample	Mates, Dupe, Stats	Cap & Cvrg Metrics	GATK indel targets	GATK indel realign	GATK recal	BAM valid	Atlas SNP	Atlas Indel	Cassandra
Nodes	1	1	0.333	0.5	0.125	1	0.333	1	0.125	0.167	0.167	0.167
RAM	48	48	15	28	3	48	14	32	4	7	7	8
Hours	3.62	1.84	1.38	3.39	1.30	0.28	2.25	3.04	0.75	9.00	7.51	1.71
Node*hrs	3.62	1.84	0.46	1.70	0.16	0.28	0.75	3.04	0.09	1.50	1.25	0.29

All estimates are approximate for whole exome and light-skim whole genome (~10-20 Gbp of data) sequenced on Illumina HiSeq and processed with the most recent versions of RTA and Casava. Nodes are 8-core, 48 GB RAM, with ~3 GHz Intel CPUs and ~1 TB of local scratch disk. Steps include all aspects of the pipeline from building reads and qualities (fastQ) from raw data (bcl files), through alignment and BAM generation using the BWA aligner, and BAM finishing with GATK post-processing and duplicate marking, capture and coverage metric calculation, and BAM file validation, finally producing variants from the Atlas2 variant calling suite with annotations from our annotator, Cassandra.

ISSN: 1471-2105