BiSpark: a Spark-based highly scalable aligner for bisulfite sequencing data

BMC Bioinformatics

Table 1 Experimental data for performance evaluation

Data set	Tailored data size	# of reads	Description
Simulation data	122MB	1,000,000	Simulation set with 0% error
	122MB	1,000,000	Simulation set with 1% error
	122MB	1,000,000	Simulation set with 2% error
GEO WGBS data (GSE80911)	1.6GB	10,000,000	10 million reads real data set
	7.9GB	50,000,000	50 million reads real data set
	16GB	100,000,000	100 million reads real data set
	32GB	200,000,000	200 million reads real data set
Reference genome	Build 37, hg19

Simulation data sets are generated by Sherman [26] with various error rates (0%, 1% and 2% respectively) where the error rate is a mean error rate per bp whereby the error curve follows an exponential decay model. Each test data sets are tailored from original WGBS data based on number of reads

ISSN: 1471-2105