K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity

Table 1 RNA-Seq datasets and computing resources used for each RNA-Seq data

organism	# of reads	# of unique k-mers	Computing resource		data source
			MR-Inchworm	Original Inchworm
mouse	105,290,476	746,811,557	iDataplex-nextscale	iDataplex-nextscale:single node (64GB mem)	[22]
sugarbeet	129,832,549	2,213,519,875	iDataplex-nextscale	iDataplex:single node (256GB mem)	unpublished data
wheat	1,468,701,119	5,775,799,648	iDataplex-nextscale	iDataplex:single vSMP node (4 TB mem) cerated by ScaleMP	unpublished data

All datasets are pair-end datasets, in which only mouse dataset is strand-specific.iDataplex-nextscale cluster is known as BlueWonder-NextScale, consisting of 360 nodes each with 2 × 12 core Intel Xeon processors (E5-2697v2 2.7GHz) and 64GB RAM making total 8640 cores in total. iDataplex cluster is known as “BlueWonder”, consisting of 512 nodes each with 2 × 8 core Intel SandyBridge processors (2.6 Ghz) making 8192 cores in total. Original Inchworm with sugarbeet dataset was run using a single iDataplex node with 256GB memory. Original Inchworm with wheat dataset was run using a single vSMP node with 4 Tb memory created by ScalewMP software (http://www.scalemp.com) on iDataplex. ScaleMP creates a virtual symmetric multiprocessing (vSMP) node for shared memory by aggregating multiple compute nodes

ISSN: 1471-2105