Removing duplicate reads using graphics processing units

BMC Bioinformatics

Table 4 Performance comparison on the SRR921897 library among GPU-DupRemoval, Fastx Toolkit Collapser, CD-HIT-DUP, and Fulcrum

Tool	Prefix length	Mismatches	Removed	Time	Memory
GPU-DupRemoval ¹	100	0	7.4 %	4 m	13.1 GB
	25	1	9.2 %	18 m	16.6 GB
		3	12.2 %	17 m	16.6 GB
	35	1	8.9 %	12 m	17.2 GB
		3	11.5 %	11 m	17.2 GB
	45	1	8.7 %	7 m	16.5 GB
		3	10.8 %	7 m	16.5 GB
	55	1	8.4 %	6 m	17.5 GB
		3	10.0 %	5 m	17.5 GB
GPU-DupRemoval ²	25	0	7.4 %	22 m	16.6 GB
		1	9.0 %	18 m	16.6 GB
		3	12.0 %	15 m	16.6 GB
CD-HIT-DUP	N/A	0	7.4 %	17 m	33.3 GB
		1	8.0 %	15 m	49.2 GB
		3	9.8 %	28 m	53.5 GB
Fulcrum	100	0	7.3 %	53 m	1.8 GB
	25	1	9.8 %	47 m	1.8 GB
		3	13.1 %	57 m	1.8 GB
	35	1	9.6 %	36 m	2.3 GB
		3	12.3 %	37 m	1.6 GB
	45	1	9.4 %	28 m	1.9 GB
		3	11.7 %	29 m	1.7 GB
	55	1	9.2 %	25 m	2.1 GB
		3	10.9 %	26 m	2.3 GB
Fastx Toolkit Collapser	N/A	0	7.4 %	12 m	10.2 GB

As for GPU-DupRemoval the table reports the results for both the current (GPU-DupRemoval ¹) and the first implementation (GPU-DupRemoval ²) of the algorithm. The library consists of 49.999.923 of 100 bp single-end reads generated with Illumina platform. The first column reports the name of the tool. The second column reports the prefix length used for clustering the reads for GPU-DupRemoval and Fulcrum. The third column reports the constraint on the allowed number of mismatches. The fourth column reports the percentage of reads that have been removed. The fifth and sixth column report the computing time and the peak of memory required to perform the experiment. Tool settings: i) GPU-DupRemoval ¹ -g 0 -D 0 (for identical duplicates) and -g 0 -p <prefix_length > -D <nb_mismatches > (for nearly-identical duplicates); ii) GPU-DupRemoval ² -g 0 -p 25 -D <nb_mismatches >; iii) CD-HIT-DUP -u 0 -c <nb_of_mismatches >; iv) Fulcrum -b <prefix_length > -s -t s (for clustering) and -q 0 -n 12 -s -t s -c <nb_mismatches >. <prefix_length > was set to 100 for identical duplicates and to 25/35/45/55 for nearly-identical duplicates. No parameter is required for Fastx Toolkit Collapser

ISSN: 1471-2105