Artificial and natural duplicates in pyrosequencing reads of metagenomic data

BMC Bioinformatics

Table 4 Metagenomic datasets used in this study

					% of natural duplicates under hypothetical sample types
					High-complexity ^b			Moderate-complexity ^c
Project/Sample ^a	Environment	Platform	Number Reads	% of total Duplicates	3 mb	100 kb	10 kb	3 mb	100 kb	10 kb
16339/SRR000905	Marine	GS_20	208633	5.74	0.01	0.52	4.98	0.10	3.22	24.88
28969/SRR000674	Coastal water	GS_FLX	201671	17.65	0.02	0.51	4.87	0.10	3.13	24.27
29421/SRR001308	Waste water	GS_FLX	378601	12.39	0.03	0.93	8.94	0.20	5.65	37.09
30445/SRR001663	Marine	GS_FLX	369811	15.39	0.03	0.93	8.68	0.19	5.49	36.53
30563/SRR001669	Human gut	GS_20	41649	7.26	0.00	0.11	1.00	0.03	0.65	6.16
33243/SRR006907	Freshwater	GS_FLX	255722	20.57	0.02	0.61	6.07	0.13	3.88	28.71
38721/SRR023845	Phyllosphere	GS_FLX	543285	11.17	0.05	1.33	12.41	0.29	7.93	45.07
Western channel/Apr_Day_gDNA	Saline water	Titanium	421004	23.38	0.04	1.04	9.80	0.20	6.23	39.42
Ocean viruses/Arctic_Shotgun	Ocean viruses	GS_20	688590	7.14	0.05	1.67	15.46	0.36	9.86	50.15
North Atlantic/BATS-174-2	Ocean gyre	GS_20	288735	17.56	0.02	0.73	6.92	0.16	4.43	31.24

^aDatasets are either from NCBI Short Read Archive with project IDs and run accession numbers at http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi or from CAMERA with project and sample names at http://camera.calit2.net.
^bHigh-, ^cmoderate-complexity microbial (or viral) environment with average genome length of 3 mb, 100 kb, and 10 kb

ISSN: 1471-2105