Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework

BMC Bioinformatics

Table 1 Comparison of search times for standard X!Tandem and Hydra

Mode	Scans	Nodes (Cores)	DB Name	Proteins (K)	Peptides (M)	Dot product (M)	Tim (min)
Hadoop	16000	43 (344)	ecoli	5.4	1.3	164	9.8
Hadoop	256000	43 (344)	ecoli	5.4	1.3	23395	338
Tandem	4663	1 (4)	human	222	168	477	29
Hadoop	4663	43 (344)	human	222	168	477	4.7
Tandem	184880	1 (4)	nr	4370	692	3291	2280
Hadoop	184880	43 (344)	nr	4370	692	3291	15.4
Tandem	184880	1 (4)	nr	16392	1248	13167	8410
Hadoop	184880	43 (344)	nr	16392	1248	13167	52.7

Example of comparison of run time for different complexities of search using the standard X!Tandem implementation and Hydra. The scans columns gives the number of spectra searched against, the Nodes column is the number of resources used (the first number of the number of machines, the second number is the number of total cores), the database name is the species database used, the Database Proteins is the number of proteins in the database, the dot product is the number of actual calculations. The times show that Hydra, unlike X!Tandem, is able to scale nearly linearly with the size of the problem. However, due to the startup costs associated with Hydra it is not suited for small searches. The PRIDE accession numbers for the spectra used were 10295 and 7962.

ISSN: 1471-2105