The identification of peptide sequences from spectra is a computationally expensive and rate limiting task. This expense is due to the fact that the identification process typically involves each measured spectrum being matched against the likely (artificially generated) spectra for each possible peptide ion within a range of mass-to charge ratios. The problem with identification of peptides is confounded as the number of peptides is huge, and the search space increases geometrically with the number of amino acid modifications considered. This paper introduces a highly scalable strategy for overcoming these limitations, which has been purposefully built to take full advantage of a highly distributed computation framework.
The shotgun proteomics workflow has become the most widely used technique for identifying and quantifying proteins present in a biological sample in a high-throughput manner. Although many variations exist, the basic approach begins with extracting proteins from sample and digesting the proteins into peptides with a proteolytic enzyme such as trypsin. The peptides are then separated using liquid chromatography and analyzed by mass spectrometry (MS). The mass-to-charge (m/z) ratios of the peptide precursor ions are measured and the precursors are fragmented into a series of ions and measured in tandem mass spectrometry (MS/MS) mode. The resulting MS/MS spectra of these peptide ions together with the precursor m/z are searched against a database of possible peptides to determine the best match
Many strategies have been proposed to help with such search techniques. Raw, embarrassingly parallel approaches are the most widely used (e.g. for X!Tandem
). To increase the speed and scalability of the searching advances in high performance computing hardware have also been used to offload the searching from general to dedicated hardware (including GPUs
 and FGAs
), and improvements to search algorithms have been implemented
[7, 8]. Some algorithms have been ported to parallel computing architectures such as MPI (
). Previous parallel implementations generally scale with the number of spectra
, however the scalability issues relating to the size of the database being searched are not addressed.
In this paper we introduce Hydra, which is designed to scale both in terms of the number of spectra and the size of the search database. Hydra makes use of Hadoop
, which is a common and well supported framework for handling distributed computation. The Hadoop framework handles the management of a cluster of generic machines. It handles the details of taking a set of two tasks: map and reduce
 (see methods for details) and creating the instances required to handle a given data set. Tasks are sent by the framework to specific machines and output is collected by the framework to the next task. The framework handles details such as failure and retry, generation of a distributed data system and storage and access to temporary results.
Using Hadoop for mass spectrometry based searching has been suggested before
, where an existing MPI search engine (X!!Tandem
) was wrapped to work with Hadoop. However, such MPI based code does not use critical features of the MapReduce algorithm (e.g. the ability to sort data in ways that guarantee that searches are performed optimally). The algorithm introduced in this paper is a complete bottom-up rewrite of the X!Tandem code, and has been designed to take full advantage of map-reduce. Within Hydra, we implement the K-score
, which is one of the pluggable scoring algorithms packaged with the Trans-Proteomic Pipeline
[15, 16] distribution of X!Tandem. The K-score is an implementation of the COMET scoring function, which produces a dot product of matched peak intensities after spectrum normalization
. The K-score normalization and scoring routines account for noise and unmatched peaks within the input spectra.
The advantages of the approach introduced in this paper are: scalability, as Hadoop has been shown to be able to handle massive data sets in the petabyte range, and has been used to undertake complex search and analysis operations on datasets of this size. This means, as is shown in this paper, that such a framework is highly suited for the growing data sets that are becomingly searched against when undertaking MS/MS analyses; flexibility, as the system can be used on a variety of commodity hardware configurations, including running across heterogeneous machines where the performance scales with additional hardware, and can be used within cloud environments (e.g. Amazon EC2) that support Hadoop; and reliability, as Hadoop provides the necessary libraries to handle the critical issues of task distribution, monitoring and failure recovery (as the numbers of tasks, nodes and data sets increases, so too does the probability of task failures, which is an important issue with high throughput experiments).
In this paper we present the design for decomposition of mass spectrometry search into a collection of map/reduce tasks, and show an implementation of our proteomic search engine Hydra. Hydra is specifically designed from the ground up to work within the MapReduce algorithm. By comparison to X!Tandem we show where and how the system scales, and discuss the advantages of our approach. The software is made publicly available under an open source license.