Volume 9 Supplement 4
A Semantic Web for Bioinformatics: Goals, Tools, Systems, Applications
PARPST: a PARallel algorithm to find peptide sequence tags
 Sara Brunetti^{1},
 Elena Lodi^{1},
 Elisa Mori^{1}Email author and
 Maria Stella^{1, 2}
DOI: 10.1186/147121059S4S11
© Brunetti et al.; licensee BioMed Central Ltd. 2008
Published: 25 April 2008
Abstract
Background
Protein identification is one of the most challenging problems in proteomics. Tandem mass spectrometry provides an important tool to handle the protein identification problem.
Results
We developed a workefficient parallel algorithm for the peptide sequence tag problem. The algorithm runs on the concurrentread, exclusivewrite PRAM in O(n) time using log n processors, where n is the number of mass peaks in the spectrum. The algorithm is able to find all the sequence tags having score greater than a parameter or all the sequence tags of maximum length. Our tests on 1507 spectra in the Open Proteomics Database shown that our algorithm is efficient and effective since achieves comparable results to other methods.
Conclusions
The proposed algorithm can be used to speed up the database searching or to identify posttranslational modifications, comparing the homology of the sequence tags found with the sequences in the biological database.
Background
Methods
Let an experimental spectrum be given related to an unknown peptide P of mass m_{ P }. A peptide sequence tag is a short string of amino acid mass differences deduced from the fragment spectrum. Let any scoring function and any number δ be given. Our task is to determine all the sequence tags scoring at least δ. If the score reduces to count the length of the string, the output consists in the sequence tags of lengths at least δ. We refer to this problem as the peptide sequence tag problem.
Although a spectrum may contain a few different types of ions, for simplicity, we consider bions and yions only. Therefore we assume M = {m_{1}, m_{2},…,m_{n}} to represent a spectrum where the real numbers m_{i} correspond to the m/z ratios of the peaks in the spectrum augmented with the numbers 1, 19, m_{ P }−17, and m_{ P }+1 that represent the “empty” bion, the “empty” yion, the weightiest bion, and the weightiest yion, respectively. Let us denote the set of the masses of the twenty amino acids by A. The peptide sequence tag problem can be reformulated in terms of paths in a graph. We build a labelled directed acyclic graph G = (V,E) such that
■ every node ν_{ i } is associated to a m_{i} ∈ M (1 ≤ i ≤ n);
■ (ν_{ i }, ν_{ j }) ∈ E if and only if (m_{ j } − m_{ i }) ∈ A (1 ≤ i <j ≤ n).
The peptide sequence tag problem consists in determining any path between two nodes in the graph G with score greater than δ. The introduction of any scoring function corresponds to assign weights to the edges of the graph: the score of a path is the sum of the scores of the edges on the path.
The algorithm
The elements of A and M are stored in two sorted arrays in the shared global memory of the PRAM. We divide M into groups of log n consecutive elements, and we assign a “responsible” processor to each mass in each group so that the i th processor is responsible for the i th mass inside the group. We divide the algorithm in three procedures:
■ precomputation procedure
■ propagation procedure
■ determination procedure.
We repeat each procedure for every group of log n elements, from the first one to the last one, and in reverse order. The procedures presented in this section are CREW since they require concurrent access to A and M in reading, but only exclusive access to the global memory in writing. In order to simplify the description that follows we give value one to the weight of each edge. This assumption corresponds to determine the longest feasible path for the de novo peptide sequencing problem or feasible paths longer than any given value as solutions for the peptide sequencing tag problem. In the next paragraph we describe the three procedures.
Precomputation procedure
Note that any node can have at most twenty predecessors or successors, or none. Since M is a sorted array, using a binary search algorithm to determine the predecessors and successors, the precomputation takes O(log n) time for each group and hence O(n) totally.
Propagation procedure
The second procedure of the algorithm permits, for each node, to compute the maximum length path passing through it. This goal is reached by iterating the search of the predecessor (successor) of every node using the pointer jumping technique [16] in every group. In order to handle the propagation, processor i stores and updates the current predecessor in the start_path pointer:
start_path[i] = h ∈ {1, …, n} ⇔ at least one path from ν_{ i } to ν_{ h } exists.
 b)
otherwise, if i has all the predecessors with start_path pointers pointing to themselves or predecessors with start_path pointing to the same node j, then we assign start_path[i] = j. d_{ s }[i] becomes the maximum distance d_{ s } of predecessors pointing to j, plus 1 (Figure 4b; Figure 3, stat. 18–22);
 c)
otherwise, if all the predecessors of i have the start_path pointers cycling on themselves or start_path pointing to a node without predecessors, we consider the predecessor j having the maximum d_{ s } distance and we assign start_path[i] = j and d_{ s }[i] = 1 + d_{ s }[j] (Figure 4c; Figure 3, stat. 24–29);
 d)
otherwise, the node waits for some changes in the start_path pointers of its predecessors.
At the end of the propagation procedures, each node i knows the maximum distance d_{ s }[i] + d_{ e }[i] of a path passing through it, the starting node start_path[i], and the ending node end_path[i] of this path. The computational complexity of this procedure is O(log n) time for each group. Indeed, in the worst condition only one node at time is unlocked and it can upload the start_path. At the beginning, we have a set of pointer trees. We are interested in the sum of their heights. This sum is obviously less than log n. Pointer jumping and merging operations decrease the total height since a tree of height h is transformed into a star by applying pointer jumping in O(log(h)) steps, and the root of a star “hooks” to the parent of any of the root's predecessor in G. Therefore, in the worst case, if h_{ max } is the maximum height of the initial set of, say k, pointer trees, all these trees degenerate into stars in O(log(h_{ max })) time, and finally they are merged in a list ranking of roots, and the algorithm stops in O(k) time. Since h_{ max }, k ≤ log n in every group, and we apply Algorithm “Propagation” to all the n/log n groups, the time complexity is O(n).
Determination procedure
This procedure allows to retrieve the solutions of the peptide sequence tag problem. Some change to the procedure permits to compute all the feasible paths of maximum length or all the feasible paths with length more than δ. We describe the latter case. At the end of the previous section, each node i having d_{ s }[i] + d_{ e }[i] > δ belongs to a solution of the peptide sequence tag problem. Moreover the set of the nodes i such that start_path[i] = i and d_{ e }[i] > δ are the starting nodes of any solution. In order to print all the solutions we can use a sequential procedure, taking at most O(ns), where s is the number of the possible sequence tags. Indeed, beginning from each starting node i we print all the possible solutions visiting only the successors j such that d_{ s }[i]+1+d_{ e }[j] > δ, and so forth for the successors of these nodes.
Results
In order to understand the performance of our algorithm and to compare it with other existing software, we simulated the processes by using the multithreading in Java, addressing the synchronization by means of barriers. We tested our program on a four 2 GHz dualcore Intel processors 8GB RAM machine.
Our first dataset consists in 1363 annotated Escherichia Coli ion trap tandem mass spectra from the Open Proteomics Database (OPD) [17] with different Xcorr (97 spectra with Xcorr ≥ 2.5, 246 spectra with Xcorr ≥ 2.0 and 1363 spectra Xcorr ≥ 1.5), and our second dataset consists of the 280 spectra of [13]. We tested the program over all these spectra after running a data preprocessing to remove tiny noise peaks as in Mascot (personal communication).
For the first dataset, the algorithm looks for peptide sequence tags of maximum length. We evaluated the percentage of cases when the algorithm finds at least one correct sequence tag at different lengths k. We obtained the following percentage:
■ 99.6%, for k = 3;
■ 96.1%, for k = 4;
■ 96.1%, for k = 3;
■ 59.5%, for k = 3.
Experimental results.
Comparison of five tag generating methods on 280 spectra: for each tag length, algorithm and number of solution tags, the table displays the proportion of test spectra with least one correct tag.
Tag length  Algorithm  Number of solutions  

1  3  
3  Local Tag PepNovo Tag Local Tag + Guten Tag PARPST  0.529 0.804 0.725 0.493 0.761  0.764 0.925 0.855 0.732 0.839 
4  Local Tag PepNovo Tag Local Tag + Guten Tag PARPST  0.464 0.732 0.700 0.418 0.468  0.714 0.850 0.811 0.614 0.597 
5  Local Tag PepNovo Tag Local Tag + Guten Tag PARPST  0.410 0.664 0.571 0.318 0.236  0.593 0.764 0.696 0.464 0.407 
6  Local Tag PepNovo Tag Local Tag + PARPST  0.332 0.579 0.527 0.079  0.489 0.632 0.546 0.125 
The average running time required to generate the sequence tags is 0.11 seconds.
Conclusions
The problem of identifying modified or variant peptide sequences is a challenging one, especially when the spectrum for unmodified sequence is not present as a standard for comparison. By joining the best partial sequences of the de novo interpretation and the database search algorithms, sequence tag can increase the speed and the effectiveness of the identification, and the discovery of unknown modifications, sequence variations and possibly alternate splice sites in proteins. Here, we have proposed a new workefficient parallel algorithm to find peptide sequence tags. Our tests shown that our algorithm is efficient and accurate since achieves comparable results to other methods. Therefore, at least in theory, the proposed algorithm could be used to identify posttranslational modifications, comparing the homology of the sequence tags found with the sequences in the biological database.
List of abbreviations used
 CREW:

concurrent read, exclusive write
 MS/MS:

tandem mass (or mass/mass)
 OPD:

Open Proteomics Database
 PRAM:

Parallel Random Access Memory
 RAM:

Random Access Memory
Declarations
Acknowledgements
We wish to thank Sonia Campa for help and useful discussions on the implementation of the algorithm.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 4, 2008: A Semantic Web for Bioinformatics: Goals, Tools, Systems, Applications. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/9?issue=S4.
Authors’ Affiliations
References
 Eng J, McCormack A, Yates J: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of American Society of Mass Spectrometry 1994, 5: 976–989.View Article
 Perkins D, Pappin D, Creasy D, Cottrell J: Probabilitybased protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20: 3551–3567.View ArticlePubMed
 Bafna V, Edwards N: On de novo interpretation of tandem mass spectra for peptide identification. In Proceedings of the seventh annual international conference on Computational molecular biology: 10–30 April 2003; Berlin. Edited by: Vingron M, Istrail S, Pevzner P, Waterman M. ACM Press; 2003:9–18.
 Brunetti S, Dutta D, Liberatori L, Mori E, Varrazzo D: An efficient algorithm for de novo peptide sequencing. In Proceeding of the seventh international conference on Adaptive and Natural Computing Algorithms: 21–23 March 2005; Coimbra. Edited by: Ribeiro B, Albrecht RF, Dobnikar A, Pearson DW, Steele NC. Springer Verlag; 2005:327–342.
 Brunetti S, Lodi E, Mori E: De novo peptide sequencing: a workefficient parallel algorithm. In Proceeding of the First International Conference on Bioinformatics Research and Development: 12–14 March 2007; Berlin. Edited by: Hochreiter S, Küng J, Palkoska J, Wagner R. OCG; 2007:66–67.
 Chen T, Kao MK, Tepel M, Rush J, Church GM: A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry. In Proceedings of the eleventh annual ACMSIAM symposium on Discrete algorithms: 09–11 January 2000; San Francisco Edited by: Shmoys D: Society for Industrial and Applied Mathematics. 2000, 389–398.
 Danĉík V, Addona TA, Clauser KR, Vath JE, Pevzner PA: De novo peptide sequencing via tandem mass spectrometry. Journal of Computational Biology 1999, 6: 327–342.View ArticlePubMed
 Frank A, Pevzner PA: Pepnovo: de novo peptide sequencing via probabilistic network modeling. Anal Chem 2005, 77: 964–973.View ArticlePubMed
 Lu B, Chen T: A suboptimal algorithm for de novo peptide sequencing via tandem mass spectrometry. Journal of Computational Biology 2003, 10: 1–12.View ArticlePubMed
 Ma B, Zhang K, Hendrie C, Liang C, Li M, DohertyKirby A, Lajoie G: PEAKS: Powerful Software for Peptide De Novo Sequencing by MS/MS. Rapid Communications in Mass Spectrometry 2003, 20: 2337–2342.View Article
 Mann M, Wilm M: Errortolerant identification of peptides in sequence databases by peptide sequence tags. Analytical Chemistry 1994, 66(24):4390–4399.View ArticlePubMed
 Han Y, Ma B, Zhang K: Spider: software for protein identification from sequence tags with de novo sequencing error. Journal of Bioinform. Comput. Biol. 2005, 3: 697–716.View Article
 Frank A, Tanner S, Bafna V, Pevzner P: Peptide sequence tags for fast database search in massspectrometry. Journal of Proteome Research 2005, 4: 1287–1295.View ArticlePubMed
 Tabb DL, Saraf A, Yates JR: Gutentag: highthroughput sequence tagging via an empirically derived fragmentation model. Analytical Chemistry 2003, 23: 415–6421.
 Liu C, Yan B, Song Y, Xu Y, Cai L: Peptide sequence tagbased blind identification of posttranslational modifications with point process model. Bioinformatics 2006, 22: e307e313.View ArticlePubMed
 Jájá J: An introduction to parallel algorithms. Addison Wesley Longman Publishing Co.: Redwood City; 1992.
 Prince JT, Carlson MW, Wang R, Lu P, Marcotte EM: The need for a public proteomics repository. Nat Biotechnol 2004, 22(4):471–472.View ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.