PARPST: a PARallel algorithm to find peptide sequence tags

Background Protein identification is one of the most challenging problems in proteomics. Tandem mass spectrometry provides an important tool to handle the protein identification problem. Results We developed a work-efficient parallel algorithm for the peptide sequence tag problem. The algorithm runs on the concurrent-read, exclusive-write PRAM in O(n) time using log n processors, where n is the number of mass peaks in the spectrum. The algorithm is able to find all the sequence tags having score greater than a parameter or all the sequence tags of maximum length. Our tests on 1507 spectra in the Open Proteomics Database shown that our algorithm is efficient and effective since achieves comparable results to other methods. Conclusions The proposed algorithm can be used to speed up the database searching or to identify post-translational modifications, comparing the homology of the sequence tags found with the sequences in the biological database.


Background
Protein identification is one of the most important goals of drug discovery research and in proteomics. Nowadays, the leading technique to identify a protein or a peptide is the tandem mass spectrometry (MS/MS). The basic idea of the identification of a peptide using tandem mass spectrometry is simple: a peptide is ionized and broken, at the peptide bond, in charge fragments (ions). The mass/ charge ratio of the resulted fragments are visualized in a graphic called tandem mass spectrum or ms/ms spectrum ( Figure 1). A ms/ms spectrum contains: the mass of the whole peptide, and a pattern of fragments that can be associated to a given sequence. Each fragment is characterized by mass/charge ratio and intensity: we refer to this pair as peak. A good quality spectrum should contain the complete series of y-ions (the fragment ions containing the carboxyl terminal), and the complete series of the bions (the fragment ions containing the amino terminus) ( Figure 2). In this case it is very easy to reconstruct the amino acidic sequence for the peptide, since it is sufficient to compute the mass differences between two adjacent peaks in each of the two series. Unfortunately, in practice, many factors contribute to complicate the problem. Indeed, many b-ion or y-ion peaks might be absent from the spectrum, or it might be an imperfect fragmentation that causes a different types of ions, or the sample might be contaminated, or it might be present post-translational modifications or many other type of peaks might unexpected appear in the spectrum. There are three different computational approaches motivated by peptide sequencing: the peptide identification searching in a database, the de novo peptide sequencing, and the peptide sequence tag. The first method finds the best matching peptide from a sequence database using a scoring function based on the likelihood that an identified peptide is actually the peptide of the spectrum [1,2]. This method is the mostly used but it is able only to identify peptide stored in a database. On the contrary the de novo method allows to identify a peptide using only the spectrum without any other previous knowledge and hence even if the peptide is not in a database. Many algorithms using this approach have been developed [3][4][5][6][7][8][9][10] having different time complexity, but the accuracy of them depends on the quality of the input spectrum: usually these algorithms cannot find the complete sequence due to missing peaks and hence the applicability of them is limited in practice. The peptide sequence tag approach combines the two previous methods: first the de novo method is applied to find a partial solution, so called sequence tag, and then a database search is applied to identify the complete sequence. The idea of using peptide sequence tags is not novel [11,12] and it is recently re-proposed to increase the speed of the database searching [13,14] or to find post-translation modifications [15]. Most of these algorithms are using a previous developed de novo approach and their time complexity to find the optimal solution according to any scoring function is at least O(n 2 ), where n is the number of the mass spectrum peaks. A simple scoring function can be defined as the sum of the correspondent mass peak abundances found in the spectrum [3] or it can be based on the ion-type, favouring the b and y ions over the other types [7,10]. In this paper we propose a parallel algorithm to determine all the peptide sequence tags longer than an input number of amino acids or all those scoring more than an input number, according to any scoring function. The parallel approach is motivated by demand of efficiency, since the interpretation of mass spectra is a high throughput process. The algorithm is work-efficient running in O(n) time on a concurrent-read, exclusive-write (CREW) PRAM [16] with log n processors, and it is a variation of the algorithm proposed in [5] to find peptide sequence tags. We simulate the parallelism by an implementation in Java on threads using barriers for synchronization. Our tests on 1507 spectra in the Open Proteomics Database shown that our algorithm is efficient and effective since achieves comparable results to other approaches.

Methods
Let an experimental spectrum be given related to an unknown peptide P of mass m P . A peptide sequence tag is a short string of amino acid mass differences deduced from the fragment spectrum. Let any scoring function and any number δ be given. Our task is to determine all the sequence tags scoring at least δ. If the score reduces to count the length of the string, the output consists in the sequence tags of lengths at least δ. We refer to this problem as the peptide sequence tag problem.
Although a spectrum may contain a few different types of ions, for simplicity, we consider b-ions and y-ions only. Therefore we assume M = {m 1 , m 2 ,…,m n } to represent a spectrum where the real numbers m i correspond to the m/ z ratios of the peaks in the spectrum augmented with the numbers 1, 19, m P −17, and m P +1 that represent the "empty" b-ion, the "empty" y-ion, the weightiest b-ion, and the weightiest y-ion, respectively. Let us denote the set of the masses of the twenty amino acids by A. The peptide sequence tag problem can be reformulated in terms of paths in a graph. We build a labelled directed acyclic The peptide sequence tag problem consists in determining any path between two nodes in the graph G with score greater than δ. The introduction of any scoring function corresponds to assign weights to the edges of the graph: the score of a path is the sum of the scores of the edges on the path.

The algorithm
The elements of A and M are stored in two sorted arrays in the shared global memory of the PRAM. We divide M into groups of log n consecutive elements, and we assign a "responsible" processor to each mass in each group so that the ith processor is responsible for the ith mass inside the group. We divide the algorithm in three procedures: We repeat each procedure for every group of log n elements, from the first one to the last one, and in reverse order. The procedures presented in this section are CREW since they require concurrent access to A and M in reading, but only exclusive access to the global memory in writing. In order to simplify the description that follows we give value one to the weight of each edge. This assumption corresponds to determine the longest feasible path for the de novo peptide sequencing problem or feasible paths longer than any given value as solutions for the peptide sequencing tag problem. In the next paragraph we describe the three procedures.

Pre-computation procedure
The first procedure consists in building the graph. Considering each group of log n masses, we associate a node to each mass, and so the ith processor is responsible for the ith node ν i in the group. Processor i, for each element in A, checks if there exists a node ν j in M that differs from it to the mass of the element in A. In this case we put an edge between ν i and ν j in the graph. We store this edge in two different adjacency lists, the so called predecessor (pred) and successor (succ) list: Note that any node can have at most twenty predecessors or successors, or none. Since M is a sorted array, using a binary search algorithm to determine the predecessors and successors, the pre-computation takes O(log n) time for each group and hence O(n) totally.

Propagation procedure
The second procedure of the algorithm permits, for each node, to compute the maximum length path passing through it. This goal is reached by iterating the search of the predecessor (successor) of every node using the pointer jumping technique [16] in every group. In order to handle the propagation, processor i stores and updates the current predecessor in the start_path pointer: Note that all the predecessors of ν i belong to the ν i 's group or to any group that precedes it, being M sorted. Hence all the start_path pointers of the predecessors of ν i are known, when ν i 's group is processed, due to the order in which groups are handled. The same is done to update the current successor in end_path. We also calculate the distances d s and d e from any node ν i to the ν h node pointed by start_path [i] and end_path[i] pointers. At the end of the propagation procedure these two pointers, related to each node ν i , will point to the termini nodes of the longer path A ms/ms spectrum   Indeed, in the worst condition only one node at time is unlocked and it can upload the start_path. At the beginning, we have a set of pointer trees. We are interested in the sum of their heights. This sum is obviously less than log n. Pointer jumping and merging operations decrease the total height since a tree of height h is transformed into a star by applying pointer jumping in O(log(h)) steps, and the root of a star "hooks" to the parent of any of the root's predecessor in G. Therefore, in the worst case, if h max is the A peptide fragmentation Figure 2 A peptide fragmentation. In a mass spectrometer a whole peptide (below) is broken into two fragments, one of them charged. Usually the breaking point is the peptide bond: in this case we could obtain a b-ion that maintains the amino termini or a y-ion that maintains the carboxyl termini.
maximum height of the initial set of, say k, pointer trees, all these trees degenerate into stars in O(log(h max )) time, and finally they are merged in a list ranking of roots, and the algorithm stops in O(k) time. Since h max , k ≤ log n in every group, and we apply Algorithm "Propagation" to all the n/log n groups, the time complexity is O(n).

Determination procedure
This procedure allows to retrieve the solutions of the peptide sequence tag problem. Some change to the procedure permits to compute all the feasible paths of maximum length or all the feasible paths with length more than δ.
We describe the latter case. At the end of the previous section, each node i having d s [i] + d e [i] > δ belongs to a solution of the peptide sequence tag problem. Moreover the Algorithm of the Propagation procedure Figure 3 Algorithm of the Propagation procedure. Propagation procedure permits, for each node, to compute the maximum length path passing through it. Pseudo-code is presented for the computation of start_path and d s for any group, while d e and end_path can be obtained by replacing pred and start_path with succ and end_path, respectively.

Results
In order to understand the performance of our algorithm and to compare it with other existing software, we simulated the processes by using the multithreading in Java, addressing the synchronization by means of barriers. We tested our program on a four 2 GHz dual-core Intel processors 8GB RAM machine.
Our first dataset consists in 1363 annotated Escherichia Coli ion trap tandem mass spectra from the Open Proteomics Database (OPD) [17] with different Xcorr (97 spectra with Xcorr ≥ 2.5, 246 spectra with Xcorr ≥ 2.0 and 1363 spectra Xcorr ≥ 1.5), and our second dataset consists of the 280 spectra of [13]. We tested the program over all these spectra after running a data pre-processing to remove tiny noise peaks as in Mascot (personal communication).
For the first dataset, the algorithm looks for peptide sequence tags of maximum length. We evaluated the percentage of cases when the algorithm finds at least one correct sequence tag at different lengths k. We obtained the following percentage: ■ 99.6%, for k = 3; ■ 96.1%, for k = 4; ■ 96.1%, for k = 3; ■ 59.5%, for k = 3.
The average running time required to generate the sequence tags is 0.15 seconds. We compared our program with the public available program PepNovo on the same dataset of 280 spectra as in [13]. We evaluated the occurrence of at least one correct sequence tag in the generated sequence tag of maximum length found by our algorithm. Since in general the generated sequence tag is not unique, we used the scoring function defined in [7] to assign a score to the sequences. We evaluated the percentage of cases where any correct tag is contained in the highest scoring solution at different lengths. Additionally, we reported on the occurrence of any correct tag in the set of  Table 1. We note that, since the percentages grow substantially if we consider the occurrence of correct sequence tags in the generated maximum length sequence tags, selectivity of the scoring function is low. Therefore better results could be obtained by using a different scoring function.
The average running time required to generate the sequence tags is 0.11 seconds.

Conclusions
The problem of identifying modified or variant peptide sequences is a challenging one, especially when the spectrum for unmodified sequence is not present as a standard for comparison. By joining the best partial sequences of the de novo interpretation and the database search algorithms, sequence tag can increase the speed and the effectiveness of the identification, and the discovery of unknown modifications, sequence variations and possibly alternate splice sites in proteins. Here, we have proposed a new work-efficient parallel algorithm to find peptide sequence tags. Our tests shown that our algorithm is efficient and accurate since achieves comparable results to other methods. Therefore, at least in theory, the proposed algorithm could be used to identify post-translational modifications, comparing the homology of the sequence tags found with the sequences in the biological database.