A parallel method for enumerating amino acid compositions and masses of all theoretical peptides

Background Enumeration of all theoretically possible amino acid compositions is an important problem in several proteomics workflows, including peptide mass fingerprinting, mass defect labeling, mass defect filtering, and de novo peptide sequencing. Because of the high computational complexity of this task, reported methods for peptide enumeration were restricted to cover limited mass ranges (below 2 kDa). In addition, implementation details of these methods as well as their computational performance have not been provided. The increasing availability of parallel (multi-core) computers in all fields of research makes the development of parallel methods for peptide enumeration a timely topic. Results We describe a parallel method for enumerating all amino acid compositions up to a given length. We present recursive procedures which are at the core of the method, and show that a single task of enumeration of all peptide compositions can be divided into smaller subtasks that can be executed in parallel. The computational complexity of the subtasks is compared with the computational complexity of the whole task. Pseudocodes of processes (a master and workers) that are used to execute the enumerating procedure in parallel are given. We present computational times for our method executed on a computer cluster with 12 Intel Xeon X5650 CPUs (72 cores) running Windows HPC Server. Our method has been implemented as a 32- and 64-bit Windows application using Microsoft Visual C++ and the Message Passing Interface. It is available for download at https://ispace.utmb.edu/users/rgsadygo/Proteomics/ParallelMethod. Conclusion We describe implementation of a parallel method for generating mass distributions of all theoretically possible amino acid compositions.


Background
Mass spectrometry (MS) plays a crucial role in modern proteomics as a key method for protein identification and quantification. MS provides accurate mass and abundance measurements of intact and fragmented peptide ions, which are then processed by specialized algorithms and transformed into peptide and protein identities. Thus, efficiency of many MS-based proteomics workflows depends on how well we understandand can utilizethe properties of peptide masses and peptide mass distribution.
It has been observed that peptide masses have a nonuniform, clustered distribution, which is explained by the fact that peptides are made of twenty amino acids with specific masses. This distribution consists of repeating peaks separated by approximately 1 Da, which become taller and wider as the mass increases. Consecutive peaks are separated by low populated regions (quiet zones) and gaps (forbidden zones)-that is, the mass ranges for which there exist no possible sequences of amino acids. Nonuniformity (peaks, gaps) and discrete nature of the mass distribution of peptides are important for two major problems in MS-based proteomics: peptide identification and de novo sequencing.
The knowledge of the mass distribution of a particular type of peptide (for example, non-modified tryptic peptides) can be used to facilitate peptide identification in a number of ways. Forbidden zones allow us to filter out MS signals corresponding to non-target species (nonpeptide contaminants or modified peptides) early on, before doing any complicated processing of MS data. Dodds and coworkers [1] showed that this results in exponential improvements in statistical significance and discrimination of protein identification based on peptide mass fingerprinting on the Mascot platform. Nonoverlapping or partially overlapping peaks in the mass distributions of different types of peptides allow recognition of these types based solely on precursor masses. For example, Spengler and Hester [2] showed that accurate masses (with accuracy of 0.1 or even 1 ppm) allow phosphorylated and nonmodified peptides to be distinguished. Lehmann and coworkers [3] and Jones and coworkers [4] showed that this is possible for glycopeptides and lipids. In addition, there have been many suggestions for label tags shifting the mass of labeled peptides to quiet or forbidden zones in order to allow easier identification and quantification of these peptides [5].
The major drawback of peptide identification algorithms based on database search is their inability to identify peptides that are not present in the reference database. De novo sequencing algorithms are designed to restore peptide compositions from MS data without the use of peptide databases. These algorithms employ several strategies for MS data analysis s [6], one of which is based on the fact that for a given mass there exist only a finite (though sometimes very large) number of amino acid sequences (or amino acid compositions) that can assume that mass, and that these sequences (compositions) can be explicitly enumerated. The use of the masses of fragment ions can further reduce the number of admissible compositions. Several reports have shown the feasibility of this strategy, especially for high accuracy data provided by modern Fourier transform mass spectrometers [7][8][9].
Proteomics applications mentioned above rely on specific properties of the peptide mass distributions that can only be obtained by enumerating all theoretically possible peptides. Moreover, in many circumstances it is impossible to generate these distributions once and for all, as many parameters can vary from experiment to experiment (peptide modifications, enzymatic specificity, number of missed cleavages, etc.) Thus, it is desirable to be able to generate peptide mass distributions (or some parts of these distributions) "to order" and, therefore, to be able to generate them fast.
Several works focusing on different MS-based proteomics applications employed enumeration of all theoretically possible peptides [8,[10][11][12][13]. Because of the high computational complexity of the task, enumeration of peptides was done for the mass range below 2 kDa, which limited applicability of the obtained results. Also, even for this mass range long computational times and extensive computational capabilities were often required. Olson and others [8] mentioned the use of a parallel method for peptide enumeration, but details of its implementation as well as its computational performance were not reported.
In a recent paper [14] we described the mass distribution of all theoretically possibly tryptic peptides made of 20 amino acids, up to the mass of 3 kDa. The paper provided detailed characterization of forbidden zones and amino acid compositions of peptides from the quiet zones. We showed how forbidden zones shrink over the mass range, where they completely disappear and how they depend on the measured mass accuracy. We found that peptide sequence compositions in the quiet zones are less diverse than those in the peaks of the distribution, and that forbidden zones may be extended by eliminating certain types of unrealistic compositions. We also characterized symmetry of mass peaks and the accuracy of the Mann's equations [13] for the mass peak position and width. Our study was made possible by advancing computational techniques for the enumeration of amino acid compositions.
In this paper, we describe in detail a parallel method for enumerating all amino acid compositions up to a given length. First, we present a pseudocode for recursive procedures which are the core of this method. We then show how a single task of enumerating all peptide compositions can be divided into smaller subtasks that can be executed in parallel. We also show how the computational complexity of these subtasks compares with the computational complexity of the primary task. Finally, we provide pseudocode of processes (a master and workers) that are used to execute the enumerating procedure in parallel. To the best of our knowledge this is the first description of a computational method for a complete and unbiased enumeration of all theoretically possible peptides. We present computational times for our method, implemented by using Microsoft Visual C+ + and the Message Passing Interface (MPI), and executed on a computer cluster with 12 Intel Xeon X5650 CPUs running Windows HPC Server 2008. The mass and length limits are input parameters of the program.

Peptide compositions
Any peptide composition is represented by a numerical vector (n 1 , n 2 ,..., n 20 ), whose i-th component is equal to the number of times the i-th amino acid occurs in the peptide. For example, sequence a 1 a 20 a 1 a 1 has composition (3, 0,..., 0, 1). In some cases, it is convenient to consider peptides as sequences composed of less or more than 20 letters (tryptic peptides without missed cleavages, post-translationaly modified peptides, etc.). For this reason, let us adopt a more general notation: assume we have an alphabet of N characters and composition vectors (n 1 , n 2 ,..., n N ). The length of a composition is defined as L = n 1 + n 2 +... + n N . If m i is the monoisotopic mass associated with the i-th letter, then the monoisotopic mass of a composition is defined as m = n 1 m 1 + n 2 m 2 +... + n N m N (the monoisotopic mass of H 2 O and a proton may be added to this mass if necessary.) For a single sequence of letters we have a uniquely defined composition, while for a single composition of length L we have corresponding sequences, given by the multinomial coefficient. Note that all these sequences will have the same mass, which explains the convenience of enumerating peptide compositions instead of peptide sequences in order to obtain all theoretically possible peptide masses.
The number of compositions of length L is equal to the number of ways to choose L elements from a set of N elements if repetitions are allowed: The number of compositions of all lengths not greater than L (including one composition of length 0) is equal to which follows from the equation The latter is based on the recurrence relation and can be found in the book by Graham and others [15]. The number of sequences of all lengths not greater than L is equal to Table 1 shows the number of compositions and sequences for peptides comprised of 20 amino acids. Note how the number of sequences exceeds the number of compositions as the length of peptides grows.
Enumerating peptide compositions Figure 1 shows the pseudocode of a basic recursive procedure for enumerating all compositions of length not greater than L for an alphabet of N letters. Array c holds current composition (n 1 , n 2 ,..., n N ) and is indexed from one. Procedure Mass returns the mass of the input composition c. For a given length L, procedure GenBasic should be called with parameters (L, start = 1). Note that the depth of recursion for this procedure is equal to N -1. The number of compositions enumerated by this procedure is given by equation (3).
Several changes to procedure GenBasic will make it faster. First, if L is equal to zero on line 3 then there is no need to make assignment on line 4 and call GenBasic on line 5, since it is already known that the rest of the composition will contain zeros only. Second, we can calculate the mass of a composition as soon as its component n i becomes known, and then pass this mass to the next call of the generating procedure. By doing this, we avoid the need to recalculate the mass of the part of the composition that has not been changed. Figure 2 shows the pseudocode of procedure Gen which is a faster version of procedure GenBasic. It generates a histogram of peptide compositions' masses, instead of printing them, which is more suitable for its further use. The histogram, stored in global array massHist, contains the number of compositions falling into the mass bins of width 0.001 Da. Procedure Round returns the rounded integer value of its argument. Note that since procedure Gen calculates the mass of compositions "on the fly", we do not need to store compositions in array c, so lines 10 and 16 may be removed. We assume that array aam of size N stores masses m 1 , m 2 ,..., m N . Procedure Gen should be called with parameters (L, start = 1, m 0 = 0).

Enumerating peptide compositions in parallel
The task of enumerating all compositions (n 1 , n 2 ,..., n N ) can be split into smaller independent subtasks or jobs that can be executed in parallel. Indeed, a single call to procedure Gen with parameters (L, 1, 0) is equivalent to   Nefedov and Sadygov BMC Bioinformatics 2011, 12:432 http://www.biomedcentral.com/1471-2105/12/432 correspondingly ( Figure 3). As before, we assume that array aam stores masses m 1 , m 2 ,..., m N of the used amino acids. Certainly, we will have to combine mass histograms produced by each call of procedure Gen, which can be done knowing parameters of each job, described by a triplet (L, start, m 0 ). To illustrate this idea, consider again our example with N = 3 and L = 2. The primary task is to enumerate the following compositions: (0, 0, 0), (0, 0, 1), (0, 0, 2), (0, 1, 0), (0, 1, 1), (0, 2, 0), (1, 0, 0), (1, 0, 1), (1, 1, 0 How can we create a list or table of jobs given the initial job described by parameters (L, 1, 0)? First, job (L, 1, 0) is replaced by L+1 jobs (L, 2, 0), (L -1, 2, aam [1]),..., (0, 2, aam[1]*L) (Figure 3). If, for a given L, job (L, 2, 0) is executed in acceptable time, we do not need to do anything else, and the table of jobs has been initialized. Otherwise, we can split job (L, 2, 0) into L+1 jobs with start = 3, and similarly split other jobs with start = 2. Thus, for all jobs with start = 2 there is certain L max,2 such that if the first parameter of the job is larger than L max, 2 then this job should be split into jobs with start = 3. When this is done, we move to the jobs with start = 3 and process them in a similar manner: all jobs that have first parameter larger than L max,3 should be split into jobs with start = 4. We continue this until each job in the job table can be executed in acceptable time (see additional notes on this in the Discussion section).
When the table of jobs has been initialized, the jobs from the table can be assigned to computation processes. In this context, it is convenient to think about a master process, which does these assignments ( Figure  4), and worker processes ( Figure 5), which execute the assigned jobs and return results back to the master. The master then combines partial mass histograms computed by workers into a single final mass histogram. There may be different strategies utilized in assigning the jobs. For example, larger jobs (with larger L) may be assigned prior to smaller jobs (with smaller L). In our experiments, which are presented in the Discussion section, there were no particular strategy in job assignments (jobs were assigned in the order in which they had been inserted into the job table).
The data exchange between the master and workers ( Figure 4, lines 12, 16; Figure 5, lines 2, 5, 6) can be organized by using functions MPI_Send and MPI_Receive from any library implementing MPI [16]. In our implementation, we used Microsoft Visual C++ and MPI library from Microsoft HPC SDK Pack.

Results and Discussion
It is worthwhile to make several additional comments on procedure Gen, presented on Figure 2. Various practical considerations may suggest using an upper limit on the mass of peptide compositions that one wants to enumerate. In this case, a significant improvement in computation speed may be achieved by canceling the enumeration of compositions whose mass exceeds a given limit. If array aam contains mass values in ascending order, we can return from function Gen as soon as the current mass (m 0 in line 5, m in lines 11 and 17) exceeds the threshold. To illustrate a possible gain in speed that may be achieved by using a maximum mass limit, consider enumeration of compositions corresponding to all tryptic peptides up to the length of 30.   (Tables 2 and 3). There may be other modifications to this procedure, depending on the intended use of the generated mass distribution. For example, the maximum number of occurrences of each amino acid in a peptide may be made limited by a threshold based on the amino acid and the length and/or mass of the peptide. This would make the generated mass distribution more realistic and may increase the lengths of forbidden zones [14]. Instead of counting the number of peptide compositions, one can count the number of peptide sequences using equation (1). In this case, efficient computation of factorials "on the fly" can be implemented similar to the computation of peptide masses. If we are interested in enzyme-specific peptides, the procedure can be modified to allow a given number of missed cleavages. The   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21 procedure CreateMassHistMaster(L) create numOfWorkers work processes jobs ← CreateJobs(L) numOfJobs ← number of created jobs numOfBusyWorkers ← 0 while numOfJobs > 0 or numOfBusyWorkers > 0 comment: find free process and assign next job p ← 1 while p < numOfProcs and numOfJobs > 0 if process p is free then assign next unassigned job from jobs to process p numOfJobs ← numOfJobs -1 numOfBusyWorkers ← numOfBusyWorkers + 1 p ← p + 1 wait for massHist from any worker update massHistGlobal using massHist numOfBusyWorkers ← numOfBusyWorkers -1 tell work processes to terminate return massHistGlobal  An important question is how the job (L, start + 1, 0) compares with the job (L, start, 0) in terms of computational complexity. Let us denote the number of compositions enumerated by the first procedure by C(L, start + 1), and the number of compositions enumerated by the second procedure by C(L, start). Using equation (3) we have: Hence, C(L, start) C(L, start + 1) = 1 + L N − start + 1 .
Initialization of a job table requires the maximum value of parameter start, as well as parameters L max,2 , L max,3 , etc., to be specified. These can be determined empirically based on the available computational resources and the number of processes that can be executed in parallel. For example, we found that for enumerating tryptic peptide compositions of masses up to 3 kDa by using 72 processes running on 12 Intel Xeon X5650 CPUs the following parameters would give good performance: start ≤ 7, L max,2 = 20, L max,3 = 24, L max,4 = 28, L max,5 = 34, L max,6 = 40. The tuning of these parameters is important to ensure good performance, as they directly affect the computation time ( Table 2).
It should be noted that a job table may have jobs with the same parameters L and start, differing only in M. For example, consider the case illustrated in Figure 3. Splitting job (L, 2, 0) into L+1 jobs with start = 3 will give us, among others, job (L-1, 3, aam [2]). On the other hand, splitting job (L-1, 2, aam [1]) into L jobs with start = 3 gives us job (L-1, 3, aam [1]). It is clear that execution of these two jobs can be done in one call to function Gen, which should be modified to be able to accept two input masses m 1 0 , m 2 0 instead of m 0 , and to work with two variables m 1 , m 2 instead of m. In a similar manner, execution of more than two jobs may be done in one call to function Gen. This approach will  lead to a significant speed-up in computations (it has not been implemented in our code). In fact, a job table may have jobs with all three parameters L, start and M being equal. Consider, for example, a primary job with L = 40, start = 0, and m 0 = 0. Assume that array aam holds amino acid masses in ascending order. Then the first five masses stored in this array will correspond to glycine (G), alanine (A), serine (S), proline (P) and valine (V), and we can denote the first five elements of a composition by n G , n A , n S , n P , n V . Assume that the job splitting algorithm (see subsection 2.3) yields the following two jobs: n G = 2, n A = 0, n S = 0, n P = 0, n V = 1, start = 6, L = 37, n G = 0, n A = 3, n S = 0, n P = 0, n V = 0, start = 6, L = 37.
Then these two jobs will have the same m 0 = 213.111 Da, since tripeptides GGV and AAA are isomeric. If a job table is generated using parameters start ≤ 7, L max,2 = 20, L max,3 = 24, L max,4 = 28, L max,5 = 34, L max,6 = 40, then for L = 40 about 2% of all jobs will be duplicates; for L = 50about 29%, and for L = 60about 47%. In the case when we are only interested in the mass distribution of peptide compositions, there is no need to execute duplicate jobs. If certain job occurs k times, it is enough to execute it once and then multiply the resulting histogram by k before adding it to the final histogram. However, if we would like to get every peptide composition, then we cannot remove duplicate jobs.
In the end of this section, we present Table 3 which shows computation times for enumeration of tryptic compositions for a range of lengths between 25 and 55, with and without the use of a maximum mass limit. The numbers in the second column may seem counterintuitive at first, since, for example, it takes 19 min to generate the distribution for L = 25 and 11 min for L = 35. The explanation, however, lies in using the maximum mass limit of 3 kDa. The longest job for the task with L = 25 was L = 24, start = 2, m 0 = 0, and it executed for 19 min. The longest job for the task with L = 30 was L = 24, start = 2, m 0 = 285, and it executed for 8 min. The difference in 11 min comes from the fact that more compositions were canceled out in the second case because of the mass limit that was used.
We would like to note, in addition to Table 3, that enumeration of all tryptic peptides having the mass no greater than 3 kDa (the length of these peptides does not exceed 51) took 32 minutes.

Conclusions
In this paper, we presented a detailed description of a parallel method for enumerating all theoretically possible amino acid compositions and discussed different aspects of its implementation. Enumeration of all amino acid compositions is important in several proteomics workflows, including peptide mass fingerprinting, mass defect labeling, mass defect filtering, and de novo peptide sequencing. Given the fact that multi-core computers and computer clusters are becoming increasingly available, it is natural to address this computationally expensive task using a parallelization approach.
We believe that by reducing computational times from hours to minutes, the applicability of the enumeration of all amino acid compositions in various proteomics studies may be significantly improved and extended. We have used the method described in this work to characterize forbidden and quiet zones in the mass distribution of tryptic peptides [14]. In the next step, we plan to apply this method to enhance the accuracy of protein identification in real mass spectrometry data. Our method has been implemented as a 32-and 64-bit Windows application using Microsoft Visual C++ and MPI. It is freely available for download at https://ispace.utmb. edu/users/rgsadygo/Proteomics/ParallelMethod.

Availability and Requirements
•