A parallel method for enumerating amino acid compositions and masses of all theoretical peptides
- Alexey V Nefedov^{1} and
- Rovshan G Sadygov^{1}Email author
DOI: 10.1186/1471-2105-12-432
© Nefedov and Sadygov; licensee BioMed Central Ltd. 2011
Received: 27 July 2011
Accepted: 7 November 2011
Published: 7 November 2011
Abstract
Background
Enumeration of all theoretically possible amino acid compositions is an important problem in several proteomics workflows, including peptide mass fingerprinting, mass defect labeling, mass defect filtering, and de novo peptide sequencing. Because of the high computational complexity of this task, reported methods for peptide enumeration were restricted to cover limited mass ranges (below 2 kDa). In addition, implementation details of these methods as well as their computational performance have not been provided. The increasing availability of parallel (multi-core) computers in all fields of research makes the development of parallel methods for peptide enumeration a timely topic.
Results
We describe a parallel method for enumerating all amino acid compositions up to a given length. We present recursive procedures which are at the core of the method, and show that a single task of enumeration of all peptide compositions can be divided into smaller subtasks that can be executed in parallel. The computational complexity of the subtasks is compared with the computational complexity of the whole task. Pseudocodes of processes (a master and workers) that are used to execute the enumerating procedure in parallel are given. We present computational times for our method executed on a computer cluster with 12 Intel Xeon X5650 CPUs (72 cores) running Windows HPC Server. Our method has been implemented as a 32- and 64-bit Windows application using Microsoft Visual C++ and the Message Passing Interface. It is available for download at https://ispace.utmb.edu/users/rgsadygo/Proteomics/ParallelMethod.
Conclusion
We describe implementation of a parallel method for generating mass distributions of all theoretically possible amino acid compositions.
Background
Mass spectrometry (MS) plays a crucial role in modern proteomics as a key method for protein identification and quantification. MS provides accurate mass and abundance measurements of intact and fragmented peptide ions, which are then processed by specialized algorithms and transformed into peptide and protein identities. Thus, efficiency of many MS-based proteomics workflows depends on how well we understand -- and can utilize -- the properties of peptide masses and peptide mass distribution.
It has been observed that peptide masses have a nonuniform, clustered distribution, which is explained by the fact that peptides are made of twenty amino acids with specific masses. This distribution consists of repeating peaks separated by approximately 1 Da, which become taller and wider as the mass increases. Consecutive peaks are separated by low populated regions (quiet zones) and gaps (forbidden zones)-that is, the mass ranges for which there exist no possible sequences of amino acids. Nonuniformity (peaks, gaps) and discrete nature of the mass distribution of peptides are important for two major problems in MS-based proteomics: peptide identification and de novo sequencing.
The knowledge of the mass distribution of a particular type of peptide (for example, non-modified tryptic peptides) can be used to facilitate peptide identification in a number of ways. Forbidden zones allow us to filter out MS signals corresponding to non-target species (nonpeptide contaminants or modified peptides) early on, before doing any complicated processing of MS data. Dodds and coworkers [1] showed that this results in exponential improvements in statistical significance and discrimination of protein identification based on peptide mass fingerprinting on the Mascot platform. Nonoverlapping or partially overlapping peaks in the mass distributions of different types of peptides allow recognition of these types based solely on precursor masses. For example, Spengler and Hester [2] showed that accurate masses (with accuracy of 0.1 or even 1 ppm) allow phosphorylated and nonmodified peptides to be distinguished. Lehmann and coworkers [3] and Jones and coworkers [4] showed that this is possible for glycopeptides and lipids. In addition, there have been many suggestions for label tags shifting the mass of labeled peptides to quiet or forbidden zones in order to allow easier identification and quantification of these peptides [5].
The major drawback of peptide identification algorithms based on database search is their inability to identify peptides that are not present in the reference database. De novo sequencing algorithms are designed to restore peptide compositions from MS data without the use of peptide databases. These algorithms employ several strategies for MS data analysis s [6], one of which is based on the fact that for a given mass there exist only a finite (though sometimes very large) number of amino acid sequences (or amino acid compositions) that can assume that mass, and that these sequences (compositions) can be explicitly enumerated. The use of the masses of fragment ions can further reduce the number of admissible compositions. Several reports have shown the feasibility of this strategy, especially for high accuracy data provided by modern Fourier transform mass spectrometers [7–9].
Proteomics applications mentioned above rely on specific properties of the peptide mass distributions that can only be obtained by enumerating all theoretically possible peptides. Moreover, in many circumstances it is impossible to generate these distributions once and for all, as many parameters can vary from experiment to experiment (peptide modifications, enzymatic specificity, number of missed cleavages, etc.) Thus, it is desirable to be able to generate peptide mass distributions (or some parts of these distributions) "to order" and, therefore, to be able to generate them fast.
Several works focusing on different MS-based proteomics applications employed enumeration of all theoretically possible peptides [8, 10–13]. Because of the high computational complexity of the task, enumeration of peptides was done for the mass range below 2 kDa, which limited applicability of the obtained results. Also, even for this mass range long computational times and extensive computational capabilities were often required. Olson and others [8] mentioned the use of a parallel method for peptide enumeration, but details of its implementation as well as its computational performance were not reported.
In a recent paper [14] we described the mass distribution of all theoretically possibly tryptic peptides made of 20 amino acids, up to the mass of 3 kDa. The paper provided detailed characterization of forbidden zones and amino acid compositions of peptides from the quiet zones. We showed how forbidden zones shrink over the mass range, where they completely disappear and how they depend on the measured mass accuracy. We found that peptide sequence compositions in the quiet zones are less diverse than those in the peaks of the distribution, and that forbidden zones may be extended by eliminating certain types of unrealistic compositions. We also characterized symmetry of mass peaks and the accuracy of the Mann's equations [13] for the mass peak position and width. Our study was made possible by advancing computational techniques for the enumeration of amino acid compositions.
In this paper, we describe in detail a parallel method for enumerating all amino acid compositions up to a given length. First, we present a pseudocode for recursive procedures which are the core of this method. We then show how a single task of enumerating all peptide compositions can be divided into smaller subtasks that can be executed in parallel. We also show how the computational complexity of these subtasks compares with the computational complexity of the primary task. Finally, we provide pseudocode of processes (a master and workers) that are used to execute the enumerating procedure in parallel. To the best of our knowledge this is the first description of a computational method for a complete and unbiased enumeration of all theoretically possible peptides. We present computational times for our method, implemented by using Microsoft Visual C++ and the Message Passing Interface (MPI), and executed on a computer cluster with 12 Intel Xeon X5650 CPUs running Windows HPC Server 2008. The mass and length limits are input parameters of the program.
Implementation
Peptide compositions
Any peptide composition is represented by a numerical vector (n_{1}, n_{2},..., n_{20}), whose i-th component is equal to the number of times the i-th amino acid occurs in the peptide. For example, sequence a_{1}a_{20}a_{1}a_{1} has composition (3, 0,..., 0, 1). In some cases, it is convenient to consider peptides as sequences composed of less or more than 20 letters (tryptic peptides without missed cleavages, post-translationaly modified peptides, etc.). For this reason, let us adopt a more general notation: assume we have an alphabet of N characters and composition vectors (n_{1}, n_{2},..., n_{ N } ). The length of a composition is defined as L = n_{1} + n_{2} +... + n_{ N } . If m_{ i } is the monoisotopic mass associated with the i-th letter, then the monoisotopic mass of a composition is defined as m = n_{1}m_{1} + n_{2}m_{2} +... + n_{ N }m_{ N } (the monoisotopic mass of H_{2}O and a proton may be added to this mass if necessary.)
corresponding sequences, given by the multinomial coefficient. Note that all these sequences will have the same mass, which explains the convenience of enumerating peptide compositions instead of peptide sequences in order to obtain all theoretically possible peptide masses.
Number of compositions and sequences comprised of 20 letters, of length not greater than L, for L ranging from 3 to 10, and their ratios (rounded)
Length of Peptides (L) | Number of Compositions (A) | Number of Sequences (B) | Ratio B/A |
---|---|---|---|
3 | 1,770 | 8,420 | 5 |
4 | 10,625 | 168,420 | 16 |
5 | 53,129 | 3,368,420 | 63 |
6 | 230,229 | 67,368,420 | 293 |
7 | 888,029 | 1,347,368,420 | 1,517 |
8 | 3,108,104 | 26,947,368,420 | 8,670 |
9 | 10,015,004 | 538,947,368,420 | 53,814 |
10 | 30,045,014 | 10,778,947,368,420 | 358,760 |
Enumerating peptide compositions
Procedure GenBasic begins enumeration with composition (0, 0,..., 0) and first generate all compositions with n_{ N } ranging from 0 to L. It then sets n_{N-1}to 1, and generates all compositions with n_{ N } ranging from 0 to L - 1, and so on. The last composition in this generation process is (L, 0,..., 0). Essentially, the compositions are generated like N-digit numbers, in ascending order, with requirement that the sum of the "digits" must not be greater than L. For instance, for N = 3 and L = 2 the procedure generates all compositions up to length 2 in the following order: (0, 0, 0), (0, 0, 1), (0, 0, 2), (0, 1, 0), (0, 1, 1), (0, 2, 0), (1, 0, 0), (1, 0, 1), (1, 1, 0), (2, 0, 0).
Several changes to procedure GenBasic will make it faster. First, if L is equal to zero on line 3 then there is no need to make assignment on line 4 and call GenBasic on line 5, since it is already known that the rest of the composition will contain zeros only. Second, we can calculate the mass of a composition as soon as its component n_{ i } becomes known, and then pass this mass to the next call of the generating procedure. By doing this, we avoid the need to recalculate the mass of the part of the composition that has not been changed.
Enumerating peptide compositions in parallel
To illustrate this idea, consider again our example with N = 3 and L = 2. The primary task is to enumerate the following compositions: (0, 0, 0), (0, 0, 1), (0, 0, 2), (0, 1, 0), (0, 1, 1), (0, 2, 0), (1, 0, 0), (1, 0, 1), (1, 1, 0), (2, 0, 0). This can be accomplished by independent enumeration of three subsets of compositions: (i) (0, 0, 0), (0, 0, 1), (0, 0, 2), (0, 1, 0), (0, 1, 1), (0, 2, 0); (ii) (1, 0, 0), (1, 0, 1), (1, 1, 0); and (iii) (2, 0, 0). Compositions (i) can be enumerated by setting n_{1} = 0 and calling Gen with parameters (L = 2, start = 2, m_{0} = 0); compositions (ii) can be enumerated by setting n_{1} = 1 and calling Gen with parameters (L = 1, start = 2, m_{0} = m_{1}); and single composition (iii) is enumerated by setting n_{1} = 2 and calling Gen with parameters (L = 0, start = 2, m_{0} = 2m_{1}).
How can we create a list or table of jobs given the initial job described by parameters (L, 1, 0)? First, job (L, 1, 0) is replaced by L+1 jobs (L, 2, 0), (L - 1, 2, aam[1]),..., (0, 2, aam[1]*L) (Figure 3). If, for a given L, job (L, 2, 0) is executed in acceptable time, we do not need to do anything else, and the table of jobs has been initialized. Otherwise, we can split job (L, 2, 0) into L+1 jobs with start = 3, and similarly split other jobs with start = 2. Thus, for all jobs with start = 2 there is certain L_{max,2} such that if the first parameter of the job is larger than L_{max, 2} then this job should be split into jobs with start = 3. When this is done, we move to the jobs with start = 3 and process them in a similar manner: all jobs that have first parameter larger than L_{max,3} should be split into jobs with start = 4. We continue this until each job in the job table can be executed in acceptable time (see additional notes on this in the Discussion section).
The data exchange between the master and workers (Figure 4, lines 12, 16; Figure 5, lines 2, 5, 6) can be organized by using functions MPI_Send and MPI_Receive from any library implementing MPI [16]. In our implementation, we used Microsoft Visual C++ and MPI library from Microsoft HPC SDK Pack.
Results and Discussion
Computation times for enumerating all tryptic compositions up to the length of 30, for different sets of jobs and number of work processes, with and without the maximum mass limit
Task | Number of Workers | Job Table | Computation Time | |||||
---|---|---|---|---|---|---|---|---|
Number of Jobs | start | L _{ max,2 } | L _{ max,3 } | L _{ max,4 } | massMax = 3 kDa | no massMax | ||
L = 30 | 1 | 1 | 1 | - | - | - | 6 h 03 min | 35 h 11 min |
5 | 30 | 2 | - | - | - | 2 h 12 min | 14 h 52 min | |
30 | 30 | 2 | - | - | - | 1 h 39 min | 13 h 32 min | |
30 | 255 | ≤ 3 | 20 | - | - | 28 min | 5 h 02 min | |
71 | 255 | ≤ 3 | 20 | - | - | 27 min | 4 h 57 min | |
71 | 679 | ≤ 5 | 20 | 24 | 28 | 11 min | 1 h 20 min |
Computation times for enumerating all tryptic compositions with different maximum lengths, with and without maximum mass limit
L | Computation Time | |
---|---|---|
maxMass = 3 kDa | no mass limit | |
25 | 19 min | 29 min |
30 | 11 min | 1 h 20 min |
35 | 8 min | 5 h 38 min |
40 | 8 min | 38 h 28 min |
45 | 14 min | > 96 h |
50 | 29 min | - |
There may be other modifications to this procedure, depending on the intended use of the generated mass distribution. For example, the maximum number of occurrences of each amino acid in a peptide may be made limited by a threshold based on the amino acid and the length and/or mass of the peptide. This would make the generated mass distribution more realistic and may increase the lengths of forbidden zones [14]. Instead of counting the number of peptide compositions, one can count the number of peptide sequences using equation (1). In this case, efficient computation of factorials "on the fly" can be implemented similar to the computation of peptide masses. If we are interested in enzyme-specific peptides, the procedure can be modified to allow a given number of missed cleavages. The number of amino acids (N) and their monoisotopic masses may vary depending on specific proteases used in sample preparation, possible post-translational or chemical modifications, and other factors. The resolution of the mass histogram (0.001 Da) may be changed as well, without significantly impairing computational speed.
Thus, if N = 20, L = 40, and start = 2, then C(40, 2)/C(39, 2) ≈ 1.5, which means that Gen(39, 2, 0) will be about 1.5 times faster than Gen(40, 2, 0).
Initialization of a job table requires the maximum value of parameter start, as well as parameters L_{max,2} , L_{max,3} , etc., to be specified. These can be determined empirically based on the available computational resources and the number of processes that can be executed in parallel. For example, we found that for enumerating tryptic peptide compositions of masses up to 3 kDa by using 72 processes running on 12 Intel Xeon X5650 CPUs the following parameters would give good performance: start ≤ 7, L_{max,2} = 20, L_{max,3} = 24, L_{max,4} = 28, L_{max,5} = 34, L_{max,6} = 40. The tuning of these parameters is important to ensure good performance, as they directly affect the computation time (Table 2).
It should be noted that a job table may have jobs with the same parameters L and start, differing only in M. For example, consider the case illustrated in Figure 3. Splitting job (L, 2, 0) into L+1 jobs with start = 3 will give us, among others, job (L- 1, 3, aam[2]). On the other hand, splitting job (L- 1, 2, aam[1]) into L jobs with start = 3 gives us job (L- 1, 3, aam[1]). It is clear that execution of these two jobs can be done in one call to function Gen, which should be modified to be able to accept two input masses ${m}_{0}^{1}$, ${m}_{0}^{2}$ instead of m_{0}, and to work with two variables m^{1}, m^{2} instead of m. In a similar manner, execution of more than two jobs may be done in one call to function Gen. This approach will lead to a significant speed-up in computations (it has not been implemented in our code).
Then these two jobs will have the same m_{0} = 213.111 Da, since tripeptides GGV and AAA are isomeric. If a job table is generated using parameters start ≤ 7, L_{max,2} = 20, L_{max,3} = 24, L_{max,4} = 28, L_{max,5} = 34, L_{max,6} = 40, then for L = 40 about 2% of all jobs will be duplicates; for L = 50 -- about 29%, and for L = 60 -- about 47%. In the case when we are only interested in the mass distribution of peptide compositions, there is no need to execute duplicate jobs. If certain job occurs k times, it is enough to execute it once and then multiply the resulting histogram by k before adding it to the final histogram. However, if we would like to get every peptide composition, then we cannot remove duplicate jobs.
In the end of this section, we present Table 3 which shows computation times for enumeration of tryptic compositions for a range of lengths between 25 and 55, with and without the use of a maximum mass limit. The numbers in the second column may seem counterintuitive at first, since, for example, it takes 19 min to generate the distribution for L = 25 and 11 min for L = 35. The explanation, however, lies in using the maximum mass limit of 3 kDa. The longest job for the task with L = 25 was L = 24, start = 2, m_{0} = 0, and it executed for 19 min. The longest job for the task with L = 30 was L = 24, start = 2, m_{0} = 285, and it executed for 8 min. The difference in 11 min comes from the fact that more compositions were canceled out in the second case because of the mass limit that was used.
We would like to note, in addition to Table 3, that enumeration of all tryptic peptides having the mass no greater than 3 kDa (the length of these peptides does not exceed 51) took 32 minutes.
Conclusions
In this paper, we presented a detailed description of a parallel method for enumerating all theoretically possible amino acid compositions and discussed different aspects of its implementation. Enumeration of all amino acid compositions is important in several proteomics workflows, including peptide mass fingerprinting, mass defect labeling, mass defect filtering, and de novo peptide sequencing. Given the fact that multi-core computers and computer clusters are becoming increasingly available, it is natural to address this computationally expensive task using a parallelization approach.
We believe that by reducing computational times from hours to minutes, the applicability of the enumeration of all amino acid compositions in various proteomics studies may be significantly improved and extended. We have used the method described in this work to characterize forbidden and quiet zones in the mass distribution of tryptic peptides [14]. In the next step, we plan to apply this method to enhance the accuracy of protein identification in real mass spectrometry data. Our method has been implemented as a 32- and 64-bit Windows application using Microsoft Visual C++ and MPI. It is freely available for download at https://ispace.utmb.edu/users/rgsadygo/Proteomics/ParallelMethod.
Availability and Requirements
• Project Name: PepComp
• Project home page: https://ispace.utmb.edu/users/rgsadygo/Proteomics/ParallelMethod
• Operating System: MS Windows
• Other Requirements: Message Passing Interface, multi-core CPU
• Programming Language: Visual Studio C++
• License: No license needed
Declarations
Acknowledgements
This work was supported in part by HHSN272200800048C NIAID Clinical Proteomics Center (Allan R. Brasier, UTMB) and NIH-NLBIHHSN268201000037C NHLBI Proteomics Center for Airway Inflammation (Alex Kurosky, UTMB).
Authors’ Affiliations
References
- Dodds ED, An HJ, Hagerman PJ, Lebrilla CB: Enhanced peptide mass fingerprinting through high mass accuracy: Exclusion of non-peptide signals based on residual mass. J Proteome Res 2006, 5: 1195–1203. 10.1021/pr050486oView ArticlePubMedGoogle Scholar
- Spengler B, Hester A: Mass-Based Classification (MBC) of Peptides: Highly Accurate Precursor Ion Mass Values Can Be Used to Directly Recognize Peptide Phosphorylation. Journal of the American Society for Mass Spectrometry 2008, 19: 1808–1812. 10.1016/j.jasms.2008.08.005View ArticlePubMedGoogle Scholar
- Lehmann WD, Bohne A, von der Lieth CW: The information encrypted in accurate peptide masses-improved protein identification and assistance in glycopeptide identification and characterization. Journal of Mass Spectrometry 2000, 35: 1335–1341. 10.1002/1096-9888(200011)35:11<1335::AID-JMS70>3.0.CO;2-0View ArticlePubMedGoogle Scholar
- Jones JJ, Stump MJ, Fleming RC, Lay JO, Wilkins CL: Strategies and data analysis techniques for lipid and phospholipid chemistry elucidation by intact cell MALDI-FTMS. Journal of the American Society for Mass Spectrometry 2004, 15: 1665–1674. 10.1016/j.jasms.2004.08.007View ArticlePubMedGoogle Scholar
- Hall MP, Ashrafi S, Obegi I, Petesch R, Peterson JN, Schneider LV: 'Mass defect' tags for biomolecular mass spectrometry. Journal of Mass Spectrometry 2003, 38: 809–816. 10.1002/jms.493View ArticlePubMedGoogle Scholar
- Lu B, Chen T: Algorithms for de novo peptide sequencing using tandem mass spectrometry. Drug Discorvery Today: BIOSILICO 2004, 2: 85–90. 10.1016/S1741-8364(04)02387-XView ArticleGoogle Scholar
- Spengler B: De novo sequencing, peptide composition analysis, and composition-based sequencing: A new strategy employing accurate mass determination by Fourier transform ion cyclotron resonance mass spectrometry. Journal of the American Society for Mass Spectrometry 2004, 15: 703–714. 10.1016/j.jasms.2004.01.007View ArticlePubMedGoogle Scholar
- Olson MT, Epstein JA, Yergey AL: De novo peptide sequencing using exhaustive enumeration of peptide composition. J Am Soc Mass Spectrom 2006, 17: 1041–1049. 10.1016/j.jasms.2006.03.007View ArticlePubMedGoogle Scholar
- Spengler B: Accurate mass as a bioinformatic parameter in data-to-knowledge conversion: Fourier transform ion cyclotron resonance mass spectrometry for peptide de novo sequencing. Eur J Mass Spectrom (Chichester, Eng) 2007, 13: 83–87. 10.1255/ejms.840View ArticleGoogle Scholar
- Zubarev RA, Hakansson P, Sundqvist B: Accuracy requirements for peptide characterization by monoisotopic molecular mass measurements. Analytical Chemistry 1996, 68: 4060–4063. 10.1021/ac9604651View ArticleGoogle Scholar
- Demirev PA, Zubarev RA: Probing combinatorial library diversity by mass spectrometry. Analytical Chemistry 1997, 69: 2893–2900. 10.1021/ac970049wView ArticlePubMedGoogle Scholar
- Fenyo D, Qin J, Chait BT: Protein identification using mass spectrometric information. Electrophoresis 1998, 19: 998–1005. 10.1002/elps.1150190615View ArticlePubMedGoogle Scholar
- Mann M: Useful Tables of Possible and Probable Peptide Masses. Atlanta, GA; 1995.Google Scholar
- Nefedov AV, Mitra I, Brasier AR, Sadygov RG: Examining troughs in the mass distribution of all theoretically possible tryptic peptides. J Proteome Res 2011, 10: 4150–4157. 10.1021/pr2003177PubMed CentralView ArticlePubMedGoogle Scholar
- Graham RL, Knuth DE, Patashnik O: Concrete mathematics: a foundation for computer science. 2nd edition. Reading, Mass: Addison-Wesley; 1994.Google Scholar
- Pacheco P: Parallel Programming with MPI. 1st edition. San Francisco: Morgan Kaufman; 1996.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.