ProbPS: A new model for peak selection based on quantifying the dependence of the existence of derivative peaks on primary ion intensity
© Zhang et al; licensee BioMed Central Ltd. 2011
Received: 18 April 2011
Accepted: 17 August 2011
Published: 17 August 2011
The analysis of mass spectra suggests that the existence of derivative peaks is strongly dependent on the intensity of the primary peaks. Peak selection from tandem mass spectrum is used to filter out noise and contaminant peaks. It is widely accepted that a valid primary peak tends to have high intensity and is accompanied by derivative peaks, including isotopic peaks, neutral loss peaks, and complementary peaks. Existing models for peak selection ignore the dependence between the existence of the derivative peaks and the intensity of the primary peaks. Simple models for peak selection assume that these two attributes are independent; however, this assumption is contrary to real data and prone to error.
In this paper, we present a statistical model to quantitatively measure the dependence of the derivative peak's existence on the primary peak's intensity. Here, we propose a statistical model, named ProbPS, to capture the dependence in a quantitative manner and describe a statistical model for peak selection. Our results show that the quantitative understanding can successfully guide the peak selection process. By comparing ProbPS with AuDeNS we demonstrate the advantages of our method in both filtering out noise peaks and in improving de novo identification. In addition, we present a tag identification approach based on our peak selection method. Our results, using a test data set, suggest that our tag identification method (876 correct tags in 1000 spectra) outperforms PepNovoTag (790 correct tags in 1000 spectra).
We have shown that ProbPS improves the accuracy of peak selection which further enhances the performance of de novo sequencing and tag identification. Thus, our model saves valuable computation time and improving the accuracy of the results.
Mass spectrometry is a popular method for protein identification [1–6]. In a typical protein identification experiment using mass spectrometry, proteins are first digested into peptides by an enzyme, say trypsin. Tandem mass spectra of the peptides are generated using a tandem mass spectrometer (MS/MS). Traditionally, two approaches for peptide identification from MS/MS spectra have been used: database searches [3–8] and de novo sequencing [9–31].
Typical database searches first identify a set of candidate peptides from a protein sequence database, and then construct a theoretical spectrum for each peptide. Finally, the similarity between the theoretical spectrum and the MS/MS experimental spectrum is calculated and the most similar peptides are reported as predictions. There are several popular tandem mass spectrometry data analysis programs of this type: SEQUEST , Mascot , X!Tandem , SCOPE , and ProbID , are some examples of these. Before comparing a theoretical spectrum against an experimental spectrum, noise peaks in the experimental spectrum should be filtered out. Noise peaks in the spectrum can cause significant differences between the experimental and theoretical spectra and, as a result, correct solutions may be missed.
De novo sequencing, on the other hand, is database-independent because it exclusively uses the information contained in the MS/MS spectrum. Thus, the de novo technique has the potential to identify peptides that are not included in protein sequence databases. Widely-used de novo packages include PEAKS [9, 10], PepNovo [11, 12], et al. [13–31] Recently, variants of de novo sequencing, the tag-based methods [32–38], have been developed to identify a segment of a peptide rather than a full-length peptide. After inferring the tags from a MS/MS spectrum, the candidate peptides that do not match any of the tags are filtered out. Therefore, an effective tag identification method can improve identification accuracy and reduce the running time for database searches by reducing the number of candidate peptides. Both de novo methods and tag-based methods usually require high-quality spectra, and do not perform well on spectra with noise peaks. Thus, peak selection is important for the effective use of de novo methods.
Generally speaking, there are three types of peaks in a tandem mass spectrum: i) the primary peak that is highly likely to be accompanied by a set of derivative peaks caused by the loss of ammonia, the loss of water, or isotopic shift; ii) noise peaks from signals from mass spectrometry and other unknown reasons; and iii) peaks generated from contaminants. Although isotopic shifts and neutral losses are often observed for peaks generated from contaminants, complementary peaks are seldom observed. This provides a way to distinguish valid peaks from noise and contaminant peaks. In this study, the latter two peaks are called noise peaks.
Before attempting to identify a peptide from a MS/MS spectrum, it is useful to perform a pre-processing step (called peak selection) to filter out noise and contaminant peaks. A widely accepted peak selection rule utilizes two peak attributes, peak intensity and the existence of derivative peaks. Briefly, a peak accompanied by derivative peaks and an associated complementary peak is likely to be valid; peaks without these features are likely to be noise. Our observations suggest that the existence of derivative peaks and complementary peaks is strongly depending on the primary peak intensity. Existing methods for peak selection adopt simple models that assume that these two attributes are independent. This assumption contradicts to real data and is error prone. In this study we proposed a statistical model, named ProbPS, to capture the interdependence of peak intensity and the existence of derivative peaks in a quantitative manner. Our experimental results demonstrate that our model can improve both peak selection and tag identification.
For a peak p in a tandem mass spectrum,
V = 1 if the peak is a valid primary peak; otherwise V = 0.
I is the peak intensity;
ISO indicates the existence of isotopic shift;
NH3 indicates the existence of a peak that corresponds to the neutral loss of an ammonia;
H2O indicates the existence of a peak that corresponds to the neutral loss of a water;
COMP indicates the existence of a peak that corresponds to a complementary ion;
2.2 The model for peak selection
2.2.1 Quantifying the dependency of derivative ions on primary peak intensity
To investigate the dependency of derivative ions on primary peak intensity we used spectra from the Swed-CAD database , a collection of high quality MS/MS spectra of tryptic peptides. Using SEQUEST, we identified 15,897 unique, annotated peptide-spectrum matches (PSM) to use as a training set.
In Figure 1 an evident nonlinear relationship between primary peak intensity and the existence of isotopic peaks can be observed. The nonlinear relationship can be explained by supposing that, for a primary ion, its isotopic derivative is observed with probability p. Then, for a total of I primary ions, an isotopic derivative would be observed with probability . Therefore, it is reasonable to approximate this relationship using an exponential function. Like P(ISO|I, V = 1), P(ISO|I, V = 0) also approximates 1 as the peak intensity goes to infinity. The reason for the slight differences in Figure 1 and 2 is that a contaminant ion might generate an isotopic shift similar to the shift generated by a primary ion. A significantly different pattern between P(COMP|I, V = 1) and P(COMP|I, V = 0) is observed (Figure 3 and 4) because for contaminant ions, complementary peaks are seldom generated.
2.2.2 Bayesian framework for peak selection
where p(V = 1)= P(I, D|V = 1)P(V = 1) and p(V = 0)= P(I, D|V = 0)P(V = 0).
3.1 Peak selection based on probPS
We also compared probPS against the relevance value used in AuDeNS . AuDeNS uses a framework for de novo sequencing of peptides. It first cleans the input spectrum with a number of data cleaning algorithms ("grass mowers"), followed by a sequencing algorithm. It applies the mowers to the input data, assigning to each input peak i a relevance value r(i), with the default being r(i) = 1. Hereby, each mower M uses a relevance factor Rel M (which can be set as a parameter of AuDeNS), and the relevance value of peak i is then given by , where M (i) is the value assigned to peak i by mower M. The relevance of a solution is then the sum of the relevances of the peaks matched by this solution. Precisely, AuDeNS produces a ranked list of sequence suggestions for a spectrum.
3.2 Improving de novo identification using probPS
de novo peptide identification results after peak selection based on probPS and relevance.
Cross-validation of the performance of probPS and AuDeNS in improving de novo peptide identification.
#Correctly identified peptides a
3.3 Identifying tags based on probPS
Ordinary tagging methods directly identify tags on a given mass spectrum. For example, PepNovoTag  extracts all substrings of the desired length from the PepNovo reconstruction process, and uses a logistic regression model to evaluate these tags. This strategy suffers from noise peaks in the spectrum. Our method only uses the peaks with high probPS values to generate tags. Specifically, our tag identification method (called probTag) starts with the top peaks with high probPS along with their complementary peaks to find the most reliable neighbor peaks.
Comparison of probTag and PepNovoTag (version 3
Tag Identification Performance
It should be noticed both PepNovoTag and ProbTag are combinations of peak selection and tagging techniques. This is only an implicit and indirect evidence of the peak selection performance.
4 Conclusion and discussion
In this study, we described the dependence between derivate peaks and primary ion intensity in a quantitative manner. The experimental results demonstrate that this quantitative description can help improve the accuracy of peak selection which further improves the performance of de novo sequencing and tag identification.
In addition to the peak attributes used in the study, other attributes like, for example, consecutive ions may prove to further improve peak selection. In general, valid peaks are more likely to have a consecutive ion than invalid peaks. In future work, we aim to incorporate this attribute into our peak selection method.
This study was funded by the Beijing Municipal Natural Science Foundation (grant 5102029) and the National Natural Science Foundation of China (grant 30800189). We thank Prof. Jonas Grossmann for providing the LCQ spectra data. We also thank to our reviewers for their constructive comments and suggestions.
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422(6928):198–207. 10.1038/nature01511View ArticlePubMedGoogle Scholar
- Baldwin MA: Protein identification by mass spectrometry: issues to be considered. Mol Cell Proteomics 2004, 3: 1–9.View ArticlePubMedGoogle Scholar
- Yatesr JR, Eng JK, McCormack AL, Schieltz D: Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal Chem 1995, 67(8):1426–36. 10.1021/ac00104a020View ArticleGoogle Scholar
- Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20(18):3551–67. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2View ArticlePubMedGoogle Scholar
- Craig R, Beavis RC: TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20(9):1466–7. 10.1093/bioinformatics/bth092View ArticlePubMedGoogle Scholar
- Bafna V, Edwards N: SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics 2001, 17(Suppl 1):S13–21. 10.1093/bioinformatics/17.suppl_1.S13View ArticlePubMedGoogle Scholar
- Zhang N, Aebersold R, Schwikowski B: ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics 2002, 2(10):1406–12. 10.1002/1615-9861(200210)2:10<1406::AID-PROT1406>3.0.CO;2-9View ArticlePubMedGoogle Scholar
- Paizs B, Suhai S: Fragmentation pathways of protonated peptides. Mass Spectrom Rev 2005, 24(4):508–48. 10.1002/mas.20024View ArticlePubMedGoogle Scholar
- Ma B, Zhang K, Hendrie C, Liang C, Li M, Doherty-Kirby A, Lajoie G: PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun Mass Spectrom 2003, 17(20):2337–42. 10.1002/rcm.1196View ArticlePubMedGoogle Scholar
- Ma B, Zhang KZ, Liang CZ: An effective algorithm for peptide de novo sequencing from MS/MS spectra. Journal of Computer and System Sciences 2005, 70(3):418–430. 10.1016/j.jcss.2004.12.001View ArticleGoogle Scholar
- Frank A, Pevzner P: PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal Chem 2005, 77(4):964–73. 10.1021/ac048788hView ArticlePubMedGoogle Scholar
- Frank AM, Savitski MM, Nielsen ML, Zubarev RA, Pevzner PA: De novo peptide sequencing and identification with precision mass spectrometry. J Proteome Res 2007, 6: 114–23. 10.1021/pr060271uPubMed CentralView ArticlePubMedGoogle Scholar
- Taylor JA, Johnson RS: Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid Commun Mass Spectrom 1997, 11(9):1067–75. 10.1002/(SICI)1097-0231(19970615)11:9<1067::AID-RCM953>3.0.CO;2-LView ArticlePubMedGoogle Scholar
- Taylor JA, Johnson RS: Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Anal Chem 2001, 73(11):2594–604. 10.1021/ac001196oView ArticlePubMedGoogle Scholar
- Johnson RS, Taylor JA: Searching sequence databases via de novo peptide sequencing by tandem mass spectrometry. Mol Biotechnol 2002, 22(3):301–15. 10.1385/MB:22:3:301View ArticlePubMedGoogle Scholar
- Dancik V, Addona TA, Clauser KR, Vath JE, Pevzner PA: De novo peptide sequencing via tandem mass spectrometry. J Comput Biol 1999, 6(3–4):327–42. 10.1089/106652799318300View ArticlePubMedGoogle Scholar
- Alves G, Yu YK: Robust accurate identification of peptides (RAId): deciphering MS2 data using a structured library search with de novo based statistics. Bioinformatics 2005, 21(19):3726–32. 10.1093/bioinformatics/bti620View ArticlePubMedGoogle Scholar
- Fischer B, Roth V, Roos F, Grossmann J, Baginsky S, Widmayer P, Gruissem W, Buhmann JM: NovoHMM: a hidden Markov model for de novo peptide sequencing. Anal Chem 2005, 77(22):7265–73. 10.1021/ac0508853View ArticlePubMedGoogle Scholar
- Fernandez-de Cossio J, Gonzalez J, Besada V: A computer program to aid the sequencing of peptides in collision-activated decomposition experiments. Comput Appl Biosci 1995, 11(4):427–34.PubMedGoogle Scholar
- Fernandez-de Cossio J, Gonzalez J, Betancourt L, Besada V, Padron G, Shimonishi Y, Takao T: Automated interpretation of high-energy collision-induced dissociation spectra of singly protonated peptides by 'SeqMS', a software aid for de novo sequencing by tandem mass spectrometry. Rapid Commun Mass Spectrom 1998, 12(23):1867–78. 10.1002/(SICI)1097-0231(19981215)12:23<1867::AID-RCM407>3.0.CO;2-SView ArticlePubMedGoogle Scholar
- DiMaggio JPA, Floudas CA: De novo peptide identification via tandem mass spectrometry and integer linear optimization. Anal Chem 2007, 79(4):1433–46. 10.1021/ac0618425PubMed CentralView ArticlePubMedGoogle Scholar
- Yan B, Pan C, Olman VN, Hettich RL, Xu Y: A graph-theoretic approach for the separation of b and y ions in tandem mass spectra. Bioinformatics 2005, 21(5):563–74. 10.1093/bioinformatics/bti044View ArticlePubMedGoogle Scholar
- Lu B, Chen T: A suboptimal algorithm for de novo peptide sequencing via tandem mass spectrometry. J Comput Biol 2003, 10: 1–12. 10.1089/106652703763255633View ArticlePubMedGoogle Scholar
- Yan B, Qu YX, Mao FL, Olman VN, Xu Y: PRIME: A mass spectrum data mining tool for de novo sequencing and PTMs identification. Journal of Computer Science and Technology 2005, 20(4):483–490. 10.1007/s11390-005-0483-5View ArticleGoogle Scholar
- Zhang Z: De novo peptide sequencing based on a divide-and-conquer algorithm and peptide tandem spectrum simulation. Anal Chem 2004, 76(21):6374–83. 10.1021/ac0491206View ArticlePubMedGoogle Scholar
- Mo L, Dutta D, Wan Y, Chen T: MSNovo: a dynamic programming algorithm for de novo peptide sequencing via tandem mass spectrometry. Anal Chem 2007, 79(13):4870–8. 10.1021/ac070039nView ArticlePubMedGoogle Scholar
- Bern M, Goldberg D: De novo analysis of peptide tandem mass spectra by spectral graph partitioning. J Comput Biol 2006, 13(2):364–78. 10.1089/cmb.2006.13.364View ArticlePubMedGoogle Scholar
- Demine R, Walden P: Sequit: software for de novo peptide sequencing by matrix-assisted laser desorption/ionization post-source decay mass spectrometry. Rapid Commun Mass Spectrom 2004, 18(8):907–13. 10.1002/rcm.1420View ArticlePubMedGoogle Scholar
- Chi H, Sun RX, Yang B, Song CQ, Wang LH, Liu C, Fu Y, Yuan ZF, Wang HP, He SM, Dong MQ: pNovo: de novo peptide sequencing and identification using HCD spectra. J Proteome Res 2010, 9(5):2713–24. 10.1021/pr100182kView ArticlePubMedGoogle Scholar
- Bartels C: Fast Algorithm for Peptide Sequencing by Mass-Spectroscopy. Biomedical and Environmental Mass Spectrometry 1990, 19(6):363–368. 10.1002/bms.1200190607View ArticlePubMedGoogle Scholar
- Chen T, Kao MY, Tepel M, Rush J, Church GM: A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry. Journal of Computational Biology 2001, 8(3):325–337. 10.1089/10665270152530872View ArticlePubMedGoogle Scholar
- Mann M, Wilm M: Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem 1994, 66(24):4390–9. 10.1021/ac00096a002View ArticlePubMedGoogle Scholar
- Sunyaev S, Liska AJ, Golod A, Shevchenko A: MultiTag: multiple error-tolerant sequence tag search for the sequence-similarity identification of proteins by mass spectrometry. Anal Chem 2003, 75(6):1307–15. 10.1021/ac026199aView ArticlePubMedGoogle Scholar
- Tabb DL, Saraf A, Yatesr JR: GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal Chem 2003, 75(23):6415–21. 10.1021/ac0347462PubMed CentralView ArticlePubMedGoogle Scholar
- Day RM, Borziak A, Gorin A: PPM-chain - De novo peptide identification program comparable in performance to sequest. 2004 Ieee Computational Systems Bioinformatics Conference, Proceedings 2004, 505–508.Google Scholar
- Frank A, Tanner S, Bafna V, Pevzner P: Peptide sequence tags for fast database search in mass-spectrometry. Journal of Proteome Research 2005, 4(4):1287–1295. 10.1021/pr050011xView ArticlePubMedGoogle Scholar
- Shen Y, Tolic N, Hixson KK, Purvine SO, Anderson GA, Smith RD: De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins. Anal Chem 2008, 80(20):7742–54. 10.1021/ac801123pPubMed CentralView ArticlePubMedGoogle Scholar
- Tabb DL, Ma ZQ, Martin DB, Ham AJ, Chambers MC: DirecTag: accurate sequence tags from peptide MS/MS through statistical scoring. J Proteome Res 2008, 7(9):3838–46. 10.1021/pr800154pPubMed CentralView ArticlePubMedGoogle Scholar
- Falth M, Savitski MM, Nielsen ML, Kjeldsen F, Andren PE, Zubarev RA: SwedCAD, a database of annotated high-mass accuracy MS/MS spectra of tryptic peptides. Journal of Proteome Research 2007, 6(10):4063–4067. 10.1021/pr070345hView ArticlePubMedGoogle Scholar
- Sun SW, Qiao YT, Zhang H, Bu DB: PI: An open-source software package for validation of the SEQUEST result and visualization of mass spectrum. BMC Bioinformatics 2011., 12:Google Scholar
- Sun S, Yu C, Qiao Y, Lin Y, Dong G, Liu C, Zhang J, Zhang Z, Cai J, Zhang H, Bu D: Deriving the probabilities of water loss and ammonia loss for amino acids from tandem mass spectra. J Proteome Res 2008, 7: 202–8. 10.1021/pr070479vView ArticlePubMedGoogle Scholar
- Keller A, Purvine S, Nesvizhskii AI, Stolyar S, Goodlett DR, Kolker E: Experimental protein mixture for validating tandem mass spectral analysis. OMICS 2002, 6(2):207–12. 10.1089/153623102760092805View ArticlePubMedGoogle Scholar
- Grossmann J, Roos FF, Cieliebak M, Liptak Z, Mathis LK, Muller M, Gruissem W, Baginsky S: AUDENS: A tool for automated peptide de novo sequencing. Journal of Proteome Research 2005, 4(5):1768–1774. 10.1021/pr050070aView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.