Better score function for peptide identification with ETD MS/MS spectra
© Liu et al; licensee BioMed Central Ltd. 2010
Published: 18 January 2010
Tandem mass spectrometry (MS/MS) has become the primary way for protein identification in proteomics. A good score function for measuring the match quality between a peptide and an MS/MS spectrum is instrumental for the protein identification. Traditionally the to-be-measured peptides are fragmented with the collision induced dissociation (CID) method. More recently, the electron transfer dissociation (ETD) method was introduced and has proven to produce better fragment ion ladders for larger and more basic peptides. However, the existing software programs that analyze ETD MS/MS data are not as advanced as they are for CID.
To take full advantage of ETD data, in this paper we develop a new score function to evaluate the match between a peptide and an ETD MS/MS spectrum. Experiments on real data demonstrated that this newly developed score function significantly improved the de novo sequencing accuracy of the PEAKS software on ETD data.
A new and better score function for ETD MS/MS peptide identification was developed. The method used to develop our ETD score function can be easily reused to train new score functions for other types of MS/MS data.
In recent years, tandem mass spectrometry (MS/MS) has become a popular technique in proteomics for protein identification. In a typical "bottom-up" approach, proteins are enzymatically digested into short peptides, each of them is measured with MS/MS and produces an MS/MS spectrum. Analytical software is used to identify a peptide for each qualified spectrum. Then the peptides are used for protein identification and characterization. Many software programs (e.g. Mascot, Sequest and PEAKS) have been developed for MS/MS data analysis [1–6].
Database search and de novo sequencing are two main approaches for peptide identification from MS/MS data. Database search requires a protein sequence database, and tries to find a peptide from the database that best explains the MS/MS spectrum. De novo sequencing does not require a sequence database, and instead constructs a peptide sequence from scratch to best explain the MS/MS spectrum. Algorithms and software developments using these two approaches have been one of the research focuses in bioinformatics community recently. Some exemplary work for database search approach can be found in [1, 2, 7–10], and for de novo sequencing approach can be found in [3, 11–18].
The collision induced dissociation (CID) is among the earliest ways developed for peptide fragmentation in a tandem mass spectrometer, and is also the most widely studied in bioinformatics. It is known to produce more y and b-ions than the other ion types. Recently, more fragmentation mechanisms such as infrared multiphoton dissociation (IRMPD) , electron capture dissociation (ECD) , and electron transfer dissociation (ETD)  have been developed. Particularly, ETD is regarded as a promising method complementary to CID because ETD is better suited for sequencing larger and more basic peptides.
Due to the importance of the matching score function, many scoring methods have been proposed. One straightforward approach is based on the match between the real spectrum and the theoretical spectrum generated from a candidate peptide [2, 22]. Another approach is based on the probabilistic model of fragmentation process [9, 12, 23, 24]. These probabilistic models study and employ the statistical differences between peaks produced with different ion types. Frank et al. further exploited these properties by a probabilistic network for collision induced dissociation (CID) data to improve the accuracy of de novo sequencing .
Most of the existing score functions are designed and trained specifically for CID data. ETD spectra differ from CID spectra in the magnitude and the complexity of fragment ion signals. Although a few software tools such as PEAKS and Mascot have been adjusted to work on ETD data, they are better suited for CID data and give poorer results with ETD data. To better utilize the advantage of ETD spectra on longer and more basic peptides, in this paper, we propose a score function for ETD data based on a probabilistic model. The experiments on real data showed that the score function significantly improved the de novo sequencing performance when used in the PEAKS software . Our method for developing this score function is general enough to be used to develop new score functions for CID and other types of MS/MS data.
We used an ETD dataset of 9885 spectra from a complex C. elegans protein mixture digested with trypsin followed by alkylation with iodoacetamide. The spectra were acquired on an LTQ Orbitrap XL ETD. The peptide mixtures were separated with Surveyor LC equipped with MicroAS autosampler using a reversed phase peptide trap and a reversed phase analytical column at a flow rate of 250 nl/min. A gradient of 5 ~ 30% acetonitrile in 90 minutes was employed.
The database search modules in PEAKS 5.0  and Mascot 2.2  were used for peptide identification. The tandem mass spectra were matched against NCBI C. elegans protein sequence database. For both programs, the error tolerances for parent mass and fragment mass were set as 10 ppm and 0.8 Da, respectively. Each program reported no peptide or one best peptide with a confidence score for each spectrum. (The programs may report no peptide for low quality spectra.) PEAKS 5.0 reported 1496 peptides and Mascot 2.2 reported 2950 peptides. We applied several filters to remove low quality spectra. A spectrum is kept only if it satisfies the following three conditions: (1) the PEAKS confidence score is no less than 50%; (2) the Mascot confidence score is no less than 30%; and (3) the two peptides reported by PEAKS and Mascot are the same. Finally, we selected 259 spectrum-peptide pairs from the dataset. Both programs identified the peptides consistently with high confidence scores. Thus, the peptides are very accurate and can be used as positive controls in our experiments. Among them, 130 spectrum-peptide pairs are used for training and the remaining 129 pairs are used for testing.
An ETD spectrum often contains several peaks generated from its precursor ions. These peaks often have high intensities, and may affect the performance of the matching score function. Thus, we removed these precursor ion peaks from the spectra in data preparation.
Most de novo sequencing software developed so far is for CID data and does not work for ETD data. The only software available to us for performance comparison using ETD data is the PEAKS software . Recent versions of PEAKS have a built-in set of parameters for ETD, which was adjusted from its CID parameters. We compared our score function with the function used in PEAKS 5.1. To make a fair comparison, the result of our new score function is obtained by simply replacing the score function of PEAKS 5.1 without making any change to the algorithm. Thus, the improvements reported in the following are purely caused by the difference in the score functions.
Comparison of the score function in PEAKS and the new score function.
Predictions with correct subsequences of length at least x
x = 3
In addition, we search the maximum correct consecutive subsequences in the predicted peptides. For each predicted peptide P, the length of the maximum correct consecutive subsequence is denoted by L max (P). We count the number of predicted peptides with L max (P) ≥ l, for l = 3, 4, ..., 10. The error tolerance for the position of a predicted amino acid subsequence is 0.6 Da, and the amino acids leucine(L) and isoleucine(I) are considered as the same since their mass values are the same.
The testing data contain 129 peptides and 1859 amino acids. Using the function in PEAKS 5.1 and our score function, we acquired two sets of predicted peptides, each contains 129 de novo peptides. The accuracy and the maximum correct consecutive subsequences of the resulting peptides are reported in Table 1. The results show that the new score function significantly improves the accuracy of de novo sequencing from 32.5% to 55.3% for Accuracy I, and from 34.2% to 55.9% for Accuracy II. In PEAKS 5.1, about 56% of predicted peptides contain a corrected subsequence with length at least 3. For the new score function, more than half of the predicated peptides contain a correct subsequence with length at least 6. From the experiments, we conclude that the new score function has better performance in de novo sequencing with ETD data than the score function in PEAKS 5.1.
We note that the independence assumption in our score function development in Section Methods is not very realistic because different types of ions generated by the fragmentation between the same pair of adjacent residues of the peptide are often correlated to each other. To more accurately model this, a probabilistic network (or Bayesian network) can be used similarly to . We also tried this approach on our data and found no apparent improvement over the model used in this paper. One possible reason is that our training data is of limited size, and not sufficient to train the many parameters of the probabilistic network accurately. Another possible reason is the following. The probabilistic network is advantageous over our simple model only in dealing with the dependence between different types of fragment ions at the same location (between the same pair of adjacent residues). However, when a de novo sequencing method makes an error at a location, normally there is either no fragment ion or only one low fragment ion at the location. Under this condition, the dependence between different types of fragment ions is weak. As a result, a model with the independence assumption would work as well as a more sophisticated one without this assumption.
In this paper, we proposed a significance level measurement to distinguish signal peaks from noise. Based on this, a score function for de novo sequencing with ETD data was defined. Experiments on real data showed that our score function greatly improved the score function used in the latest PEAKS 5.1 software. The method used to develop our ETD score function can be easily reused to train new score functions for other types of MS/MS data.
We first introduce a novel significance level measurement to evaluate the likelihood that a peak is signal. The significance level has a better ability to distinguish signal and noise peaks than the original relative intensities given by the spectrum. Then, we study the distributions of the significance levels of different types of fragment ions, and define a likelihood ratio score for each fragment ion. Finally, we combine all the fragment ion scores of a peptide together to provide a peptide score function.
The peak significance level
The peptide fragmentation mechanism shown in Figure 1(a) and 1(b) is overly simplified. In a real mass spectrum (Figure 2), there are also many noise peaks that cannot be explained by the six types of fragment ions. Meanwhile, not all of these fragment ions form strong peaks in the spectrum. Some are either absent or too low to be distinguished from the noise. Different spectra, and even different regions of the same spectrum, may have different noise and signal levels. Consequently, the intensity or the relative intensity given by the original spectrum do not accurately reflect the likelihood for a peak being signal. Thus, it is necessary to develop a better measure to reflect this likelihood. In this section we focus on this task and develop a measure called significance level.
According to our experience, four features of a peak may be associated with the significance of the peak: (1) global rank, (2) local rank, (3) global intensity ratio, and (4) local intensity ratio. These four features are explained in the following, respectively.
The global rank of a peak is the number of peaks (in the same spectrum) that are higher than or equal to the current peak. A smaller global rank means a more significant peak. However, a spectrum of a longer peptide normally has more signal peaks than a spectrum of a shorter peptide. With this observation, we further adjust the global rank by dividing the original ranking number by the size of the peptide, which is estimated by dividing the precursor ion mass by the average mass of an amino acid residue (approximately 112 Da).
The local rank and the local intensity ratio are defined similarly as their global versions, except that only the peaks within ±57 Da (the mass of the smallest amino acid residue) difference to the current peak are examined. More specifically, the local rank is the number of peaks which are within ±57 Da difference, and are higher than or equal to the current peak. Suppose the intensity of the highest peak within ±57 Da is h1, the global reference intensity is h2, the intensity of the current peak is h, then the local intensity ratio of the current peak is defined as max (1, min(h1, h2)/h).
where c rg , c rl , c tg and c tl are the coefficients for the features. It should be noted that under this definition, a smaller value for significance level means a stronger peak. This is somewhat counterintuitive but chosen deliberately in order to assign the same value 0 to the strongest peak in every spectrum.
Distribution of the peak significance levels for different ion types
It is well known that different types of mass spectrometers (especially if the peptide fragmentation methods are different) tend to produce different ion types. For example, CID produces more b and y-ions than other ion types, while ETD produces more c and z'-ions. Thus, when using a peptide to explain a spectrum, matching a high-intensity peak with a z'-ion or a y-ion will have different contributions towards the likelihood that the peptide is correct. For ETD data, because we expect high z'-ions, the matching with a z'-ion will contribute more than the matching with a y-ion. Thus it is necessary to study the distribution of the significance level for different ion types. In this section we study this with ETD data. The distribution will help us to develop our final score function.
The probabilistic model and score function
The significance of this match, denoted as λ (x1,1, x1,2, ..., xk, n-1), is therefore evaluated by the ratio between these two likelihoods. In this section we make the assumption that the events that t i , jhas intensity x i , jare independent to each other. The independence assumption makes our score function easier to calculate while not sacrificing much performance on analysis accuracy. A discussion on the model where the events are dependent can be found in Section Discussion.
Here fi, jis one of the f functions trained for the j-th ion of the ion type t i . In practice, for each ion type of interest, we learn five f functions, for the first and second ions near the N-terminus, the last and second last ions at the C-terminus, and the middle of the peptide, respectively. The log λ defined in (1) is our score function.
The frequencies that different fragment ion types have observed peaks in the training ETD data.
The work was supported in part by China High-tech R&D Program 2008AA02Z313, NSERC (Canada), and a start up grant at University of Waterloo.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 1, 2010: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S1.
- Perkins DN, Pappin DJC, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20: 3551–3567. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2View ArticlePubMedGoogle Scholar
- Eng JK, McCormack AL, John R Yates I: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry 1994, 5(11):976–989. 10.1016/1044-0305(94)80016-2View ArticlePubMedGoogle Scholar
- Ma B, Zhang K, Hendrie C, Liang C, Li M, Doherty-Kirby A, Lajoie G: PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrum. Rapid Commun Mass Spectrom 2003, 17: 2337–2342. 10.1002/rcm.1196View ArticlePubMedGoogle Scholar
- Craig R, Beavis RC: TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20(9):1466–1467. 10.1093/bioinformatics/bth092View ArticlePubMedGoogle Scholar
- Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH: Open mass spectrometry search algorithm. J Proteome Res 2004, 3(5):958–964. 10.1021/pr0499491View ArticlePubMedGoogle Scholar
- Tanner S, Shu H, Frank A, Wang LC, Zandi E, Mumby M, Pevzner PA, Bafna V: InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Analytical Chemistry 2005, 77(14):4626–4639. 10.1021/ac050102dView ArticlePubMedGoogle Scholar
- Mann M, Wilm M: Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Analytical Chemistry 1994, 66(24):4390–4399. 10.1021/ac00096a002View ArticlePubMedGoogle Scholar
- Clauser KR, Baker P, Burlingame AL: Role of accurate mass measurement (+/- 10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Analytical Chemistry 1999, 71(14):2871–2882. 10.1021/ac9810516View ArticlePubMedGoogle Scholar
- Bafna V, Edwards N: SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics 2001, 17(Suppl 1):S13-S21.View ArticlePubMedGoogle Scholar
- Field HI, Fenyö D, Beavis RC: RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in a relational database. Proteomics 2002, 2: 36–47. 10.1002/1615-9861(200201)2:1<36::AID-PROT36>3.0.CO;2-WView ArticlePubMedGoogle Scholar
- Chen T, Kao MY, Tepel M, Rush J, Church GM: A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry. Journal of Computational Biology 2001, 8(3):325–337. 10.1089/10665270152530872View ArticlePubMedGoogle Scholar
- Dančík V, Addona TA, Clauser KR, Vath JE, Pevzner PA: De novo peptide sequencing via tandem mass spectrometry. Journal of Computational Biology 1999.Google Scholar
- Frank A, Pevzner P: PepNovo: de novo peptide sequencing via probabilistic network modeling. Analytical Chemistry 2005, 77: 964–973. 10.1021/ac048788hView ArticlePubMedGoogle Scholar
- Hines WM, Falick AM, Burlingame AL, Gibson BW: Pattern-based algorithm for peptide sequencing from tandem high energy collision-induced dissociation mass spectra. Journal of the American Society for Mass Spectrometry 1992, 3: 326–336. 10.1016/1044-0305(92)87060-CView ArticlePubMedGoogle Scholar
- Lu B, Chen T: A suboptimal algorithm for de novo peptide sequencing via tandem mass spectrometry. Journal of Computational Biology 2003, 10: 1–12. 10.1089/106652703763255633View ArticlePubMedGoogle Scholar
- Ma B, Zhang K, Liang C: An effective algorithm for the peptide de novo sequencing from MS/MS spectrum. Journal of Computer and System Science 2005, 70: 418–430. 10.1016/j.jcss.2004.12.001View ArticleGoogle Scholar
- Taylor JA, Johnson RS: Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid Commun Mass Spectrom 1997, 11(9):1067–1075. 10.1002/(SICI)1097-0231(19970615)11:9<1067::AID-RCM953>3.0.CO;2-LView ArticlePubMedGoogle Scholar
- Taylor JA, Johnson RS: Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Analytical Chemistry 2001, 73(11):2594–2604. 10.1021/ac001196oView ArticlePubMedGoogle Scholar
- Little DP, Speir JP, Senko MW, O'Connor PB, McLafferty FW: Infrared multiphoton dissociation of large multiply charged ions for biomolecule sequencing. Analytical Chemistry 1994, 66(18):2809–2815. 10.1021/ac00090a004View ArticlePubMedGoogle Scholar
- McLafferty FW, Horn DM, Breuker K, Ge Y, Lewis MA, Cerda B, Zubarev RA, Carpenter BK: Electron capture dissociation of gaseous multiply charged ions by Fourier-transform ion cyclotron resonance. Journal of the American Society for Mass Spectrometry 2001, 12(3):245–249. 10.1016/S1044-0305(00)00223-3View ArticlePubMedGoogle Scholar
- Syka JEP, Coon JJ, Schroeder MJ, Shabanowitz J, Hunt DF: Peptide and protein sequence analysis by electron transfer dissociation mass spectrometry. PNAS USA 2004, 101(26):9528–9533. 10.1073/pnas.0402700101PubMed CentralView ArticlePubMedGoogle Scholar
- Fu Y, Yang Q, Sun R, Li D, Zeng R, Ling CX, Gao W: Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry. Bioinformatics 2004, 20(12):1948–1954. 10.1093/bioinformatics/bth186View ArticlePubMedGoogle Scholar
- Havilio M, Haddad Y, Smilansky Z: Intensity-based statistical scorer for tandem mass spectrometry. Analytical Chemistry 2003, 75(3):435–444. 10.1021/ac0258913View ArticlePubMedGoogle Scholar
- Elias JE, Gibbons FD, King OD, Roth FP, Gygi SP: Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nature Biotechnology 2004, 22(2):214–219. 10.1038/nbt930View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.