Better score function for peptide identification with ETD MS/MS spectra
© Liu et al. 2010
Published: 18 January 2010
Skip to main content
© Liu et al. 2010
Published: 18 January 2010
Tandem mass spectrometry (MS/MS) has become the primary way for protein identification in proteomics. A good score function for measuring the match quality between a peptide and an MS/MS spectrum is instrumental for the protein identification. Traditionally the to-be-measured peptides are fragmented with the collision induced dissociation (CID) method. More recently, the electron transfer dissociation (ETD) method was introduced and has proven to produce better fragment ion ladders for larger and more basic peptides. However, the existing software programs that analyze ETD MS/MS data are not as advanced as they are for CID.
To take full advantage of ETD data, in this paper we develop a new score function to evaluate the match between a peptide and an ETD MS/MS spectrum. Experiments on real data demonstrated that this newly developed score function significantly improved the de novo sequencing accuracy of the PEAKS software on ETD data.
A new and better score function for ETD MS/MS peptide identification was developed. The method used to develop our ETD score function can be easily reused to train new score functions for other types of MS/MS data.
In recent years, tandem mass spectrometry (MS/MS) has become a popular technique in proteomics for protein identification. In a typical "bottom-up" approach, proteins are enzymatically digested into short peptides, each of them is measured with MS/MS and produces an MS/MS spectrum. Analytical software is used to identify a peptide for each qualified spectrum. Then the peptides are used for protein identification and characterization. Many software programs (e.g. Mascot, Sequest and PEAKS) have been developed for MS/MS data analysis [1–6].
Database search and de novo sequencing are two main approaches for peptide identification from MS/MS data. Database search requires a protein sequence database, and tries to find a peptide from the database that best explains the MS/MS spectrum. De novo sequencing does not require a sequence database, and instead constructs a peptide sequence from scratch to best explain the MS/MS spectrum. Algorithms and software developments using these two approaches have been one of the research focuses in bioinformatics community recently. Some exemplary work for database search approach can be found in [1, 2, 7–10], and for de novo sequencing approach can be found in [3, 11–18].
The collision induced dissociation (CID) is among the earliest ways developed for peptide fragmentation in a tandem mass spectrometer, and is also the most widely studied in bioinformatics. It is known to produce more y and b-ions than the other ion types. Recently, more fragmentation mechanisms such as infrared multiphoton dissociation (IRMPD) , electron capture dissociation (ECD) , and electron transfer dissociation (ETD)  have been developed. Particularly, ETD is regarded as a promising method complementary to CID because ETD is better suited for sequencing larger and more basic peptides.
Due to the importance of the matching score function, many scoring methods have been proposed. One straightforward approach is based on the match between the real spectrum and the theoretical spectrum generated from a candidate peptide [2, 22]. Another approach is based on the probabilistic model of fragmentation process [9, 12, 23, 24]. These probabilistic models study and employ the statistical differences between peaks produced with different ion types. Frank et al. further exploited these properties by a probabilistic network for collision induced dissociation (CID) data to improve the accuracy of de novo sequencing .
Most of the existing score functions are designed and trained specifically for CID data. ETD spectra differ from CID spectra in the magnitude and the complexity of fragment ion signals. Although a few software tools such as PEAKS and Mascot have been adjusted to work on ETD data, they are better suited for CID data and give poorer results with ETD data. To better utilize the advantage of ETD spectra on longer and more basic peptides, in this paper, we propose a score function for ETD data based on a probabilistic model. The experiments on real data showed that the score function significantly improved the de novo sequencing performance when used in the PEAKS software . Our method for developing this score function is general enough to be used to develop new score functions for CID and other types of MS/MS data.
We used an ETD dataset of 9885 spectra from a complex C. elegans protein mixture digested with trypsin followed by alkylation with iodoacetamide. The spectra were acquired on an LTQ Orbitrap XL ETD. The peptide mixtures were separated with Surveyor LC equipped with MicroAS autosampler using a reversed phase peptide trap and a reversed phase analytical column at a flow rate of 250 nl/min. A gradient of 5 ~ 30% acetonitrile in 90 minutes was employed.
The database search modules in PEAKS 5.0  and Mascot 2.2  were used for peptide identification. The tandem mass spectra were matched against NCBI C. elegans protein sequence database. For both programs, the error tolerances for parent mass and fragment mass were set as 10 ppm and 0.8 Da, respectively. Each program reported no peptide or one best peptide with a confidence score for each spectrum. (The programs may report no peptide for low quality spectra.) PEAKS 5.0 reported 1496 peptides and Mascot 2.2 reported 2950 peptides. We applied several filters to remove low quality spectra. A spectrum is kept only if it satisfies the following three conditions: (1) the PEAKS confidence score is no less than 50%; (2) the Mascot confidence score is no less than 30%; and (3) the two peptides reported by PEAKS and Mascot are the same. Finally, we selected 259 spectrum-peptide pairs from the dataset. Both programs identified the peptides consistently with high confidence scores. Thus, the peptides are very accurate and can be used as positive controls in our experiments. Among them, 130 spectrum-peptide pairs are used for training and the remaining 129 pairs are used for testing.
An ETD spectrum often contains several peaks generated from its precursor ions. These peaks often have high intensities, and may affect the performance of the matching score function. Thus, we removed these precursor ion peaks from the spectra in data preparation.
Most de novo sequencing software developed so far is for CID data and does not work for ETD data. The only software available to us for performance comparison using ETD data is the PEAKS software . Recent versions of PEAKS have a built-in set of parameters for ETD, which was adjusted from its CID parameters. We compared our score function with the function used in PEAKS 5.1. To make a fair comparison, the result of our new score function is obtained by simply replacing the score function of PEAKS 5.1 without making any change to the algorithm. Thus, the improvements reported in the following are purely caused by the difference in the score functions.
Comparison of the score function in PEAKS and the new score function.
Predictions with correct subsequences of length at least x
x = 3
In addition, we search the maximum correct consecutive subsequences in the predicted peptides. For each predicted peptide P, the length of the maximum correct consecutive subsequence is denoted by L max (P). We count the number of predicted peptides with L max (P) ≥ l, for l = 3, 4, ..., 10. The error tolerance for the position of a predicted amino acid subsequence is 0.6 Da, and the amino acids leucine(L) and isoleucine(I) are considered as the same since their mass values are the same.
The testing data contain 129 peptides and 1859 amino acids. Using the function in PEAKS 5.1 and our score function, we acquired two sets of predicted peptides, each contains 129 de novo peptides. The accuracy and the maximum correct consecutive subsequences of the resulting peptides are reported in Table 1. The results show that the new score function significantly improves the accuracy of de novo sequencing from 32.5% to 55.3% for Accuracy I, and from 34.2% to 55.9% for Accuracy II. In PEAKS 5.1, about 56% of predicted peptides contain a corrected subsequence with length at least 3. For the new score function, more than half of the predicated peptides contain a correct subsequence with length at least 6. From the experiments, we conclude that the new score function has better performance in de novo sequencing with ETD data than the score function in PEAKS 5.1.
We note that the independence assumption in our score function development in Section Methods is not very realistic because different types of ions generated by the fragmentation between the same pair of adjacent residues of the peptide are often correlated to each other. To more accurately model this, a probabilistic network (or Bayesian network) can be used similarly to . We also tried this approach on our data and found no apparent improvement over the model used in this paper. One possible reason is that our training data is of limited size, and not sufficient to train the many parameters of the probabilistic network accurately. Another possible reason is the following. The probabilistic network is advantageous over our simple model only in dealing with the dependence between different types of fragment ions at the same location (between the same pair of adjacent residues). However, when a de novo sequencing method makes an error at a location, normally there is either no fragment ion or only one low fragment ion at the location. Under this condition, the dependence between different types of fragment ions is weak. As a result, a model with the independence assumption would work as well as a more sophisticated one without this assumption.
In this paper, we proposed a significance level measurement to distinguish signal peaks from noise. Based on this, a score function for de novo sequencing with ETD data was defined. Experiments on real data showed that our score function greatly improved the score function used in the latest PEAKS 5.1 software. The method used to develop our ETD score function can be easily reused to train new score functions for other types of MS/MS data.
We first introduce a novel significance level measurement to evaluate the likelihood that a peak is signal. The significance level has a better ability to distinguish signal and noise peaks than the original relative intensities given by the spectrum. Then, we study the distributions of the significance levels of different types of fragment ions, and define a likelihood ratio score for each fragment ion. Finally, we combine all the fragment ion scores of a peptide together to provide a peptide score function.
The peptide fragmentation mechanism shown in Figure 1(a) and 1(b) is overly simplified. In a real mass spectrum (Figure 2), there are also many noise peaks that cannot be explained by the six types of fragment ions. Meanwhile, not all of these fragment ions form strong peaks in the spectrum. Some are either absent or too low to be distinguished from the noise. Different spectra, and even different regions of the same spectrum, may have different noise and signal levels. Consequently, the intensity or the relative intensity given by the original spectrum do not accurately reflect the likelihood for a peak being signal. Thus, it is necessary to develop a better measure to reflect this likelihood. In this section we focus on this task and develop a measure called significance level.
According to our experience, four features of a peak may be associated with the significance of the peak: (1) global rank, (2) local rank, (3) global intensity ratio, and (4) local intensity ratio. These four features are explained in the following, respectively.
The global rank of a peak is the number of peaks (in the same spectrum) that are higher than or equal to the current peak. A smaller global rank means a more significant peak. However, a spectrum of a longer peptide normally has more signal peaks than a spectrum of a shorter peptide. With this observation, we further adjust the global rank by dividing the original ranking number by the size of the peptide, which is estimated by dividing the precursor ion mass by the average mass of an amino acid residue (approximately 112 Da).
The local rank and the local intensity ratio are defined similarly as their global versions, except that only the peaks within ±57 Da (the mass of the smallest amino acid residue) difference to the current peak are examined. More specifically, the local rank is the number of peaks which are within ±57 Da difference, and are higher than or equal to the current peak. Suppose the intensity of the highest peak within ±57 Da is h 1, the global reference intensity is h 2, the intensity of the current peak is h, then the local intensity ratio of the current peak is defined as max (1, min(h 1, h 2)/h).
where c rg , c rl , c tg and c tl are the coefficients for the features. It should be noted that under this definition, a smaller value for significance level means a stronger peak. This is somewhat counterintuitive but chosen deliberately in order to assign the same value 0 to the strongest peak in every spectrum.
It is well known that different types of mass spectrometers (especially if the peptide fragmentation methods are different) tend to produce different ion types. For example, CID produces more b and y-ions than other ion types, while ETD produces more c and z'-ions. Thus, when using a peptide to explain a spectrum, matching a high-intensity peak with a z'-ion or a y-ion will have different contributions towards the likelihood that the peptide is correct. For ETD data, because we expect high z'-ions, the matching with a z'-ion will contribute more than the matching with a y-ion. Thus it is necessary to study the distribution of the significance level for different ion types. In this section we study this with ETD data. The distribution will help us to develop our final score function.
The significance of this match, denoted as λ (x 1,1, x 1,2, ..., x k, n-1), is therefore evaluated by the ratio between these two likelihoods. In this section we make the assumption that the events that t i , j has intensity x i , j are independent to each other. The independence assumption makes our score function easier to calculate while not sacrificing much performance on analysis accuracy. A discussion on the model where the events are dependent can be found in Section Discussion.
Here f i, j is one of the f functions trained for the j-th ion of the ion type t i . In practice, for each ion type of interest, we learn five f functions, for the first and second ions near the N-terminus, the last and second last ions at the C-terminus, and the middle of the peptide, respectively. The log λ defined in (1) is our score function.
The frequencies that different fragment ion types have observed peaks in the training ETD data.
The work was supported in part by China High-tech R&D Program 2008AA02Z313, NSERC (Canada), and a start up grant at University of Waterloo.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 1, 2010: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.