A statistical model-building perspective to identification of MS/MS spectra with PeptideProphet
© Ma et al.; licensee BioMed Central Ltd. 2012
Published: 5 November 2012
Skip to main content
© Ma et al.; licensee BioMed Central Ltd. 2012
Published: 5 November 2012
PeptideProphet is a post-processing algorithm designed to evaluate the confidence in identifications of MS/MS spectra returned by a database search. In this manuscript we describe the "what and how" of PeptideProphet in a manner aimed at statisticians and life scientists who would like to gain a more in-depth understanding of the underlying statistical modeling. The theory and rationale behind the mixture-modeling approach taken by PeptideProphet is discussed from a statistical model-building perspective followed by a description of how a model can be used to express confidence in the identification of individual peptides or sets of peptides. We also demonstrate how to evaluate the quality of model fit and select an appropriate model from several available alternatives. We illustrate the use of PeptideProphet in association with the Trans-Proteomic Pipeline, a free suite of software used for protein identification.
In mass-spectrometry shotgun proteomics, the first phase of analysis is the identification of peptides in complex biological mixtures digested by enzymes such as trypsin. Dependent on the peptides in the biological mixture, an experiment will produce a certain number of spectra (call it N). MS/MS spectra are individually matched to peptides by searching through a database of peptides predicted from the genome of the organism. The way the searches are performed can be constrained using different search parameters, such as the number of tryptic termini (NTT), number of missed cleavages (NMC) or the mass difference of the observed precursor ion mass and the weight of the theoretical peptide (ΔM).
We will discuss PeptideProphet in the context of two database search algorithms: SEQUEST  and Tandem with the k-score plugin [2, 3]. SEQUEST attempts to determine a direct correlation between an observed spectrum and sequences of amino acids in a protein sequence database. Typical quantities associated with SEQUEST include: XCorr, ΔCn, SpRank. Typical quantities associated with Tandem with the k-score plugin include: logDot (logarithm of dot product between observed and theoretical spectrum) and ΔDot. PeptideProphet can be used with any database search algorithm that returns a quantitative score.
Given a database search algorithm, every spectrum that is observed will be scored against the peptides in the database. For each spectrum, the highest scoring peptide (depending on the scoring criterion) is typically chosen as the best match. The best match is the potential peptide sequence that generated its corresponding observed spectrum. Thus, we have N spectra that have been matched to a peptide and we will refer to these spectra as identified spectra.
The necessity of PeptideProphet arises because the spectra are subject to noise making it difficult to determine if the peptide that it is matched to is correct. The spectrum itself is generated from a peptide sequence and peaks can be missing or reduced in intensity. Because the spectrum that is being generated is subject to noise the database-based criterion will vary when comparing theoretical spectra to observed spectra. Additionally, when searching the database, the correct peptide sequence may be absent. Because of this noise, how do we determine confidence in an identified spectrum? Traditional standards (such as just accepting all above XCorr > 2.5) does not reflect the quality of the identification. Such a rule may accept too many incorrectly identified spectra. Thus, statistical inference is needed to model the presence of noise.
PeptideProphet  is a post-processing and rescoring algorithm for determining confidence in identified spectra found using a database search. PeptideProphet is one of the first methods for the assessment of confidence. It is based on a probability model and an Empirical Bayesian approach to model fitting. It is now not a single model, but a family of models .
Rescoring: produce a score which reflects the quality of an identified spectrum, while summarizing multiple quantities, such as XCorr and ΔCn or logDot and ΔDot. The rescoring separates incorrectly and correctly identified spectra scores as much as possible.
Modeling: produce a probability-based model for the distribution of correctly and incorrectly identified spectra. The model must be then fit to the scores of all identified spectra.
Evaluation of the Quality of Fit: determine how well the scores fit the probability-based model.
Evaluation of confidence in individual identified spectra using the posterior probability.
Evaluation of confidence in sets of identified spectra: produce a cutoff on the scores to determine a set of correctly identified spectra while controlling the False-Discovery Rate, defined as the expected proportion of false positives.
We will first discuss the basic version of PeptideProphet and then discuss the three extensions.
This dataset uses the first LC-MS/MS replicate file from the Western Consortium of the National Cancer Institute's Mouse Models of Human Cancer . The data was obtained using the Multiple Affinity Removal System and was matched using a semitryptic SEQUEST search against an IPI human protein database allowing a 3 Dalton mass tolerance and 0-1 missed cleavage sites. More details on the spectra can be seen in .
This dataset uses spectra generated from a linear ion trap Fourier transform instrument that was published in . In particular the spectra from Mixture 3 was used, where 16 known trypsin-digested proteins from different mammals were analyzed. Spectra were also matched using a semitryptic SEQUEST search against a database file with the 16 known proteins concatenated with human influenza proteins allowing a 3 Dalton mass tolerance and 0-2 missed cleavage sites. Matches to human influenza proteins are known to be incorrect. More details on the dataset can be seen in .
PeptideProphet works with the observed spectra as the experimental unit where we have N observed spectra with N being generally large (in the thousands or more). Since the number of spectra N is typically very large, the identified spectra can be viewed as the underlying population.
An observed score is interpreted as a test statistic. In statistics the summarized score S is called a test statistic because it is the function of the observed experimental unit that is being used to answer our hypotheses.
PeptideProphet assumes that the test statistic comes from a mixture of two distributions: one from the distribution of correct identifications, and the other from the distribution of the incorrect identifications. The distributions may be characterized by a few parameters (parametric) or many parameters (semi or non-parametric).
Inference: confidence is determined for individual spectra or sets of spectra.
If the researcher is interested in a set of spectrum identifications, the False Discovery Rate should be controlled.
Table of multiple hypothesis testing quantities
# Not Rejected
# True Nulls
# True Alternatives
N - N 0
N - R
An alternative confidence rate that is rarely used is the False Positive Rate (FPR). The False Positive Rate, given a cutoff δ is the expected proportion of all truly incorrectly identified spectra that are considered to be correctly identified. From the terms in Table 1 it is represented by
Many users prefer the q-value which is the minimum False Discovery Rate required for a score s i to be considered significant. It is represented by , where Γ represents the set of all possible cutoff scores . This confidence measure is used to describe a score s i at a single point but examines the False Discovery Rate of all possible scores. Unlike the False Discovery Rate, the q-value is a monotonic quantity with respect to the score cutoff.
If the researcher is interested in specific spectrum identifications the posterior error probability is most commonly used as it quantifies the confidence of a single identified spectrum.
The posterior error probability represents which we also denote as PEP. In other words using a probability model for S i , we can find the probability of an identified spectrum being incorrect given its test statistic. Note that we can also calculate which is the probability of an identified spectrum being correct given its test statistic. The posterior error probability is also called the local false discovery rate (locfdr) [10, 11].
Alternatively the p-value can be used. If s i is the ith observed score then the p-value represents , or the probability of observing a score equal to or greater than s i assuming that the ith identified spectrum was incorrectly identified. The p-value is similar to the FPR in that the p-value is the probability of observing a score equal to or greater than s i assuming that it is one of the N 0 truly null hypotheses.
such that S > 0 for correctly identified spectra and S < 0 for incorrectly identified spectra.
In the basic version of PeptideProphet the β's are estimated empirically from a controlled mixture and are dependent on the precursor ion charge (i.e. a separate discriminant function was trained for 1+, 2+, 3+ precursor ion charges).
The last equality is due to the fact that all scores are independent and identically distributed (iid). Due to different discriminant functions being used for each charge, a different sampling distribution and set of parameters are produced for each precursor ion charge (we will refer to this simply as the charge).
Note that the density functions of f T =0, NTT , f T =0, NMC , f T =0, ΔM , f T =1, NTT , f T =1, NMC , and f T =1, ΔM are discrete. It is assumed, conditional on the identified spectrum being incorrect or correct, that the members of (S i , NTT i , NMC i , δM i ) are independent, as shown above.
PeptideProphet is considered an Empirical Bayesian approach because it uses each identified spectrum twice: once to estimate via the Expectation-Maximimzation  algorithm the parameters of the sampling distribution (π0, μ, σ, α, β, and γ) and second to estimate the confidence in the correctness of an identified spectrum. The EM-algorithm iterates between two steps, called the E-step and the M-step in order to estimate the value of model parameters. With a large enough set of identified spectra (say 100), the EM-algorithm will always converge . The algorithm starts with initial values of model parameters π0, μ, σ, α, β, and γ.
In the E-step, given the estimated values of the model parameters, the probability of each score being correct (or incorrect) is calculated. Given a single observed score s i and its correctness status T i , usage of Bayes Theorem yields which corresponds to the ratio of the Gamma density scaled by π 0 over the sum of the Gamma and Normal densities scaled by π 0 and 1 - π 0 at score s i .
Deviations of the assumptions, or a low number of identified spectra can lead to an inadequate or unstable model fit and incorrect conclusions. This can be diagnosed by visual inspection, and also by the bootstrap. We recommend using visual inspection over goodness of fit tests as tests do not explore the specific fitting issues that may influence subsequent inference of the identified spectra. In fact goodness of fit tests simply attempt to summarize the goodness of fit into one summary statistic whereas we are typically interested in the fit at certain locations of the mixture distribution. There are several visual attributes of the mixture distribution that researchers should be aware of and some remedies for them.
Do the empirical scores follow the fitted curves well? Particular attention needs to be paid to the tails of the distributions, especially the right tail of the distribution of scores of incorrect identifications (red) and the left tail of the distribution of scores of correct identifications (blue). This is often of most interest to researchers as the identified spectra in these regions are considered to be borderline correct or incorrect. In the case of Figure 2b the curves fit the histogram well but in Figure 2a there are many mismatches in the bars and the fitted curves. The culprit of these mismatches is likely due to the small number of spectra. The right portion of the Normal distribution is fit with approximately only 30 spectra. If the data is comprised of a large number of spectra but is deviating from the fitted curves, robust procedures can also be considered and will be discussed later.
Do the curves highly overlap? Although high overlap does not necessarily indicate a poor fit it will lead to smaller sets of confidently identified spectra. Overlaps that occur in situations of highly constrained searches can be remedied with techniques in later sections. Overlap in the case of a small number of spectra (Figure 2a) may be remedied by artificially adding observations using decoys which will also be subsequently demonstrated.
An issue that is not commonly addressed however is the number of identified spectra available to fit the mixture model. The number of identified spectra required to fit a reliable model depends highly on the separation and the form of the observed scores. A statistical approach to examine the stability of the fitted model can be done via the bootstrap.
Bootstrapping can be performed by sampling with replacement B samples (spectra) where each is of size N from the original dataset. At least 100 to 500 bootstrapped samples are recommended. For each bootstrapped sample b, we can refit the PeptideProphet model to receive bootstrapped estimates of , , , , , and . The bias, variance, and mean squared error (MSE) of the procedure used to estimate a parameter can be found using the bootstrapped estimates. In the case of μ, the bootstrap bias estimate is . Large biases imply that the estimation procedure is systematically over or underestimating the true value of a parameter. Note that as B increases the bias does not move towards 0. The bootstrap variance estimate is defined as . Smaller variability is desired. The bias and variability of an estimation procedure is often summarized using the mean squared error, which is .
In order to determine the correctness of the spectrum identifications, a decision rule is defined where any spectrum identification with a score above δ is concluded to be correct. In many experiments we are interested in the statistical properties of the list of spectrum identifications with scores above δ.
The False Positive Rate for a cutoff t can also be estimated using the area under the fitted frequency curve of the distribution of scores for incorrect identifications as seen in Figure 5. Mathematically this is equivalent to the p-value, or since each incorrect score follows the same distribution. Note that the False Positive Rate ignores the distribution of scores for correct identifications.
The estimation of the q-value at a specific point ρ requires the estimation of the False Discovery Rate at every point s i from i = 1, 2, ..., N. The q-value for a point ρ is the minimum False Discovery Rate among all points s i such that s i ≤ ρ. The estimated False Discovery Rate can be found using the model-based estimates or by interpreting each posterior error probability as a local false discovery rate. The q-value is often useful if a monotonically increasing error rate is desired for decreasing cutoff values. For example, in the case of Figure 5 suppose the experimenter was only interested in scores around 4. Using model-based estimates, the estimated False Discovery Rate with a cutoff at 4 is 0.01503874 but the estimate of the False Discovery Rate with a cutoff at 3.8 is 0.01489971 suggesting that the error rate is lower for a lower cutoff value. To avoid this issue, the q-value can be used as it finds the minimum False Discovery Rate at each cutoff value. The q-value at 4 is 0.01489939 (found using increments of 0.01 searching all FDR values from -4 to 4).
We now discuss the estimation of the posterior error probability and the p-value. These measures are properties of a single spectrum and are synonymous to performing a single hypothesis test. In Figure 5 the posterior error probability and p-value only apply to spectra at a single point.
According to Bayes Theorem the posterior probability of T i = 0 (our hypotheses of interest) given its test statistic is . Following the Empirical Bayesian step where parameters are estimated we have that . Because the posterior error probability is equivalent to the local false discovery rate we also have that .
The p-value is estimated as which is the right tail-end of the Gamma density past s i .
The posterior error probability may be preferred over the p-value because it also yields an estimate for the probability of an identified spectrum to being correct (1 - PEP). The advantage of the p-value is that it only requires the use of the distribution of scores for incorrect identifications as it ignores the distribution of scores for correct identifications. Notice that in Figure 5 the p-value at a score of 1 is a low value of 0.004 but that the Posterior Error Probability at 1 is a much higher value at 0.156.
When there is significant overlap between the two density functions or a low number of identified spectra it is difficult for the EM-algorithm to estimate π 0 and the parameters of the Gamma and Normal distributions. In this case PeptideProphet employs the Target-Decoy approach to better estimate the Gamma distribution. We first describe the two forms of Target-Decoy: the concatenated strategy and the separate strategy [16, 17]. The objective of both strategies is to introduce decoys in order to estimate the error rate since decoys are known to be incorrectly identified spectra. Reversed sequences (decoy sequences) are commonly generated by taking the target database and reversing each target sequence. Alternative methods are to use randomized sequences where amino acid sequences are generated using a pre-specified probability distribution .
In the concatenated Target-Decoy strategy each spectrum is searched in a single database that is composed of both target and decoy sequences. This involves competition between the best correct peptide sequence, the best incorrect forward peptide sequence, and the best (incorrect) decoy peptide sequence. Hits where the best incorrect decoy peptide sequence is found to be the match are used to estimate the FDR.
In the separate Target-Decoy strategy each spectrum is searched once in the forward database and searched again independently in the decoy database. The distribution of scores from the peptides identified via the decoy database is used to estimate the form of the distribution of incorrectly identified spectra. This approach ignores competition between forward and decoy sequences.
The quality of fit of the Gamma and Normal distributions may rely on how the database is searched (constrained versus unconstrained search) or the search algorithm that is used . As is the case in many statistical modelings, there is no guarantee that the scores of the identified spectra necessarily follow the Gamma and Normal distributions. Previously, decoys were used to estimate parameters of pre-specified distributions. Now we will use decoys for data-dependent estimation of the distributions themselves.
An example of this approach can be seen in Figure 7. The parametric fit of the distribution of scores for correct identifications clearly deviates from the Normal curve as the mode of the correct hits is shifted to the right. The semiparametric approach produces a curve that more robustly fits the left-skewed distribution of scores for correct identifications.
To avoid overfitting, this approach should only be used in the cases of strong deviations between the fitted distributions and the observed scores, such as the parametric fit (dashed-lines) in Figure 7. Overfitting typically occurs in experiments with a small number of spectra, such as in Figure 2a. Overfitting can be checked via bootstrapping by seeing if bootstrapped samples do not reflect the same need for a semiparametric fit at certain score values. This can be done via quantile to quantile plots or by checking mean squared errors. If users anticipate good separation, parametric PeptideProphet is often sufficient for practical purposes.
Overlap in the distributions of scores of correct and incorrect identifications can be due to a suboptimal scoring function, which does not discriminate well between the properties of correct and incorrect identifications. This often occurs in cases of constrained searches where the database that is searched is much smaller than the unconstrained search space that was used to find the coefficients in the fixed discriminant function. For additional information on constrained versus unconstrained searches, see . A solution to this is to adapt the discriminant function to each experiment or search approach which can improve the separation between the distribution of scores for incorrect and correct identifications .
The improvement of the adaptive discriminant function over the fixed discriminant function for the Controlled Mixture dataset in a constrained search space is displayed in Figure 9. Only tryptic peptides with a narrow mass tolerance were searched.
This approach is also useful for incorporating lower ranked peptide matches (i.e. for a given spectrum, instead of only considering the best peptide sequence match, use the new discriminant function to also rescore peptide sequence matches that ranked close to the best peptide sequence match). Every time a new discriminant function is estimated (when the I are averaged) a new summarized score is calculated for the top 5 (can be changed of course) Peptide Matches for every spectrum. The highest scoring peptide-spectrum-match is used in the training of the next discriminant function.
The Trans-Proteomic Pipieline (TPP) is an open source program developed at the Institute for Systems Biology designed for complete proteomic analysis starting from spectrum identification to protein identification and quantification and can be downloaded from http://sourceforge.net/projects/sashimi/. In this section we assume that search results have already been converted to pepXML files, which is the standard input for PeptideProphet. A discussion of this can be found at (http://tools.proteomecenter.org/wiki/index.php?title=TPP_Tutorial).
We present an example using the Human Plasma dataset where the spectra are searched through Tandem with the k-score plugin with TPP version 4.4. PeptideProphet automatically models all precursor ion charges and outputs the probability of correct identification. A mixture model using the Normal for the distribution of correct scores and a Gumbel distribution for the distribution of incorrect scores.
1. False Discovery Rate: estimates of the False Discovery Rate can be obtained three ways. In the upper-right hand corner of Figure 12 estimated False Discovery Rates under the "Error" column is given for 1 - PEP values under the "Min Prob" column. In other words, "Min Prob" represents the minimum posterior probability of being correct in order to conclude that an identified spectrum is correct. For example, a "Min Prob" of 0.95 implies that only identified spectra with PEP's lesser than 0.05 are considered correct or that (1 - PEP) must be greater than 0.95 to be considered correct.
The calculation takes into account that among the correctly identified spectra, it is estimated that a majority of the identified spectra have 0 missed cleavages.
A third approach of estimating the False Discovery Rate is to download all posterior probabilities, convert them to posterior error probabilities (local false discovery rates) by taking the complement, define a cutoff point t and then to calculate .
2. False Positive Rate or p-value: using the Gumbel's estimated parameters, the false positive rate can be found by looking at the tail area.
3. q-value: the q-value at a specific point δ can be calculated by estimating the False Discovery Rate at the score value of every identified spectra and then by finding the minimum False Discovery Rate among all scores .
4. Posterior Error Probability and Local False Discovery Rate: these are most easily found by finding the complement of the values in the first column of Figure 11 or by looking at the complement of "prob" at the bottom center of Figure 12. Note that these probabilities automatically incorporate NTT, NMC, and ΔM. If the experimenter was interested in the posterior error probability of a score independent of NTT, NMC, and ΔM, this can still be calculated using the estimated model parameters.
All inference for semisupervised and semiparametric PeptideProphet cases are identical. Inference would be identical for the adaptive version of PeptideProphet but it is not implemented in TPP at this time but is available from the authors upon request.
Following the execution of PeptideProphet the next step in analysis is often the identification of proteins present in the sample. In this different analysis, the experimental unit changes from being a spectrum to a peptide. TPP can be used to run ProteinProphet, a computational algorithm that can utilize PeptideProphet's estimated probabilities to determine the probability for the presence of proteins in two steps . In the first step the posterior probability of a peptide being correctly identified from PeptideProphet is decreased for peptides that are the only peptide linked to a protein and increased for peptides that are linked to proteins explained by many peptides. In the second step the probability of a protein being in the sample is calculated as the probability that at least one of its associated peptides were identified in the sample. This is if is the adjusted probability of a peptide being in the sample where i is indexed from 1 to the number of peptides linked to the protein in question.
PeptideProphet is available for use on the Trans-Proteomic Pipeline with many other database search tools (X!Tandem, MASCOT, OMSSA, Phenyx, ProbID, InsPecT, MyriMatch). The statistical approach of PeptideProphet is generalizable to any database search algorithm that returns a quantitative score for each identified spectrum.
Although we used the Gamma and Normal distributions to model the components of the PeptideProphet model, there are no limitation to the choice of parametric distribution for describing the distributions of scores for incorrect and correct identifications in PeptideProphet. The Gumbel distribution, with parameters μ and β is another common distribution used for the distribution of scores of incorrect identifications. A generalization of the Gumbel distribution is the Extreme Value Distribution. Additional information, such as the NTT, may be incorporated into the summarized score by using a different machine learning approach instead of a discriminant function. Quantities like the NTT were left out of the summarized discriminant score due to its discrete nature. For example, the logistic regression function would allow discrete and continuous covariates to be transformed into a single summarized score while separating identified spectra with T = 0 from identified spectra with T = 1.
The Target-Decoy approach used in this manuscript is an approach that pioneered the use of decoys for the estimation of the False Discovery Rate and its results are often compared to other techniques . For the estimation of the False Discovery Rate PeptideProphet and Target-Decoy methods in our experience produce similar results especially when the semisupervised version of PeptideProphet is used as its search approach is similar to the concatenated version of Target-Decoy. In fact, PeptideProphet can be considered as an extension of the concatenated version of Target-Prophet because of its additional modeling objectives. PeptideProphet simply has distributional assumptions and can be used to estimate confidence of individual spectrum identifications or sets of spectrum identifications (local and global FDR estimates) whereas target-decoy is limited to sets (global FDR estimate only). Also, if there is heavy overlap Target-Decoy will outperform basic PeptideProphet but Semisupervised PeptideProphet and Target-Decoy should be similar.
An alternative approach which relaxes the parametric assumptions is the variable component approach which uses an unknown mixture of Gaussians to represent the incorrect and correct distributions of scores . The correct distribution is represented by a mixture distribution of k 0 normal distributions (that may have different means and variances) and the incorrect distribution is represented by a separate mixture distribution of k 1 normal distributions. Parameters k 0 and k 1 are unknown. Each score s i is a member of either the overall correct or incorrect distributions, but are then further assigned as a member to one of the sub-components of the mixture representing the correct or incorrect distribution. Gibbs sampling is used to estimate the forms of the sub-components (which also suggests the complexity of this approach). Although the variable component and kernel methods perform similarly there are minor computational and modeling issues to consider . The advantages to the variable component method are that: (1) The model is still parametric, which may help reduce the chance of overfitting, (2) Kernel estimation may over fit, especially if the bandwidth is too low, and (3) It does not completely rely on decoys for the negative whereas kernel density estimation uses decoys only for estimating the negative distribution. The advantages of the kernel approach are that: (1) The variable component method is much more computationally intensive and more complicated (and thus the Kernel Estimation is less intensive), (2) The variable component method requires the specification of priors, and (3) Kernel estimation is very well known and commonly used.
The authors would like to thank Hyungwon Choi for providing R-code for the PeptideProphet model fits. The work was supported in part by the NSF CAREER award DBI-1054826 to OV, and by NIH grants R01-GM-094231 and R01-CA-126239 to AN.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 16, 2012: Statistical mass spectrometry-based proteomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S16.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.