A mass accuracy sensitive probability based scoring algorithm for database searching of tandem mass spectrometry data

Xu, Hua; Freitas, Michael A

doi:10.1186/1471-2105-8-133

Research article
Open access
Published: 20 April 2007

A mass accuracy sensitive probability based scoring algorithm for database searching of tandem mass spectrometry data

Hua Xu¹ &
Michael A Freitas²

BMC Bioinformatics volume 8, Article number: 133 (2007) Cite this article

6918 Accesses
120 Citations
3 Altmetric
Metrics details

Abstract

Background

Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) has become one of the most used tools in mass spectrometry based proteomics. Various algorithms have since been developed to automate the process for modern high-throughput LC-MS/MS experiments.

Results

A probability based statistical scoring model for assessing peptide and protein matches in tandem MS database search was derived. The statistical scores in the model represent the probability that a peptide match is a random occurrence based on the number or the total abundance of matched product ions in the experimental spectrum. The model also calculates probability based scores to assess protein matches. Thus the protein scores in the model reflect the significance of protein matches and can be used to differentiate true from random protein matches.

Conclusion

The model is sensitive to high mass accuracy and implicitly takes mass accuracy into account during scoring. High mass accuracy will not only reduce false positives, but also improves the scores of true positive matches. The algorithm is incorporated in an automated database search program MassMatrix.

Background

Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) has become one of the most used tools in mass spectrometry based proteomics [1]. In shotgun proteomics, peptides are separated using liquid chromatography and introduced into a mass spectrometer via an ionization interface. In tandem mass spectrometry, the peptide precursor ions are isolated and fragmented via collision-induced dissociation (CID) [2] with inert gas, electron capture dissociation (ECD) [3], surface induced dissociation (SID) [4] and/or electron transfer dissociation (ETD) [5]. The resulting tandem MS spectra contain product ion signatures that relate back to the identity of the peptide precursor ions [2, 6, 7].

Various algorithms have since been developed to automate the process for modern high-throughput LC-MS/MS experiments. These algorithms fall under two categories: de novo sequence inference and database searching [8]. The first approach identifies peptide sequences directly from the tandem MS data [9, 10]. This type of algorithm is usually computationally expensive and limited by the mass accuracy of the tandem MS data [8]. The database searching algorithms identify peptides by comparison with a protein sequence database [11]. In this approach, all potential peptides are created from the sequence database via digestion with proteases. Theoretical spectra containing product ion series appropriate for the given fragmentation technique are created for the peptides. All tandem MS spectra in the data set are then compared with the theoretical spectra [1]. Because of their relatively lower computation expense and higher compatibility with low mass accuracy spectra, database searching programs are more commonly used at this time [12].

There are also various probability based post-search methods used to statistically curate search results from database search algorithms [13, 14]. These methods estimate the accuracy of protein/peptide identifications and compare search results from different algorithms based on a common standard. However, many models involve empirical parameters such as score from correlative scoring algorithms. Therefore they may possess biases as a result of parameter optimization or model training.

The key comparison between different algorithms lies in how each approach scores a potential match between experimental and theoretical spectra [11, 15–25]. We recently developed a database searching program, MassMatrix that uses a mass accuracy sensitive statistical model for scoring. This approach is separate and distinct from algorithms that filter matches based on mass accuracy. In the latter high mass accuracy can be used to filter spectra by only searching tandem mass spectra whose precursor ion falls within the stated mass tolerance, and filtering product ions by high mass accuracy can further reduce the likelihood of a random match [26, 27]. However, a score sensitive model implicitly takes mass accuracy into account during scoring. The model is rigorously derived and sensitive to the searching tolerance determined by the accuracy of mass spectrometer. High accuracy improves the sensitivity and selectivity of searches. The statistical scores represent the probability that a match is a random occurrence. In addition, a novel statistically derived algorithm to rigorously calculate protein scores from the statistically based peptide scores has been developed. Thus the protein scores reflect the significance of protein hits and can be used to differentiate true protein hits from random ones. Herein we describe the statistical models.

Results

Multiple scoring algorithms

The peptide matching algorithm contains two independent scoring models, including a descriptive model and a statistical model. These models are used to calculate three distinct scores for a peptide match. Each of the scores may be independently used to ascertain the quality of the match. Because each score is distinct, the combination of scores is useful for validating each peptide match. The two models and the application to calculating peptide match scores are described in detail in the following.

Descriptive peptide scoring model

Descriptive scores do not strictly convey any statistical relevance and may be prone to bias due to the scoring parameters. However, they have proven to be useful and generally augment probability based scores [13]. The descriptive model used herein to calculate peptide match scores (S) is shown in eqn. 1.

S = 100 \frac{\sum_{i = 1}^{n_{match}} I_{i} r_{match}^{2} \max (0, \frac{n_{match} - 3}{n_{match}})}{\sqrt{L_{pep}}}

(1)

I_iis defined as the standardized abundance of the i^th product ion in the experimental spectrum (calculated by dividing the abundance of the i^th product ion by the maximum abundance in the spectrum), $\sum_{i = 1}^{n_{match}} I_{i}$ is the total standardized abundance of matched product ions, n_match is the number of matched product ions, r_match is the ratio of standardized abundance of matched product ions to total standardized abundance of the experimental spectrum, and L_pep is the length of the peptide in the number of amino acids. Each of these factors contributes to the overall score as follows: $\sum_{i = 1}^{n_{match}} I_{i}$ evaluates the quality of the match, $r_{match}^{2}$ introduces a penalty for unmatched product ions, $\max (0, \frac{n_{match} - 3}{n_{match}})$ is an arbitrary penalty for matches with poor fragmentation, $\sqrt{L_{pep}}$ is an additional penalty for peptides with long sequences and the constant 100 is used arbitrarily to scale the scores. By default, scores for a spectrum with less than three matched product ions will be 0 due to the arbitrary penalty. However, the minimum number of matched ions may be changed to any value. Reducing this number is especially valuable for the analysis of singly charged peptides that have characteristic C-terminal aspartic acid fragmentation [28]. The penalty for peptide length is included to normalize the scores. Peptides with longer sequences have more fragment ions and higher empirical scores than shorter sequences. The penalty results in long and short sequences both have similar scores for matches of similar quality. The choices of incorporating squared and square root for the terms n_match and L_pep were empirically determined from the evaluation of tandem MS data sets collected from LCQ and LTQ-Orbitrap mass spectrometers.

Descriptive protein score

For "true" matches, we assume that the scores are normally distributed with a mean of 20 and a variance of 25. This arbitrary distribution estimates the distributions observed from analysis of several datasets. The expected contribution of each match to the protein score will be $S \times \int_{0}^{S} \frac{e^{- {(x - 20)}^{2} / 50}}{5 \sqrt{2 π}} d x$ . Thus, the protein score from the descriptively scored matches is calculated from eqn. 2.

protein score = \sum_{i} S_{i} \times \int_{0}^{S_{i}} \frac{e^{- {(x - 20)}^{2} / 50}}{5 \sqrt{2 π}} d x

(2)

Probability based peptide scoring model

In addition to the empirical score, a mass accuracy sensitive probability based scoring model was derived to evaluate peptide matches. The model determines the likelihood that an experimental spectrum match to a theoretical spectrum is a random occurrence. Consider a pair of spectra: one experimental and one theoretical. W_eand W_tdenote their precursor masses respectively. In addition, the experimental data contains information regarding the abundance of product ions I_ifor each precursor, W_e. The model ultimately tests the following two hypotheses: the null hypothesis, H_0, states that a match is random, i.e. the theoretical spectrum is independent of the experimental; and the alternative hypothesis, H_A, states that the match is not random, i.e. the theoretical spectrum is related to the experimental one.

Scoring the match is performed in two stages: 1) match W_eagainst W_twithin the specified precursor ion mass accuracy and 2) match all product ions in the experimental spectrum against the theoretical within the specified mass accuracy. Both stages rely on calculating the probability that the occurrence of an ion within a fixed mass window could be a random occurrence ( $p = \frac{mass window}{mass range}$ ).

To match the experimental precursor with that of theoretical peptide we first define the variable q:

q = {\begin{array}{l} 1 & W_{e} and W_{t} match \\ 0 & otherwise \end{array}

(3)

Under H₀, the possibility that any precursor ion match (q = 1) could be random is given in eqn. 4.

p_{1} = \frac{2 τ_{pep}}{Π}

(4)

In the above equation, τ_pep is the mass accuracy of the precursor ion and Π is the detection range for the precursor ion. For each precursor ion the mass window is defined as ± τ_pep (2 × τ_pep). Thus q has a bernoulli (p₁) distribution under H₀. If the precursor ion masses of the pair of spectra do not match (q = 0) then the second stage is skipped. If q = 1 we proceed to stage 2 where we test the match of the experimental product ion spectrum against the theoretical spectrum.

The variable b_iis defined for each product ion, i, in the experimental spectrum as follows:

b_{i} = {\begin{array}{l} 1 & peak i matches \\ 0 & otherwise \end{array} .

(5)

Under H₀, all matched product ions are random and independent occurrences. The probability that a product ion randomly matches any of the product ions in the theoretical spectrum is:

p_{2} = \frac{Π_{theo}}{Π}

(6)

where Π_theo is the total coverage of the detection range for all product ions in the theoretical spectrum and Π is the MS/MS detection range. It is assumed that Π is the same as the precursor ion mass range. However, for instruments that have a dynamic detection range assuming a fixed value Π will result in more conservative scores. For each product ion in the theoretical spectrum, the mass window is ± τ_msms (2 × τ_msms). If we assume there is no overlap in the product ion mass windows, then Π_theo is calculated using the following equation

Π_theo = 2 m × τ_msms.

The probability that any single matched product ion (b_i= 1) could be random can be calculated using the eqn. 8

p_{2} = \frac{2 m \times τ_{msms}}{Π}

(8)

where τ_msms is the product ion mass accuracy and m is the number of product ions within the detection range in the theoretical spectrum. Because the theoretical spectrum is independent of the experimental under H₀, all b_i(i = 1, 2 ..., n) are assumed to have an identical and independent bernoulli (p₂) distribution under H₀. The model is then used to perform two distinct tests. Each uses a different approach to evaluate the quality of a match: number of matched product ions x and total abundance of matched product ions Y.

The pp score

The model is used to evaluate whether the number of matched product ions in an experimental spectrum could be a random occurrence. For all spectra whose precursor ion masses match, i.e. q = 1, the variable x is defined as the number of product ions in the experimental spectrum that match the theoretical spectrum (eqn. 9) where b_i(i = 1, 2 ..., n) is defined in eqn. 5 and n is the number of product ions in the experimental spectrum.

x = \sum_{i = 1}^{n} b_{i}

(9)

Under H₀, all b_ihave an identical and independent bernoulli (p₂) distribution. Therefore, x will have a binomial (n, p₂) distribution. Consequently the probability mass function for x is:

p (x) = \frac{n!}{x! (n - x)!} p_{2}^{x} {(1 - p_{2})}^{n - x}

(10)

where p₂ is calculated from eqn. 6. The p-value, α, is defined as the probability that the quality of a random match between a pair of spectra is greater than or equal to a match observed under H₀. The pp value, β, is defined as the negative common logarithm of the p-value:

β = -log(α)

We use x to evaluate the quality of a match, such that the p-value is the probability that x for a random match between the pair of theoretical and experimental spectra is greater than or equal to that of the actual match, x = n_match, under H₀. The p-value is:

α = \sum_{x = n_{match}}^{n} p (x) = \sum_{x = n_{match}}^{n} \frac{n!}{x! (n - x)!} p_{2}^{x} {(1 - p_{2})}^{n - x}

(12)

and the pp value is

β = - \log (α) = - \log (\sum_{x = n_{match}}^{n} \frac{n!}{x! (n - x)!} p_{2}^{x} {(1 - p_{2})}^{n - x})

(13)

The pp2 score

The second approach evaluates whether the total abundance of matched product ions in the experimental spectrum could be a random occurrence. Y is defined as the total abundance of experimental product ions that match product ions in a given theoretical spectrum:

Y = \sum_{i = 1}^{n} I_{i} b_{i}

(14)

where I_iis the standardized abundance of the i^th product ion in the experimental spectrum and b_iis defined in eqn. 5. For clarity we define y_i= I_ib_ito give eqn. 15.

Y = \sum_{i = 1}^{n} y_{i}

(15)

However, to complete the test we must know the inherent distribution of Y. This distribution is unknown and thus pp2 values can not be precisely calculated as were the pp values based on the total number of matched product ions. In order to estimate the pp2 value, three assumptions are needed:

1. I_iis identically and independently distributed across product ions in the experimental spectrum,

2. b_iis uncorrelated with I_iin the experimental spectrum,

3. the number of product ions, n, in the experimental spectrum is large (n > 30).

Under assumption 1, the mean μ_Iand variance σ_I² for the distribution of I_iare estimated by:

{\begin{array}{l} {\hat{μ}}_{I} = \frac{1}{n} \sum_{i = 1}^{n} I_{i} \\ {\hat{σ}}_{I}^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} {(I_{i} - {\hat{μ}}_{I})}^{2} \end{array}

(16)

Since y_i= I_ib_i, assumption 2 yields eqn. 17 under H₀,

{\begin{array}{l} μ_{y} = E_{y} (y_{i}) = E_{I} (E_{y | I} (y_{i} | I_{i})) = E_{I} (p_{2} I_{i}) = p_{2} μ_{I} \\ σ_{y}^{2} = E_{y} (y_{i}^{2}) - E_{y} {(y_{i})}^{2} = E_{I} (E_{y | I} (y_{i}^{2} | I_{i})) - {(p_{2} μ_{I})}^{2} = p_{2} (1 - p_{2}) μ_{I}^{2} + p_{2} σ \end{array}

(17)

Thus, μ_yand σ_y² can be estimated as:

{\begin{array}{l} {\hat{μ}}_{y} = p_{2} {\hat{μ}}_{I} \\ {\hat{σ}}_{y}^{2} = p_{2} (1 - p_{2}) {\hat{μ}}_{I}^{2} + p_{2} {\hat{σ}}_{I}^{2} \end{array}

(18)

According to the central limit theorem, Y is approximately a normal distribution with the following parameters under assumption 3, i.e. when n is large (n > 30)

{\begin{array}{l} μ_{Y} = n μ_{y} \\ σ_{Y}^{2} = n σ_{y}^{2} \end{array}

(19)

The resulting probability density function is given in eqn. 20.

f_{Y} (Y) = \frac{e^{- {(Y - μ_{Y})}^{2} / (2 σ_{Y}^{2})}}{\sqrt{2 π} σ_{Y}}

(20)

And μ_Yand σ_Y² are estimated by eqn. 21.

{\begin{array}{l} {\hat{μ}}_{Y} = n {\hat{μ}}_{y} = n p_{2} {\hat{μ}}_{I} \\ {\hat{σ}}_{Y}^{2} = n {\hat{σ}}_{y}^{2} = n {p_{2} (1 - p_{2}) {\hat{μ}}_{I}^{2} + p_{2} {\hat{σ}}_{I}^{2}} = n p_{2} (1 - p_{2}) {\hat{μ}}_{I}^{2} + n p_{2} {\hat{σ}}_{I}^{2} \end{array}

(21)

The p-value, α, is the probability that Y for a random match is greater than or equal to that of the actual match, I_match, under H₀. The p-value becomes:

α = \int_{I_{match}}^{+ \infty} f_{Y} (x) d x = \int_{I_{match}}^{+ \infty} \frac{e^{- {(x - μ_{Y})}^{2} / (2 σ_{Y}^{2})}}{\sqrt{2 π} σ_{Y}} d x

(22)

and is estimated by:

α = \int_{I_{match}}^{+ \infty} \frac{e^{- {(x - {\hat{μ}}_{Y})}^{2} / (2 {\hat{σ}}_{Y}^{2})}}{\sqrt{2 π} {\hat{σ}}_{Y}} d x

(23)

resulting in the pp2 value, β, as follows:

β = - \log (α) = - \log (\int_{I_{match}}^{+ \infty} \frac{e^{- {(x - {\hat{μ}}_{Y})}^{2} / (2 {\hat{σ}}_{Y}^{2})}}{\sqrt{2 π} {\hat{σ}}_{Y}} d x)

(24)

The pp2 value can be estimated by equation 17 very efficiently. However, the real distribution of Y is more tailed to larger values than the normal distribution. Therefore, pp2 values are overestimated when they are large.

Distribution of pp value for random matches

When q = 0, the algorithm always assigns pp value, β = 0 because the experimental and theoretical precursor ions do not match. The cumulative distribution function for pp value when q = 0 is shown in eqn. 25.

F_{β | q = 0} (β) = {\begin{matrix} 1 & β = 0 \\ 0 & β > 0 \end{matrix}

(25)

In statistical hypothesis testing, a p-value for a null hypothesis H₀ is always a uniform distribution on the interval [0, 1]. Therefore, the cumulative distribution function for p-value of a random match is continuously distributed as

F_{α|q = 1}(α) = α (0 ≤ α≤ 1)

when q = 1. According to the definition of pp value (eqn. 11), the cumulative distribution function for pp value when q = 1 is

F_{β|q = 1}(β) = 1 - 10^-β (β ≥ 0)

and the probability density function is

\begin{matrix} f_{β | q = 1} (β) = \frac{d}{d β} F_{β | q = 1} (β) = \frac{d}{d β} (1 - 10^{- β}) = \ln (10) 10^{- β} & (β \geq 0) \end{matrix} .

(28)

Matches with pp or pp2 values under a critical value β_c> 0 are discarded, i.e. their pp values are assigned 0. Thus the distribution of pp value for random matches returned by the algorithm is

F_{β | q = 1}^{*} (0) = \int_{0}^{β_{c}} f_{β | q = 1} (x) d x = \int_{0}^{β_{c}} \ln (10) 10^{- x} d x = 1 - 10^{- β_{c}}

(29)

and for β ≥ β_c> 0,

F_{β | q = 1}^{*} (β) = F_{β | q = 1} (β) = 1 - 10^{- β}

(30)

Thus when q = 1

F_{β | q = 1}^{*} (β) = {\begin{array}{l} 1 - 10^{- β_{c}} & β = 0 \\ 1 - 10^{- β} & β \geq β_{c} \end{array} .

(31)

Likewise we can specify the unconditional distribution of pp values for random matches as follows. Since q has a bernoulli (p₁) distribution, we have

P (q) = {\begin{array}{l} 1 - p_{1} & q = 0 \\ p_{1} & q = 1 \end{array},

(32)

For β = 0 the cumulative distribution function becomes,

\begin{matrix} F_{β} (0) = F_{β | q = 0} (0) \times P_{q} (0) + F_{β | q = 1}^{*} (0) \times P_{q} (1) \\ = 1 \times (1 - p_{1}) + (1 - 10^{- β_{c}}) \times p_{1} = 1 - p_{1} 10^{- β_{c}} \end{matrix}

(33)

and for β ≥ β_c> 0, it becomes,

\begin{matrix} F_{β} (β) = F_{β | q = 0} (β) \times P_{q} (0) + F_{β | q = 1}^{*} (β) \times P_{q} (1) \\ = 0 \times (1 - p_{1}) + (1 - 10^{- β}) \times p_{1} = 1 - p_{1} 10^{- β} \end{matrix}

(34)

The combined cumulative distribution function is thus,

F_{β} (β) = {\begin{array}{l} 1 - p_{1} 10^{- β_{c}} & β = 0 \\ 1 - p_{1} 10^{- β} & β \geq β_{c} \end{array}

(35)

When β ≥ β_c, F_β(β) is continuous and the probability density function of pp value for random matches is

\begin{matrix} f_{β} (β) = \frac{d}{d β} F_{β} (β) = \frac{d}{d β} (1 - p_{1} 10^{- β}) = p_{1} \ln (10) 10^{- β} & (β \geq β_{c}) \end{matrix}

(36)

Confidence level for pp and pp2 values

The confidence level can also be determined for both pp and pp2. Suppose there are r theoretical spectra within the protein sequence database. If we assume that all theoretical spectra are uncorrelated, eqn. 37 gives φ, the number of random matches that have a pp value greater than or equal to β under H₀ for any given experimental spectrum.

φ = r \int_{β}^{+ \infty} f_{β} (x) d x = r \int_{β}^{+ \infty} p_{1} \ln (10) 10^{- x} d x = r p_{1} 10^{- β},

(37)

The confidence level, ψ, is defined as

ψ = -log(φ) = -log(r p₁ 10^-β) = β -log(r)-log(p₁)

where β is either the pp or pp2 value, r is the number of theoretical spectra within the protein sequence database, and p₁ is given in eqn. 4. Confidence levels calculated from pp value and pp2 value are referred as confidence level and confidence level2 respectively.

The confidence level is the negative common logarithm of the expected number of random matches with a pp value bigger than or equal to the one we observe for the corresponding experimental spectrum. Therefore, if the confidence level is below 0, more than one random match for the spectrum is expected and the corresponding match is highly suspect. From eqn. 38, the pp value is directly related to the confidence value. The confidence level is dependent upon the size of the database and degrades as the number of peptide created from the database increases.

The protein pp score

The pp model is also used to calculate pp values for protein matches. Let r_protein denote the total number of theoretical spectra created from a protein sequence and n_spectra denote the total number of experimental spectra in the data set. The cross match of all experimental spectra with theoretical peptides for the protein sequence generates n_{match_protein} = r_protein × n_spctra potential matches. The sum of reported pp values of all matches for the protein is calculated from eqn. 39.

B = \sum_{i = 1}^{n_{match_protein}} β_{i}

(39)

The statistical model is used to test the following hypotheses: H₀ – All peptide matches for a given protein are random and H_A – At least one peptide match for a given protein is not random. We assume that r_spectra theoretical spectra created from the protein sequence are uncorrelated to each other and that n_spectra experimental spectra from the data set are uncorrelated to each other. Since n_{match_protein} is normally very large, B is approximately a normal distribution with a mean of μ_B= n_{match_protein} × μ_βand a variance of σ_B² = n_{match_protein} × σ_β² according to the central limit theorem.

According to the distribution of the pp value for random matches described above, the mean and variances of a random match are given by the following equations:

μ_{β} = \int_{0}^{+ \infty} x f_{β} (x) d x = \int_{β_{c}}^{+ \infty} x p_{1} \ln (10) 10^{- x} d x = (\frac{p_{1}}{\ln (10)} + p_{1} β_{c}) 10^{- β_{c}}

(40)

and

\begin{matrix} σ_{β}^{2} = E (β^{2}) - μ_{β}^{2} = \int_{0}^{+ \infty} x^{2} f_{β} (x) d x - μ_{β}^{2} = \int_{β_{c}}^{+ \infty} x^{2} p_{1} \ln (10) 10^{- x} d x - μ_{β}^{2} \\ = (\frac{2 p_{1}}{{[\ln (10)]}^{2}} + \frac{2 p_{1} β_{c}}{\ln (10)} + p_{1} β_{c}^{2}) 10^{- β_{c}} - {[(\frac{p_{1}}{\ln (10)} + p_{1} β_{c}) 10^{- β_{c}}]}^{2} \end{matrix}

(41)

where p₁ is given in eqn. 4 and β_cis the pp value threshold. Likewise for the sum of pp values for the protein, B, the mean and variance for the distribution under H₀ are given in eqn. 42:

{\begin{array}{l} μ_{B} = n_{match_protein} (\frac{p_{1}}{\ln (10)} + p_{1} β_{c}) 10^{- β_{c}} \\ σ_{B}^{2} = n_{match_protein} {[\frac{2 p_{1}}{{[\ln (10)]}^{2}} + \frac{2 p_{1} β_{c}}{\ln (10)} + p_{1} β_{c}^{2}] 10^{- β_{c}} - {[(\frac{p_{1}}{\ln (10)} + p_{1} β_{c}) 10^{- β_{c}}]}^{2}} \end{array}

(42)

The p-value for a protein, α_protein, is defined to be the probability that the protein hit can have a sum of pp values from all its peptide matches greater than or equal to B under H_0. Thus α_protein is given by

α_{protein} = \int_{B}^{+ \infty} f_{B} (x) d x = \int_{B}^{+ \infty} \frac{e^{- {(x - μ_{B})}^{2} / (2 σ_{B}^{2})}}{\sqrt{2 π} σ_{B}} d x .

(43)

and the protein pp value becomes

protein pp value = - \log (α_{protein}) = - \log (\int_{B}^{+ \infty} \frac{e^{- {(x - μ_{B})}^{2} / (2 σ_{B}^{2})}}{\sqrt{2 π} σ_{B}} d x)

(44)

Discussion

Effect of various spectral characteristics on scoring

Five example spectra, shown in Figure 1, are used to illustrate the effect of various spectral characteristics on scoring. All spectra were collected on an LTQ-Orbitrap mass spectrometer (ThermoElectron Finnigan, San Jose, CA, USA) [29]. Precursor and product ions were mass analyzed by the Orbitrap to achieve a mass accuracy of < 5.0 ppm. The pp and pp2 values at different mass accuracies (0.01 Da, 0.1 Da and 1.0 Da) were calculated and listed in Table 1. Mass accuracy tolerances were specified as either relative or absolute for precursor ions but only as absolute tolerances for product ions. Absolute mass accuracy tolerances for product ions are computationally cheaper and yield a reasonable compromise between computational expense and accuracy. Good quality spectra (Figure 1a,1b) yielded high empirical and statistical scores as expected. These sequences in Figure 1a &1b illustrate that peptide length has little effect on the scoring. This observation is consistent with the statistical model lack of peptide sequence bias and the peptide length penalty included in the empirical score (eqn. 1). Low quality data (i.e. low signal to noise ratio) can still yield good scores if the most abundant ions are dominated by signal (Figure 1c). The most challenging spectra are those with few dominant signal peaks. Examples are shown in 1d &1e. These figures show the spectra with one single dominant ion due to N-terminal fragmentation of an internal Pro residue and the neutral loss of H₂O at an N-terminal Glu residue [30]. The empirical scores were poorer for these cases since only a single ion mainly contributes to score (eqn. 1). However, the pp and pp2 values were not as severely affected and able to accurately discriminate these matches from false positives.

Table 1 Empirical and statistical scores along with associated parameters (p 1 and p 2) for each spectrum shown in Figure 1. The data were obtained for mass accuracies of 1.0 Da, 0.1 Da and 0.01 Da. Confidence levels were calculated based on a search space of 2726345 theoretical peptides. The confidence levels for the pp and pp2 scores are denotes as confidence level and confidence level2

Full size table

Comparison between pp and pp2 values

The pp value is the primary discriminator for quality of matches. The pp2 value can provide a complementary assessment of quality when pp values are suspect. Although the pp and pp2 value have the same statistical basis, there are several differences between them: The pp value is based on the number of matched product ions and the pp2 value is based on the total abundance of matched product ions. The pp value can be underestimated when noise is present in the experimental spectrum especially at low mass accuracy. Because noise normally has lower abundance than product ions, the pp2 value, on the other hand, is generally unaffected. As shown in Table 1, pp value for the spectrum with low signal to noise ratio and majority of noise peaks (Figure 1c) were affected negatively by the noise peaks and relative low compared with those for normal spectra at mass accuracy of 1.0 Da. However, pp2 value was not affected by the noise peaks.

While pp value can be precisely calculated, there are three assumptions needed to estimate the pp2 values. Assumption 1 and 3 for the pp2 test are not plausible when the number of product ions in the experimental spectrum, n is small. Therefore, pp2 value estimated by the central limit theorem cannot evaluate the quality of matches with a small number of product ions. Furthermore the normal distribution under the central limit theorem is less tailed than the true distribution of Y, pp2 value is normally overestimated when it is large (> 16) as shown in Table 1.

From the above discussion, the pp value is more reliable and accurate than the pp2 value under most circumstances, but it can be affected by noise. Under these circumstances, the number of product ions in the spectrum is normally large and pp2 value can be well estimated and complementary to pp value. Thus the combination of the two scores provides an excellent means to ascertain the quality of matches under conditions where one might fail.

Effect of mass accuracy on pp values

In the pp model, the two most important parameters (p₁, the probability that a theoretical precursor randomly matches the experimental and p₂, the probability that a theoretical product ion randomly matches any product ions in the MS/MS spectrum) are set in accordance with the predetermined mass accuracy of mass spectrometer. These parameters' values decrease as mass accuracy increases. This effect is shown in Table 1. A more thorough list of all parameters used in calculating the empirical and statistical scores is provided as supplementary material [see Additional file 1]. The statistical model specifically takes each parameter into account when calculating the statistical scores. Therefore, these two parameters have a substantial effect on the pp values for both random matches and true matches. Consequently, pp and pp2 values are very sensitive to the accuracy of mass spectrometer.

As is shown in the Figure 2, the probability of random matches having high pp values is substantially reduced as we increase mass accuracy. Increasing mass accuracy resulted in a shift of the pp value distribution for random matches to lower values. At the same time the pp value distribution for true matches moves to higher pp values. This effect is evident from the pp values in Table 1. As mass accuracy improved, the pp values improved for all peptide matches in Figure 1. Thus higher mass accuracy improves sensitivity and selectivity for a search and help discriminate true matches from random matches.

Conclusion

A new statistically derived scoring algorithm was developed for characterization of peptides, proteins and their posttranslational modifications from tandem MS data. The probability based algorithm implicitly incorporates mass accuracy into scoring the potential peptide and protein matches. This approach is separate and distinct from algorithms that filter precursor and product ion matches based on mass accuracy. The statistical model involves no empirical parameters and its scores correlate to the probability that a match is a random occurrence. A novel statistically derived algorithm to rigorously calculate protein scores from the probability based peptide scores was also developed. Thus the protein scores reflect the significance of protein matches and can be used to differentiate true protein matches from random matches. The algorithm is incorporated in an automated database search program MassMatrix.

References

Sadygov RG, Cociorva DC, Yates JR: Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book. Nature Methods. 2004, 1 (3): 195-202. 10.1038/nmeth725.
Article CAS PubMed Google Scholar
Hunt DF, Yates JR, Shabanowitz J, Winston S, Hauer CR: Protein sequencing by tandem mass-spectrometry. Proc Natl Acad Sci U S A. 1986, 57: 6233-6237. 10.1073/pnas.83.17.6233.
Article Google Scholar
Bakhtiar R, Guan ZQ: Electron capture dissociation mass spectrometry in characterization of peptides and proteins. Biotechnol Lett. 2006, 28 (14): 1047-1059. 10.1007/s10529-006-9065-z.
Article CAS PubMed Google Scholar
Nikolaev EN, Somogyi A, Smith DL, Gu CG, Wysocki VH, Martin CD, Samuelson GL: Implementation of low-energy surface-induced dissociation (eV SID) and high-energy collision-induced dissociation (keV CID) in a linear sector-TOF hybrid tandem mass spectrometer. Int J Mass Spectrom. 2001, 212: 535-551. 10.1016/S1387-3806(01)00462-6.
Article CAS Google Scholar
Syka JEP, Coon JJ, Schroeder MJ, Shabanowitz J, Hunt DF: Peptide and protein sequence analysis by electron transfer dissociation mass spectrometry. Natl Acad Sci USA. 2004, 101: 9528-9533. 10.1073/pnas.0402700101.
Article CAS Google Scholar
Hunt DF, Buko AM, Ballard JM, Shabanowitz J, Giordani AB: Sequence-analysis of polypeptides by collision activated dissociation on a triple quadrupole mass-spectrometer. Biomed Mass Spectrom. 1981, 53: 397-408. 10.1002/bms.1200080909.
Article Google Scholar
Biemann K: Contributions of mass-spectrometry to peptide and protein-structure. Biomed Environ Mass Spectrom. 1988, 16: 99-111. 10.1002/bms.1200160119.
Article CAS PubMed Google Scholar
Nesvizhskii AI, Aebersold R: Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS. Drug Discov Today. 2004, 9 (4): 173-181. 10.1016/S1359-6446(03)02978-7.
Article CAS PubMed Google Scholar
Dancík V, Addona TA, Clauser KR, Vath JE, Pevzner PA: De novo peptide sequencing via tandem mass spectrometry. J Comput Biol. 1999, 6: 327-342. 10.1089/106652799318300.
Article PubMed Google Scholar
Standing KG: Peptide and protein de novo sequencing by mass spectrometry. Curr Opin Struc Biol. 2003, 13 (5): 595-601. 10.1016/j.sbi.2003.09.005.
Article CAS Google Scholar
Eng JK, McCormack AL, Yates JR: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom. 1994, 5: 976-989. 10.1016/1044-0305(94)80016-2.
Article CAS PubMed Google Scholar
Kapp EA, Schütz F, Connolly LM, Chakel JA, Meza JE, Miller CA, Fenyo D, Eng JK, Adkins JN, Omenn GS, Simpson RJ: An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: sensitivity and specificity analysis. Proteomics. 2005, 5 (13): 3475-3490. 10.1002/pmic.200500126.
Article CAS PubMed Google Scholar
Keller A, Nesvizhskii A, Kolker E, Aebersold R: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem. 2002, 74: 5383-5392. 10.1021/ac025747h.
Article CAS PubMed Google Scholar
Qian W, Liu T, Monroe ME, Strittmatter EF, Jacobs JM, Kangas LJ, Petritis K, Camp DG, Smith RD: Probability-Based Evaluation of Peptide and Protein Identifications from Tandem Mass Spectrometry and SEQUEST Analysis: The Human Proteome. J Proteome Res. 2005, 4: 53-62. 10.1021/pr0498638.
Article CAS PubMed Google Scholar
Zhang N, Aebersold R, Schwikowshi B: ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics. 2002, 2: 1406-1412. 10.1002/1615-9861(200210)2:10<1406::AID-PROT1406>3.0.CO;2-9.
Article CAS PubMed Google Scholar
Perkins DN, Pappin DJC, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence database using mass spectrometry data. Electrophoresis. 1999, 20: 3551-3567. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2.
Article CAS PubMed Google Scholar
Hansen BT, Jones JA, Mason DE, Liebler DC: SALSA: a pattern recognition algorithm to detect electrophile-adducted peptides by automated evaluation of CID spectra in LC-MS-MS analyses. Anal Chem. 2001, 73: 1676-1683. 10.1021/ac001172h.
Article CAS PubMed Google Scholar
Mann M, Wilm M: Error tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem. 1994, 66: 4390-4399. 10.1021/ac00096a002.
Article CAS PubMed Google Scholar
Bafna V, Edwards N: SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics. 2001, 17: S13-S21. 10.1093/bioinformatics/17.1.13.
Article PubMed Google Scholar
Colinge J, Masselot A, Giron M, Dessingy T, Magnin J: OLAV: towards high-throughput tandem mass spectrometry data identification. Proteomics. 2003, 3: 1454-1463. 10.1002/pmic.200300485.
Article CAS PubMed Google Scholar
MacCoss MJ, Wu CC, Yates JR: Probability-based validation of protein identifications using a modified SEQUEST algorithm. Anal Chem. 2002, 74: 5593-5599. 10.1021/ac025826t.
Article CAS PubMed Google Scholar
Havilio M, Haddad Y, Smilansky Z: Intensity-based statistical scorer for tandem mass spectrometry. Anal Chem. 2003, 75: 435-444. 10.1021/ac0258913.
Article CAS PubMed Google Scholar
Sadygov RG, Yates JR: A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal Chem. 2003, 75: 3792-3798. 10.1021/ac034157w.
Article CAS PubMed Google Scholar
Sadygov RG, Liu H, Yates JR: Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases. Anal Chem. 2004, 76: 1664-1671. 10.1021/ac035112y.
Article CAS PubMed Google Scholar
Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH: Open mass spectrometry search algorithm. J Proteome Res. 2004, 3: 958-964. 10.1021/pr0499491.
Article CAS PubMed Google Scholar
Olsen JV, de Godoy LMF, Li G, Macek B, Mortensen P, Pesch R, Makarov A, Lange O, Horning S, Mann M: Parts per million mass accuracy on an Orbitrap mass spectrometer via lock mass injection into a C-trap. Mol Cell Proteomics. 2005, 4 (12): 2010-2021. 10.1074/mcp.T500030-MCP200.
Article CAS PubMed Google Scholar
Clauser KR, Baker P, Burlingame AL: Role of accurate mass measurement (±10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Anal Chem. 1999, 71: 2871-2882. 10.1021/ac9810516.
Article CAS PubMed Google Scholar
Sullivan AG, Brancia FL, Tyldesley R, Bateman R, Sidhu K, Hubbard SJ, Oliver SG, Gaskell SJ: The exploitation of selective cleavage of singly protonated peptide ions adjacent to aspartic acid residues using a quadrupole orthogonal time-of-flight mass spectrometer equipped with a matrix-assisted laser desorption/ionization source. Int J Mass Spectrom. 2001, 210/211: 665-676. 10.1016/S1387-3806(01)00430-4.
Article CAS Google Scholar
Xu H, Freitas AF: MassMatrix: A Database Searching Program for Rapid Characterization of Proteins and Peptides from Tandem Mass Spectrometry Data. To be submitted.
Paizs B, Suhai S: Fragmentation pathways of protonated peptides. Mass Spectrom Rev. 2005, 24: 508-548. 10.1002/mas.20024.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The study was funded by the Ohio State University, the National Institutes of Health CA107106, the V Foundation/American Association for Cancer Research Translational Cancer Research Grant and the Leukemia & Lymphoma Society.

Author information

Authors and Affiliations

Department of Chemistry, the Ohio State University, Columbus, 43210, OH, USA
Hua Xu
Department of Molecular Immunology Virology and Medical Genetics, the Ohio State University, Columbus, 43210, OH, USA
Michael A Freitas

Authors

Hua Xu
View author publications
You can also search for this author in PubMed Google Scholar
Michael A Freitas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael A Freitas.

Additional information

Authors' contributions

HX designed and mathematically proved the statistical model and drafted the manuscript. MAF was the principle investigator and provided overall guidance of the project, and also revised the manuscript critically. Both authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Xu, H., Freitas, M.A. A mass accuracy sensitive probability based scoring algorithm for database searching of tandem mass spectrometry data. BMC Bioinformatics 8, 133 (2007). https://doi.org/10.1186/1471-2105-8-133

Download citation

Received: 07 December 2006
Accepted: 20 April 2007
Published: 20 April 2007
DOI: https://doi.org/10.1186/1471-2105-8-133

A mass accuracy sensitive probability based scoring algorithm for database searching of tandem mass spectrometry data

Abstract

Background

Results

Conclusion

Background

Results

Multiple scoring algorithms

Descriptive peptide scoring model

Descriptive protein score

Probability based peptide scoring model

The pp score

The pp2 score

Distribution of pp value for random matches

Confidence level for pp and pp2 values

The protein pp score

Discussion

Effect of various spectral characteristics on scoring

Comparison between pp and pp2 values

Effect of mass accuracy on pp values

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

About this article

Cite this article

Keywords

BMC Bioinformatics

Contact us

A mass accuracy sensitive probability based scoring algorithm for database searching of tandem mass spectrometry data

Abstract

Background

Results

Conclusion

Background

Results

Multiple scoring algorithms

Descriptive peptide scoring model

Descriptive protein score

Probability based peptide scoring model

The pp score

The pp2 score

Distribution of pp value for random matches

Confidence level for pp and pp2 values

The protein pp score

Discussion

Effect of various spectral characteristics on scoring

Comparison between pp and pp2 values

Effect of mass accuracy on pp values

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us