### Multiple scoring algorithms

The peptide matching algorithm contains two independent scoring models, including a descriptive model and a statistical model. These models are used to calculate three distinct scores for a peptide match. Each of the scores may be independently used to ascertain the quality of the match. Because each score is distinct, the combination of scores is useful for validating each peptide match. The two models and the application to calculating peptide match scores are described in detail in the following.

### Descriptive peptide scoring model

Descriptive scores do not strictly convey any statistical relevance and may be prone to bias due to the scoring parameters. However, they have proven to be useful and generally augment probability based scores [

13]. The descriptive model used herein to calculate peptide match scores (S) is shown in eqn

1.

*I*_{
i
}is defined as the standardized abundance of the *i*^{th} product ion in the experimental spectrum (calculated by dividing the abundance of the *i*^{th} product ion by the maximum abundance in the spectrum),
is the total standardized abundance of matched product ions, *n*_{match} is the number of matched product ions, *r*_{match} is the ratio of standardized abundance of matched product ions to total standardized abundance of the experimental spectrum, and *L*_{pep} is the length of the peptide in the number of amino acids. Each of these factors contributes to the overall score as follows:
evaluates the quality of the match,
introduces a penalty for unmatched product ions,
is an arbitrary penalty for matches with poor fragmentation,
is an additional penalty for peptides with long sequences and the constant 100 is used arbitrarily to scale the scores. By default, scores for a spectrum with less than three matched product ions will be 0 due to the arbitrary penalty. However, the minimum number of matched ions may be changed to any value. Reducing this number is especially valuable for the analysis of singly charged peptides that have characteristic C-terminal aspartic acid fragmentation [28]. The penalty for peptide length is included to normalize the scores. Peptides with longer sequences have more fragment ions and higher empirical scores than shorter sequences. The penalty results in long and short sequences both have similar scores for matches of similar quality. The choices of incorporating squared and square root for the terms *n*_{match} and *L*_{pep} were empirically determined from the evaluation of tandem MS data sets collected from LCQ and LTQ-Orbitrap mass spectrometers.

### Descriptive protein score

For "true" matches, we assume that the scores are normally distributed with a mean of 20 and a variance of 25. This arbitrary distribution estimates the distributions observed from analysis of several datasets. The expected contribution of each match to the protein score will be

. Thus, the protein score from the descriptively scored matches is calculated from eqn

2.

### Probability based peptide scoring model

In addition to the empirical score, a mass accuracy sensitive probability based scoring model was derived to evaluate peptide matches. The model determines the likelihood that an experimental spectrum match to a theoretical spectrum is a random occurrence. Consider a pair of spectra: one experimental and one theoretical. *W*
_{
e
}and *W*
_{
t
}denote their precursor masses respectively. In addition, the experimental data contains information regarding the abundance of product ions *I*
_{
i
}for each precursor, *W*
_{
e
}. The model ultimately tests the following two hypotheses: the null hypothesis, H_{0,} states that a match is random, i.e. the theoretical spectrum is independent of the experimental; and the alternative hypothesis, H_{A}, states that the match is not random, i.e. the theoretical spectrum is related to the experimental one.

Scoring the match is performed in two stages: 1) match *W*
_{
e
}against *W*
_{
t
}within the specified precursor ion mass accuracy and 2) match all product ions in the experimental spectrum against the theoretical within the specified mass accuracy. Both stages rely on calculating the probability that the occurrence of an ion within a fixed mass window could be a random occurrence (
).

To match the experimental precursor with that of theoretical peptide we first define the variable

*q*:

Under H

_{0}, the possibility that any precursor ion match (

*q* = 1) could be random is given in eqn

4.

In the above equation, *τ*
_{pep} is the mass accuracy of the precursor ion and *Π* is the detection range for the precursor ion. For each precursor ion the mass window is defined as ± *τ*
_{pep} (2 × *τ*
_{pep}). Thus *q* has a bernoulli (*p*
_{1}) distribution under H_{0}. If the precursor ion masses of the pair of spectra do not match (*q* = 0) then the second stage is skipped. If *q* = 1 we proceed to stage 2 where we test the match of the experimental product ion spectrum against the theoretical spectrum.

The variable

*b*
_{
i
}is defined for each product ion,

*i*, in the experimental spectrum as follows:

Under H

_{0}, all matched product ions are random and independent occurrences. The probability that a product ion randomly matches any of the product ions in the theoretical spectrum is:

where Π_{theo} is the total coverage of the detection range for all product ions in the theoretical spectrum and *Π* is the MS/MS detection range. It is assumed that *Π* is the same as the precursor ion mass range. However, for instruments that have a dynamic detection range assuming a fixed value *Π* will result in more conservative scores. For each product ion in the theoretical spectrum, the mass window is ± *τ*
_{msms} (2 × *τ*
_{msms}). If we assume there is no overlap in the product ion mass windows, then *Π*
_{theo} is calculated using the following equation

*Π*_{theo} = 2 *m* × *τ*_{msms}. (7)

The probability that any single matched product ion (

*b*
_{
i
}= 1) could be random can be calculated using the eqn

8
where *τ*
_{msms} is the product ion mass accuracy and *m* is the number of product ions within the detection range in the theoretical spectrum. Because the theoretical spectrum is independent of the experimental under H_{0}, all *b*
_{
i
}(*i* = 1, 2 ..., n) are assumed to have an identical and independent bernoulli (*p*
_{2}) distribution under H_{0}. The model is then used to perform two distinct tests. Each uses a different approach to evaluate the quality of a match: number of matched product ions *x* and total abundance of matched product ions *Y*.

### The pp score

The model is used to evaluate whether the number of matched product ions in an experimental spectrum could be a random occurrence. For all spectra whose precursor ion masses match, i.e.

*q* = 1, the variable

*x* is defined as the number of product ions in the experimental spectrum that match the theoretical spectrum (eqn

9) where

*b*
_{
i
}(

*i* = 1, 2 ..., n) is defined in eqn

5 and

*n* is the number of product ions in the experimental spectrum.

Under H

_{0}, all

*b*
_{
i
}have an identical and independent bernoulli (

*p*
_{2}) distribution. Therefore,

*x* will have a binomial (

*n*,

*p*
_{2}) distribution. Consequently the probability mass function for

*x* is:

where *p*
_{2} is calculated from eqn 6. The p-value, *α*, is defined as the probability that the quality of a random match between a pair of spectra is greater than or equal to a match observed under H_{0}. The pp value, *β*, is defined as the negative common logarithm of the p-value:

*β* = -log(*α*) (11)

We use

*x* to evaluate the quality of a match, such that the p-value is the probability that

*x* for a random match between the pair of theoretical and experimental spectra is greater than or equal to that of the actual match,

*x* =

*n*
_{match}, under H

_{0}. The p-value is:

### The pp2 score

The second approach evaluates whether the total abundance of matched product ions in the experimental spectrum could be a random occurrence.

*Y* is defined as the total abundance of experimental product ions that match product ions in a given theoretical spectrum:

where

*I*
_{
i
}is the standardized abundance of the

*i*
^{th} product ion in the experimental spectrum and

*b*
_{
i
}is defined in eqn

5. For clarity we define

*y*
_{
i
}=

*I*
_{
i
}
*b*
_{
i
}to give eqn

15.

However, to complete the test we must know the inherent distribution of *Y*. This distribution is unknown and thus pp2 values can not be precisely calculated as were the pp values based on the total number of matched product ions. In order to estimate the pp2 value, three assumptions are needed:

- 1.
*I*_{
i
}is identically and independently distributed across product ions in the experimental spectrum,

- 2.
*b*_{
i
}is uncorrelated with *I*_{
i
}in the experimental spectrum,

- 3.
the number of product ions, *n*, in the experimental spectrum is large (*n* > 30)

Under assumption 1, the mean

*μ*
_{
I
}and variance

*σ*
_{
I
}
^{2} for the distribution of

*I*
_{
i
}are estimated by:

Since

*y*
_{
i
}=

*I*
_{
i
}
*b*
_{
i
}, assumption 2 yields eqn

17 under H

_{0},

Thus,

*μ*
_{
y
}and

*σ*
_{
y
}
^{2} can be estimated as:

According to the central limit theorem,

*Y* is approximately a normal distribution with the following parameters under assumption 3, i.e. when

*n* is large (

*n* > 30)

The resulting probability density function is given in eqn

20.

And

*μ*
_{
Y
}and

*σ*
_{
Y
}
^{2} are estimated by eqn

21.

The p-value,

*α*, is the probability that

*Y* for a random match is greater than or equal to that of the actual match,

*I*
_{match}, under H

_{0}. The p-value becomes:

resulting in the pp2 value,

*β*, as follows:

The pp2 value can be estimated by equation 17 very efficiently. However, the real distribution of *Y* is more tailed to larger values than the normal distribution. Therefore, pp2 values are overestimated when they are large.

### Distribution of pp value for random matches

When

*q* = 0, the algorithm always assigns pp value,

*β* = 0 because the experimental and theoretical precursor ions do not match. The cumulative distribution function for pp value when

*q* = 0 is shown in eqn

25.

In statistical hypothesis testing, a p-value for a null hypothesis H_{0} is always a uniform distribution on the interval [0, 1]. Therefore, the cumulative distribution function for p-value of a random match is continuously distributed as

*F*_{α|q = 1}(*α*) = *α* (0 ≤ *α*≤ 1) (26)

when *q* = 1. According to the definition of pp value (eqn 11), the cumulative distribution function for pp value when *q* = 1 is

*F*_{β|q = 1}(*β*) = 1 - 10^{-β
} (*β* ≥ 0) (27)

and the probability density function is

Matches with pp or pp2 values under a critical value

*β*
_{
c
}> 0 are discarded, i.e. their pp values are assigned 0. Thus the distribution of pp value for random matches returned by the algorithm is

Likewise we can specify the unconditional distribution of pp values for random matches as follows. Since

*q* has a bernoulli (

*p*
_{1}) distribution, we have

For

*β* = 0 the cumulative distribution function becomes,

and for

*β* ≥

*β*
_{
c
}> 0, it becomes,

The combined cumulative distribution function is thus,

When

*β* ≥

*β*
_{
c
},

*F*
_{
β
}(

*β*) is continuous and the probability density function of pp value for random matches is

### Confidence level for pp and pp2 values

The confidence level can also be determined for both pp and pp2. Suppose there are

*r* theoretical spectra within the protein sequence database. If we assume that all theoretical spectra are uncorrelated, eqn

37 gives

*φ*, the number of random matches that have a pp value greater than or equal to

*β* under H

_{0} for any given experimental spectrum.

The confidence level, *ψ*, is defined as

*ψ* = -log(*φ*) = -log(*r p*_{1} 10^{-β
}) = *β* -log(*r*)-log(*p*_{1}) (38)

where *β* is either the pp or pp2 value, *r* is the number of theoretical spectra within the protein sequence database, and *p*
_{1} is given in eqn 4. Confidence levels calculated from pp value and pp2 value are referred as confidence level and confidence level2 respectively.

The confidence level is the negative common logarithm of the expected number of random matches with a pp value bigger than or equal to the one we observe for the corresponding experimental spectrum. Therefore, if the confidence level is below 0, more than one random match for the spectrum is expected and the corresponding match is highly suspect. From eqn 38, the pp value is directly related to the confidence value. The confidence level is dependent upon the size of the database and degrades as the number of peptide created from the database increases.

### The protein pp score

The pp model is also used to calculate pp values for protein matches. Let

*r*
_{protein} denote the total number of theoretical spectra created from a protein sequence and

*n*
_{spectra} denote the total number of experimental spectra in the data set. The cross match of all experimental spectra with theoretical peptides for the protein sequence generates

*n*
_{match_protein} =

*r*
_{protein} ×

*n*
_{spctra} potential matches. The sum of reported pp values of all matches for the protein is calculated from eqn

39.

The statistical model is used to test the following hypotheses: H_{0} – All peptide matches for a given protein are random and H_{A} – At least one peptide match for a given protein is not random. We assume that *r*
_{spectra} theoretical spectra created from the protein sequence are uncorrelated to each other and that *n*
_{spectra} experimental spectra from the data set are uncorrelated to each other. Since *n*
_{match_protein} is normally very large, *B* is approximately a normal distribution with a mean of *μ*
_{
B
}= *n*
_{match_protein} × *μ*
_{
β
}and a variance of *σ*
_{
B
}
^{2} = *n*
_{match_protein} × *σ*
_{
β
}
^{2} according to the central limit theorem.

According to the distribution of the pp value for random matches described above, the mean and variances of a random match are given by the following equations:

where

*p*
_{1} is given in eqn

4 and

*β*
_{
c
}is the pp value threshold. Likewise for the sum of pp values for the protein,

*B*, the mean and variance for the distribution under H

_{0} are given in eqn. 42:

The p-value for a protein,

*α*
_{protein}, is defined to be the probability that the protein hit can have a sum of pp values from all its peptide matches greater than or equal to

*B* under H

_{0.} Thus

*α*
_{protein} is given by

and the protein pp value becomes