Algorithm development
Experimental datasets
Two groups of experimental MS/MS datasets were used to train and evaluate the spectral quality evaluation algorithm.
-
1)
The UCD dataset was generated in-house and consisted of 142,582 MS/MS spectra from 22 LCn/MS/MS runs, including commercial standard proteins, cultured cell extracts, and human vascular proteins, were acquired from different samples over a 6 month period. Three different database search strategies were used to annotate the dataset.
Briefly, SEQUEST [4]/PeptideProphet [15]/ProteinProphet [16] was used to identify spectra whose annotated amino acid sequences were presented in the UniProtKB/Swiss-Prot database (release 6.0). InsPecT [17] was used to identify peptides based on sequence tags and pepNovo [18]/SPIDER [19] used a combined de novo/tag approach (see experimental methods for details). The three strategies represent distinct approaches to identifying peptides from experimental tandem mass spectra (typically only the first method is used in most labs).
If a spectrum is matched by any of the three search methods described, these were regarded as annotated. Of the 142,582 spectra, 16,999 were annotated and therefore classed as identifiable (see Additional file 1). The presence of mislabeled spectra in the training data could degrade the accuracy of prediction of the resulting classifier [20]. To minimize this, we applied a 9-fold cross-validation/k-nearest neighbor strategy (see Additional file 2: Appendix 1).
The 22 runs were randomly divided into training and test datasets such that the training dataset consisted of 12 runs with 9,044 identified and 69,584 unidentified, while the test dataset consisted of 10 runs with 7,955 identified and 55,999 unidentified.
-
2)
ISB dataset. The second dataset was the publicly available ISB dataset consisting of 22 LC/MS/MS runs of artificially generated protein mixture digests [21]. We used this dataset to compare identification probabilities predicted using our program to the identification frequencies observed by Tsur and coworkers who used a "blind" approach resulting in the identification of many modified proteins [18]. This dataset consists of 39,408 MS/MS spectra. Initial efforts using SEQUEST by ISB annotated 4,310 spectra, while Tsur and coworkers [18] were able to annotate an additional 1,176 spectra. Note that the number of spectra in the dataset appear to differ from those available from the website. This is because the DTA files generated by Keller and coworkers [21] clustered adjacent MS/MS scans of similar parent mass. Because our algorithm evaluates scans individually without clustering, the spectra annotations from ISB and Tsur and coworkers [18] were "declustered" to generate our dataset.
Spectral features
A set of spectral features, similar to those used by previous workers, was used to measure the quality of MS/MS spectra. We applied a unique normalization procedure in an attempt to capture specific information from b or y ions that may be present. The intensity of peaks was initially normalized by assigning to each peak its intensity rank within a local segment, i, of 56 m/z for each spectrum, this value being chosen because there normally will be no more than one y and b ion within such a segment unless the charge state of the precursor ion is 3 or greater. This improves the signal-to-noise ratio for the whole spectrum by taking advantage of the fact that b or y ions are typically the most intense ions within any local region of a peptide tandem mass spectrum [22]. We tested all features in combination, details are reported in Appendix 2.
In order to eliminate potential feature bias between spectra generated from singly and multiply charged precursors, these were distinguished using a similar algorithm similar to that described by Hansen and co-workers [23], and the maximum m/z value, was adjusted to the mass of the precursor ion for spectra derived from singly charged ions. We found no significant performance differences when we streamed our data into +1, +2, and +3 sets, in agreement with previous reports [9, 11]. The definition of each spectral feature can be found in the experimental methods section.
Combining spectral features by logistic regression
The next step is to combine spectral features in order to make predictions on their general quality. Numerous classification algorithms may be used. In general, non-linear classifiers such as quadratic discriminant analysis (QDA) or multi-layered artificial neural networks tend to provide better classification than linear classifiers such as linear discriminant analysis (LDA) if the dataset is not linearly separable. To test whether non-linear classifiers are likely to be advantageous in this case, the training dataset was used to train a quadratic discriminant function and a linear discriminant function. The functions were then evaluated using our UCD test dataset. In this case, no significant difference was observed (see Additional file 2: Appendix 2) and as result a linear classifier was chosen for its simplicity in application and interpretation.
Logistic regression is a method that allows the discrimination between two or more groups of samples based on a vector of given variables for each sample through a logistic function. The basis of the method is similar to that of linear discriminant analysis (LDA) in combining spectral features by a weighted linear function, such that,
where x
i
are the spectral features described previously and c
i
the corresponding coefficients. However, unlike LDA, logistic regression express D as a probability of being "true" through the use of a logistic function, such that,
where θ is a value between 0 and 1, indicating the probability of a spectrum with value D being "true" or in our case, being an identifiable spectrum. The principle of maximum likelihood is applied by iteratively finding the best estimates for coefficients c
i
such that the training data best fit equation 2. Unlike LDA, the discriminant function D is maximized by maximum likelihood, relaxing the assumptions required to construct the model. For example, the features do not need to be normally distributed and the number of identifiable and unidentifiable training samples need not be similar. Further, because the maximization of the logistic function is probability based, the significance of features can be easily evaluated through analysis of the t-statistic, given the standard errors of the estimated coefficients.
We selected spectral features for the final model in an iterative process, removing features that did not contribute significantly to the final discriminant model. A logistic regression model was computed using all features available, using the computed coefficient and t-statistic for each variable to interpret their contribution to the model. It was found that IntnRatio20% and H2ORatio did not contribute significantly based on our training set (see Additional file 2: Appendix 3). These features were then removed and the discriminant model recalculated.
Statistical modeling of the identifiable and unidentifiable spectra distributions
Once the features of a spectrum are combined into a discriminant score, a statistical model can be used to assess the likelihood of that spectrum being identifiable or unidentifiable based on the spectra distributions of the complete dataset. The first step is to build models for the identifiable and the unidentifiable spectra.
Figure 1 shows the distribution for the identified and unidentified UCD test spectra tallied within bins of width 0.25 according to the discriminant score, D. It is evident that identifiable spectra approximately follow a Gaussian distribution (dotted line), while the distribution of the unidentifiable spectra has a slight positive skew (longer right-tail), probably resulting from high quality but unidentifiable spectra (i.e. misclassified) (see e.g. [11, 12]). Thus, it would be reasonable to disregard the skewness and also model the unidentifiable spectra using a Gaussian distribution. Based on Gaussian distributions, the probability that a spectrum has a discriminant score, D, given that it is identifiable can be computed as,
where + corresponds to "identifiable spectrum", μ+ and σ+ are the mean and standard deviation of the distribution respectively. The conditional probability p(D|-), where – corresponds to "unidentifiable spectrum" is similarly computed using the mean and standard deviation of the unidentifiable spectra distribution.
Since the estimated distributions will not match the observed distributions for each new dataset, the problem may now be treated as learning a mixture of two Gaussian distributions. An efficient algorithm that is commonly applied to perform unsupervised learning of mixture models is the expectation-maximization (EM) algorithm [24, 25]. This algorithm calculates the maximum likelihood estimation for fitting a given model. The PeptideProphet [15] software for predicting likelihoods for a correctly annotated MS/MS spectrum uses a similar approach. In this case, the EM algorithm is initialized using the prior probabilities and parameters of the distributions estimated from our combined test dataset (Figure 1). The EM algorithm optimizes these by iteratively calculating the expected probability assignment, p(+|D) for each spectrum, using equation 3 and Bayes' Law
In turn the EM algorithm uses the expected probability estimated to optimize the prior probability, p(+) and parameters σ and μ for the + and – distributions, such that
where n is the number of spectra in the dataset.
The prior probability for the unidentifiable spectra is (1-p(+)). The distribution parameters are calculated similarly but use (1-p(+|D
i
)) as the likelihood estimate for a spectrum being unidentifiable.
The EM algorithm is allowed to run until there are no significant changes to the estimated parameters between iterations. Supplementary Figure 2 (see Additional file 3) show examples of the algorithm used to fit datasets with different spectra distributions. In general the predicted identifiable and unidentifiable spectra distributions match the observed well, especially for the UCD dataset. For the ISB example, the unidentified spectra distribution is less well modeled; this may be a consequence of the smaller number of spectra available, or may reflect the need to include an additional distribution for effective modeling. Nevertheless, the predicted model still provides a reasonable estimate and importantly, the identifiable distribution is modeled well which is of principal importance for finding high quality spectra.
Illustrative examples
Removal of low quality unidentifiable spectra
We demonstrate the use of the assigned probabilities as a method for removing spectra that are unlikely to be identifiable from the test datasets. Runs of the UCD test dataset and ISB dataset were each analyzed and modeled separately using msmsEval. To demonstrate the accuracy of the algorithm, the estimated fraction of identifiable spectra removed at various points (the estimated fraction is calculated from the corresponding percentile values from the identifiable spectra Gaussian plot) is plotted against the actual observed fractions (Figure 2A&B).
For the UCD test dataset (Figure 2A), it can be seen that observed and predicted fractions for the identified spectra show reasonable agreement. The error bars indicate that there is some variance in the prediction between the datasets, this is likely due to the diversity of runs within the test dataset. As expected, the variance between runs from the ISB dataset is lower than for the UCD dataset (Figure 2B). This is because runs within the ISB dataset contain a low number of proteins, are of similar constituents and was presumably acquired during a single study. In the UCD case the data comprised of diverse real world examples from an active proteomics lab. It is also observed that the predicted fraction slightly underestimates the observed fraction, probably because spectra annotated by SEQUEST were not filtered by PeptideProphet (which was used to annotate the ISB dataset) and as a result there are a higher number of low quality identified spectra. Nevertheless, in both cases, the fraction of unidentified spectra removed is significantly higher than identified for fractions less than one.
Figure 2C shows the average fraction of unidentifiable spectra removed in relation to identified. For both datasets the shape of the receiver-operator-curve are similar. In general, the greater the number of unidentified spectra removed, greater the number of identifiable spectra also removed. For practical purposes in a proteomics laboratory, a user will want to maximize the removal of unidentifiable spectra without significant lost of identifiable spectra. From Figure 2C, it can be observed that the removal of 50% of unidentifiable spectra, resulting in a two-fold decrease in computer search time, will remove only 1–2% of identifiable spectra. This is consistent with other reports.
Guide for finding modified spectra
To demonstrate that the model is reliable in predicting the probability that a spectrum is identifiable, p(+|D), the estimated probability is plotted against the observed probability for the ISB dataset (Figure 3A). Spectra were sorted based on the predicted p(+|D), and for bins of 100 spectra, the average p(+|D) was calculated and plotted against the fraction of observed identifiable spectra in those same bin. Figure 3A shows that when p(+|D) is plotted against spectra that were only annotated using SEQUEST (crosses), the observed probabilities underestimate the predicted probabilities (e.g. only 60% of spectra with predicted p(+|D) = 0.9 are being identified when one would expect 90%). However with the addition of the spectra representing mostly modified peptide annotated by Tsur and coworkers (2005) [18], it can be seen that the observed probability becomes significantly closer to the predicted probability (diamonds).
Despite the improvement, the estimated probability still appears to over-estimate the observed, particularly for higher p(+|D). To investigate whether this discrepancy all remaining unidentified spectra with p(+|D) > 0.9 (673 spectra) were further analysed as described in the experimental section. Of these spectra, 212 were deemed to be correctly annotated confirming that there were indeed unannotated high quality spectra that were still unidentified. The newly annotated spectra belonged to one of the following three categories (a full list with annotation is available at the msmsEval website):
-
1)
Two new proteins. A number of spectra were annotated as peptides of bovine alpha-S1 and alpha-S2-casein which are proteins not from the original set of 18 standard proteins. 68 and 61 spectra were assigned to unique tryptic fragments of one of these proteins by SEQUEST and InsPecT respectively against a UniProt database (see experimental section for search description) giving very strong evidence that the annotations are correct. The bovine beta-casein preparation from Sigma C6905 is only > 90% pure, while alpha casein is also present in bovine milk.
-
2)
Further spectra of unmodified peptides of one of the 18 standard proteins were annotated. These were found by increasing the tolerance for mass errors allowed in our searches.
-
3)
Spectra of modified peptides of one of the 18 standard proteins. It is perhaps surprising that previous efforts by Tsur and coworkers [18] did not identify these spectra, however this could be due to a number of factors including the use of different parameters (e.g. maximum number of modifications allowed) or a different version of InsPecT. The majority of modifications observed were multiple methionine oxidations. However, we also observed amino acid polymorphisms. For example the mutation of Gln-1871 into Pro in rabbit myosin heavy chain. There is sufficient evidence that this annotation is correct as it is observed 10 times in the InsPecT blind search, in all cases with a p-value below 0.05. In addition, the mutation of glutamine into proline is reasonable given that it only requires the mutation of a single DNA base.
Notably, the inclusion of the newly annotated spectra into Figure 3A increases the correspondence between the predicted p(+|D) and the observed probability (circles). In fact, 83.7% of all spectra in the ISB dataset with a predicted p(+|D) of greater than 0.9 have now been annotated (Figure 3B).