A cross-validation scheme for machine learning algorithms in shotgun proteomics
© Granholm et al.; licensee BioMed Central Ltd. 2012
Published: 5 November 2012
Skip to main content
© Granholm et al.; licensee BioMed Central Ltd. 2012
Published: 5 November 2012
Peptides are routinely identified from mass spectrometry-based proteomics experiments by matching observed spectra to peptides derived from protein databases. The error rates of these identifications can be estimated by target-decoy analysis, which involves matching spectra to shuffled or reversed peptides. Besides estimating error rates, decoy searches can be used by semi-supervised machine learning algorithms to increase the number of confidently identified peptides. As for all machine learning algorithms, however, the results must be validated to avoid issues such as overfitting or biased learning, which would produce unreliable peptide identifications. Here, we discuss how the target-decoy method is employed in machine learning for shotgun proteomics, focusing on how the results can be validated by cross-validation, a frequently used validation scheme in machine learning. We also use simulated data to demonstrate the proposed cross-validation scheme's ability to detect overfitting.
Shotgun proteomics relies on liquid chromatography and tandem mass spectrometry to identify proteins in complex biological mixtures. A central step in the procedure is the inference of peptides from observed fragmentation spectra. This inference is frequently achieved by evaluating the resemblance between the experimental spectrum and a set of theoretical spectra constructed from a database of known protein sequences of the organism under consideration. If a peptide present in the database was analyzed in the mass spectrometer and triggered a fragmentation event, then the peptide can be identified by comparing the observed and theoretical fragmentation spectra. This matching procedure is carried out by search engines, such as Sequest , Mascot , X! Tandem  and Crux . Each generated match is referred to as a peptide-spectrum match (PSM) and is given a score, indicating the degree of similarity between the observed and the theoretical fragmentation spectrum. The best scoring peptide of a spectrum is referred to as the spectrum's top-scoring PSM, and normally only this PSM is kept for further analysis. Here, we refer to the search engine scores as raw scores, because they have not been calibrated and generally lack a direct statistical interpretation. Ideally, the best raw score is assigned to the PSM of the peptide that originally produced the spectrum. Subsequently, from the set of PSMs, the proteins in the sample can be inferred [5–8].
A large proportion of the fragmentation spectra in shotgun proteomics experiments are matched to peptides that were not present in the fragmentation cell when the spectrum was collected. We say that the top-scoring PSMs for these spectra are incorrect. In practice, the researcher generally chooses a score threshold above which PSMs are deemed significant and considered to be correct matches. However, because there are many error sources associated both with the mass spectrometer and the matching procedures, correct and incorrect PSMs cannot be completely discriminated using raw scores. For this reason, an important step in the analysis is to estimate the error rate associated with a given score threshold. These error rates, quantified using statistical confidence measures, are usually expressed in terms of the false discovery rate (FDR) [9–11], the expected fraction of false positives among the PSMs that are deemed significant. The closely related q value  is defined as the minimum FDR required to deem a PSM as correct. Thus, the q value provides a useful statistical quantity that can be readily assigned to each PSM individually.
Target-decoy analysis is arguably the most common approach for estimating error rates in shotgun proteomics. As described later, this approach uses searches against a shuffled decoy database to model incorrect matches. Besides error rate estimation, the target-decoy approach has been used to increase the score discrimination between correct and incorrect PSMs using semi-supervised machine learning [12–17]. This increased discrimination is highly valuable, because it typically results in a considerably higher number of confident peptide identifications. However, as we demonstrate below, improperly implemented machine learning approaches risk seriously damaging the quality of the results and the reliability of the corresponding estimated error rates. Without proper validation protocols, strong biases, such as overfitting, that undermine the basic assumptions of the target-decoy approach, will remain undiscovered.
Here, we describe the cross-validation procedure used by Percolator , a semi-supervised machine learning algorithm for post-processing of shotgun proteomics experiments. The procedure accurately validates the results by keeping training and validation sets separate throughout the scoring procedure. We begin by introducing the idea of target-decoy analysis. Subsequently, we focus on how ranking of PSMs can be improved by using machine learning algorithms. Finally, we will discuss how to validate the results from machine learning algorithms to ensure reliable results. The effect of the validation is demonstrated using an example based on simulated data.
Frequently, results from shotgun proteomics experiment are validated using the target-decoy analysis. The procedure provides a mean to empirically estimate the error rates by additionally matching the spectra against a decoy database. The decoy database consists of shuffled or reversed versions of the target database, which includes the protein sequences of the organism under consideration. As a consequence, the decoy database is assumed to make up a list of biologically infeasible protein sequences that are not found in nature. A spectrum matched against one of these sequences is termed a decoy PSM, as opposed to a standard target PSM, and is assumed to be incorrectly matched. The idea is that the decoy PSMs make a good model of the incorrect target matches, so that the error rates can be estimated . In this article we assume that the target and the decoy databases are searched separately. The other main strategy, which is not discussed here, is target-decoy competition, in which a single search is made through a combined target and decoy database .
To estimate the FDR corresponding to a certain score threshold with separate target-decoy searches, one first sorts all PSMs according to their score. Second, one takes all PSMs with scores equal to, or above, the threshold, and divide the number of decoy PSMs by the number of target PSMs. Third, this fraction is multiplied by the expected proportion of incorrect PSMs among all target PSMs, which can be estimated from the distribution of low-scoring matches [11, 20, 21]. To estimate q values, each PSM is assigned the lowest estimated FDR of all thresholds that includes it. With this approach, the researcher finds a score threshold that corresponds to a suitable q value, often 0.01 or 0.05, and uses this threshold to define the significant PSMs.
Let us now turn our attention on how we may improve the separation between correct and incorrect PSMs than by ranking PSMs by the search engine's raw scores alone. Correct and incorrect PSMs may have different distributions of other features than just the search engine's raw scores. We can hence design scoring functions that combine such features and obtain better separation between correct and incorrect PSMs. The features that we want to include in such a combined scoring function can be selected from a wide set of properties of the PSMs. The features might describe the PSM itself, such as the fraction of explained b- and y-ions; the PSM's peptide, such as the peptide's length; or the PSM's spectrum, such as the spectrum's charge state.
We can use machine learning techniques, such as support vector machines (SVMs) , artificial neural networks, or random forests to obtain an, by some criterion, optimal separation between labeled examples of correct and incorrect PSMs. The method that we will discuss here, Percolator, uses a semi-supervised machine learning technique, self-training  linear SVM , to increase the separation between correct and incorrect PSMs.  Semi-supervised machine learning algorithms can use decoy PSMs and a subset of the target PSMs as examples to combine multiple features of PSMs into scores that identify more PSMs than the original raw scores.
The target-decoy analysis relies on the assumption that the decoy PSMs are good models of the incorrect target PSMs. To extend the target-decoy analysis to include the scenario where we have combined different PSM features into one scoring function, we have to assure that the used PSM features for decoy PSMs are good models of the ones of incorrect target PSMs. For many features, this assumption requires that the target and decoy databases are as similar as possible. To assure the same amino acid composition, and size, the decoy is made from the target database by shuffling , using Markov  or bag-of-word models  or reversing [18, 19] it. Only reversing, however, promises the same level of sequence homogeneity between the two databases, as shuffling would lead to larger variation among decoy peptides than target peptides. Furthermore, to conserve the same peptide mass distribution between the two databases, the peptides are often pseudo-reversed . In that case, each amino acid sequence between two enzymatic cleavage sites is reversed, while the cleavage sites themselves remain intact.
In all mass spectrometry-based proteomics experiments random variation will make full separation between correct and incorrect PSMs very hard, if not impossible, to achieve. Such variation can be introduced during the experimental procedures, but also during the subsequent bioinformatics processes. Sample concentration, instrument type and sequence database composition  are just a few of many elements potentially hampering the search engine's separation performance.
Removing, or decreasing, the influence of confounding variables can improve the discrimination between correct and incorrect PSMs considerably. Machine learning approaches such as PeptideProphet , Percolator  or q-ranker  find the most discriminating features in each particular dataset, and combine these to improve the separation. On top of rendering results with additional information from the different features taken into account, the outputted score is less influenced by confounding variables, and has better discriminative performance. As an example, the effects of using Percolator scores instead of Sequest's XCorr are shown in Figure 1B.
Regardless of whether one uses an SVM, such as Percolator, or any other machine learning approach, it is necessary to validate the performance of the algorithm. As with the common raw scores, the target-decoy approach can be applied on the scores stemming from the trained learner, to estimate the new error rates of the identifications. However, the example data used for training the algorithm is not suitable for estimating the error rates, as the training examples are likely to be, at least somewhat, overfitted.
Overfitting is a common pitfall in statistics and machine learning, in which the classifier learns from random variations in the training data. [33, 34] Such learning is undesired, as it does not arise from overall trends and patterns that are generalizable to new data points. For this reason, all sound machine learning approaches keep an independent validation set separate from the training set. First, the classifier learns from the training set, to find the best scoring function. Second, the learned scoring function is applied on the validation set. This procedure helps avoid overfitting, and gives a better estimate of the performance. 
In shotgun proteomics, a naїve straightforward separation of the PSMs into a training set and a validation set would decrease the number of PSMs that can be outputted in the final results, as we cannot apply the learned SVM score on the set used for training. To avoid this, previous versions of Percolator employed duplicate decoy databases, one of which was used to drive the learning, and the second to apply the learned classifier on. The scores given to the PSMs by the second decoy database was used for estimating the error rates of the target PSMs. With this approach, however, the target PSMs are still used both for learning and validation, and the approach was thus removed from Percolator.
After the normalization, the three subsets of PSMs are merged, and the overall error rates are estimated by target-decoy analysis on all PSMs. The final result is a list of PSMs and accurate error rates, where correct and incorrect matches have been highly discriminated.
In the previous sections, we described a cross-validation procedure that assures that the machine learning algorithm only considers general patterns in the data, and not random variations within a finite dataset. However, the fundamental assumption that decoy PSMs are good models of incorrect target PSMs hasstill not been validated. This assumption can be validated by analyzing mixtures of known protein content, in which incorrect target PSMs are readily identified. Such validation experiments enable direct comparisons of these incorrect matches and the decoy PSMs. For machine learning algorithms, it is important to validate that each one of the features considered by the learner are indeed very similar between decoy and incorrect target PSMs. Else, the classifier would easily detect these features, and produce biased results. An example of such a feature is the number of PSMs matching to the same peptide sequence, which differs slightly between decoy and incorrect target PSMs. 
We evaluated the ability of our cross-validation strategy to avoid overfitting by letting it train on a series of simulated datasets. Each dataset consisted of 2500 target and 2500 decoy synthetic PSMs, described by 50 randomly generated features. All random features followed a normal distribution with mean of 0.0 and standard deviation of 1.0. To 1000 of the target synthetic PSMs, we added an off set of 10.0 to the first feature, to simulate correctly matched PSMs. With this procedure, 100 datasets were created, and the performance of Percolator was tested on each one of them. To demonstrate the effects of Percolator's cross-validation scheme, we also ran Percolator with the cross-validation protocol disabled.
Here, we discussed the cross-validation implementation used by Percolator to ensure the reliability of the machine learning output. With a three-fold cross-validation procedure, no data points are lost, while still keeping separate training and validation sets. The PSMs from the three resulting classifiers are merged after first normalizing the three scores, and the error rates are estimated using straightforward target-decoy analysis based on the normalized scores.
Although cross-validation is used by machine learning algorithms in all fields, merging the validated data afterwards is less common. In shotgun proteomics, normalizing scores and merging the data is a necessity, for instance to allow analyzing unique peptides, where multiple PSMs map to the same peptide sequence. Thus, a normalization procedure is a natural second step after the cross-validation. As there is no established general-purpose method to normalize the scores in the different cross-validation sets, we had to design our own heuristic procedure. As we have described here, we chose to linearly rescale the scores before merging the datasets. This procedure lacks support in the literature but seems to work well in practice.
This work was supported by grants from the Swedish Research Council, The Swedish Foundation for Strategic Research, the Lawski Foundation, and NIH awards R01 GM096306 and P41 GM103533.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 16, 2012: Statistical mass spectrometry-based proteomics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/13/S16.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.