Estimating statistical confidence using the target-decoy analysis
Frequently, results from shotgun proteomics experiment are validated using the target-decoy analysis. The procedure provides a means to empirically estimate the error rates by additionally matching the spectra against a decoy database. The decoy database consists of shuffled or reversed versions of the target database, which includes the protein sequences of the organism under consideration. As a consequence, the decoy database is assumed to make up a list of biologically infeasible protein sequences that are not found in nature. A spectrum matched against one of these sequences is termed a decoy PSM, as opposed to a standard target PSM, and is assumed to be incorrectly matched. The idea is that the decoy PSMs make a good model of the incorrect target matches, so that the error rates can be estimated [18]. In this article we assume that the target and the decoy databases are searched separately. The other main strategy, which is not discussed here, is target-decoy competition, in which a single search is made through a combined target and decoy database [19].
To estimate the FDR corresponding to a certain score threshold with separate target-decoy searches, one first sorts all PSMs according to their score. Second, one takes all PSMs with scores greater than or equal to the threshold, and divide the number of decoy PSMs by the number of target PSMs. Third, this fraction is multiplied by the expected proportion of incorrect PSMs among all target PSMs, which can be estimated from the distribution of low-scoring matches [11, 20, 21]. To estimate q values, each PSM is assigned the lowest estimated FDR of all thresholds that includes it. With this approach, the researcher finds a score threshold that corresponds to a suitable q value, often 0.01 or 0.05, and uses this threshold to define the significant PSMs.
Target-decoy approach to machine learning
Let us now turn our attention to how we may improve the separation between correct and incorrect PSMs than by ranking PSMs as compared to when the search engine's raw scores alone. Correct and incorrect PSMs may have different distributions of other features than just the search engine's raw scores. We can hence design scoring functions that combine such features and obtain better separation between correct and incorrect PSMs. The features that we want to include in such a combined scoring function can be selected from a wide set of properties of the PSMs. The features might describe the PSM itself, such as the fraction of explained b- and y-ions; the PSM's peptide, such as the peptide's length; or the PSM's spectrum, such as the spectrum's charge state.
We can use machine learning techniques, such as support vector machines (SVMs) [22], artificial neural networks, or random forests to obtain an optimal seperation, by some criterion, between labeled examples of correct and incorrect PSMs. The method that we will discuss here, Percolator, uses a semi-supervised machine learning technique, self-training [23] linear SVM [24], to increase the separation between correct and incorrect PSMs. [13] Semi-supervised machine learning algorithms can use decoy PSMs and a subset of the target PSMs as examples to combine multiple features of PSMs into scores that identify more PSMs than the original raw scores.
The target-decoy analysis relies on the assumption that the decoy PSMs are good models of the incorrect target PSMs. To extend the target-decoy analysis to include the scenario where we have combined different PSM features into one scoring function, we have to assure that the PSM features for decoy PSMs are good models of the ones of incorrect target PSMs. For many features, this assumption requires that the target and decoy databases are as similar as possible. To assure the same amino acid composition, and size, the decoy is made from the target database by shuffling [25], using Markov [26] or bag-of-word models [27] or reversing [18, 19] it. Only reversing, however, promises the same level of sequence homology between the two databases, as shuffling would lead to larger variation among decoy peptides than target peptides. Furthermore, to conserve the same peptide mass distribution between the two databases, the peptides are often pseudo-reversed [28]. In that case, each amino acid sequence between two enzymatic cleavage sites is reversed, while the cleavage sites themselves remain intact.
Confounding variables
In all mass spectrometry-based proteomics experiments random variation will make full separation between correct and incorrect PSMs very hard, if not impossible, to achieve. Such variation can be introduced during the experimental procedures, but also during the subsequent bioinformatics processes. Sample concentration, instrument type and sequence database composition [29] are just a few of many elements potentially hampering the search engine's separation performance.
Just as in many other measurement problems, it turns out that confounding variables have a considerable detrimental effect on the discriminative power of a search engine [30]. Confounding variables are variables that inadvertently correlate both with a property of the PSM's spectrum or peptide, and the search engine score. Thus, the score assigned to a PSM by the search engine does not exclusively indicate the quality of the match between peptide and spectrum, but also influences from confounding variables. A typical confounding variable for e.g. Sequest's XCorr is the precursor ion's charge state. Single charge precursor spectra are known to have a significantly lower XCorr than multiple charged spectra. [31] Hence, the precursor charge state is a variable of the spectrum that correlates also with the search engine score. Figure 1A shows Sequest's XCorr, influenced by a covariation of properties, charge state and others, for each spectrum. The detrimental effect of this correlation between target and decoy scores for each spectrum becomes apparent when studying this figure. Some spectra obtain high or low scores both against the target and the decoy database, regardless of their PSMs being correct or incorrect. Thresholds will inadvertently have to include some incorrect PSMs of high scoring spectra from the list of accepted target PSMs, while excluding some correct PSMs from low scoring spectra.
Removing, or decreasing, the influence of confounding variables can improve the discrimination between correct and incorrect PSMs considerably. Machine learning approaches such as PeptideProphet [32], Percolator [13] or q-ranker [16] find the most discriminating features in each particular dataset, and combine these to improve the separation. On top of rendering results with additional information from the different features taken into account, the outputted score is less influenced by confounding variables, and has better discriminative performance. As an example, the effects of using Percolator scores instead of Sequest's XCorr are shown in Figure 1B.
Cross-validation
Regardless of whether one uses an SVM, such as Percolator, or any other machine learning approach, it is necessary to validate the performance of the algorithm. As with the common raw scores, the target-decoy approach can be applied on the scores stemming from the trained learner, to estimate the new error rates of the identifications. However, the example data used for training the algorithm is not suitable for estimating the error rates, as the training examples are likely to be, at least somewhat, overfitted.
Overfitting is a common pitfall in statistics and machine learning, in which the classifier learns from random variations in the training data. [33, 34] Such learning is undesired, as it does not arise from overall trends and patterns that are generalizable to new data points. For this reason, all sound machine learning approaches keep an independent validation set separate from the training set. First, the classifier learns from the training set, to find the best scoring function. Second, the learned scoring function is applied on the validation set. This procedure helps avoid overfitting, and gives a better estimate of the performance. [35]
In shotgun proteomics, a naїve straightforward separation of the PSMs into a training set and a validation set would decrease the number of PSMs that can be outputted in the final results, as we cannot apply the learned SVM score on the set used for training. To avoid this, previous versions of Percolator employed duplicate decoy databases, one of which was used to drive the learning, and the second to apply the learned classifier on. The scores given to the PSMs by the second decoy database was used for estimating the error rates of the target PSMs. With this approach, however, the target PSMs are still used both for learning and validation, and the approach was thus removed from Percolator.
As opposed to using duplicate decoy databases, current versions of Percolator employ cross-validation, a common method to deal with small training sets in machine learning [33, 35–37]. Cross-validation means to randomly divide the input examples into a number of equally sized subsets, and to train the classifier multiple times, each time on all but one of the subsets. After each training procedure, the excluded subset is used for validation. The number of subsets can be varied, but is commonly denoted k. Consequently, in a k-fold cross-validation procedure, k - 1 subsets are used for training, and 1 subset for testing. This is repeated k times to use all possible combinations of training and validation sets. With this approach, all the data points can be classified and validated, while still keeping a separate training set. Consequently, to reliably score all PSMs, Percolator employs a three-fold cross-validation procedure by dividing the spectra into three equally sized subsets. The target and decoy PSMs from two of the subsets are used for training, and the PSMs of the spectra in the third subset for validation. The three-fold cross-validation procedure in Percolator is illustrated in Figure 2 and outlined in pseudo-code in Figure 3.
An SVM can learn from training data using different settings, or hyperparameters [38]. The best set of hyperparameters for the dataset at hand are usually approximated by a so-called grid search. This search is performed by training and validating the classifier multiple times, each time with a different permutation of hyperparameters. The hyperparameters with the best validation results is then used for the actual training. Percolator uses a nested three-fold cross-validation step within each training set to perform a grid-search. The two training subsets are divided once again into three parts, of which two at the time serve as training data, and the third as validation data. The nested cross-validation is performed for each combination of hyperparameters, so that the best combination can be chosen for training the classifier on the two top-level training sets. The nested cross-validation scheme used in Percolator is illustrated in Figure 4.
Merging separated datasets
The cross-validation is necessary to prevent overfitting, but has the drawback that the three subsets of PSMs are scored by three different classifiers. These subsets cannot be directly compared, as each classifier produces a unique score learned from the features of its respective input examples. Nevertheless, for the researcher the three subsets have no experimental meaning and they must be merged into a single list of PSMs. To merge data points from multiple classifiers, they are given a normalized score, called the SVM score, based on the separation between target and decoy PSMs. In Percolator, the normalization is performed after an internal target-decoy analysis within each of the three classified subsets. The subset score corresponding to a q value of 0.01 is fixed to an SVM score of 0. And the median of the decoy PSM subset scores is set to -1. Both the target and the decoy PSM scores within the subset are normalized with the same linear transformation, using the above constrains. Figure 5 outlines in pseudo-code how the normalization and merging is done in practice.
After the normalization, the three subsets of PSMs are merged, and the overall error rates are estimated by target-decoy analysis on all PSMs. The final result is a list of PSMs and accurate error rates, where correct and incorrect matches have been highly discriminated.
Other issues with validation
In the previous sections, we described a cross-validation procedure that assures that the machine learning algorithm only considers general patterns in the data, and not random variations within a finite dataset. However, the fundamental assumption that decoy PSMs are good models of incorrect target PSMs hasstill not been validated. This assumption can be validated by analyzing mixtures of known protein content, in which incorrect target PSMs are readily identified. Such validation experiments enable direct comparisons of these incorrect matches and the decoy PSMs. For machine learning algorithms, it is important to validate that each one of the features considered by the learner are indeed very similar between decoy and incorrect target PSMs. Else, the classifier would easily detect these features, and produce biased results. An example of such a feature is the number of PSMs matching to the same peptide sequence, which differs slightly between decoy and incorrect target PSMs. [39]
Simulated example
We evaluated the ability of our cross-validation strategy to avoid overfitting by letting it train on a series of simulated datasets. Each dataset consisted of 2500 target and 2500 decoy synthetic PSMs, described by 50 randomly generated features. All random features followed a normal distribution with mean of 0.0 and standard deviation of 1.0. To 1000 of the target synthetic PSMs, we added an off set of 10.0 to the first feature, to simulate correctly matched PSMs. With this procedure, 100 datasets were created, and the performance of Percolator was tested on each one of them. To demonstrate the effects of Percolator's cross-validation scheme, we also ran Percolator with the cross-validation protocol disabled.
Given that q represents the q value, the ideal identification rate from the above experiment is 1000/(1 - q). In other words, we hope to find 1000 PSMs with a q value of 0, but more when we increase the q value and start to introduce incorrect PSMs among the reported PSMs. As seen in Figure 6A, without cross-validation, Percolator overestimates the number of significant synthetic PSMs. With cross-validation, on the other hand, Percolator outputs results close to the ideal identification rate. Additionally, as seen in Figure 6B, cross-validation ensures that the identified synthetic PSMs are the correct ones. Without it, the estimated error rates (q values) are not accurate.