Transfer posterior error probability estimation for peptide identification

Background In shotgun proteomics, database searching of tandem mass spectra results in a great number of peptide-spectrum matches (PSMs), many of which are false positives. Quality control of PSMs is a multiple hypothesis testing problem, and the false discovery rate (FDR) or the posterior error probability (PEP) is the commonly used statistical confidence measure. PEP, also called local FDR, can evaluate the confidence of individual PSMs and thus is more desirable than FDR, which evaluates the global confidence of a collection of PSMs. Estimation of PEP can be achieved by decomposing the null and alternative distributions of PSM scores as long as the given data is sufficient. However, in many proteomic studies, only a group (subset) of PSMs, e.g. those with specific post-translational modifications, are of interest. The group can be very small, making the direct PEP estimation by the group data inaccurate, especially for the high-score area where the score threshold is taken. Using the whole set of PSMs to estimate the group PEP is inappropriate either, because the null and/or alternative distributions of the group can be very different from those of combined scores. Results The transfer PEP algorithm is proposed to more accurately estimate the PEPs of peptide identifications in small groups. Transfer PEP derives the group null distribution through its empirical relationship with the combined null distribution, and estimates the group alternative distribution, as well as the null proportion, using an iterative semi-parametric method. Validated on both simulated data and real proteomic data, transfer PEP showed remarkably higher accuracy than the direct combined and separate PEP estimation methods. Conclusions We presented a novel approach to group PEP estimation for small groups and implemented it for the peptide identification problem in proteomics. The methodology of the approach is in principle applicable to the small-group PEP estimation problems in other fields.


Background
Identification of the proteins expressed in cells or tissues plays an essential role in proteomics. In shotgun proteomics, proteins are first digested into peptide mixture that is then analyzed via high-throughput tandem mass spectrometry (MS/MS), resulting in thousands to millions of MS/MS spectra in a typical experiment. Analysis of these spectra leads to a great number of candidate identifications of peptides. Protein sequences are inferred from reliably identified peptides, followed by qualitative or quantitative analysis. The peptide identification based on MS/MS has become one of the key problems in proteomics [1,2].
To identify the peptides, the MS/MS spectra are commonly searched against a protein sequence database. For each spectrum, candidate peptides from the database are scored according to the quality of their matches to the spectrum. The top scored peptidespectrum match (PSM) is taken as a candidate peptide identification. However, for many reasons, e.g. the incompleteness of the protein database or the imperfectness of the scoring function, the top-scored PSMs are not always correct identifications. Thus, filtering and quality control of PSMs after database search is necessary [3].
The scores of correct PSMs are usually higher in trend than those of incorrect PSMs, but they always have an overlap, resulting the difficulty in recognizing the correct PSMs. In early years, a simple way was to specify an empirical threshold and consider the PSMs with scores higher than the threshold as the correct ones. However, such threshold may not be appropriate, resulting in reduced accuracy or sensitivity of peptide identification. Thus, a quality control method that not only ensures the identification accuracy, but also does not sacrifice the identification sensitivity is needed. Quality control of PSMs can be dealt with as a multiple hypothesis testing problem [4,5]. Each PSM corresponds to a hypothesis test. The null hypothesis (H 0 ) is that the peptide is incorrectly identified, and the corresponding alternative hypothesis (H 1 ) is that the peptide is correctly identified. The most commonly used statistical confidence measure in multiple hypothesis testing is the false discovery rate (FDR) proposed by Benjamini and Hochberg [6]. FDR is defined as the expected proportion of incorrect ones among all rejections of null hypotheses.
At present, the common way to estimate the FDR of PSMs in proteomics is the targetdecoy database search approach [7]. The principle of the target-decoy approach is simple: the experimental MS/MS spectra are searched against a database which not only consists of the target protein sequences but also the same size of decoy protein sequences (typically the reverse sequences of the target proteins). Because an incorrect identification has an equal chance of being a match to the target sequences or to the decoy sequences, the number of decoy PSMs can be used as an estimate of the number of false target PSMs and the FDR of target PSMs can be estimated by the ratio of decoy PSMs to the target PSMs above the score threshold.
FDR measures the global confidence of a collection of PSMs with different scores, whereas one may be interested in the confidence of PSM(s) with a specific score. The posterior error probability (PEP, also known as local false discovery rate) is defined as the probability of a hypothesis being null given the test statistic, and consequently it can measure the confidence of individual tests [8]. In our case, the PEP of a PSM is the probability that this PSM is incorrect given its score. Let f (x) = π 0 f 0 (x) + π 1 f 1 (x) denote the probability density function (pdf ) of the scores of a collection of PSMs, with f 0 (x) being the pdf of the scores of incorrect PSMs, f 1 (x) the pdf of scores of correct PSMs, π 0 the proportion of incorrect PSMs, and π 1 = 1 − π 0 . Bayes' rule gives, FDR can be derived from PEP using a simple relationship between them, i.e., FDR(x) = E f {PEP(s)|s ≥ x}. Therefore, whenever possible, estimation of PEP is always more desirable than FDR.
PEP estimation relies on decomposing the mixture distribution of f (x). There are three approaches to achieve this aim in proteomics: parametric, semi-parametric, and non-parametric approaches. The early PeptideProphet [9] algorithm was a parametric approach, in which f 0 (x) and f 1 (x) are assumed to be specific types of distributions and their parameters are estimated from the observed scores using the EM (Expecting Maximization) algorithm. However, the parametric approach could be problematic if the assumption on the distribution types is inappropriate [2]. In addition, PeptideProphet did not make use of any decoy information to estimate f 0 (x). In the improved version of PeptideProphet [10,11], f 0 (x) is first derived directly from the scores of decoy PSMs using kernel density estimation, and then f 1 (x) and π 0 are estimated using a semi-parametric method [12]. This semi-parametric and semi-supervised approach is more flexible and stable. Different from Peptide-Prophet, which estimates f 0 (x) and f 1 (x) explicitly, the method proposed by Käll et al. [13] estimates f (x) directly with a non-parametric approach and estimates π 0 by bootstrap.
In proteomics, it is often the case that only a group (subset) of peptide identifications, e.g. those with specific post-translational modifications (PTMs) or from specific proteins, are focused on [14][15][16][17]. Thus, group FDR estimation is necessary. The most straightforward way to estimate the FDR of the group is to simply use the combined FDR estimated on all PSMs as the FDR for the PSM group of interest. However, due to the difference between the score distributions of the group and the whole set of PSMs, the combined FDR may be greatly different from the real group FDR at the same score threshold, leading to unreliable or failed quality control of peptide identifications in the group [14,18,19]. Estimating the group FDR separately on the group PSMs is certainly a better choice, which we name the separate FDR estimation method. However, for small groups, the number of PSMs in the group may not be sufficient for reliable estimation of the separate FDR, leading to overly conservative or liberal FDR estimation, especially for higher-score interval where observed decoy PSMs are even fewer [20][21][22].
Fu et al. [21] proposed the transfer FDR method for quality control of small groups of peptide identifications. Transfer FDR derives the group FDR from the combined FDR based on the relationship between them. A key component of transfer FDR is to fit the proportion of decoy PSMs belonging to the group as a function of PSM score, and extrapolate it to the high-score interval for group FDR estimation. Zhang et al. [23] and Li et al. [24] developed methods of similar rationales but less rigors in estimating the proportion of group decoy PSMs.
It is also desirable to evaluate the PEPs of individual PSMs in the group of interest. Similar to the case of FDR, two direct methods can be used to estimate the group PEP, i.e., the combined PEP (estimate the group PEP using the whole set of PSMs) and the separate PEP (estimate the group PEP solely using the PSMs in the group). However, these two methods have the same problems faced by combined FDR and separate FDR as mentioned above. Especially, when the group is very small, separate PEP estimation is even infeasible.
As far as we know, there are currently no group PEP estimation methods for small groups in proteomics and there are few in statistics. Efron [18] discussed the necessity of group PEP estimation and proposed a general approach, named class-wise fdr, based on the relationship between the group PEP and the combined PEP in the Bayesian framework. In order to calculate the relationship, class-wise fdr supposes the cases in the group under H 0 come from a normal distribution, which, however, may not hold in some applications, e.g. peptide identification.
Here, we present a group PEP estimation method, named transfer PEP, for quality control of small groups of peptide identifications. Inspired by the transfer learning technology [25], which transfers the knowledge from one domain to another domain for better learning with insufficient training data, transfer PEP builds on the empirical relationship between the group distribution and the combined distribution of PSM scores. When the group null distribution is different from the combined counterpart, transfer PEP derives it from the fitted proportion of group decoy PSMs among all decoy PSMs. When the group alternative distribution is different from the combined counterpart, transfer PEP estimates it, as well as π 0 , using a semi-parametric method. The accuracy and power of transfer PEP were validated on simulated data and real MS/MS data of peptides.

Algorithm
The aim is to estimate PEP G (x), the PEP of PSMs in a group G at arbitrary score x: where f G0 (x) and f G1 (x) are the pdfs of null and alternative distributions of group G, i.e. the pdfs of the scores of incorrect and correct PSMs in the group, respectively, π G0 is the proportion of incorrect PSMs in the group, and π G1 = 1 − π G0 . We deal with the situation in which the group G is so small that f G0 (x), f G1 (x) and π G0 cannot be estimated directly. We assume that the whole set of PSMs is always large enough such that f 0 (x), f 1 (x) and π 0 can be accurately estimated out, e.g., using the same algorithm as in PeptideProhpet. The rationale of our algorithm, transfer PEP, is to make use of the relationship between the group and combined score distributions to help estimate PEP G (x).

Estimation of π G0 f G0 (x)
When f G0 = f 0 , f 0 is directly used as f G0 . When f G0 = f 0 , we establish a relationship between them as follows. Define γ G (x) = Prob(G|H 0 , s ≥ x), where s is the PSM score. As we previously showed, γ G (x) can be readily fitted as a linear function of x using decoy PSMs, the given incorrect PSMs [21]. Let F 0 (x) and F G0 (x) denote the cumulative distribution functions (cdfs) of f 0 (x) and f G0 (x), respectively. Bayes' rule gives, Thus, Taking the derivatives of both sides of Eq. (4), we have where π G is the ratio of group PSMs to all PSMs, which can be directly calculated.

Estimation of f G1 (x) and π G0
When there is no established relationship available between them, and we estimate f G1 (x) and π G0 using a semi-parametric approach [10,12]. In this approach, f G1 (x) and π G0 are updated iteratively with an EMlike procedure. When f G0 = f 0 and f G1 = f 1 , π G0 is the only parameter that needs to be estimated. In this case, we estimate it using the same iterative procedure, which reduces to a standard EM algorithm in the simplest form. Algorithm 1 outlines the main steps of our transfer PEP algorithm. In the algorithm, the probability for each of the n group PSMs being correct is stored in a n-dimensional vector, θ G . In each iteration, π G1 is estimated by the average of θ G . f G1 (x) is estimated by Gaussian kernels, K(·), with θ G used as weights. Then, θ G is updated using the current π G1 , f G1 (x), and π G0 f G0 (x). The above procedure is repeated until θ G becomes stable.

Equality judgement
In order to use the algorithm, we need to judge whether f G0 = f 0 and f G1 = f 1 in practice. Define λ G (x) = Prob(G|H 1 , s ≥ x). Then, we have the following two conclusions: Whether γ G (x) is a constant can be judged by examining whether the fitted γ G (x) is a horizontal line. Similar to γ G (x), λ G (x) can be estimated by the proportion of correct matches belonging to the group: where FDR G (x) is the group FDR at score threshold x, N Gt (x) is the number of target PSMs in the group with scores > x, N Gd (x) is the number of decoy PSMs in the group with scores > x, N t (x) is the number of target PSMs with scores > x, and N d (x) is the number of decoy PSMs with scores > x. At varying x, we calculate the estimated value of λ G (x), and examine whether or not these values approximate some constant.

Algorithm 1 Transfer PEP
π 0 ,f 0 ,f 1 were estimated on the whole set of PSM scores.

Results
In order to validate the performance of the transfer PEP algorithm, we must be able to know the theoretical distribution of data so as to compare the estimated PEP to the theoretical PEP. However, the theoretical distribution is in general absent in the problem of peptide identification. Therefore, we prepared three different types of data to evaluate the accuracy and power of transfer PEP: (i) theoretical simulated data, (ii) simulated MS/MS data of peptides, and (iii) real MS/MS data of peptides.
Three methods for estimating the group PEP of peptide identifications were compared: combined PEP, separate PEP and transfer PEP. Combined PEP and separate PEP were estimated on the whole set of PSMs and on the PSMs in the group only, respectively, using the semi-parametric method as used in the PeptideProphet algorithm [10]. Transfer PEP was estimated using Algorithm 1 as described in the previous section.
Two criteria were used for evaluation: the consistency between the estimated PEP and the theoretical PEP, and the consistency between the estimated FDR and the real FDR. The estimated FDR was obtained by integration of the estimated PEP, and was used for evaluation on MS/MS data because the theoretical PEP was not available for them. The integrals of combined PEP, separate PEP and transfer PEP are denoted as iCombined FDR, iSeparate FDR and iTransfer FDR, respectively. Note that iTransfer FDR is not the transfer FDR which we proposed previously [21].

Theoretical simulated data
To evaluate the consistency between the estimated PEP and the theoretical PEP, we simulated sets of scores for the case f G0 = f 0 and f G1 = f 1 under the condition that γ G (x) = ax + b, in which a = 0 and b = 0. All the scores were divided into two complementary groups: G and Q. Assume all the scores are greater than or equal to 0. From Eq. (4) we have π G0 = bπ 0 π G . Bringing it into Eq. (5) yields According to the definition of γ G (x), we have Prob Thus if γ G (x) = ax + b and f 0 (x) are given, both f G0 (x) and f Q0 (x) are given as well.
γ G (x) was fitted as a linear function using the observed proportions of decoy scores belonging to group G above threshold x, as shown in Fig. 1. Notice that big deviation was observed at critical regions, i.e. large scores, which correspond to small FDRs and we care the most. This deviation was caused by the random fluctuation of the proportion calculated from very limited number of scores. The similar phenomenon was observed on MS/MS data (Figs. 3, 5 and 8). The proportions for large scores should be extrapolated from the fitted function. This is the very rational of transfer PEP. Figure 2 shows the results of the three PEP estimation methods in one simulation in which the number of scores from f G1 (x) is n = 10. As shown in Fig. 2a, both π G0 f G0 (x) Fig. 1 The linear fitting result of γ G (x) on the simulated data. The x-axis represents the score threshold and y-axis represents the proportion of decoy scores belonging to group G above the threshold and π G1 f G1 (x) estimated by combined PEP seriously deviated from the theoretical distributions. The result of separate PEP was much better, but still had significant deviations at some regions due to the insufficient sample size. Benefiting from the estimation of γ G (x), transfer PEP gave remarkably accurate estimates of both π G0 f G0 (x) and π G1 f G1 (x). The group PEP curve estimated by the transfer PEP was also the most accurate among the three methods, as shown in Fig. 2b.
To evaluate the average performance of each estimation method in the S simulations, we calculated the mean and standard deviation (SD) of mean squared error (MSE) between the estimates,PEP G , and the theoretical values, PEP G , for top scores (Ratio = 1%, 5%, 10%, 20%, 100%). The MSE in the j th simulation for the given values of Ratio and n (the number of correct scores generated from f G1 (x)) is calculated as: where N j denotes the number of top Ratio scores in the j th simulation, andPEP G,i,j and PEP G,i,j denote the estimated and theoretical PEPs of the i th score in the j th simulation, respectively. Then, we compute the mean and SD of MSEs over the S simulations as: The quality of the estimates provided by the three estimation methods in the configuration (n, Ratio) is measured by both Mean(n, Ratio) and SD(n, Ratio).  Table 1 shows the results. When the number of scores from f G1 (x) was small (n = 1, 10, 20, 50), both the mean and SD of MSE were very large for the combined PEP, especially for the high-score regions. The separate PEP was much better, but still deviated from the theoretical PEP G when the number of scores from f G1 (x) was too small (n = 1, 10, 20), especially for the high-score regions. For all the configurations of Ratio and n, the transfer PEP estimated the PEP G accurately. With increasing n and Ratio, the performances of both the combined PEP and the separate PEP gradually approached the performance of transfer PEP.

Simulated MS/MS data
We designed a simulation experiment for identification of variant peptides, i.e. peptides containing single amino acid variations. The simulated MS/MS spectra used here were part of the data used in [19]. Notes: Mean and SD, the mean and standard deviation of mean squared errors (MSEs) of estimates; n, the number of correct scores in group G; Ratio, the percentage of top scores whose MSEs were evaluated A total of 1,038,743 random tryptic peptide sequences were first generated. These peptides served as the non-variant peptides in the database to be searched. Then for each of these peptides, a variant peptide was generated by mutating one randomly selected amino acid of the peptide. Amino acids Isoleucine and Leucine were not allowed to be mutated between each other, and the peptide C-terminals were not allowed for mutation. The combination of these non-variant and variant peptides constituted the target database that was searched.
The simulated MS/MS spectra were composed of three parts: 20,000 variant spectra, 20,000 non-variant spectra and 80,000 noise spectra. The variant and non-variant spectra were theoretically generated from variant and non-variant peptides, respectively, which were randomly selected from the target database. The noise spectra were generated from additional sequences that were out of the target database.
In spectrum simulation, the mass-to-charge ratio (m/z) values of singly charged fragment ions of b and y types were predicted. The intensities of the fragment ions are randomly sampled from the uniform distribution. A number of noise peaks were generated and combined with fragment ions to form the MS/MS spectrum of the peptide. More details about the method to generate the simulated spectra can be found in Ref [14].
In each experiment, a dataset was constructed by including n(=1, 5, 10, 20, 50, 100) randomly selected variant spectra and 15000 randomly selected non-variant and noise spectra. The experiment was repeated 1000 times.
Mascot(v2.5.1) [27] was used as the search engine. Trypsin was specified as the proteolytic enzyme and no missed cleavage was allowed. The precursor and fragment mass matching tolerances were both 0.01 Da. No fixed or variable modifications were set for search. The database was searched in the target-decoy strategy by combining the target sequences with their reversed versions. Figure 3 gives examples of the linear fitting results of γ G (x) and λ G (x). As shown,γ G (x) is closely around the constant 0.5, andλ G (x) is closely around the constant 0.1. Thus, it was assumed that f G0 = f 0 and f G1 = f 1 held. Fig. 3 Examples of linear fitting results of a γ G (x) and b λ G (x) in the variant-peptide identification simulation study. The x-axis represents the score threshold, and the y-axis represents a the proportion of decoy variant PSMs among all decoy PSMs, or b the estimated proportion of correct variant PSMs among all variant PSMs above the score threshold x Figure 4 plots the estimated iCombined FDR, iSeparateFDR and iTransfer FDR against the real FDR at different FDR control levels (1-10%) for different group size n of variant spectra. As shown, iTransfer FDR was closest to the real FDR among the three estimates. Both iCombined FDR and iSeparate FDR remarkably deviated from the real FDR when the number of variant spectra was small. iSeparate FDR gradually approached to iTransfer FDR with increasing n, but iCombined FDR didn't. Table 2 compares the three FDR estimation methods, in terms of the mean and SD of the estimation errors as well as the average numbers of all and false variant PSMs obtained at 1% FDR control level. As shown, iCombined FDR dramatically deviated from the real FDR. iSeparate FDR was much better, but still deviated from the real FDR when the number of variant spectra was small (n = 1, 5,10,20). The results of iCombined and iSeparate FDR gradually approached to those of iTransfer FDR with increasing n. iTransfer FDR was the best among these three methods for all the numbers of variant spectra.
As the size of the group increases, the advantage of transfer PEP over other methods decreases. When the group size is large enough, the advantage vanishes. However, it is hard to say there is a fixed threshold at which the advantage disappears. It depends on the problem addressed, the dataset analyzed and other experimental conditions. According to our results, our method seems to be most effective when the group size is <50, and become comparable with other methods when the group size is >100.

Real MS/MS data
In this section, we compared the three PEP estimation methods on a real MS/MS dataset. The objective judgement of the identification correctness is absent, so we used the transfer FDR [21] as the comparative reference. Two datasets were used for identification of methylated peptides and variant peptides, respectively.

Methylated peptide identification
The MS/MS spectra in this dataset were from the draft map of human proteome described in Kim et al. [28], and were downloaded from the PRIDE data repository (https://www.ebi.ac.uk/pride/, dataset identifier PXD000561). Briefly, this draft map was from protein samples of 30 human tissues which were analyzed on high-resolution Fourier-transform mass spectrometers using HCD fragmentation. In this paper, only the spectra of brain tissue were analyzed which included 24 RAW files. Mascot(v2.5.1) [27] was used to identify the spectra. The protein sequence database searched was UniProt human protein database (v201506). All cysteines were assumed to be carbamidomethylated, and methionines were allowed to be oxidized. N-termini of peptides starting with glutamine residues were allowed to be pyroglutamined. N-termini of proteins were allowed to be acetylated. Both lysines and arginines were allowed to be methylated. Precursor and fragment mass matching tolerances were set as 10 ppm and 0.05 Da, respectively. Trypsin was specified as the proteolytic enzyme and up to two missed cleavages were permitted.
The linear fitting results of γ G (x) and λ G (x) are shown in Fig. 5. We can see thatγ G (x) varies in the interval [0,0.5], andλ G (x) is almost constant at 0.0028. Thus, we assumed f G0 = f 0 and f G1 = f 1 . Figure 6 shows the numbers of identified methylated PSMs after filteration by the three FDR methods (iCombined FDR, iSeparate FDR and iTransfer FDR) at different FDR conrol levels (1-10%). Figure 7a shows the consistency of the three methods with transfer FDR. It is clear that iTransfer FDR was the most conservative and consistent with transfer FDR, iSeparative FDR was comparable but a little liberal, and iCombined FDR seriously underestimated the FDR.

Variant peptide identification
The data used for identification of variant peptides, i.e. peptides containing single amino acid variations, was part of a colorectal cell line dataset, which has been described in detail in Li et al. [24]. Proteins were digested by trypsin and analyzed on an LTQ-Orbitrap mass spectrometer. Only the spectra of SW480 sample were analyzed in this paper.
Mascot(v2.5.1) [27] was used to identify the spectra. The protein sequence database searched was MS-CanProVar(v1.0) [24], which can be downloaded from http://canprovar. zhang-lab.org/. All cysteines were assumed to be carbamidomethylated, and methionines were allowed to be oxidized. N-termini of peptides starting with glutamine residues were allowed to be pyroglutamined. Precursor and fragment mass matching tolerances were set as 10 ppm and 0.5 Da, respectively. Trypsin was specified as the proteolytic enzyme and two missed cleavages were permitted.
The linear fitting results of γ G (x) and λ G (x) are shown in Fig. 8. Accordingly, we assumed f G0 = f 0 and f G1 = f 1 for this dataset. With the FDR control level set at 1%, 42, 36 and 32 variant PSMs were obtained by iCombined, iSeparate, and iTransfer The linear fitting results of a γ G (x) and b λ G (x) on the human proteome draft dataset. The x-axis represents the score threshold, and the y-axis represents a the proportion decoy methylated PSMs among all decoy PSMs, or b the estimated proportion of correct methylated PSMs among all methylated PSMs above the score threshold x Fig. 6 The numbers of methylated PSMs obtained with the three FDR methods at 0-10% FDR control level   Figure 7b shows that, similar to the result of methylated peptide identification, iTransfer FDR was the most consistent with transfer FDR.

Conclusions
In this paper, we have presented transfer PEP, the first solution to the problem of PEP estimation for small groups of peptide identifications in proteomics. By using the empirical relationship between the combined null distribution and the group null distribution of identification scores, transfer PEP makes possible accurate PEP estimation for data of very limited sample size. The small groups are not uncommon in proteomics. For example, when one focuses on identifying amino acid mutations [19] or open searching of PTMs [22,29], the concerned group is often very small, typically <50. Given the group null distribution, transfer PEP uses an iterative semi-parametric method to estimate the group alternative distribution and the null proportion. Because kernel density estimation is used, transfer PEP does not require the distribution forms to be known and thus is applicable to different scoring functions. The performance of transfer PEP was validated on both the simulated data and the real mass spectra datasets. Compared with the combined and separate PEPs, transfer PEP showed much more accuracy in estimating the PEP of small groups without loss of power. Estimation of PEP enables evaluation of the confidence of individual peptide identifications, which is desirable in many circumstances, e.g. protein inference [30]. Finally, it is worthwhile to note that transfer PEP is in principle adaptable to the small-group PEP estimation problems in other fields, as long as γ G (x) can be estimated, which is not limited to the linear form.