Transfer posterior error probability estimation for peptide identification

Yi, Xinpei; Gong, Fuzhou; Fu, Yan

doi:10.1186/s12859-020-3485-y

Methodology Article
Open access
Published: 04 May 2020

Transfer posterior error probability estimation for peptide identification

Xinpei Yi^1,2,
Fuzhou Gong^1,2 &
Yan Fu ORCID: orcid.org/0000-0001-6896-5931^1,2

BMC Bioinformatics volume 21, Article number: 173 (2020) Cite this article

2648 Accesses
9 Citations
2 Altmetric
Metrics details

Abstract

Background

In shotgun proteomics, database searching of tandem mass spectra results in a great number of peptide-spectrum matches (PSMs), many of which are false positives. Quality control of PSMs is a multiple hypothesis testing problem, and the false discovery rate (FDR) or the posterior error probability (PEP) is the commonly used statistical confidence measure. PEP, also called local FDR, can evaluate the confidence of individual PSMs and thus is more desirable than FDR, which evaluates the global confidence of a collection of PSMs. Estimation of PEP can be achieved by decomposing the null and alternative distributions of PSM scores as long as the given data is sufficient. However, in many proteomic studies, only a group (subset) of PSMs, e.g. those with specific post-translational modifications, are of interest. The group can be very small, making the direct PEP estimation by the group data inaccurate, especially for the high-score area where the score threshold is taken. Using the whole set of PSMs to estimate the group PEP is inappropriate either, because the null and/or alternative distributions of the group can be very different from those of combined scores.

Results

The transfer PEP algorithm is proposed to more accurately estimate the PEPs of peptide identifications in small groups. Transfer PEP derives the group null distribution through its empirical relationship with the combined null distribution, and estimates the group alternative distribution, as well as the null proportion, using an iterative semi-parametric method. Validated on both simulated data and real proteomic data, transfer PEP showed remarkably higher accuracy than the direct combined and separate PEP estimation methods.

Conclusions

We presented a novel approach to group PEP estimation for small groups and implemented it for the peptide identification problem in proteomics. The methodology of the approach is in principle applicable to the small-group PEP estimation problems in other fields.

Background

Identification of the proteins expressed in cells or tissues plays an essential role in proteomics. In shotgun proteomics, proteins are first digested into peptide mixture that is then analyzed via high-throughput tandem mass spectrometry (MS/MS), resulting in thousands to millions of MS/MS spectra in a typical experiment. Analysis of these spectra leads to a great number of candidate identifications of peptides. Protein sequences are inferred from reliably identified peptides, followed by qualitative or quantitative analysis. The peptide identification based on MS/MS has become one of the key problems in proteomics [1, 2].

To identify the peptides, the MS/MS spectra are commonly searched against a protein sequence database. For each spectrum, candidate peptides from the database are scored according to the quality of their matches to the spectrum. The top scored peptide-spectrum match (PSM) is taken as a candidate peptide identification. However, for many reasons, e.g. the incompleteness of the protein database or the imperfectness of the scoring function, the top-scored PSMs are not always correct identifications. Thus, filtering and quality control of PSMs after database search is necessary [3].

The scores of correct PSMs are usually higher in trend than those of incorrect PSMs, but they always have an overlap, resulting the difficulty in recognizing the correct PSMs. In early years, a simple way was to specify an empirical threshold and consider the PSMs with scores higher than the threshold as the correct ones. However, such threshold may not be appropriate, resulting in reduced accuracy or sensitivity of peptide identification. Thus, a quality control method that not only ensures the identification accuracy, but also does not sacrifice the identification sensitivity is needed. Quality control of PSMs can be dealt with as a multiple hypothesis testing problem [4, 5]. Each PSM corresponds to a hypothesis test. The null hypothesis (H₀) is that the peptide is incorrectly identified, and the corresponding alternative hypothesis (H₁) is that the peptide is correctly identified. The most commonly used statistical confidence measure in multiple hypothesis testing is the false discovery rate (FDR) proposed by Benjamini and Hochberg [6]. FDR is defined as the expected proportion of incorrect ones among all rejections of null hypotheses.

At present, the common way to estimate the FDR of PSMs in proteomics is the target-decoy database search approach [7]. The principle of the target-decoy approach is simple: the experimental MS/MS spectra are searched against a database which not only consists of the target protein sequences but also the same size of decoy protein sequences (typically the reverse sequences of the target proteins). Because an incorrect identification has an equal chance of being a match to the target sequences or to the decoy sequences, the number of decoy PSMs can be used as an estimate of the number of false target PSMs and the FDR of target PSMs can be estimated by the ratio of decoy PSMs to the target PSMs above the score threshold.

FDR measures the global confidence of a collection of PSMs with different scores, whereas one may be interested in the confidence of PSM(s) with a specific score. The posterior error probability (PEP, also known as local false discovery rate) is defined as the probability of a hypothesis being null given the test statistic, and consequently it can measure the confidence of individual tests [8]. In our case, the PEP of a PSM is the probability that this PSM is incorrect given its score. Let f(x)=π₀f₀(x)+π₁f₁(x) denote the probability density function (pdf) of the scores of a collection of PSMs, with f₀(x) being the pdf of the scores of incorrect PSMs, f₁(x) the pdf of scores of correct PSMs, π₀ the proportion of incorrect PSMs, and π₁=1−π₀. Bayes’ rule gives,

$$\begin{array}{@{}rcl@{}} \text{PEP}(x) &=& \text{Prob}(H_{0}|x)=\frac{\pi_{0}f_{0}(x)}{f(x)} \end{array} $$

(1)

FDR can be derived from PEP using a simple relationship between them, i.e., FDR(x)=E_f{PEP(s)|s≥x}. Therefore, whenever possible, estimation of PEP is always more desirable than FDR.

PEP estimation relies on decomposing the mixture distribution of f(x). There are three approaches to achieve this aim in proteomics: parametric, semi-parametric, and non-parametric approaches. The early PeptideProphet [9] algorithm was a parametric approach, in which f₀(x) and f₁(x) are assumed to be specific types of distributions and their parameters are estimated from the observed scores using the EM (Expecting Maximization) algorithm. However, the parametric approach could be problematic if the assumption on the distribution types is inappropriate [2]. In addition, PeptideProphet did not make use of any decoy information to estimate f₀(x). In the improved version of PeptideProphet [10, 11], f₀(x) is first derived directly from the scores of decoy PSMs using kernel density estimation, and then f₁(x) and π₀ are estimated using a semi-parametric method [12]. This semi-parametric and semi-supervised approach is more flexible and stable. Different from PeptideProphet, which estimates f₀(x) and f₁(x) explicitly, the method proposed by Käll et al. [13] estimates $\frac {f_{0}(x)}{f(x)}$ directly with a non-parametric approach and estimates π₀ by bootstrap.

In proteomics, it is often the case that only a group (subset) of peptide identifications, e.g. those with specific post-translational modifications (PTMs) or from specific proteins, are focused on [14–17]. Thus, group FDR estimation is necessary. The most straightforward way to estimate the FDR of the group is to simply use the combined FDR estimated on all PSMs as the FDR for the PSM group of interest. However, due to the difference between the score distributions of the group and the whole set of PSMs, the combined FDR may be greatly different from the real group FDR at the same score threshold, leading to unreliable or failed quality control of peptide identifications in the group [14, 18, 19]. Estimating the group FDR separately on the group PSMs is certainly a better choice, which we name the separate FDR estimation method. However, for small groups, the number of PSMs in the group may not be sufficient for reliable estimation of the separate FDR, leading to overly conservative or liberal FDR estimation, especially for higher-score interval where observed decoy PSMs are even fewer [20–22].

Fu et al. [21] proposed the transfer FDR method for quality control of small groups of peptide identifications. Transfer FDR derives the group FDR from the combined FDR based on the relationship between them. A key component of transfer FDR is to fit the proportion of decoy PSMs belonging to the group as a function of PSM score, and extrapolate it to the high-score interval for group FDR estimation. Zhang et al. [23] and Li et al. [24] developed methods of similar rationales but less rigors in estimating the proportion of group decoy PSMs.

It is also desirable to evaluate the PEPs of individual PSMs in the group of interest. Similar to the case of FDR, two direct methods can be used to estimate the group PEP, i.e., the combined PEP (estimate the group PEP using the whole set of PSMs) and the separate PEP (estimate the group PEP solely using the PSMs in the group). However, these two methods have the same problems faced by combined FDR and separate FDR as mentioned above. Especially, when the group is very small, separate PEP estimation is even infeasible.

As far as we know, there are currently no group PEP estimation methods for small groups in proteomics and there are few in statistics. Efron [18] discussed the necessity of group PEP estimation and proposed a general approach, named class-wise fdr, based on the relationship between the group PEP and the combined PEP in the Bayesian framework. In order to calculate the relationship, class-wise fdr supposes the cases in the group under H₀ come from a normal distribution, which, however, may not hold in some applications, e.g. peptide identification.

Here, we present a group PEP estimation method, named transfer PEP, for quality control of small groups of peptide identifications. Inspired by the transfer learning technology [25], which transfers the knowledge from one domain to another domain for better learning with insufficient training data, transfer PEP builds on the empirical relationship between the group distribution and the combined distribution of PSM scores. When the group null distribution is different from the combined counterpart, transfer PEP derives it from the fitted proportion of group decoy PSMs among all decoy PSMs. When the group alternative distribution is different from the combined counterpart, transfer PEP estimates it, as well as π₀, using a semi-parametric method. The accuracy and power of transfer PEP were validated on simulated data and real MS/MS data of peptides.

Algorithm

The aim is to estimate PEP_G(x), the PEP of PSMs in a group G at arbitrary score x:

$$\begin{array}{@{}rcl@{}} \text{PEP}_{G}(x) &=& \text{Prob}(H_{0}|x,G)=\frac{\pi_{G0}f_{G0}(x)}{\pi_{G0}f_{G0}(x)+\pi_{G1}f_{G1}(x)} \end{array} $$

(2)

where f_G0(x) and f_G1(x) are the pdfs of null and alternative distributions of group G, i.e. the pdfs of the scores of incorrect and correct PSMs in the group, respectively, π_G0 is the proportion of incorrect PSMs in the group, and π_G1=1−π_G0.

We deal with the situation in which the group G is so small that f_G0(x), f_G1(x) and π_G0 cannot be estimated directly. We assume that the whole set of PSMs is always large enough such that f₀(x), f₁(x) and π₀ can be accurately estimated out, e.g., using the same algorithm as in PeptideProhpet. The rationale of our algorithm, transfer PEP, is to make use of the relationship between the group and combined score distributions to help estimate PEP_G(x).

Estimation of π _G0f _G0(x)

When f_G0=f₀, f₀ is directly used as f_G0. When f_G0≠f₀, we establish a relationship between them as follows. Define γ_G(x)=Prob(G|H₀,s≥x), where s is the PSM score. As we previously showed, γ_G(x) can be readily fitted as a linear function of x using decoy PSMs, the given incorrect PSMs [21]. Let F₀(x) and F_G0(x) denote the cumulative distribution functions (cdfs) of f₀(x) and f_G0(x), respectively. Bayes’ rule gives,

$$\begin{array}{@{}rcl@{}} \gamma_{G}(x) &=& \text{Prob}(G|H_{0},s\geq{x})\\ &=& \frac{\text{Prob}(G,H_{0})\text{Prob}(s\geq{x}|G,H_{0})}{\text{Prob}(H_{0})\text{Prob}(s\geq{x}|H_{0})}\\ &=& \frac{\text{Prob}(G,H_{0})(1-{F}_{G0}(x))}{\text{Prob}(H_{0})(1-{F}_{0}(x))}\\ &=& \frac{\pi_{G}\pi_{G0}(1-{F}_{G0}(x))}{\pi_{0}(1-{F}_{0}(x))} \end{array} $$

(3)

Thus,

$$\begin{array}{@{}rcl@{}} \pi_{G0}(1-F_{G0}(x)) &=& \frac{\pi_{0}(1-F_{0}(x))\gamma_{G}(x)}{\pi_{G}} \end{array} $$

(4)

Taking the derivatives of both sides of Eq. (4), we have

$$\begin{array}{@{}rcl@{}} \pi_{G0}f_{G0}(x) &=& {\frac{-\pi_{0}{(\gamma_{G}{(x)})}'(1-F_{0}(x))+\pi_{0}{\gamma_{G}{(x)}}f_{0}{(x)}}{\pi_{G}}} \end{array} $$

(5)

where π_G is the ratio of group PSMs to all PSMs, which can be directly calculated.

Estimation of f _G1(x) and π _G0

When f_G1=f₁, f₁ is directly used as f_G1. When f_G1≠f₁, there is no established relationship available between them, and we estimate f_G1(x) and π_G0 using a semi-parametric approach [10, 12]. In this approach, f_G1(x) and π_G0 are updated iteratively with an EM-like procedure. When f_G0=f₀ and f_G1=f₁, π_G0 is the only parameter that needs to be estimated. In this case, we estimate it using the same iterative procedure, which reduces to a standard EM algorithm in the simplest form.

Algorithm 1 outlines the main steps of our transfer PEP algorithm. In the algorithm, the probability for each of the n group PSMs being correct is stored in a n-dimensional vector, θ_G. In each iteration, π_G1 is estimated by the average of θ_G. f_G1(x) is estimated by Gaussian kernels, K(·), with θ_G used as weights. Then, θ_G is updated using the current π_G1, f_G1(x), and π_G0f_G0(x). The above procedure is repeated until θ_G becomes stable.

Equality judgement

In order to use the algorithm, we need to judge whether f_G0=f₀ and f_G1=f₁ in practice. Define λ_G(x)=Prob(G|H₁,s≥x). Then, we have the following two conclusions: (1) f_G0=f₀ if and only if γ_G(x) is a constant, and (2) f_G1=f₁ if and only if λ_G(x) is a constant. Take γ_G(x) as an example. If γ_G(x) is a constant γ, then by using Eq. (5), we have $f_{G0}(x)=\frac {{\pi _{0}}\gamma {f_{0}(x)}}{\pi _{G}\pi _{G0}}=Cf_{0}(x)$, in which C is a constant. Because F_G0(∞)=CF₀(∞)=1, C=1. Thus, f_G0=f₀. On the other hand, when f_G0=f₀, $\gamma _{G}(x)=\frac {\pi _{G}\pi _{G0}}{\pi _{0}}$, which is a constant.

Whether γ_G(x) is a constant can be judged by examining whether the fitted γ_G(x) is a horizontal line. Similar to γ_G(x), λ_G(x) can be estimated by the proportion of correct matches belonging to the group:

$$\begin{array}{@{}rcl@{}} \hat{\lambda}_{G}(x) &=& \frac{N_{Gt}(x)(1-\text{FDR}_{G}(x))}{N_{t}(x)(1-\text{FDR}(x))}\\ &=& \frac{N_{Gt}(x)-N_{Gd}(x)}{N_{t}(x)-N_{d}(x)} \end{array} $$

(6)

where FDR_G(x) is the group FDR at score threshold x, N_Gt(x) is the number of target PSMs in the group with scores >x, N_Gd(x) is the number of decoy PSMs in the group with scores >x, N_t(x) is the number of target PSMs with scores >x, and N_d(x) is the number of decoy PSMs with scores >x. At varying x, we calculate the estimated value of λ_G(x), and examine whether or not these values approximate some constant.

Results

In order to validate the performance of the transfer PEP algorithm, we must be able to know the theoretical distribution of data so as to compare the estimated PEP to the theoretical PEP. However, the theoretical distribution is in general absent in the problem of peptide identification. Therefore, we prepared three different types of data to evaluate the accuracy and power of transfer PEP: (i) theoretical simulated data, (ii) simulated MS/MS data of peptides, and (iii) real MS/MS data of peptides.

Three methods for estimating the group PEP of peptide identifications were compared: combined PEP, separate PEP and transfer PEP. Combined PEP and separate PEP were estimated on the whole set of PSMs and on the PSMs in the group only, respectively, using the semi-parametric method as used in the PeptideProphet algorithm [10]. Transfer PEP was estimated using Algorithm 1 as described in the previous section.

Two criteria were used for evaluation: the consistency between the estimated PEP and the theoretical PEP, and the consistency between the estimated FDR and the real FDR. The estimated FDR was obtained by integration of the estimated PEP, and was used for evaluation on MS/MS data because the theoretical PEP was not available for them. The integrals of combined PEP, separate PEP and transfer PEP are denoted as iCombined FDR, iSeparate FDR and iTransfer FDR, respectively. Note that iTransfer FDR is not the transfer FDR which we proposed previously [21].

Theoretical simulated data

To evaluate the consistency between the estimated PEP and the theoretical PEP, we simulated sets of scores for the case f_G0≠f₀ and f_G1≠f₁ under the condition that γ_G(x)=ax+b, in which a≠0 and b≠0. All the scores were divided into two complementary groups: G and Q. Assume all the scores are greater than or equal to 0. From Eq. (4) we have $\pi _{G0}=\frac {b\pi _{0}}{\pi _{G}}$. Bringing it into Eq. (5) yields

$$\begin{array}{@{}rcl@{}} f_{G0}(x) &=& \frac{-a(1-{F}_{0}(x))+(ax+b)f_{0}(x)}{b} \end{array} $$

(7)

According to the definition of γ_G(x), we have Prob(G|H₀)=γ_G(0)=b, and Prob(Q|H₀)=1−b. Because f₀(x)=Prob(G|H₀)f_G0(x)+Prob(Q|H₀)f_Q0(x), we have

$$\begin{array}{@{}rcl@{}} f_{Q0}(x) &=& \frac{f_{0}(x)-bf_{G0}(x)}{1-b} \end{array} $$

(8)

Thus if γ_G(x)=ax+b and f₀(x) are given, both f_G0(x) and f_Q0(x) are given as well.

In the simulation, we set γ_G(x)=−0.01x+0.4 and f₀(x)=Gamma(x,0.96,1.5), and derived f_G0(x) and f_Q0(x) using Eq. (7) and Eq. (8), respectively. The total number of scores were N=15000. The proportion of incorrect scores (from null distribution f₀) was π₀=0.65. Among the N₀ incorrect scores, N_G0 scores were generated from f_G0(x) with probability Prob(G|H₀)=b=0.4, and N_Q0=N₀−N_G0 scores were generated from f_Q0(x) with probability Prob(Q|H₀)=1−b=0.6. Among the N₁=N−N₀ correct scores (from alternative distribution f₁), n (=1, 10, 20, 50, 100) scores were generated from f_G1(x)=N(9,6) and N₁−n scores were generated from f_Q1(x)=N(10,6). The choice of gamma and normal distributions to generate the incorrect and correct scores is because they resemble the real distributions [10, 26]. To mimic the target-decoy strategy, N₀ decoy scores were generated. Among them, N_G0 scores were from f_G0(x) and N_Q0 scores were from f_Q0(x). This simulation was repeated S=1000 times.

γ_G(x) was fitted as a linear function using the observed proportions of decoy scores belonging to group G above threshold x, as shown in Fig. 1. Notice that big deviation was observed at critical regions, i.e. large scores, which correspond to small FDRs and we care the most. This deviation was caused by the random fluctuation of the proportion calculated from very limited number of scores. The similar phenomenon was observed on MS/MS data (Figs. 3, 5 and 8). The proportions for large scores should be extrapolated from the fitted function. This is the very rational of transfer PEP.

Figure 2 shows the results of the three PEP estimation methods in one simulation in which the number of scores from f_G1(x) is n=10. As shown in Fig. 2a, both π_G0f_G0(x) and π_G1f_G1(x) estimated by combined PEP seriously deviated from the theoretical distributions. The result of separate PEP was much better, but still had significant deviations at some regions due to the insufficient sample size. Benefiting from the estimation of γ_G(x), transfer PEP gave remarkably accurate estimates of both π_G0f_G0(x) and π_G1f_G1(x). The group PEP curve estimated by the transfer PEP was also the most accurate among the three methods, as shown in Fig. 2b.

To evaluate the average performance of each estimation method in the S simulations, we calculated the mean and standard deviation (SD) of mean squared error (MSE) between the estimates, $\hat {\text {PEP}}_{G}$, and the theoretical values, PEP_G, for top scores (Ratio=1%,5%,10%,20%,100%). The MSE in the j^th simulation for the given values of Ratio and n (the number of correct scores generated from f_G1(x)) is calculated as:

$$\text{MSE}_{j}(n,{Ratio})=\frac{1}{N_{j}}\sum\limits_{i=1}^{N_{j}}\left(\hat{\text{PEP}}_{G,i,j}-\text{PEP}_{G,i,j}\right)^{2} $$

where N_j denotes the number of top Ratio scores in the j^th simulation, and $\hat {\text {PEP}}_{G,i,j}$ and PEP_G,i,j denote the estimated and theoretical PEPs of the i^th score in the j^th simulation, respectively. Then, we compute the mean and SD of MSEs over the S simulations as:

$$\text{Mean}(n,{Ratio})=\frac{1}{S}\sum\limits_{j=1}^{S}\text{MSE}_{j}(n,{Ratio}) $$

$$\text{SD}(n,{Ratio})=\sqrt{\frac{1}{S}\sum\limits_{j=1}^{S}(\text{MSE}_{j}(n,{Ratio})-\text{Mean(n,{Ratio})})^{2}} $$

The quality of the estimates provided by the three estimation methods in the configuration (n,Ratio) is measured by both Mean(n,Ratio) and SD(n,Ratio).

Table 1 shows the results. When the number of scores from f_G1(x) was small (n=1,10,20,50), both the mean and SD of MSE were very large for the combined PEP, especially for the high-score regions. The separate PEP was much better, but still deviated from the theoretical PEP_G when the number of scores from f_G1(x) was too small (n=1,10,20), especially for the high-score regions. For all the configurations of Ratio and n, the transfer PEP estimated the PEP_G accurately. With increasing n and Ratio, the performances of both the combined PEP and the separate PEP gradually approached the performance of transfer PEP.

Table 1 The PEP estimation errors of three methods on the simulated data

Full size table

Simulated MS/MS data

We designed a simulation experiment for identification of variant peptides, i.e. peptides containing single amino acid variations. The simulated MS/MS spectra used here were part of the data used in [19].

A total of 1,038,743 random tryptic peptide sequences were first generated. These peptides served as the non-variant peptides in the database to be searched. Then for each of these peptides, a variant peptide was generated by mutating one randomly selected amino acid of the peptide. Amino acids Isoleucine and Leucine were not allowed to be mutated between each other, and the peptide C-terminals were not allowed for mutation. The combination of these non-variant and variant peptides constituted the target database that was searched.

The simulated MS/MS spectra were composed of three parts: 20,000 variant spectra, 20,000 non-variant spectra and 80,000 noise spectra. The variant and non-variant spectra were theoretically generated from variant and non-variant peptides, respectively, which were randomly selected from the target database. The noise spectra were generated from additional sequences that were out of the target database.

In spectrum simulation, the mass-to-charge ratio (m/z) values of singly charged fragment ions of b and y types were predicted. The intensities of the fragment ions are randomly sampled from the uniform distribution. A number of noise peaks were generated and combined with fragment ions to form the MS/MS spectrum of the peptide. More details about the method to generate the simulated spectra can be found in Ref [14].

In each experiment, a dataset was constructed by including n(=1, 5, 10, 20, 50, 100) randomly selected variant spectra and 15000 randomly selected non-variant and noise spectra. The experiment was repeated 1000 times.

Mascot(v2.5.1) [27] was used as the search engine. Trypsin was specified as the proteolytic enzyme and no missed cleavage was allowed. The precursor and fragment mass matching tolerances were both 0.01 Da. No fixed or variable modifications were set for search. The database was searched in the target-decoy strategy by combining the target sequences with their reversed versions.

Figure 3 gives examples of the linear fitting results of γ_G(x) and λ_G(x). As shown, $\hat {\gamma }_{G}(x)$ is closely around the constant 0.5, and $\hat {\lambda }_{G}(x)$ is closely around the constant 0.1. Thus, it was assumed that f_G0=f₀ and f_G1=f₁ held.

Figure 4 plots the estimated iCombined FDR, iSeparateFDR and iTransfer FDR against the real FDR at different FDR control levels (1–10%) for different group size n of variant spectra. As shown, iTransfer FDR was closest to the real FDR among the three estimates. Both iCombined FDR and iSeparate FDR remarkably deviated from the real FDR when the number of variant spectra was small. iSeparate FDR gradually approached to iTransfer FDR with increasing n, but iCombined FDR didn’t.

Table 2 compares the three FDR estimation methods, in terms of the mean and SD of the estimation errors as well as the average numbers of all and false variant PSMs obtained at 1% FDR control level. As shown, iCombined FDR dramatically deviated from the real FDR. iSeparate FDR was much better, but still deviated from the real FDR when the number of variant spectra was small (n=1,5,10,20). The results of iCombined and iSeparate FDR gradually approached to those of iTransfer FDR with increasing n. iTransfer FDR was the best among these three methods for all the numbers of variant spectra.

Table 2 Results achieved with the three methods for estimating variant FDRs on simulated data, with the FDR control level at 1%

Full size table

As the size of the group increases, the advantage of transfer PEP over other methods decreases. When the group size is large enough, the advantage vanishes. However, it is hard to say there is a fixed threshold at which the advantage disappears. It depends on the problem addressed, the dataset analyzed and other experimental conditions. According to our results, our method seems to be most effective when the group size is <50, and become comparable with other methods when the group size is >100.

Real MS/MS data

In this section, we compared the three PEP estimation methods on a real MS/MS dataset. The objective judgement of the identification correctness is absent, so we used the transfer FDR [21] as the comparative reference. Two datasets were used for identification of methylated peptides and variant peptides, respectively.

Methylated peptide identification

The MS/MS spectra in this dataset were from the draft map of human proteome described in Kim et al. [28], and were downloaded from the PRIDE data repository (https://www.ebi.ac.uk/pride/, dataset identifier PXD000561). Briefly, this draft map was from protein samples of 30 human tissues which were analyzed on high-resolution Fourier-transform mass spectrometers using HCD fragmentation. In this paper, only the spectra of brain tissue were analyzed which included 24 RAW files.

Mascot(v2.5.1) [27] was used to identify the spectra. The protein sequence database searched was UniProt human protein database (v201506). All cysteines were assumed to be carbamidomethylated, and methionines were allowed to be oxidized. N-termini of peptides starting with glutamine residues were allowed to be pyroglutamined. N-termini of proteins were allowed to be acetylated. Both lysines and arginines were allowed to be methylated. Precursor and fragment mass matching tolerances were set as 10 ppm and 0.05 Da, respectively. Trypsin was specified as the proteolytic enzyme and up to two missed cleavages were permitted.

The linear fitting results of γ_G(x) and λ_G(x) are shown in Fig. 5. We can see that $\hat {\gamma }_{G}(x)$ varies in the interval [0,0.5], and $\hat {\lambda }_{G}(x)$ is almost constant at 0.0028. Thus, we assumed f_G0≠f₀ and f_G1=f₁.

Figure 6 shows the numbers of identified methylated PSMs after filteration by the three FDR methods (iCombined FDR, iSeparate FDR and iTransfer FDR) at different FDR conrol levels (1–10%). Figure 7a shows the consistency of the three methods with transfer FDR. It is clear that iTransfer FDR was the most conservative and consistent with transfer FDR, iSeparative FDR was comparable but a little liberal, and iCombined FDR seriously underestimated the FDR.

Variant peptide identification

The data used for identification of variant peptides, i.e. peptides containing single amino acid variations, was part of a colorectal cell line dataset, which has been described in detail in Li et al. [24]. Proteins were digested by trypsin and analyzed on an LTQ-Orbitrap mass spectrometer. Only the spectra of SW480 sample were analyzed in this paper.

Mascot(v2.5.1) [27] was used to identify the spectra. The protein sequence database searched was MS-CanProVar(v1.0) [24], which can be downloaded from http://canprovar.zhang-lab.org/. All cysteines were assumed to be carbamidomethylated, and methionines were allowed to be oxidized. N-termini of peptides starting with glutamine residues were allowed to be pyroglutamined. Precursor and fragment mass matching tolerances were set as 10 ppm and 0.5 Da, respectively. Trypsin was specified as the proteolytic enzyme and two missed cleavages were permitted.

The linear fitting results of γ_G(x) and λ_G(x) are shown in Fig. 8. Accordingly, we assumed f_G0≠f₀ and f_G1=f₁ for this dataset. With the FDR control level set at 1%, 42, 36 and 32 variant PSMs were obtained by iCombined, iSeparate, and iTransfer FDRs, respectively. Figure 7b shows that, similar to the result of methylated peptide identification, iTransfer FDR was the most consistent with transfer FDR.

Conclusions

In this paper, we have presented transfer PEP, the first solution to the problem of PEP estimation for small groups of peptide identifications in proteomics. By using the empirical relationship between the combined null distribution and the group null distribution of identification scores, transfer PEP makes possible accurate PEP estimation for data of very limited sample size. The small groups are not uncommon in proteomics. For example, when one focuses on identifying amino acid mutations [19] or open searching of PTMs [22, 29], the concerned group is often very small, typically <50. Given the group null distribution, transfer PEP uses an iterative semi-parametric method to estimate the group alternative distribution and the null proportion. Because kernel density estimation is used, transfer PEP does not require the distribution forms to be known and thus is applicable to different scoring functions. The performance of transfer PEP was validated on both the simulated data and the real mass spectra datasets. Compared with the combined and separate PEPs, transfer PEP showed much more accuracy in estimating the PEP of small groups without loss of power. Estimation of PEP enables evaluation of the confidence of individual peptide identifications, which is desirable in many circumstances, e.g. protein inference [30]. Finally, it is worthwhile to note that transfer PEP is in principle adaptable to the small-group PEP estimation problems in other fields, as long as γ_G(x) can be estimated, which is not limited to the linear form.

Availability of data and materials

The transfer PEP algorithm was implemented in Matlab. The source codes and the test data are available at http://fugroup.amss.ac.cn/software/TransferPEP/TransferPEP.html. The peptide mass spectra we used are publicly available.

Abbreviations

PSM:: peptide-spectrum match
FDR:: false discovery rate
PEP:: posterior error probability
MS/MS:: mass spectrometry
pdf:: probability density function
EM:: Expecting Maximization
PTM:: post-translational modification
cdf:: cumulative distribution function
SD:: standard deviation
MSE:: mean squared error
m/z:: mass-to-charge ratio

References

Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003; 422(6928):198.
Article CAS Google Scholar
Nesvizhskii AI, Vitek O, Aebersold R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat Methods. 2007; 4(10):787.
Article CAS Google Scholar
Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J Proteome. 2010; 73(11):2092–123.
Article CAS Google Scholar
Käll L, Storey JD, MacCoss MJ, Noble WS. Posterior error probabilities and false discovery rates: two sides of the same coin. J Proteome Res. 2007; 7(01):40–4.
Article Google Scholar
Choi H, Nesvizhskii AI. False discovery rates and related statistical concepts in mass spectrometry-based proteomics. J Proteome Res. 2007; 7(01):47–50.
Article Google Scholar
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol). 1995; 57(1):289–300.
Google Scholar
Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods. 2007; 4(3):207–14.
Article CAS Google Scholar
Efron B, Tibshirani R. Empirical Bayes methods and false discovery rates for microarrays. Genet Epidemiol. 2002; 23(1):70–86.
Article Google Scholar
Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem. 2002; 74(20):5383–92.
Article CAS Google Scholar
Choi H, Ghosh D, Nesvizhskii AI. Statistical validation of peptide identifications in large-scale proteomics using the target-decoy database search strategy and flexible mixture modeling. J Proteome Res. 2007; 7(01):286–92.
Article Google Scholar
Choi H, Nesvizhskii AI. Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. J Proteome Res. 2007; 7(01):254–65.
Article Google Scholar
Robin S, Bar-Hen A, Daudin J-J, Pierre L. A semi-parametric approach for mixture models: Application to local false discovery rate estimation. Comput Stat Data Anal. 2007; 51(12):5483–93.
Article Google Scholar
Käll L, Storey JD, Noble WS. Non-parametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry. Bioinformatics. 2008; 24(16):42–8.
Article Google Scholar
Fu Y. Bayesian false discovery rates for post-translational modification proteomics. Stat Interface. 2012; 5:47–59.
Article Google Scholar
Noble WS. Mass spectrometrists should search only for peptides they care about. Nat Methods. 2015; 12(7):605.
Article CAS Google Scholar
Sticker A, Martens L, Clement L. Mass spectrometrists should search for all peptides, but assess only the ones they care about. Nat Methods. 2017; 14(7):643–44.
Article CAS Google Scholar
Li H, Park J, Kim H, Hwang K-B, Paek E. Systematic comparison of false-discovery-rate-controlling strategies for proteogenomic search using spike-in experiments. J Proteome Res. 2017; 16(6):2231–9.
Article Google Scholar
Efron B. Simultaneous inference: When should hypothesis testing problems be combined?. Ann Appl Stat. 2008; 2(1):197–223.
Article Google Scholar
Yi X, Wang B, An Z, Gong F, Li J, Fu Y. Quality control of single amino acid variations detected by tandem mass spectrometry. J Proteome. 2018; 187:144–51.
Article CAS Google Scholar
Huttlin EL, Hegeman AD, Harms AC, Sussman MR. Prediction of error associated with false-positive rate determination for peptide identification in large-scale proteomics experiments using a combined reverse and forward peptide sequence database strategy. J Proteome Res. 2007; 6(1):392–8.
Article CAS Google Scholar
Fu Y, Qian X. Transferred subgroup false discovery rate for rare post-translational modifications detected by mass spectrometry. Mol Cell Proteomics. 2014; 13(5):1359–68.
Article CAS Google Scholar
An Z, Zhai L, Ying W, Qian X, Gong F, Tan M, Fu Y. Ptminer: Localization and quality control of protein modifications detected in an open search and its application to comprehensive post-translational modification characterization in human proteome. Mol Cell Proteomics. 2019; 18(2):391–405.
Article CAS Google Scholar
Zhang J, Yang M. -k., Zeng H, Ge F. Gapp: a proteogenomic software for genome annotation and global profiling of posttranslational modifications in prokaryotes. Mol Cell Proteomics. 2016; 15(11):116.
Article Google Scholar
Li J, Su Z, Ma Z-Q, Slebos RJ, Halvey P, Tabb DL, Liebler DC, Pao W, Zhang B. A bioinformatics workflow for variant peptide detection in shotgun proteomics. Mol Cell Proteomics. 2011; 10(5):M110–006536.
Article Google Scholar
Pan SJ, Yang Q, et al. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2010; 22(10):1345–1359.
Article Google Scholar
Ma K, Vitek O, Nesvizhskii AI. A statistical model-building perspective to identification of ms/ms spectra with peptideprophet. BMC Bioinformatics. 2012; 13(S16):1.
Article Google Scholar
Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophor Int J. 1999; 20(18):3551–67.
Article CAS Google Scholar
Kim M-S, Pinto SM, Getnet D, Nirujogi RS, Manda SS, Chaerkady R, Madugundu AK, Kelkar DS, Isserlin R, Jain S, et al. A draft map of the human proteome. Nature. 2014; 509(7502):575.
Article CAS Google Scholar
Kong AT, Leprevost FV, Avtonomov DM, Mellacheruvu D, Nesvizhskii AI. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. Nat Methods. 2017; 14(5):513.
Article CAS Google Scholar
Nesvizhskii AI, Aebersold R. Interpretation of shotgun proteomic data the protein inference problem. Mol Cell Proteomic. 2005; 4(10):1419–40.
Article CAS Google Scholar

Download references

Acknowledgements

We thank Prof. Mengqiu Dong from National Institute of Biological Sciences, Beijing, and Dr. Kun He from Shenzhen Institute of Computing Sciences, Shenzhen University for helpful discussions.

Funding

This work was supported by the National Key R&D Program of China (2018YFB0704304 and 2017YFC0908400) and the NCMIS CAS. The funders played no role in the design of the study and collection, analysis and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations

National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China
Xinpei Yi, Fuzhou Gong & Yan Fu
School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
Xinpei Yi, Fuzhou Gong & Yan Fu

Authors

Xinpei Yi
View author publications
You can also search for this author in PubMed Google Scholar
Fuzhou Gong
View author publications
You can also search for this author in PubMed Google Scholar
Yan Fu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

FG and YF conceived the study. YF and XY designed the algorithm. XY implemented the algorithm and analyzed the data. XY and YF wrote the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Fuzhou Gong or Yan Fu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

All authors consent for publication of this manuscript.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Yi, X., Gong, F. & Fu, Y. Transfer posterior error probability estimation for peptide identification. BMC Bioinformatics 21, 173 (2020). https://doi.org/10.1186/s12859-020-3485-y

Download citation

Received: 19 September 2019
Accepted: 08 April 2020
Published: 04 May 2020
DOI: https://doi.org/10.1186/s12859-020-3485-y

Transfer posterior error probability estimation for peptide identification