Score regularization for peptide identification
 Zengyou He^{1}Email author,
 Hongyu Zhao^{2} and
 Weichuan Yu^{3}Email author
https://doi.org/10.1186/1471210512S1S2
© He et al; licensee BioMed Central Ltd. 2011
Published: 15 February 2011
Abstract
Background
Peptide identification from tandem mass spectrometry (MS/MS) data is one of the most important problems in computational proteomics. This technique relies heavily on the accurate assessment of the quality of peptidespectrum matches (PSMs). However, current MS technology and PSM scoring algorithm are far from perfect, leading to the generation of incorrect peptidespectrum pairs. Thus, it is critical to develop new postprocessing techniques that can distinguish true identifications from false identifications effectively.
Results
In this paper, we present a consistencybased PSM reranking method to improve the initial identification results. This method uses one additional assumption that two peptides belonging to the same protein should be correlated to each other. We formulate an optimization problem that embraces two objectives through regularization: the smoothing consistency among scores of correlated peptides and the fitting consistency between new scores and initial scores. This optimization problem can be solved analytically. The experimental study on several real MS/MS data sets shows that this reranking method improves the identification performance.
Conclusions
The score regularization method can be used as a general postprocessing step for improving peptide identifications. Source codes and data sets are available at: http://bioinformatics.ust.hk/SRPI.rar.
Keywords
Background
The identification of peptides by searching tandem mass spectrometry (MS/MS) spectra against a protein database is an essential technology in shotgun proteomics. Current peptide search engines such as Mascot [1] and Sequest [2] work on the principle of “query by spectrum”. They mainly use the spectrumassociated information such as peak location (m/z), peak intensity and peak types (e.g., bion, or yion) to perform peptide identification. Such spectrumbased database searching methods are far from satisfactory since random peptidespectrum matches (PSMs) occur frequently in the identification results. These false assignments can be attributed to the poor quality of spectra, posttranslational modifications (PTMs) of proteins and other unpredictable factors, making it challenging to distinguish correct identifications from incorrect ones.
To improve the identification performance, one possible solution is to incorporate extra information. For example, mass spectrometry is usually coupled with liquid chromatography (LC), which provides retention time measurement associated with the general biophysical characteristics of a peptide. The idea of using retention time for peptide identification has been discussed recently [3].
 1.
Smoothing consistency: The PSMs with similar peptides should have similar scores.
 2.
Fitting consistency: The initial ranking score provides valuable information. Thus, the new score of each PSM should not deviate too much from its original one.
Here we use a linear combination of these two objectives and introduce a regularization parameter to control their relative importance. This optimization problem has a closedform solution. We apply the proposed method to several real MS/MS data sets. Experimental results show that our method consistently outperforms the baseline method.
The rest of the paper is organized as follows: Section 2 presents the problem formulation and our method. Section 3 shows the experimental results. Section 4 discusses related methods and Section 5 concludes the paper.
Methods
In this section, we first introduce the problem and illustrate the underlying assumption using a motivating example. Then, we present a probabilitybased similarity measure to quantify the interpeptide affinity. Finally, we provide an optimization formulation under the regularization framework and discuss how to find the optimal solution efficiently.
Problem statement
Let ℂ = {(p_{1}, s_{1}), (p_{2}, s_{2}), …, (p_{ n }, s_{ n })} be a set of PSMs, where p_{ i } is a peptide sequence and s_{ i } is a MS/MS spectrum. The set ℂ is associated with a vector of initial ranking scores X = (x_{1}, x_{2}, …, x_{ n })^{ T } provided by a standard peptide search engine (i.e., baseline ranker). There is no special requirement about the input ranking score, i.e., it may be any type of score (e.g., probability score, Evalue).
As the baseline ranker tends to be imperfect, the goal of our reranking method is to find a new score vector Y = (y_{1}, y_{2}, …, y_{ n })^{ T } using the interpeptide relationship to improve the ranking.
Note that it is possible to have p_{ i } = p_{ j } for i ≠ j since the same peptide may be identified by multiple spectra.
A motivating example
Practically, the assumption that “peptides from the same protein should have similar ranking scores” is often violated. This is because different peptide sequences may have different physicochemical properties and fragmentation pattern, which will result in different scores from database search engines, even if they are from the same protein. However, imposing such a consistency constraint is still beneficial for peptide identification. We shall explain our idea using a toy example.

Case 1: One PSM has a very high score and the other PSM has a score below the threshold. The consistency constraint will force them to move towards each other in terms of ranking scores. This may provide the second PSM a higher score above the threshold. This change will improve the identification performance if the second PSM is correct, but will lower the performance otherwise. Fortunately the probability that the second PSM is correct is higher since the probability that its parent protein exists is high (though the first peptide might be identified simply due to chance, it is unlikely that such identification will have a high score).

Case 2: The score of first PSM is barely above the threshold and that of second PSM is very low. Penalizing their scores according to the consistency assumption will pull down the first PSM (below the threshold). Though the constituent peptidelevel match scores of a truly present protein may vary widely, it is unlikely that all these scores are not high. Therefore, the probability that the first PSM is incorrect is high. In other words, there is a good possibility to detect this incorrect identification via consistencybased score adjustment.

Case 3 and Case 4: The result will not change.
The above example shows that penalizing inconsistent PSMs per protein may help us improve the initial identification performance. This observation motivates us to investigate the possibility of utilizing consistency hypothesis in PSM reranking, even when such an assumption does not always hold in real proteomics experiments.
Graph construction
where U_{ i } ∩ U_{ j } denotes the set of proteins that contains both p_{ i } and p_{ j }, and · denotes the number of elements in a set.
The matrix D will be used in the next subsection.
U_{1} = {A}, U_{2} = {A, C}, U_{3} = {B}, U_{4} = {B, C}, U_{5} = {A}.
Here are some characteristics of W:

It is a sparse matrix, which can be loaded using a relatively small storage space.

The number of nonzero entries in each row varies significantly. This is because one peptide may have many neighbors while another peptide may have only a few neighbors.

There is no selfloops in the graph since its diagonal entries are zeros (w_{ ii } = 0).
Regularization framework
Given the vector of initial scores X and the similarity matrix W, we compute a vector of new scores Y for the same set of PSMs by considering two objectives: smoothing consistency among similar peptides and fitting consistency between new scores and initial identification scores.
Smoothing consistency
where d_{ ii } and d_{ jj } are the diagonal elements of matrix D.
If related peptides (w_{ ij } > 0) have very inconsistent scores, then the value of L(Y) will be high.
In the spectral graph theory [5], I – D^{–1/2}WD^{–1/2} is called the normalized Laplacian. The appendix gives the detailed proof of equation (4) and corresponding interpretations based on the spectral graph theory.
Fitting consistency
If peptides have scores that are inconsistent with their original scores, then the value of F(Y) will be high.
Score regularization
where λ ∈ (0, 1) is a regularization parameter controlling the balance between the smoothing consistency and the fitting consistency.
where S = D^{–1/2}WD^{–1/2} and I is the identity matrix.
After some algebraic derivation, we obtain the closedform solution (see the appendix for the rigorous proof that the inverse always exists):
Y* = λ(I – (1 – λ)S)^{–1}X. (9)
For the toy example shown in Fig.2, the closedform solution gives a new ranking list (see the right part of Fig.2).
Miscellaneous issues
Isolated nodes
To compute the closedform solution, each d in D can not be zero. In other words, there should be at least one nonzero entry in each row of W. This means that each peptide must have some similar peptides in the graph. For those isolated peptides, we have two choices:

Exclude these peptides during graph construction, but keep their identification scores during reranking.

Introduce a dummy node as the neighbor of each isolated node. Meanwhile, set the corresponding similarity value to an extremely small positive number (e.g., 1/ 10^{8}).
In this paper, we adopt the second strategy for the sake of implementation simplicity.
Largescale implementation
The matrix S is usually very sparse and needs a relatively small storage space. However, (I – (1 – λ)S)^{–1} may be very dense and requires a huge storage space. When the computation of (I – (1 – λ)S)^{–1} is infeasible due to space limitation, we use the following iteration [6] to find the solution:
Y(t + 1) = λX + (1 – λ)SY(t). (10)
It has been proved that the iteration process will converge to the closedform solution Y*[6]. Since S is sparse, this method requires less storage space than computing the closedform solution directly. Intuitively, the iteration can be understood as an information diffusion process on the graph. In each round, every node updates its score by linearly combining its own score and the scores of its neighbors.
Protein inference
Peptide identification is only one intermediate step of protein identification. Though the PSM reranking strategy is able to effectively improve peptide identifications, one may wonder if it really helps in protein identification. Indeed, the fact that better peptide reranking results will lead to better protein inference have been experimentally verified for several times (e.g., [7]). Therefore, we will focus on the identification performance comparison at the peptide level in this paper.
Results
We use the following parameters for peptide identification: monoisotopic masses, mass tolerance of 2Da for precursor, mass tolerance of 1Da for fragment ion, fixed modification on Cys and one missed cleavage site. We only consider b and y fragment ions in PSM scoring. We use the negative logarithm of Evalue of each PSM provided by X!Tandem as the initial ranking score. The criterion for filtering PSMs is Evalue ≤ 0. 1. Our method generates a set of new scores that better distinguishes correct identifications from incorrect ones. Through the experiment, the regularization parameter λ is fixed to 0.5 unless it is explicitly specified. In performance evaluation, a peptidespectrum pair (p_{ i },x_{ i }) is labeled as a false positive if p_{ i } belongs to a decoy protein; otherwise, it is a true positive. Given a vector of ranking scores X = {x_{1}, x_{2},…, x_{ n }} and a score threshold δ, the true positive rate (TPR) is defined as the number of true positives above the threshold divided by the total number of true positives:
TRP(X, δ) = {p_{ i } ∈ Tx_{ i } ≥ δ}/{p_{ i } ∈ T}. (11)
Similarly, the false positive rate (FPR) is defined as:
FPR(X, δ) = {p_{ i } ∈ ℝx_{ i } ≥ δ}/{p_{ i } ∈ ℝ}. (12)
 1.
Our consistencybased reranking method provides consistent and substantial performance improvement on the data set DS2, DS3 and DS4. Note that our method does not require any prior knowledge or training data.
 2.
Though there is only a marginal improvement of the overall performance on DS1 (AUC=0.64 vs. AUC=0.65), we note that our method achieves significantly higher true positive rate than the baseline method when the false positive rate is around 10%. It is a nice property since the false positive rate is usually set to a relatively small value in practice.
Discussion
Here we briefly review previous works related to the ideas discussed in this paper.
PSM reranking
Many PSM scoring algorithms have been developed to facilitate accurate peptide identification. Mascot [1], Sequest [2] and X!Tandem [8] are the mostly used PSM scoring algorithms. These algorithms use information in each single MS/MS spectrum to perform peptide inference. As discussed in the introduction, they suffer from the problem of generating incorrect PSMs due to various reasons. An effective postprocessing strategy is to reorder PSMs so as to reduce the overlap between correct identifications and incorrect identifications.
Machine learning techniques are widely used to build reranking models [9–14]. These methods require highquality MS/MS spectra as training data to generate an accurate classification or regression model.However, it is very difficult to obtain a discriminative model that is universally applicable to different platforms and experimental conditions.
One may argue that some semisupervised learning methods such as Percolator [10] do not require any training data. This is not true since they still need to build a predictive model, in which the training set is constructed automatically on the fly: The PSMs derived from searching a decoy database are used as negative examples and the highscoring PSMs derived from searching the target database are used as positive examples. Eliminating the need of constructing a training set manually cannot be interpreted as being free of training data.
The proposed method is similar to those learningbased reranking approaches in the sense that it borrows information from different spectra. The novelty that distinguishes our method from previous ones is that we explicitly exploit the rank dependence between/among peptides from the same protein.
 1.
It does not need MS/MS spectra as training data. This flexibility makes the algorithm applicable to MS/MS data generated from different platforms and experimental conditions.
 2.
It utilizes the interpeptide relationship during the reranking process. Such information is readily available in the protein database. Furthermore, this peptidepeptide connection encoded in protein sequence is very stable and noisefree.
 3.
The optimization problem in this paper has a closedform solution, which enables us to obtain the optimal reranking list easily.
Discrete regularization
The idea of regularization has been widely studied in the literature. In particular, similar optimization formulations have been used in semisupervised learning [6] and information retrieval [15]. To the best of our knowledge, there has been no previous work that applies this idea to peptide identification and PSM reranking.
Since we use the regularization technique in a different problem setting, some subtle differences among different methods exist. For instance, the methods in machine learning [6] and document retrieval [15] usually generate equal number of neighbors (i.e., the number of nonzero entries in the row of W) for each node. While the number of neighbors of different peptides in our similarity matrix may vary significantly.
Peptide dependency
 1.
Our objective is to rerank PSMs while ProteinProphet aims at protein inference.
 2.
Our method can lower the score of a high scoring PSM in the presence of low scoring matches from the same protein. This will improve the identifications of some onehit wonders. Otherwise, they may be overwhelmed. In contrast, the adjustment mechanism in ProteinProphet favors peptides having many neighbors.
 3.
We can find the optimal solution while ProteinProphet doesn’t has such a property.
Conclusions
This paper introduced a consistencybased PSM reranking method: Given an initial set of identification scores and the interpeptide similarity matrix (graph), the new method finds a set of new scores by minimizing the score inconsistency among similar peptides and the score inconsistency between updated identification and initial identification. Since the new method only requires the initial identifications as input, we can apply it to initial rankings from any peptide search engines. Thus, this consistencybased score regularization can be used as a general postprocessing step in peptide identifications.
The affinity measure in this paper only considers interpeptide relationship and ignores other sources of information contained in the peptidespectrum pairs. For instance, many valuable features such as peak offset and sequence composition can help us define more comprehensive similarity metrics. Such extensions will generate an enhanced affinity graph since two peptidespectrum pairs may become similar even when their peptides do not belong to the same protein. We will study whether such an extended graph model can further improve the identification performance in the future work.
Our model is based on the hypothesis that “similar peptides should have similar ranking scores”. This hypothesis can have different interpretations, making it possible to formulate different optimization problems. For example, we can use the relative rank instead of the ranking score in the objective function. The investigation of alternative optimization formulations is another interesting topic.
Declarations
Acknowledgements
This work was partially supported by the Natural Science Foundation of China under Grant No. 61003176, the Fundamental Research Funds for the Central Universities of China (DUT10JR05 and DUT10ZD110), the General Research Fund 621707 from the Hong Kong Research Grant Council and the Research Proposal Competition Awards RPC07/08.EG25 and RPC10EG04 from the Hong Kong University of Science and Technology.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S1.
Authors’ Affiliations
References
 Perkins DN, Pappin DJC, Creasy DM, Cottrell JS: Probabilitybased protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20(18):3551–3567. 10.1002/(SICI)15222683(19991201)20:18<3551::AIDELPS3551>3.0.CO;22View ArticlePubMedGoogle Scholar
 Eng JK, Mccormack AL, Yates JR: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry 1994, 5(11):976–989. 10.1016/10440305(94)800162View ArticlePubMedGoogle Scholar
 Klammer AA, Yi X, MacCoss MJ, Noble WS: Improving Tandem Mass Spectrum Identification Using Peptide Retention Time Prediction across Diverse Chromatography Conditions. Analytical Chemistry 2007, 79(16):6111–6118. 10.1021/ac070262kView ArticlePubMedGoogle Scholar
 von Luxburg U: A tutorial on spectral clustering. Statistics and Computing 2007, 17(4):395–416. 10.1007/s112220079033zView ArticleGoogle Scholar
 Chung FRK: Spectral Graph Theory. American Mathematical Society; 1997.Google Scholar
 Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B: Learning with Local and Global Consistency. In Advances in Neural Information Processing Systems (NIPS03). Edited by: Thrun S, Saul LK, Schölkopf B. MIT Press; 2003.Google Scholar
 He Z, Yu W: Improving peptide identification with singlestage mass spectrum peaks. Bioinformatics 2009, 25(22):2969–2974. 10.1093/bioinformatics/btp501View ArticlePubMedGoogle Scholar
 Craig R, Beavis RC: TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20(9):1466–1467. 10.1093/bioinformatics/bth092View ArticlePubMedGoogle Scholar
 Keller A, Nesvizhskii AI, Kolker E, Aebersold R: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical Chemistry 2002, 74(20):5383–5392. 10.1021/ac025747hView ArticlePubMedGoogle Scholar
 Kall L, Canterbury JD, Weston J, Noble WS, MacCoss MJ: Semisupervised learning for peptide identification from shotgun proteomics datasets. Nature Methods 2007, 4: 923–925. 10.1038/nmeth1113View ArticlePubMedGoogle Scholar
 Lin Y, Qiao Y, Sun S, Yu C, Dong G, Bu D: A Fragmentation Event Model for Peptide Identification by Mass Spectrometry. In Proceedings of The 12th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2008), Volume 4955 of LNBI. Edited by: Vingron M, Wong L. Springer; 2008:154–166.Google Scholar
 Klammer AA, Reynolds SM, Bilmes JA, MacCoss MJ, Noble WS: Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification. Bioinformatics 2008, 24(13):i348i356. 10.1093/bioinformatics/btn189PubMed CentralView ArticlePubMedGoogle Scholar
 Ding Y, Choi H, Nesvizhskii AI: Adaptive Discriminant Function Analysis and Reranking of MS/MS Database Search Results for Improved Peptide Identification in Shotgun Proteomics. Journal of Proteome Research 2008, 7(11):4878–4889. 10.1021/pr800484xPubMed CentralView ArticlePubMedGoogle Scholar
 Frank A: A RankingBased Scoring Function For PeptideSpectrum Matches. Journal of Proteome Research 2009, 8(5):2241–2252. 10.1021/pr800678bPubMed CentralView ArticlePubMedGoogle Scholar
 Diaz F: Regularizing querybased retrieval scores. Information Retrevial 2007, 10(6):531–562. 10.1007/s1079100790348View ArticleGoogle Scholar
 Nesvizhskii AI, Keller A, Kolker E, Aebersold R: A statistical model for identifying proteins by tandem mass spectrometry. Analytical Chemistry 2003, 75(17):4646–4658. 10.1021/ac0341261View ArticlePubMedGoogle Scholar
 Klimek J, Eddes JS, Hohmann L, Jackson J, Peterson A, Letarte S, Gafken PR, Katz JE, Mallick P, Lee H, Schmidt A, Ossola R, Eng JK, Aebersold R, Martin DB: The Standard Protein Mix Database: A Diverse Dataset to Assist in the Production of Improved Peptide and Protein Identification Software Tools. Journal of Proteome Research 2008, 7: 96–103. 10.1021/pr070244jPubMed CentralView ArticlePubMedGoogle Scholar
 Whiteaker J, Zhang H, Eng J, Fang R, Piening B, Feng L, Lorentzen T, Schoenherr R, Keane J, Holzman T, Fitzgibbon M, Lin C, Zhang H, Cooke K, Liu T, Camp D, Anderson L, Watts J, Smith R, McIntosh M, Paulovich A: Headtohead comparison of serum fractionation techniques. Journal of Proteome Research 2007, 6(2):828–836. 10.1021/pr0604920View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.