 Research
 Open Access
 Published:
Score regularization for peptide identification
BMC Bioinformatics volume 12, Article number: S2 (2011)
Abstract
Background
Peptide identification from tandem mass spectrometry (MS/MS) data is one of the most important problems in computational proteomics. This technique relies heavily on the accurate assessment of the quality of peptidespectrum matches (PSMs). However, current MS technology and PSM scoring algorithm are far from perfect, leading to the generation of incorrect peptidespectrum pairs. Thus, it is critical to develop new postprocessing techniques that can distinguish true identifications from false identifications effectively.
Results
In this paper, we present a consistencybased PSM reranking method to improve the initial identification results. This method uses one additional assumption that two peptides belonging to the same protein should be correlated to each other. We formulate an optimization problem that embraces two objectives through regularization: the smoothing consistency among scores of correlated peptides and the fitting consistency between new scores and initial scores. This optimization problem can be solved analytically. The experimental study on several real MS/MS data sets shows that this reranking method improves the identification performance.
Conclusions
The score regularization method can be used as a general postprocessing step for improving peptide identifications. Source codes and data sets are available at: http://bioinformatics.ust.hk/SRPI.rar.
Background
The identification of peptides by searching tandem mass spectrometry (MS/MS) spectra against a protein database is an essential technology in shotgun proteomics. Current peptide search engines such as Mascot [1] and Sequest [2] work on the principle of “query by spectrum”. They mainly use the spectrumassociated information such as peak location (m/z), peak intensity and peak types (e.g., bion, or yion) to perform peptide identification. Such spectrumbased database searching methods are far from satisfactory since random peptidespectrum matches (PSMs) occur frequently in the identification results. These false assignments can be attributed to the poor quality of spectra, posttranslational modifications (PTMs) of proteins and other unpredictable factors, making it challenging to distinguish correct identifications from incorrect ones.
To improve the identification performance, one possible solution is to incorporate extra information. For example, mass spectrometry is usually coupled with liquid chromatography (LC), which provides retention time measurement associated with the general biophysical characteristics of a peptide. The idea of using retention time for peptide identification has been discussed recently [3].
We note that peptides are correlated with each other. This observation motivates us to exploit the interpeptide relationship as an additional source of information. The most straightforward and reliable relationship between two peptides is their coexistence in proteins. Two peptides are said to be “related” or “similar” if they belong to the same protein. We define the similarity between two peptides as the probability of their simultaneous occurrence in the same protein. Intuitively, the identification of one peptide will indicate the existence of its related peptides. Therefore, it is reasonable to extend this intuition to the following hypothesis: Related peptides should have similar ranking scores. Such a consistency pattern within related peptides can be utilized to reorder PSMs through the manipulation of ranking scores. In this paper, we formulate the consistencybased PSM reranking problem as an optimization problem of balancing the score from initial identification against the scores of related peptides. We attempt to unify two contending goals in one single objective function:

1.
Smoothing consistency: The PSMs with similar peptides should have similar scores.

2.
Fitting consistency: The initial ranking score provides valuable information. Thus, the new score of each PSM should not deviate too much from its original one.
Here we use a linear combination of these two objectives and introduce a regularization parameter to control their relative importance. This optimization problem has a closedform solution. We apply the proposed method to several real MS/MS data sets. Experimental results show that our method consistently outperforms the baseline method.
The rest of the paper is organized as follows: Section 2 presents the problem formulation and our method. Section 3 shows the experimental results. Section 4 discusses related methods and Section 5 concludes the paper.
Methods
In this section, we first introduce the problem and illustrate the underlying assumption using a motivating example. Then, we present a probabilitybased similarity measure to quantify the interpeptide affinity. Finally, we provide an optimization formulation under the regularization framework and discuss how to find the optimal solution efficiently.
Problem statement
Let ℂ = {(p_{1}, s_{1}), (p_{2}, s_{2}), …, (p_{ n }, s_{ n })} be a set of PSMs, where p_{ i } is a peptide sequence and s_{ i } is a MS/MS spectrum. The set ℂ is associated with a vector of initial ranking scores X = (x_{1}, x_{2}, …, x_{ n })^{T} provided by a standard peptide search engine (i.e., baseline ranker). There is no special requirement about the input ranking score, i.e., it may be any type of score (e.g., probability score, Evalue).
As the baseline ranker tends to be imperfect, the goal of our reranking method is to find a new score vector Y = (y_{1}, y_{2}, …, y_{ n })^{T} using the interpeptide relationship to improve the ranking.
Note that it is possible to have p_{ i } = p_{ j } for i ≠ j since the same peptide may be identified by multiple spectra.
A motivating example
Practically, the assumption that “peptides from the same protein should have similar ranking scores” is often violated. This is because different peptide sequences may have different physicochemical properties and fragmentation pattern, which will result in different scores from database search engines, even if they are from the same protein. However, imposing such a consistency constraint is still beneficial for peptide identification. We shall explain our idea using a toy example.
Suppose a hypothetical protein consists of two detected peptides. Fig.1 shows four possibilities according to the distribution of ranking scores and a predefined threshold value.

Case 1: One PSM has a very high score and the other PSM has a score below the threshold. The consistency constraint will force them to move towards each other in terms of ranking scores. This may provide the second PSM a higher score above the threshold. This change will improve the identification performance if the second PSM is correct, but will lower the performance otherwise. Fortunately the probability that the second PSM is correct is higher since the probability that its parent protein exists is high (though the first peptide might be identified simply due to chance, it is unlikely that such identification will have a high score).

Case 2: The score of first PSM is barely above the threshold and that of second PSM is very low. Penalizing their scores according to the consistency assumption will pull down the first PSM (below the threshold). Though the constituent peptidelevel match scores of a truly present protein may vary widely, it is unlikely that all these scores are not high. Therefore, the probability that the first PSM is incorrect is high. In other words, there is a good possibility to detect this incorrect identification via consistencybased score adjustment.

Case 3 and Case 4: The result will not change.
The above example shows that penalizing inconsistent PSMs per protein may help us improve the initial identification performance. This observation motivates us to investigate the possibility of utilizing consistency hypothesis in PSM reranking, even when such an assumption does not always hold in real proteomics experiments.
Graph construction
Each peptide may belong to multiple proteins. We use U_{ i } to denote the set of proteins that contains p_{ i }. Given n peptidespectrum pairs, we construct an n × n symmetric similarity matrix W with its element w_{ ij } measuring the similarity between p_{ i } and p_{ j }. Concretely, w_{ ij } is defined as the probability that these two peptides belong to the same protein:
where U_{ i } ∩ U_{ j } denotes the set of proteins that contains both p_{ i } and p_{ j }, and · denotes the number of elements in a set.
Given W, we define a diagonal matrix D as:
The matrix D will be used in the next subsection.
In Fig.2, we use a toy example to describe the graph construction procedure. In this example, there are five peptidespectrum pairs and three proteins {A, B, C}. According to the peptideprotein relationship in the left part of Fig.2, we obtain U_{ i } for each peptide p_{ i }:
U_{1} = {A}, U_{2} = {A, C}, U_{3} = {B}, U_{4} = {B, C}, U_{5} = {A}.
Then, we construct a similarity graph as shown in the middle part of Fig.2. The corresponding W and D read:
Here are some characteristics of W:

It is a sparse matrix, which can be loaded using a relatively small storage space.

The number of nonzero entries in each row varies significantly. This is because one peptide may have many neighbors while another peptide may have only a few neighbors.

There is no selfloops in the graph since its diagonal entries are zeros (w_{ ii } = 0).
Regularization framework
Given the vector of initial scores X and the similarity matrix W, we compute a vector of new scores Y for the same set of PSMs by considering two objectives: smoothing consistency among similar peptides and fitting consistency between new scores and initial identification scores.
Smoothing consistency
We use the following cost function to quantify the interpeptide inconsistency:
where d_{ ii } and d_{ jj } are the diagonal elements of matrix D.
If related peptides (w_{ ij } > 0) have very inconsistent scores, then the value of L(Y) will be high.
It is important to mention the following equation [4]:
In the spectral graph theory [5], I – D^{–1/2}WD^{–1/2} is called the normalized Laplacian. The appendix gives the detailed proof of equation (4) and corresponding interpretations based on the spectral graph theory.
Fitting consistency
We define another cost function to penalize the inconsistency between the initial identification score and the new score:
If peptides have scores that are inconsistent with their original scores, then the value of F(Y) will be high.
Score regularization
We use a linear combination of L(Y) and F(Y) to compose the regularized objective function:
where λ ∈ (0, 1) is a regularization parameter controlling the balance between the smoothing consistency and the fitting consistency.
Thus, the consistencybased PSM reranking problem is formulated as finding an optimal Y* such that:
The above optimization problem has been studied in machine learning [6]. It has a closedform solution. Concretely, we set the derivative of Q(Y) (with respect to Y) equal to zero:
where S = D^{–1/2}WD^{–1/2} and I is the identity matrix.
After some algebraic derivation, we obtain the closedform solution (see the appendix for the rigorous proof that the inverse always exists):
Y* = λ(I – (1 – λ)S)^{–1}X. (9)
For the toy example shown in Fig.2, the closedform solution gives a new ranking list (see the right part of Fig.2).
Miscellaneous issues
Isolated nodes
To compute the closedform solution, each d in D can not be zero. In other words, there should be at least one nonzero entry in each row of W. This means that each peptide must have some similar peptides in the graph. For those isolated peptides, we have two choices:

Exclude these peptides during graph construction, but keep their identification scores during reranking.

Introduce a dummy node as the neighbor of each isolated node. Meanwhile, set the corresponding similarity value to an extremely small positive number (e.g., 1/ 10^{8}).
In this paper, we adopt the second strategy for the sake of implementation simplicity.
Largescale implementation
The matrix S is usually very sparse and needs a relatively small storage space. However, (I – (1 – λ)S)^{–1} may be very dense and requires a huge storage space. When the computation of (I – (1 – λ)S)^{–1} is infeasible due to space limitation, we use the following iteration [6] to find the solution:
Y(t + 1) = λX + (1 – λ)SY(t). (10)
It has been proved that the iteration process will converge to the closedform solution Y*[6]. Since S is sparse, this method requires less storage space than computing the closedform solution directly. Intuitively, the iteration can be understood as an information diffusion process on the graph. In each round, every node updates its score by linearly combining its own score and the scores of its neighbors.
Protein inference
Peptide identification is only one intermediate step of protein identification. Though the PSM reranking strategy is able to effectively improve peptide identifications, one may wonder if it really helps in protein identification. Indeed, the fact that better peptide reranking results will lead to better protein inference have been experimentally verified for several times (e.g., [7]). Therefore, we will focus on the identification performance comparison at the peptide level in this paper.
Results
We apply our method to four real data sets (named DS1DS4) in Table 1. We use X!Tandem [8][version 2007.07.01.2] as the baseline ranker to search against a composite targetdecoy database. In the composite database, we use proteins from the SwissProt database (release 56.6) as target entries and shuffle these target protein sequences to generate decoy entries. Each decoy protein sequence is a random permutation of residues in the corresponding target protein. Here we use T and ℝ to denote the set of target proteins and the set of decoy proteins, respectively.
We use the following parameters for peptide identification: monoisotopic masses, mass tolerance of 2Da for precursor, mass tolerance of 1Da for fragment ion, fixed modification on Cys and one missed cleavage site. We only consider b and y fragment ions in PSM scoring. We use the negative logarithm of Evalue of each PSM provided by X!Tandem as the initial ranking score. The criterion for filtering PSMs is Evalue ≤ 0. 1. Our method generates a set of new scores that better distinguishes correct identifications from incorrect ones. Through the experiment, the regularization parameter λ is fixed to 0.5 unless it is explicitly specified. In performance evaluation, a peptidespectrum pair (p_{ i },x_{ i }) is labeled as a false positive if p_{ i } belongs to a decoy protein; otherwise, it is a true positive. Given a vector of ranking scores X = {x_{1}, x_{2},…, x_{ n }} and a score threshold δ, the true positive rate (TPR) is defined as the number of true positives above the threshold divided by the total number of true positives:
TRP(X, δ) = {p_{ i } ∈ Tx_{ i } ≥ δ}/{p_{ i } ∈ T}. (11)
Similarly, the false positive rate (FPR) is defined as:
FPR(X, δ) = {p_{ i } ∈ ℝx_{ i } ≥ δ}/{p_{ i } ∈ ℝ}. (12)
We plot the receiver operating characteristic (ROC) curves of the baseline method and our method in Fig.3. We also use the area under ROC curve (AUC) as a single numeric indicator of overall performance. Fig.3 shows that:

1.
Our consistencybased reranking method provides consistent and substantial performance improvement on the data set DS2, DS3 and DS4. Note that our method does not require any prior knowledge or training data.

2.
Though there is only a marginal improvement of the overall performance on DS1 (AUC=0.64 vs. AUC=0.65), we note that our method achieves significantly higher true positive rate than the baseline method when the false positive rate is around 10%. It is a nice property since the false positive rate is usually set to a relatively small value in practice.
To test the sensitivity of our algorithm to the regularization parameter, we vary λ from 0.1 to 0.9 and plot the AUC values in Fig.4. It shows that the identification result is robust with respect to λ. The increase of λ (when λ > 0. 6) will lead to the decrease of AUC since the regularized score will be identical to the initial score when λ = 1. Though we cannot determine the optimal λ automatically, we suggest to set λ = 0. 5 as a rule of thumb in practice since this setting exhibits good performance on average.
We also plot the initial score distribution and the updated score distribution in Fig.5. Here we use the minmax normalization to transform both the initial identification score and the new reranked score into the interval [0,1]. It reveals that the consistency constraint will shrink scores in each group (true and decoy) towards their mean value. Although the consistencybased reranking method cannot completely separate true identifications from decoys, it does reduce the score overlap on DS2, DS3 and DS4. Note that the consistencybased reranking procedure is less effective on DS1 since there is a serious score overlap. Even in this case, we find that the separation between true and decoy identifications is improved at lower score region.
Discussion
Here we briefly review previous works related to the ideas discussed in this paper.
PSM reranking
Many PSM scoring algorithms have been developed to facilitate accurate peptide identification. Mascot [1], Sequest [2] and X!Tandem [8] are the mostly used PSM scoring algorithms. These algorithms use information in each single MS/MS spectrum to perform peptide inference. As discussed in the introduction, they suffer from the problem of generating incorrect PSMs due to various reasons. An effective postprocessing strategy is to reorder PSMs so as to reduce the overlap between correct identifications and incorrect identifications.
Machine learning techniques are widely used to build reranking models [9–14]. These methods require highquality MS/MS spectra as training data to generate an accurate classification or regression model.However, it is very difficult to obtain a discriminative model that is universally applicable to different platforms and experimental conditions.
One may argue that some semisupervised learning methods such as Percolator [10] do not require any training data. This is not true since they still need to build a predictive model, in which the training set is constructed automatically on the fly: The PSMs derived from searching a decoy database are used as negative examples and the highscoring PSMs derived from searching the target database are used as positive examples. Eliminating the need of constructing a training set manually cannot be interpreted as being free of training data.
The proposed method is similar to those learningbased reranking approaches in the sense that it borrows information from different spectra. The novelty that distinguishes our method from previous ones is that we explicitly exploit the rank dependence between/among peptides from the same protein.
Overall, the consistencybased reranking model offers several advantages:

1.
It does not need MS/MS spectra as training data. This flexibility makes the algorithm applicable to MS/MS data generated from different platforms and experimental conditions.

2.
It utilizes the interpeptide relationship during the reranking process. Such information is readily available in the protein database. Furthermore, this peptidepeptide connection encoded in protein sequence is very stable and noisefree.

3.
The optimization problem in this paper has a closedform solution, which enables us to obtain the optimal reranking list easily.
Discrete regularization
The idea of regularization has been widely studied in the literature. In particular, similar optimization formulations have been used in semisupervised learning [6] and information retrieval [15]. To the best of our knowledge, there has been no previous work that applies this idea to peptide identification and PSM reranking.
Since we use the regularization technique in a different problem setting, some subtle differences among different methods exist. For instance, the methods in machine learning [6] and document retrieval [15] usually generate equal number of neighbors (i.e., the number of nonzero entries in the row of W) for each node. While the number of neighbors of different peptides in our similarity matrix may vary significantly.
Peptide dependency
The idea of incorporating dependencies of peptides from the same protein has been used in ProteinProphet [16]. Here we highlight that there are at least three key differences between our formulation and ProteinProphet.

1.
Our objective is to rerank PSMs while ProteinProphet aims at protein inference.

2.
Our method can lower the score of a high scoring PSM in the presence of low scoring matches from the same protein. This will improve the identifications of some onehit wonders. Otherwise, they may be overwhelmed. In contrast, the adjustment mechanism in ProteinProphet favors peptides having many neighbors.

3.
We can find the optimal solution while ProteinProphet doesn’t has such a property.
Conclusions
This paper introduced a consistencybased PSM reranking method: Given an initial set of identification scores and the interpeptide similarity matrix (graph), the new method finds a set of new scores by minimizing the score inconsistency among similar peptides and the score inconsistency between updated identification and initial identification. Since the new method only requires the initial identifications as input, we can apply it to initial rankings from any peptide search engines. Thus, this consistencybased score regularization can be used as a general postprocessing step in peptide identifications.
The affinity measure in this paper only considers interpeptide relationship and ignores other sources of information contained in the peptidespectrum pairs. For instance, many valuable features such as peak offset and sequence composition can help us define more comprehensive similarity metrics. Such extensions will generate an enhanced affinity graph since two peptidespectrum pairs may become similar even when their peptides do not belong to the same protein. We will study whether such an extended graph model can further improve the identification performance in the future work.
Our model is based on the hypothesis that “similar peptides should have similar ranking scores”. This hypothesis can have different interpretations, making it possible to formulate different optimization problems. For example, we can use the relative rank instead of the ranking score in the objective function. The investigation of alternative optimization formulations is another interesting topic.
References
 1.
Perkins DN, Pappin DJC, Creasy DM, Cottrell JS: Probabilitybased protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20(18):3551–3567. 10.1002/(SICI)15222683(19991201)20:18<3551::AIDELPS3551>3.0.CO;22
 2.
Eng JK, Mccormack AL, Yates JR: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry 1994, 5(11):976–989. 10.1016/10440305(94)800162
 3.
Klammer AA, Yi X, MacCoss MJ, Noble WS: Improving Tandem Mass Spectrum Identification Using Peptide Retention Time Prediction across Diverse Chromatography Conditions. Analytical Chemistry 2007, 79(16):6111–6118. 10.1021/ac070262k
 4.
von Luxburg U: A tutorial on spectral clustering. Statistics and Computing 2007, 17(4):395–416. 10.1007/s112220079033z
 5.
Chung FRK: Spectral Graph Theory. American Mathematical Society; 1997.
 6.
Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B: Learning with Local and Global Consistency. In Advances in Neural Information Processing Systems (NIPS03). Edited by: Thrun S, Saul LK, Schölkopf B. MIT Press; 2003.
 7.
He Z, Yu W: Improving peptide identification with singlestage mass spectrum peaks. Bioinformatics 2009, 25(22):2969–2974. 10.1093/bioinformatics/btp501
 8.
Craig R, Beavis RC: TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20(9):1466–1467. 10.1093/bioinformatics/bth092
 9.
Keller A, Nesvizhskii AI, Kolker E, Aebersold R: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical Chemistry 2002, 74(20):5383–5392. 10.1021/ac025747h
 10.
Kall L, Canterbury JD, Weston J, Noble WS, MacCoss MJ: Semisupervised learning for peptide identification from shotgun proteomics datasets. Nature Methods 2007, 4: 923–925. 10.1038/nmeth1113
 11.
Lin Y, Qiao Y, Sun S, Yu C, Dong G, Bu D: A Fragmentation Event Model for Peptide Identification by Mass Spectrometry. In Proceedings of The 12th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2008), Volume 4955 of LNBI. Edited by: Vingron M, Wong L. Springer; 2008:154–166.
 12.
Klammer AA, Reynolds SM, Bilmes JA, MacCoss MJ, Noble WS: Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification. Bioinformatics 2008, 24(13):i348i356. 10.1093/bioinformatics/btn189
 13.
Ding Y, Choi H, Nesvizhskii AI: Adaptive Discriminant Function Analysis and Reranking of MS/MS Database Search Results for Improved Peptide Identification in Shotgun Proteomics. Journal of Proteome Research 2008, 7(11):4878–4889. 10.1021/pr800484x
 14.
Frank A: A RankingBased Scoring Function For PeptideSpectrum Matches. Journal of Proteome Research 2009, 8(5):2241–2252. 10.1021/pr800678b
 15.
Diaz F: Regularizing querybased retrieval scores. Information Retrevial 2007, 10(6):531–562. 10.1007/s1079100790348
 16.
Nesvizhskii AI, Keller A, Kolker E, Aebersold R: A statistical model for identifying proteins by tandem mass spectrometry. Analytical Chemistry 2003, 75(17):4646–4658. 10.1021/ac0341261
 17.
Klimek J, Eddes JS, Hohmann L, Jackson J, Peterson A, Letarte S, Gafken PR, Katz JE, Mallick P, Lee H, Schmidt A, Ossola R, Eng JK, Aebersold R, Martin DB: The Standard Protein Mix Database: A Diverse Dataset to Assist in the Production of Improved Peptide and Protein Identification Software Tools. Journal of Proteome Research 2008, 7: 96–103. 10.1021/pr070244j
 18.
Whiteaker J, Zhang H, Eng J, Fang R, Piening B, Feng L, Lorentzen T, Schoenherr R, Keane J, Holzman T, Fitzgibbon M, Lin C, Zhang H, Cooke K, Liu T, Camp D, Anderson L, Watts J, Smith R, McIntosh M, Paulovich A: Headtohead comparison of serum fractionation techniques. Journal of Proteome Research 2007, 6(2):828–836. 10.1021/pr0604920
Acknowledgements
This work was partially supported by the Natural Science Foundation of China under Grant No. 61003176, the Fundamental Research Funds for the Central Universities of China (DUT10JR05 and DUT10ZD110), the General Research Fund 621707 from the Hong Kong Research Grant Council and the Research Proposal Competition Awards RPC07/08.EG25 and RPC10EG04 from the Hong Kong University of Science and Technology.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S1.
Author information
Affiliations
Corresponding authors
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors contributions
ZH performed the implementations and drafted the manuscript. HZ and WY conceived the study and finalized the manuscript. All authors read and approved the final manuscript.
Appendix
Rationale Behind Normalized Laplacian
In this section, we present the rationale of using normalized Laplacian in the cost function of smoothing inconsistency.
To penalize the smoothing inconsistency, one straightforward method is to use the weighted squared difference among scores of similar peptides:
The spectral graph theory [4] shows that the following equation holds:
In the spectral graph theory, D – W is the unnormalized Laplacian [5]. Recall that different peptides may have different number of neighbors, the cost function based on unnormalized Laplacian will place more penalties on those peptides with more neighbors. To address this issue, we use the normalized Laplacian to replace the unnormalized Laplacian in L(Y).
Detailed Data Description
Some Proofs
Theorem 1 and Theorem 2 imply the positive semidefiniteness of unnormalized Laplacian and normalized Laplacian, respectively.
Theorem 1. For every vector Y , we have
Proof. Here we repeat the proof of [4] for completeness.
Theorem 2. For every vector Y , we have
Proof.
To compute the closedform solution, the matrix (I – αS) must be invertible, where α = 1 – λ. Here we provide the detailed proof for the sake of completeness since it is omitted by [6].
Before we proceed to prove the existence of the inverse, we first show that the following two lemmas are correct.
Lemma 1. Both (I – S) and (I + S) are positive semidefinite, where S = D^{–1/2}WD^{–1/2}.
Proof. According to Theorem 2, (I – S) is positive semidefinite since is always nonnegative.
Similarly, we can show that. . Thus, (I + S) is also positive semidefinite.
Lemma 2. The eigenvalues of matrix S = D^{–1/2}WD^{–1/2}fall into [1,1].
Proof. According to Lemma 1, (I – S) is positive semidefinite. Meanwhile, (I – S) is symmetric. Thus, all the eigenvalues of (I – S) are nonnegative. Let e be an eigenvalue of S and the associated eigenvector is V, i.e., SV = eV. Then,
(I – S)V = V – SV = V – eV=(1 – e)V.
Therefore, (1 – e) is the eigenvalue of (I – S) and V is its corresponding eigenvector. Thus, 1 – e ≥ 0, i.e.,all the eigenvalues of S are not larger than 1.
Similarly, we can prove that each eigenvalue of S is at least 1.
Theorem 3. (I – αS) is invertible, where S = D^{–1/2}WD^{–1/2}and α ∈ (0, 1).
Proof. A matrix is invertible if and only if all its eigenvalues are nonzero. Here we show that each eigenvalue of (I – αS) is nonzero.
Let e be an eigenvalue of S and the associated eigenvector is V. Then,
(I – αS)V = V – αSV = V – αeV = (1 – αe)V.
Obviously, 1 – αe is the eigenvalue of (I – αS) and V is the corresponding eigenvector. According to Lemma 2, we know that –1 ≤ e ≤ 1. Thus, 1 – αe > 0 holds since α ∈ (0, 1).
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
He, Z., Zhao, H. & Yu, W. Score regularization for peptide identification. BMC Bioinformatics 12, S2 (2011). https://doi.org/10.1186/1471210512S1S2
Published:
Keywords
 Ranking Score
 Peptide Identification
 Similar Peptide
 Spectral Graph Theory
 Area Under Receiver Operating Characteristic Curve