A nonparametric model for quality control of database search results in shotgun proteomics
© Zhang et al; licensee BioMed Central Ltd. 2008
Received: 05 June 2007
Accepted: 21 January 2008
Published: 21 January 2008
Analysis of complex samples with tandem mass spectrometry (MS/MS) has become routine in proteomic research. However, validation of database search results creates a bottleneck in MS/MS data processing. Recently, methods based on a randomized database have become popular for quality control of database search results. However, a consequent problem is the ignorance of how to combine different database search scores to improve the sensitivity of randomized database methods.
In this paper, a multivariate nonlinear discriminate function (DF) based on the multivariate nonparametric density estimation technique was used to filter out false-positive database search results with a predictable false positive rate (FPR). Application of this method to control datasets of different instruments (LCQ, LTQ, and LTQ/FT) yielded an estimated FPR close to the actual FPR. As expected, the method was more sensitive when more features were used. Furthermore, the new method was shown to be more sensitive than two commonly used methods on 3 complex sample datasets and 3 control datasets.
Using the nonparametric model, a more flexible DF can be obtained, resulting in improved sensitivity and good FPR estimation. This nonparametric statistical technique is a powerful tool for tackling the complexity and diversity of datasets in shotgun proteomics.
The objective of proteomics is to investigate proteins on a global scale [1, 2]. The high-throughput and sensitive tandem mass spectrometry (MS/MS) platform is now a supporting technology for protein identification in proteomic research [3, 4]. Using the shotgun strategy, a large number of MS/MS spectra can be gathered in a few hours . The MS/MS data is generally processed by the so-called database searching method . Automated software such as SEQUEST  and MASCOT  can rapidly assign tryptic peptides to MS/MS spectra by searching a protein sequence database and then identify proteins by utilizing the identified peptides. A notable problem in the MS/MS data processing is the high false positive rate (FPR) of the database search results . Thus, validation of database search results is unavoidable and necessary work, particularly when processing the large amount low accuracy MS/MS spectra with SEQUEST .
There are many proposed parameters and algorithms for evaluating SEQUEST database search results [11–32]. Such approaches must confront two main problems: First, the complex physical and chemical mechanisms of the shotgun experiment make it difficult to model the matches between MS/MS spectra and peptides with a one-size-fits-all algorithm . Thus, database search software provides multiple scores, and many empirical and intuitive parameters are used in the validation of database search results. These parameters describe different aspects of the quality of the match and provide complementary information to the validation of the database search results. Combining these parameters while considering their relationships is difficult. Second, many factors can affect the distributions of quality control parameters, including the sample, the database, the experimental conditions, and other random factors [24, 27]. Avoiding the effects of such factors during the validation of database search results is difficult. In addition, large-scale proteomics always uses multiple, complementary MS/MS platforms and multiple database search software tools to acquire more results with a high confidence level. Thus, a universal framework for quality control of results is needed .
Recently, the randomized database method has become an attractive framework for quality control of MS/MS database search results. By constructing a negative control dataset for each experiment MS/MS dataset and the given database, the randomized database method can provide a universal foundation for the result quality control for different types of database search software and minimize the effects of differences in samples, experiment conditions, and databases . In the randomized database method, the negative control dataset is generated by searching the constructed randomized database and used to simulate random matches from the normal database. The false positive rate can be estimated using the numbers of matches from the normal and randomized database given a set of filter criteria.
Moore et al. used the reverse database (a special kind of randomized database) for their Qscore model in 2002 . Subsequently, Qian et al.  and Peng et al.  used the reverse database method to investigate the problem of optimizing the cutoff value of Xcorr and ΔCn in yeast and human proteome research, respectively. Recently, Higdon et al.  investigated some problems encountered in the application of the reshuffled database. As they noted, searching a combined database can yield more accurate FPR estimation than individually searching normal and reshuffled databases. Based on the binomial distribution, Huttlin et al. investigated the minimum error associated with the estimated FPR . They pointed out that the estimated FPR for a large dataset could be quite accurate. Randomized database methods have been widely used in many research projects [34–40]. However, different groups use different criteria; there is no standard statistical framework that can easily integrate commonly used parameters for the quality control of database search results.
where N R and N N are the preserved number of peptide matches that pass certain filter criteria and derive from the randomized and normal databases, respectively. Huttlin et al  have given a statistical interpretation of formula 2 by using the binomial distribution. So, in this paper, we used formula 2 to estimate FPR. Generally, the filter criteria are discriminant functions (DFs) of database search scores. Determining the acceptance boundaries for database search scores (such as Xcorr and ΔCn) is a simple and commonly used method [25, 26]. Lopez-Ferrer et al sought to introduce a statistical model that would provide a more complex DF and thus improve the sensitivity of filter criteria . In their model, XCc(= ln(Xcorr)) and of random matches were considered to follow normal distributions, and the distributions of XCc and DCc were assumed to be independent. The contour line of the estimated joint distribution of XCc and DCc was used as the filter boundary. However, we found that normal distributions do not fit well the distributions of XCc and DCc of the random matches in the LCQ control dataset used in this paper(see "Datasets and database search" section); the χ2 goodness of fit test shows that we can reject the null hypothesis H0 (the distribution is normal) at a significance level of 0.05. Furthermore, the correlation between XCc and DCc is significant (correlation coefficient = 0.1, p-value = 1.8 × 10-24; random matches in the LCQ control dataset, see section "Datasets and database search") which is inconsistent with the independence assumption made by Lopez-Ferrer et al. Another problem with their model is that it cannot be generalized to the situations involving more parameters.
Multivariate nonparametric models can describe data with complex and variable statistical structures. The term nonparametric is not meant to imply that such models do not use any parameters but rather denotes that the number and nature of the parameters are not fixed in advance but flexible. This advantage makes nonparametric models a powerful tool for addressing the problem of multiple parameters with variable distributions in the validation of database search results. Using a set of kernel functions (such as a Gaussian kernel function); the nonparametric model can fit the distribution of multiple parameters directly with considerable accuracy [41, 42]. Generally, parameter estimation for a nonparametric model is an iterative optimization procedure. The fully nonparametric probability density function estimate (FnPDFe) procedure proposed by Archambeau et al.  and David et al. , which is based on a maximum likelihood estimate (MLE) and expectation-maximization (EM) algorithm, is easily implemented with computer programs. In this paper, based on the randomized database searching, FnPDFe was used to estimate the multivariate PDF of the commonly used database scores, the contour lines of the estimated PDF were taken as the candidate DFs. We demonstrated that the FPR estimation errors of the newly introduced method were acceptable on the control datasets from different instruments (LCQ, LTQ and LTQ/FT), its sensitivity was also proved to be improved on the control datasets and the real sample datasets.
In this section, the DFs of the nonparametric model were discussed at first, and then we show that the sensitivity of the model could be improved by incorporating more features. The accuracy of the FPR estimation of the nonparametric model was investigated and the performance of the nonparametric model was proved superior by comparing with other commonly used methods in proteomics.
Nonparametric model and the DF
The model with 3 Gaussian functions for +2 charge observations in the LCQ control dataset
andXcorr > m Xcorr
Incorporating more features
One obvious advantage of the nonparametric model is that it can easily integrate more scores for validating peptide identifications. By taking into account more features and performing the classification in a high-dimension feature space, a more reasonable DF can be found, and thus, higher sensitivity can be achieved. Here, another powerful parameter called Sim introduced by Zhang  in 2004 and discussed by Sun et al.  recently was added to the nonparametric model. Sim measures the similarity between the experiment and the predicted MS/MS spectrum which was generated by the kinetic model introduced by Zhang  and the mass error tolerance for aligning the ions was specified as 0.5.
For the LCQ control dataset, by trial and error, we found a nonparametric model with 5 component GDFs can work well (65 parameters). We also tried a model with 7 component GDFs, but its performance was not improved and two of the component GDFs had a coefficient P i near 0 [see Additional file 1]. Thus, we selected 5 component GDFs to build the model. When the expected FPR was 0.05 and 0.01, the actual FPR was 0.044 and 0.012, respectively. The number of peptide matches after filtering was 765 and 699, which were 104 (approximately 15.6%) and 121 (approximately 20.9%) respectively higher than the results of the nonparametric model using Xcorr and ΔCn, respectively. The sensitivity increased to 0.879 and 0.822 respectively, and the specificity did not change. Thus, by incorporating more features, the nonparametric model can provide greater discriminating power. In the following part of this paper, we discussed the nonparametric model with three features: Xcorr, ΔCn and Sim only. All the model parameters used in this paper were provided in Additional file 1.
The accuracy of the FPR estimation
Actual FPRs and the corresponding estimated FPRs
Expected FPR = 0.05
Expected FPR = 0.01
Total matches/false positive matches
Total matches/false positive matches
In most cases, the FPRs estimated by formula 2 were close to but larger than the actual FPRs. Thus, the quality of the resulting datasets was better than claimed. It facilitates the strict result quality control but some sensitivity is lost.
For little datasets, such as +1 charge state matches of different instruments, the actual FPR was larger than the corresponding estimated FPR. The error of the FPR estimation was also a bit larger. This result agrees with the conclusions of Huttlin et al .
The estimated FPRs were not equal but close to the expected FPR. The smaller the resulting datasets, the larger the difference between estimated FPR and expected FPR. This arises from the rounding error in formula 2. For example, with an expected FPR of 0.01, the allowable number of random matches was less than 1 for the +1 charge dataset of LCQ, because only 62 matches were left after filtering. Thus, it is impossible to have an estimated FPR exactly equal to 0.01. A preferred alternative is rounding the estimated FPR to 0 (Table 2).
The error of the FPR estimation at the expected FPR of 0.01 is larger that of 0.05. This result means that some unexpected contaminants exist. For example, in the LCQ control dataset, peptide "HVGDLGNVTADK" was identified with high database scores Xcorr = 4.5837, ΔCn = 0.542204) and the matched percentage of predicted ions reached 91% (Figure 3). This peptide comes from protein sp|P00441| SODC_HUMAN, which is not a protein in the control sample. But this peptide also belongs to protein sp|P00442|SODC_BOVIN, which may be contaminants in the sample because 4 proteins (ALBU_BOVIN, LACB_BOVIN, LCA_BOVIN and CYC_BOVIN) of bovine were added to the control sample.
Manually checking the confirmed matches by the nonparametric model, we found that some results with large Xcorr but very small ΔCn were confirmed. In some cases, the peptide in the second rank was correct. For example, in the LTQ dataset (D2), a peptide "LEAELEK" was identified with Xcorr = 2.4273 and ΔCn = 0.0533 (+1 charge state). The peptide at the second rank was "LEALEEK", a peptide from control protein P62937|PPIA_HUMAN, because of the theoretic mass spectrum similarity between these peptides, which will result in some FPR estimation error.
Compare the performance of nonparametric model with other methods
Comparison of different methods on the control datasets
Expected FPR = 0.05
Expected FPR = 0.01
Validated matches/false positives
Validated matches/false positives
In each case, the sensitivity of M3 is the highest. The difference in sensitivity of different methods ranges from 3.8% to 27.7%.
For the LCQ and LTQ dataset, the performance of M1 and M2 differs little and Peptideprophet (M2) which was trained by a LCQ control dataset  does not seem to work well on the LTQ/FT dataset.
The performance of the nonparametric model differs little on the dataset of different instruments. When the expected FPR is 0.05, the sensitivity is above 0.85 and it is above 0.74 when the expected FPR is 0.01.
FPR estimation errors exist for different methods. In some cases, the error is large. This may be caused by the calculation errors because of unexpected contaminants and random errors.
Application to large datasets
Shotgun experiments always generate large datasets . Thus, the nonparametric model demonstrated to be effective with the control dataset should be validated using large datasets. At first, we investigated the quality of the confirmed matches by the nonparametric model (The filter criteria can be found in Additional file 7). Another 6 parameters which were commonly used to validate the peptide identifications of SEQUEST database search results were calculated for each match. They are maximal continuous b or y ion series length (CSL) , the matched percentage of the predicted ions by SEQUEST (Ions) , ranked preliminary score (RSp) , the continuity of b or y ion series (Cont) , the matched percentage of ion intensities in the experiment mass spectrum (iIons)  and the matched percentage of the peak number in the experiment mass spectrum (nIons) . The percentages of the confirmed results which passed the empirical rules (Table 4) convinced us that most of these matches had a high confidence level. It must be noted that RSp = 1 is a strict rule  and some correct matches may be lost if we require RSp = 1. For instance, only 76% correct matches are with RSp = 1 in the LTQ control dataset.
Validate the confirmed matches by empirical rules (%).
CSL ≥ 4
Ions ≥ 0.2
RSp = 1
Conts ≥ 0.2
iIons ≥ 0.25
nIons ≥ 0.2
Comparison of different methods on the complex datasets
Expected FPR = 0.05
Expected FPR = 0.01
Proteins with at least 2/3 peptide hits (%)
Proteins with at least 2/3 peptide hits (%)
Due to the complexity of the peptide identification problem, many parameters have been proposed for use in modeling the quality of matches between MS/MS spectra and peptides. For example, Xcorr and Sim assess the similarity between theoretical and experimental spectra, and ΔCn assesses the effect of database size. There are two main reasons for the simultaneous existence of multiple parameters. First, the complex physical and chemical process of the MS/MS platform makes it difficult to model the peptide identification problem universally ). Second, the huge computational burden of the database search makes it difficult to implement complex models. Thus, most MS/MS data processing approaches currently used include two steps: 1) find candidate peptides quickly and thus reduce the search space; 2) validate the results carefully by taking into account more information. As in this paper, a popular way for quality control of data in shotgun proteomics is to generate a set of easily calculated scores measuring the quality of the matches in different ways and then to combine these parameters to validate the results . The randomized database method provides a feasible framework for constructing a negative control dataset and controlling the FPR of the acquired dataset. The nonparametric model introduced in this paper provides a framework for feature integration and determination of nonlinear DFs. However, if too many parameters are used, the nonparametric model will encounter a computational problem. With too many variable parameters in the model, there may be many solutions to the MLE equations. Thus, the iterative process of the EM algorithm may reach a local minimum, and good performance of the model cannot be guaranteed. Thus, when many features are used, it is recommended that the features be partitioned into different groups by hierarchical clustering  and the nonparametric model be applied to each cluster. Other feature-space reduction methods such as principal component analysis (PCA) and partial least squares (PLS) can also be used .
The computational burden of the nonparametric model may be doubted, especially for the huge LTQ dataset. It is lucky that it does not need so many observations to build the nonparametric model. If the dataset is too large, we can resample the observations and use fewer observations to build the model. We tried this approach on the LTQ complex dataset. The results achieved by the model built with randomly selected 30,000 observations differed little from that of the model built with all the 432,338 observations. Thus, in the model building procedure, if the number of the observations exceeds 30,000, we resample the dataset and randomly select 30,000 observations to build the model and if the number of the observations is less than 30,000, all the observations are used. Therefore, the consumed time of the model building was less than 2 min on a PC with Intel Pentium 4 2.8G CPU and 512 MB memory.
In this paper, we provide a framework for validation of peptide identification in shotgun proteomics that is based on the randomized database method and a nonparametric model. The practical problems in implementing the nonparametric model were investigated, and its performance was found to be better than that of traditional methods. The nonparametric model can provide a more flexible and accurate solution for DF determination for quality control of large datasets in shotgun proteomics research. All the programs used in this work are available by request from the authors.
Datasets and database search
The 6 datasets used in this paper.
Real sample dataset
Reference or notes
12 purified proteins + 23 peptides
49 purified human proteins
8 purified proteins
Human K562 cell line
the BIATECH Institute (Bothell, WA 98011, USA)
Proteomics Standards Research Group (sPRG) 
Beijing proteome Research Center (Beijing 102206, China)
Open Proteomics Database (OPD)
Beijing proteome Research Center (Beijing 102206, China)
Beijing proteome Research Center (Beijing 102206, China)
The two unpublished LTQ/FT datasets were provided by Beijing Proteome Research Center (BPRC). The samples were digested with trypsin and then analyzed by a 7-Tesla LTQ/FT mass spectrometer (Thermo Electron, San Jose, CA) coupled with an Agilent 1100 nano-flow liquid chromatography system. The reverse phase C18 trap columns (300 μ m internal diameter × 5 mm long column) were connected with the 6-port column-switching valve for the on-line desalting. A PicoFritTM tip column (BioBasic C18, 5 μ m particle size, 75 μ m internal diameter × 10 cm long column, 15 μ m internal diameter at spray tip, New Objective, Woburn, MA, USA) was used for the following separation. Elution was solvent A (Milli-Q water, 2 % acetonitrile and 0.1%FA, v/v/v) and solvent B (Milli-Q water, 80% acetonitrile and 0.1%FA, v/v/v). The gradient was 15–40% B in 40 min, 40–100% B in 10 min. One FT full MS scan was followed by 5 data-dependent LTQ MS/MS scans on the five most intense ions. The dynamical excluding time was 45 seconds. Ions were accumulated in linear ion trap controlled by AGC. The AGC values were 5 × 105 charges for FT full MS scan and 1 × 104 charges for LTQ MS/MS scan. The resolution was 10,000 for FT full MS scan at m/z 400. The temperature of the ion transfer tube was set at 200°C and the spray voltage was 1.8 KV. The isolation width was 4Da and normalized collision energy was 35% for MS/MS scan. Mass spectra were acquired over the m/z range from 400 to 2000.
All the MS/MS spectra were extracted from the *.raw files by Extract_MSn.exe which is a console program in Bioworks 3.2 (Thermo Finnigan, San Jose, CA). For the LCQ datasets, the minimal total ion intensity is 10,000. For the LTQ or LTQ/FT datasets, the total ion intensity of each MS/MS spectrum is required to exceed 100. For all the datasets, the spectra must have at least 20 ions. Then the database search was performed on a local TurboSEQUEST (version 2.7) server. The fixed modification of oxidation (15.99Da) on the Met residue and the variable modification of carboxyamidomethylation (57.02Da) on the Cys residue were set. The enzyme was trypsin and the maximal allowed missed cleavage sites was 2. Only the b and y ions were taken into account. For the LCQ or LTQ datasets, the precursor mass error tolerance was 3.0Da, and for the LFQ/FT datasets, it was 15ppm.
For all the datasets except D2, which was searched against the database published by sPRG , the searched databases were derived from IPI Human 3.19 . For the control datasets, the control sequences for dataset D1 and D3 [see Additional file 3, 4 and 5] including the sequences of purified proteins or peptides plus the typical sample contaminants such as keratin and trypsin were added into the IPI Human 3.19. The control sequences for D2 were determined according to the report of sPRG (see Additional file 4) . The databases were constructed using the method proposed in one of our previous paper  and could be described as: the protein sequences in the normal database were digested in silico (trypsin), and then the amino acid residues (AAR) (except the one on the C-terminal) of the resulting peptides were reshuffled by using a random number generator. Then the reshuffled peptides were spliced to form new protein sequences in the randomized database. Finally, the normal database and the randomized database were merged to form the searched database.
Database search results of the 6 datasets
Database search results
The workflow of the nonparametric model based method
The workflow of the nonparametric model based method is shown in Figure 7. Firstly, a randomized database was constructed by randomizing the tryptic peptide sequence. Then the MS/MS spectra were searched against the combined database using SEQUEST. Then, matches with an assigned peptide from the randomized database (we call them randomized database matches, RDM) were used to build the nonparametric model. The joint distribution of selected parameters (such as Xcorr, ΔCn and Sim [31, 45]) of random matches was fit with the nonparametric model using the FnPDFe method and the contour lines of the estimated PDF, which are complex nonlinear functions, were used as candidate DFs. The actually used DFs were determined according to the expected FPR and formula 2 for different charge states. Finally, the resulting DFs were used to filter the matches from the normal database. In the model-building step, k-means clustering was used to initialize the EM algorithm procedure.
Initial the nonparametric model with k-means clustering
Thus, a normalization step, which calculated the z-score of the observed values of each feature, was used to eliminate the scale difference, and thus achieve a more reasonable partition (Figure 8).
Nonparametric model and the EM algorithm
and P(i),i = 1,...N satisfies: (1) 0 <P(i) ≤ 1; (2) . μ i , Σ i is the mean vector and covariance matrix of the i-th MGDF.
Initial step: Initialize the objective parameters μ i , Σ i , and P(i) with heuristic knowledge or random values.
- (2)E-step: update the posterior distributions:(9)
- (3)M-step: estimate the current parameters:(10)(11)(12)
(4) Repeat steps 2–3 until the change of parameters is very little.
One problem with implementation of the EM algorithm is how to initialize the parameters. Use of an improper starting point may prolong the converging time of the EM algorithm or cause it to reach a local minimum. In this work, k-means clustering was used to partition the observations into subclasses, and the means and covariance matrixes of the component Gaussian distributions were initialized using the means and covariance matrixes of the subclasses.
Another difficulty in implementing the EM algorithm is the selection of the number of component density functions. Generally speaking, inclusion of more functions will approximate the distributions of the observations more accurately, while allowing more parameters to be determined. However, overly complex models may cause the EM algorithm to reach a local minimum and worsen the performance of the resulting model. In this work, a trial and error procedure was used to select the minimum number of component density functions: try numbers from 2 until the change of the likelihood function value is very little (such as less than 1%).
tandem mass spectrometry
false positive rate
3D quadrupole ion trap
linear trap quadrupole
probability density function
fully nonparametric probability density function estimate
maximum likelihood estimate
multivariate Gaussian density function
randomized database matches
international protein index
maximum likelihood estimate
We thank Dr.Songfeng Wu of the Beijing Proteome Research Centre for his thoughtful discussion. We also thank Master's candidate JieMa of the Beijing Proteome Research Centre for assistance with the database search. The LCQ control dataset was provided by the BIATECH institute and Dr.Zhongqi Zhang kindly provided the program MassAnalyzer, we thank them here. This work was funded by the Chinese Ministry of Science and Technology (2006AA02A312, 2006AA02Z334, 2006CB910803, 2006CB910700), the National Natural Science Foundation of China (30621063, 342123), and the Beijing Municipal Science and Technology Project (H030230280590), Chinese National Key Program of Basic Research (2006CB910700).
- Pandey A, Mann M: Proteomics to study genes and genomes. Nature 2000, 405(6788):837–46. 10.1038/35015709View ArticlePubMedGoogle Scholar
- Patterson SD, Aebersold RH: Proteomics: the first decade and beyond. Nat Genet 2003, 33(Suppl):311–23. 10.1038/ng1106View ArticlePubMedGoogle Scholar
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422(6928):198–207. 10.1038/nature01511View ArticlePubMedGoogle Scholar
- Domon B, Aebersold R: Mass spectrometry and protein analysis. Science 2006, 312(5771):212–7. 10.1126/science.1124619View ArticlePubMedGoogle Scholar
- Nesvizhskii AI, Aebersold R: Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS. Drug Discov Today 2004, 9(4):173–81. 10.1016/S1359-6446(03)02978-7View ArticlePubMedGoogle Scholar
- Eng JK, McCormack AL, Yates JR 3rd: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 1994, 5(11):976–89. 10.1016/1044-0305(94)80016-2View ArticlePubMedGoogle Scholar
- Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20(18):3551–67. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2View ArticlePubMedGoogle Scholar
- Domon B, Aebersold R: Challenges and opportunities in proteomics data analysis. Mol Cell Proteomics 2006, 5(10):1921–6. 10.1074/mcp.R600012-MCP200View ArticlePubMedGoogle Scholar
- Sadygov RG, Cociorva D, Yates JR 3rd: Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book. Nat Methods 2004, 1(3):195–202. 10.1038/nmeth725View ArticlePubMedGoogle Scholar
- Chamrad D, Meyer HE: Valid data from large-scale proteomics studies. Nat Methods 2005, 2(9):667–75. 10.1038/nmeth0905-647View ArticleGoogle Scholar
- Chen Y, Kwon SW, Kim SC, Zhao Y: Integrated Approach for Manual Evaluation of Peptides Identified by Searching Protein Sequence Databases with Tandem Mass Spectra. J Proteome Res 2005, 4(3):998–1005. 10.1021/pr049754tView ArticlePubMedGoogle Scholar
- Tabb DL, McDonald WH, Yates JR 3rd: DTASelect and Contrast: Tools for Assembling and Comparing Protein Identifications from Shotgun Proteomics. J Proteome Res 2002, 1(1):21–6. 10.1021/pr015504qPubMed CentralView ArticlePubMedGoogle Scholar
- Sun W, Li F, Wang J, Zheng D, Gao Y: AMASS: Software for Automatically Validating the Quality of MS/MS Spectrum from SEQUEST Results. Mol Cell Proteomics 2004, 3(12):1194–1199. 10.1074/mcp.M400120-MCP200View ArticlePubMedGoogle Scholar
- Link AJ, Eng J, Schieltz DM, Carmack E, Mize GJ, Morris DR, Garvik BM, Yates JR 3rd: Direct analysis of protein complexes using mass spectrometry. Nat Biotechnol 1999, 17(7):676–82. 10.1038/10890View ArticlePubMedGoogle Scholar
- Keller A, Nesvizhskii AI, Kolker E, Aebersold R: Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search. Anal Chem 2002, 74(20):5383–5392. 10.1021/ac025747hView ArticlePubMedGoogle Scholar
- Lopez-Ferrer D, Martinez-Bartolome S, Villar M, Campillos M, Martin-Maroto F, Vazquez J: Statistical model for large-scale peptide identification in databases from tandem mass spectra using SEQUEST. Anal Chem 2004, 76(23):6853–6860. 10.1021/ac049305cView ArticlePubMedGoogle Scholar
- Eriksson J, Fenyo D: A model of random mass-matching and its use for automated significance testing in mass spectrometric proteome analysis. Proteomics 2002, 2(3):262–270. 10.1002/1615-9861(200203)2:3<262::AID-PROT262>3.0.CO;2-WView ArticlePubMedGoogle Scholar
- Sadygov RG, Yates JR 3rd: A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal Chem 2003, 75(15):3792–3798. 10.1021/ac034157wView ArticlePubMedGoogle Scholar
- Sadygov RG, Liu H, Yates JR 3rd: Statistical Models for Protein Validation Using Tandem Mass Spectral Data and Protein Amino Acid Sequence Databases. Anal Chem 2004, 76(6):1664–1671. 10.1021/ac035112yView ArticlePubMedGoogle Scholar
- Moore RE, Young MK, Lee TD: Qscore: An Algorithm for Evaluating SEQUEST Database Search Results. J Am Soc Mass Spectrom 2002, 13(4):378–386. 10.1016/S1044-0305(02)00352-5View ArticlePubMedGoogle Scholar
- Baüczek T, Bucinski A, Ivanov Ar, Kaliszan R: Artificial Neural Network Analysis for Evaluation of Peptide MS/MS Spectra in Proteomics. Anal Chem 2004, 76(6):1726–1732. 10.1021/ac030297uView ArticleGoogle Scholar
- Razumovskaya J, Olman V, Xu D, Uberbacher EC, VerBerkmoes NC, Hettich RL, Xu Y: A computational method for assessing peptide identification Reliability in tandem mass spectrometry analysis with SEQUEST. Proteomics 2004, 4(4):961–969. 10.1002/pmic.200300656View ArticlePubMedGoogle Scholar
- Anderson DC, Li W, Payan DG, Noble WS: A New Algorithm for the Evaluation of Shotgun Peptide Sequencing in Proteomics: Support Vector Machine Classification of Peptide MS/MS Spectra and SEQUEST Scores. J Proteome Res 2003, 2(2):137–146. 10.1021/pr0255654View ArticlePubMedGoogle Scholar
- Ulintz PJ, Zhu J, Qin ZS, Andrews PC: Improved classification of mass spectrometry database search results using newer machine learning approaches. Mol Cell Proteomics 2006, 5(3):497–509.View ArticlePubMedGoogle Scholar
- Qian WJ, Liu T, Monroe ME, Strittmatter EF, Jacobs JM, Kangas LJ, Petritis K, Camp DG 2nd, Smith RD: Probability-Based Evaluation of Peptide and Protein identifications from Tandem Mass Spectrometry and SEQUEST Analysis: The Human Proteome. J Proteome Res 2005, 4(1):53–62. 10.1021/pr0498638View ArticlePubMedGoogle Scholar
- Peng J, Elias JE, Thoreen CC, Licklider LJ, Gygi SP: Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J Proteome Res 2003, 2(1):43–50. 10.1021/pr025556vView ArticlePubMedGoogle Scholar
- Elias JE, Haas W, Faherty BK, Gygi SP: Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nat Methods 2005, 2(9):667–75. 10.1038/nmeth785View ArticlePubMedGoogle Scholar
- Higdon R, Hogan JM, Van Belle G, Kolker E: Randomized sequence databases for tandem mass spectrometry peptide and protein identification. OMICS 2005, 9(4):364–79. 10.1089/omi.2005.9.364View ArticlePubMedGoogle Scholar
- Li F, Sun W, Gao Y, Wang J: RScore: a peptide randomicity score for evaluating tandem mass spectra. Rapid Commun Mass Spectrom 2004, 18(14):1655–9. 10.1002/rcm.1535View ArticlePubMedGoogle Scholar
- Fenyo D, Beavis RC: A Method for Assessing the Statistical Significance of Mass Spectrometry-Based Protein Identifications Using General Scoring Schemes. Anal Chem 2003, 75(4):768–74. 10.1021/ac0258709View ArticlePubMedGoogle Scholar
- Sun S, Meyer-Arendt K, Eichelberger B, Brown R, Yen CY, Old WM, Pierce K, Cios KJ, Ahn NG, Resing KA: Improved validation of peptide MS/MS assignments using spectral intensity prediction. Mol Cell Proteomics 2007, 6(1):1–17. 10.1074/mcp.M600449-MCP200View ArticlePubMedGoogle Scholar
- Strittmatter EF, Kangas LJ, Petritis K, Mottaz HM, Anderson GA, Shen Y, Jacobs JM, Camp DG 2nd, Smith RD: Application of peptide LC retention time information in a discriminant function for peptide identification by tandem mass spectrometry. J Proteome Res 2004, 3(4):760–9. 10.1021/pr049965yView ArticlePubMedGoogle Scholar
- Huttlin EL, Hegeman AD, Harms AC, Sussman MR: Prediction of Error Associated with False-Positive Rate Determination for Peptide Identification in Large-Scale Proteomics Experiments Using a Combined Reverse and Forward Peptide Sequence Database Strategy. J Proteome Res 2007, 6(1):392–398. 10.1021/pr0603194PubMed CentralView ArticlePubMedGoogle Scholar
- Andersen JS, Lam YW, Leung AK, Ong SE, Lyon CE, Lamond AI, Mann M: Nucleolar proteome dynamics. Nature 2005, 433(7021):77–83. 10.1038/nature03207View ArticlePubMedGoogle Scholar
- Pilch B, Mann M: Large-scale and high-confidence proteomic analysis of human seminal plasma. Genome Biol 2006, 7(5):R40. 10.1186/gb-2006-7-5-r40PubMed CentralView ArticlePubMedGoogle Scholar
- de Godoy LM, Olsen JV, de Souza GA, Li G, Mortensen P, Mann M: Status of complete proteome analysis by mass spectrometry: SILAC labeled yeast as a model system. Genome Biol 2006, 7(6):R50. 10.1186/gb-2006-7-6-r50PubMed CentralView ArticlePubMedGoogle Scholar
- Adachi J, Kumar C, Zhang Y, Olsen JV, Mann M: The human urinary proteome contains more than 1500 proteins including a large proportion of membranes proteins. Genome Biol 2006, 7(9):R80. 10.1186/gb-2006-7-9-r80PubMed CentralView ArticlePubMedGoogle Scholar
- Beausoleil SA, Villen J, Gerber SA, Rush J, Gygi SP: A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat Biotechnol 2006, 24(10):1285–92. 10.1038/nbt1240View ArticlePubMedGoogle Scholar
- Everley PA, Bakalarski CE, Elias JE, Waghorne CG, Beausoleil SA, Gerber SA, Faherty BK, Zetter BR, Gygi SP: Enhanced analysis of metastatic prostate cancer using stable isotopes and high mass accuracy instrumentation. J Proteome Res 2006, 5(5):1224–31. 10.1021/pr0504891View ArticlePubMedGoogle Scholar
- Haas W, Faherty BK, Gerber SA, Elias JE, Beausoleil SA, Bakalarski CE, Li X, Villen J, Gygi SP: Optimization and use of peptide mass measurement accuracy in shotgun proteomics. Mol Cell Proteomics 2006, 5(7):1326–37. 10.1074/mcp.M500339-MCP200View ArticlePubMedGoogle Scholar
- Hwang JN, Lay SR, Lippman A: Nonparametric Multivariate Density Estimation: A Comparative Study. IEEE Transactions on Signal Processing 1994, 42(10):2795–2810. 10.1109/78.324744View ArticleGoogle Scholar
- Archambeau C, Verleysen M: Fully nonparametric probability density function estimation with finite gaussian mixture models. 7th ICPAR Conf 2003, 81–84.Google Scholar
- Duda RichardO, Hart PeterE, Stork DavidG: Pattern Classification, Second Edition. John Wiley 2001, 10: 3–13.Google Scholar
- Resing KA, Meyer-Arendt K, Mendoza AM, Aveline-Wolf LD, Jonscher KR, Pierce KG, Old WM, Cheung HT, Russell S, Wattawa JL, Goehle GR, Knight RD, Ahn NG: Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. Anal Chem 2004, 76(13):3556–68. 10.1021/ac035229mView ArticlePubMedGoogle Scholar
- Zhang Z: Prediction of Low-Energy Collision-Induced Dissociation Spectra of Peptides. Anal Chem 2004, 76(14):3908–3922. 10.1021/ac049951bView ArticlePubMedGoogle Scholar
- Purvine S, Picone AF, Kolker E: Standard mixtures for proteome studies. OMICS 2004, 8(1):79–92. 10.1089/153623104773547507View ArticlePubMedGoogle Scholar
- Yang X, Dondeti V, Dezube R, Maynard DM, Geer LY, Epstein J, Chen X, Markey SP, Kowalak JA: DBParser: Web-Based Software for Shotgun Proteomic Data Analyses. J Proteome Res 2004, 3(5):1002–08. 10.1021/pr049920xView ArticlePubMedGoogle Scholar
- Nesvizhskii AI, Roos FF, Grossmann J, Vogelzang M, Eddes JS, Gruissem W, Baginsky S, Aebersold R: Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol Cell Proteomics 2006, 5(4):652–70.View ArticlePubMedGoogle Scholar
- Handl J, Knowles J, Kell DB: Computational cluster validation in post-genomic data analysis. Bioinformatics 2005, 21(15):3201–12. 10.1093/bioinformatics/bti517View ArticlePubMedGoogle Scholar
- Geladi P, Kowlaski B: Partial least square regression: A tutorial. Analytica Chemica Acta 1986, 35: 1–17. 10.1016/0003-2670(86)80028-9View ArticleGoogle Scholar
- Bilmes JA: A gentle tutorial of the EM algorithm and its applications to parameter estimation for gaussian mixture and hidden Markov models. International Computer Science Institute, Berkeley, California; 1998.Google Scholar
- Nash SG, Sofer : Linear and Nonlinear Programming. McGraw-Hill 1996.Google Scholar
- Scott DW: On optimal and data-based histograms. Biometrika 1979, 66: 605–610. 10.1093/biomet/66.3.605View ArticleGoogle Scholar
- Silverman BW: Density estimation for statistics and data analysis. Chapman Hall: London; 1986.View ArticleGoogle Scholar
- Chen M, Ying W, Song Y, Liu X, Yang B, Wu S, Jiang Y, Cai Y, He F, Qian X: Analysis of human liver proteome using replicate shotgun strategy. Proteomics 2007, 7(14):2479–88. 10.1002/pmic.200600338View ArticlePubMedGoogle Scholar
- Prince JT, Carlson MW, Wang R, Lu P, Marcotte EM: The need for a public proteomics repository. Nat Biotechnol 2004, 22(4):471–2. 10.1038/nbt0404-471View ArticlePubMedGoogle Scholar
- Zhang J, Li J, Xie H, Zhu Y, He F: A new strategy to filter out false positive identifications of peptides in SEQUEST database search results. Proteomics 2007, 7(22):4036–44. 10.1002/pmic.200600929View ArticlePubMedGoogle Scholar
- Modha D, Scott-Spangler W: Feature weighting in k-means clustering. Machine Learning 2003, 52(3):217–237. 10.1023/A:1024016609528View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.