A nonparametric model for quality control of database search results in shotgun proteomics

Zhang, Jiyang; Li, Jianqi; Liu, Xin; Xie, Hongwei; Zhu, Yunping; He, Fuchu

doi:10.1186/1471-2105-9-29

Methodology article
Open access
Published: 21 January 2008

A nonparametric model for quality control of database search results in shotgun proteomics

Jiyang Zhang^1,2,
Jianqi Li²,
Xin Liu²,
Hongwei Xie¹,
Yunping Zhu² &
…
Fuchu He^1,2

BMC Bioinformatics volume 9, Article number: 29 (2008) Cite this article

5602 Accesses
Metrics details

Abstract

Background

Analysis of complex samples with tandem mass spectrometry (MS/MS) has become routine in proteomic research. However, validation of database search results creates a bottleneck in MS/MS data processing. Recently, methods based on a randomized database have become popular for quality control of database search results. However, a consequent problem is the ignorance of how to combine different database search scores to improve the sensitivity of randomized database methods.

Results

In this paper, a multivariate nonlinear discriminate function (DF) based on the multivariate nonparametric density estimation technique was used to filter out false-positive database search results with a predictable false positive rate (FPR). Application of this method to control datasets of different instruments (LCQ, LTQ, and LTQ/FT) yielded an estimated FPR close to the actual FPR. As expected, the method was more sensitive when more features were used. Furthermore, the new method was shown to be more sensitive than two commonly used methods on 3 complex sample datasets and 3 control datasets.

Conclusion

Using the nonparametric model, a more flexible DF can be obtained, resulting in improved sensitivity and good FPR estimation. This nonparametric statistical technique is a powerful tool for tackling the complexity and diversity of datasets in shotgun proteomics.

Background

The objective of proteomics is to investigate proteins on a global scale [1, 2]. The high-throughput and sensitive tandem mass spectrometry (MS/MS) platform is now a supporting technology for protein identification in proteomic research [3, 4]. Using the shotgun strategy, a large number of MS/MS spectra can be gathered in a few hours [5]. The MS/MS data is generally processed by the so-called database searching method [5]. Automated software such as SEQUEST [6] and MASCOT [7] can rapidly assign tryptic peptides to MS/MS spectra by searching a protein sequence database and then identify proteins by utilizing the identified peptides. A notable problem in the MS/MS data processing is the high false positive rate (FPR) of the database search results [8]. Thus, validation of database search results is unavoidable and necessary work, particularly when processing the large amount low accuracy MS/MS spectra with SEQUEST [10].

There are many proposed parameters and algorithms for evaluating SEQUEST database search results [11–32]. Such approaches must confront two main problems: First, the complex physical and chemical mechanisms of the shotgun experiment make it difficult to model the matches between MS/MS spectra and peptides with a one-size-fits-all algorithm [9]. Thus, database search software provides multiple scores, and many empirical and intuitive parameters are used in the validation of database search results. These parameters describe different aspects of the quality of the match and provide complementary information to the validation of the database search results. Combining these parameters while considering their relationships is difficult. Second, many factors can affect the distributions of quality control parameters, including the sample, the database, the experimental conditions, and other random factors [24, 27]. Avoiding the effects of such factors during the validation of database search results is difficult. In addition, large-scale proteomics always uses multiple, complementary MS/MS platforms and multiple database search software tools to acquire more results with a high confidence level. Thus, a universal framework for quality control of results is needed [8].

Recently, the randomized database method has become an attractive framework for quality control of MS/MS database search results. By constructing a negative control dataset for each experiment MS/MS dataset and the given database, the randomized database method can provide a universal foundation for the result quality control for different types of database search software and minimize the effects of differences in samples, experiment conditions, and databases [27]. In the randomized database method, the negative control dataset is generated by searching the constructed randomized database and used to simulate random matches from the normal database. The false positive rate can be estimated using the numbers of matches from the normal and randomized database given a set of filter criteria.

Moore et al. used the reverse database (a special kind of randomized database) for their Qscore model in 2002 [20]. Subsequently, Qian et al. [25] and Peng et al. [26] used the reverse database method to investigate the problem of optimizing the cutoff value of Xcorr and ΔCn in yeast and human proteome research, respectively. Recently, Higdon et al. [28] investigated some problems encountered in the application of the reshuffled database. As they noted, searching a combined database can yield more accurate FPR estimation than individually searching normal and reshuffled databases. Based on the binomial distribution, Huttlin et al. investigated the minimum error associated with the estimated FPR [33]. They pointed out that the estimated FPR for a large dataset could be quite accurate. Randomized database methods have been widely used in many research projects [34–40]. However, different groups use different criteria; there is no standard statistical framework that can easily integrate commonly used parameters for the quality control of database search results.

There are two primary problems with the randomized database method: how to determine the filter criteria and how to estimate the FPR in succession. Based on the hypothesis that random matches are randomly drawn from normal and randomized databases, formula 1 can be used to estimate the actual FPR [25, 26]; Elias et al [27] recommended formula 2 for reliable data quality control:

F P R = \frac{N_{R}}{N_{N}}

(1)

F P R = \frac{2 N_{R}}{N_{N} + N_{R}}

(2)

where N_Rand N_Nare the preserved number of peptide matches that pass certain filter criteria and derive from the randomized and normal databases, respectively. Huttlin et al [33] have given a statistical interpretation of formula 2 by using the binomial distribution. So, in this paper, we used formula 2 to estimate FPR. Generally, the filter criteria are discriminant functions (DFs) of database search scores. Determining the acceptance boundaries for database search scores (such as Xcorr and ΔCn) is a simple and commonly used method [25, 26]. Lopez-Ferrer et al sought to introduce a statistical model that would provide a more complex DF and thus improve the sensitivity of filter criteria [16]. In their model, XCc(= ln(Xcorr)) and $D C c (= \sqrt{Δ C n})$ of random matches were considered to follow normal distributions, and the distributions of XCc and DCc were assumed to be independent. The contour line of the estimated joint distribution of XCc and DCc was used as the filter boundary. However, we found that normal distributions do not fit well the distributions of XCc and DCc of the random matches in the LCQ control dataset used in this paper(see "Datasets and database search" section); the χ² goodness of fit test shows that we can reject the null hypothesis H₀ (the distribution is normal) at a significance level of 0.05. Furthermore, the correlation between XCc and DCc is significant (correlation coefficient = 0.1, p-value = 1.8 × 10^-24; random matches in the LCQ control dataset, see section "Datasets and database search") which is inconsistent with the independence assumption made by Lopez-Ferrer et al. Another problem with their model is that it cannot be generalized to the situations involving more parameters.

Multivariate nonparametric models can describe data with complex and variable statistical structures. The term nonparametric is not meant to imply that such models do not use any parameters but rather denotes that the number and nature of the parameters are not fixed in advance but flexible. This advantage makes nonparametric models a powerful tool for addressing the problem of multiple parameters with variable distributions in the validation of database search results. Using a set of kernel functions (such as a Gaussian kernel function); the nonparametric model can fit the distribution of multiple parameters directly with considerable accuracy [41, 42]. Generally, parameter estimation for a nonparametric model is an iterative optimization procedure. The fully nonparametric probability density function estimate (FnPDFe) procedure proposed by Archambeau et al. [42] and David et al. [43], which is based on a maximum likelihood estimate (MLE) and expectation-maximization (EM) algorithm, is easily implemented with computer programs. In this paper, based on the randomized database searching, FnPDFe was used to estimate the multivariate PDF of the commonly used database scores, the contour lines of the estimated PDF were taken as the candidate DFs. We demonstrated that the FPR estimation errors of the newly introduced method were acceptable on the control datasets from different instruments (LCQ, LTQ and LTQ/FT), its sensitivity was also proved to be improved on the control datasets and the real sample datasets.

Results

In this section, the DFs of the nonparametric model were discussed at first, and then we show that the sensitivity of the model could be improved by incorporating more features. The accuracy of the FPR estimation of the nonparametric model was investigated and the performance of the nonparametric model was proved superior by comparing with other commonly used methods in proteomics.

Nonparametric model and the DF

In order to illustrate the shape of the DFs derived from the nonparametric model, a two dimension model which used Xcorr and ΔCn was investigated at first. Because Xcorr significantly correlate with the charge state (+1, +2, and +3) [15], the matches with different charge states were processed individually. Since a large percentage of correct matches have a double charge, the matches in the control dataset with a double charge are discussed here. Using a trial and error approach, a model with 3 Gaussian functions (18 variables, Table 1) fit the distribution well (χ² goodness of fit test; significance level = 0.05). Figure 1A and Figure 1B show the histogram and density function, respectively. The estimated error for each bin is shown in Figure 1C. The small error (≤ 3.6 × 10^-3) also demonstrates that the fit is accurate.

Table 1 The model with 3 Gaussian functions for +2 charge observations in the LCQ control dataset

Full size table

DFs that can simultaneously reject as many false positives as possible and accept as many true positives as possible are preferred. Thus, the region in the feature space with fewer random matches is more preferred, and the contour lines of the PDF of the random matches are good candidate DFs (Figure 1D). Generally, random matches have a small ΔCn and Xcorr, while correct matches have a large ΔCn and Xcorr. Correct matches with the peptide isoform [44] have a small ΔCn and a large Xcorr. Matches with a small Xcorr and a large ΔCn may be due to the limited search space of the database searching. These matches are rare and more likely to be random matches; they may be localized to the accepted region of the contour line DFs because these results are also rare random events. A new DF of Xcorr was added to exclude such matches: Xcorr > m_Xcorr, where m_Xcorris the mean of Xcorr of randomized database matches (bold red vertical line in Figure 1D). Given an expected FPR α, a target value f_αcan be searched to ensure the calculated FPR (FPR_cal) is less than or equal to α. When searching for f_α, N_Nand N_Rwere counted according to the rules:

\sum_{i = 1}^{N} P (i) f_{G} (X | i) \leq f_{α}

(3)

andXcorr > m_Xcorr

where X = (Xcorr, ΔCn) is the observation, and N = 3 is the number of Gaussian functions. Many f_αsatisfied formula 3 and formula 4. The one with the largest N_Nwas used in the final DF. Figure 2 shows the DFs for different expected FPRs and different charge states. The shapes of the boundaries were significantly different, which indicates that it is difficult to fit all the distributions of different charge states with a simple distribution. The nonparametric model can provide feasible solutions to this complex problem. Since the resulting DFs are smooth, this method is more robust than the K nearest neighbor method [41].

Incorporating more features

One obvious advantage of the nonparametric model is that it can easily integrate more scores for validating peptide identifications. By taking into account more features and performing the classification in a high-dimension feature space, a more reasonable DF can be found, and thus, higher sensitivity can be achieved. Here, another powerful parameter called Sim introduced by Zhang [45] in 2004 and discussed by Sun et al. [31] recently was added to the nonparametric model. Sim measures the similarity between the experiment and the predicted MS/MS spectrum which was generated by the kinetic model introduced by Zhang [45] and the mass error tolerance for aligning the ions was specified as 0.5.

For the LCQ control dataset, by trial and error, we found a nonparametric model with 5 component GDFs can work well (65 parameters). We also tried a model with 7 component GDFs, but its performance was not improved and two of the component GDFs had a coefficient P_inear 0 [see Additional file 1]. Thus, we selected 5 component GDFs to build the model. When the expected FPR was 0.05 and 0.01, the actual FPR was 0.044 and 0.012, respectively. The number of peptide matches after filtering was 765 and 699, which were 104 (approximately 15.6%) and 121 (approximately 20.9%) respectively higher than the results of the nonparametric model using Xcorr and ΔCn, respectively. The sensitivity increased to 0.879 and 0.822 respectively, and the specificity did not change. Thus, by incorporating more features, the nonparametric model can provide greater discriminating power. In the following part of this paper, we discussed the nonparametric model with three features: Xcorr, ΔCn and Sim only. All the model parameters used in this paper were provided in Additional file 1.

The accuracy of the FPR estimation

The control datasets were generated by analyzing a set of known proteins and peptides with MS/MS platforms, which were commonly used to validate the performance of mathematical models for peptide identification [46]. Table 2 reports the actual FPR and the number of validated matches at two commonly expected FPRs of 0.05 and 0.01. From Table 2, the following propositions can be made:

Table 2 Actual FPRs and the corresponding estimated FPRs

Full size table

(1)
In most cases, the FPRs estimated by formula 2 were close to but larger than the actual FPRs. Thus, the quality of the resulting datasets was better than claimed. It facilitates the strict result quality control but some sensitivity is lost.
(2)
For little datasets, such as +1 charge state matches of different instruments, the actual FPR was larger than the corresponding estimated FPR. The error of the FPR estimation was also a bit larger. This result agrees with the conclusions of Huttlin et al [33].
(3)
The estimated FPRs were not equal but close to the expected FPR. The smaller the resulting datasets, the larger the difference between estimated FPR and expected FPR. This arises from the rounding error in formula 2. For example, with an expected FPR of 0.01, the allowable number of random matches was less than 1 for the +1 charge dataset of LCQ, because only 62 matches were left after filtering. Thus, it is impossible to have an estimated FPR exactly equal to 0.01. A preferred alternative is rounding the estimated FPR to 0 (Table 2).
(4)
The error of the FPR estimation at the expected FPR of 0.01 is larger that of 0.05. This result means that some unexpected contaminants exist. For example, in the LCQ control dataset, peptide "HVGDLGNVTADK" was identified with high database scores Xcorr = 4.5837, ΔCn = 0.542204) and the matched percentage of predicted ions reached 91% (Figure 3). This peptide comes from protein sp|P00441| SODC_HUMAN, which is not a protein in the control sample. But this peptide also belongs to protein sp|P00442|SODC_BOVIN, which may be contaminants in the sample because 4 proteins (ALBU_BOVIN, LACB_BOVIN, LCA_BOVIN and CYC_BOVIN) of bovine were added to the control sample.

(5)
Manually checking the confirmed matches by the nonparametric model, we found that some results with large Xcorr but very small ΔCn were confirmed. In some cases, the peptide in the second rank was correct. For example, in the LTQ dataset (D2), a peptide "LEAELEK" was identified with Xcorr = 2.4273 and ΔCn = 0.0533 (+1 charge state). The peptide at the second rank was "LEALEEK", a peptide from control protein P62937|PPIA_HUMAN, because of the theoretic mass spectrum similarity between these peptides, which will result in some FPR estimation error.

Compare the performance of nonparametric model with other methods

Two other methods were also be widely used in the proteomic research. The first one (named M1) searches for the optimized cut-off values of Xcorr and ΔCn simultaneously while making the number of confirmed matches reached its maximum given an expected FPR. The resulting accepted region on the Xcorr-ΔCn plane is a rectangle. The second one (named M2) is Peptideprophet (V1.9), which is an empirical statistic model, introduced by Keller et al [15]. PeptideProphet provided the estimated error rates (EER) at different probability score cut-offs. EER has similar meaning with FPR, so we used it as the measure of the quality of the resulting dataset and only the probability score cut-offs without additional criterion were used to filter the matches. In order to name it easily, we denote the nonparametric model as M3 in the following part of this paper. For the control datasets, the confirmed matches, the actual FPR and the sensitivity were listed in Table 3 (The filter criteria can be found in Additional file 1). Some conclusions can be drawn:

Table 3 Comparison of different methods on the control datasets

Full size table

(1)
In each case, the sensitivity of M3 is the highest. The difference in sensitivity of different methods ranges from 3.8% to 27.7%.
(2)
For the LCQ and LTQ dataset, the performance of M1 and M2 differs little and Peptideprophet (M2) which was trained by a LCQ control dataset [15] does not seem to work well on the LTQ/FT dataset.
(3)
The performance of the nonparametric model differs little on the dataset of different instruments. When the expected FPR is 0.05, the sensitivity is above 0.85 and it is above 0.74 when the expected FPR is 0.01.
(4)
FPR estimation errors exist for different methods. In some cases, the error is large. This may be caused by the calculation errors because of unexpected contaminants and random errors.

Application to large datasets

Shotgun experiments always generate large datasets [8]. Thus, the nonparametric model demonstrated to be effective with the control dataset should be validated using large datasets. At first, we investigated the quality of the confirmed matches by the nonparametric model (The filter criteria can be found in Additional file 7). Another 6 parameters which were commonly used to validate the peptide identifications of SEQUEST database search results were calculated for each match. They are maximal continuous b or y ion series length (CSL) [11], the matched percentage of the predicted ions by SEQUEST (Ions) [44], ranked preliminary score (RSp) [44], the continuity of b or y ion series (Cont) [13], the matched percentage of ion intensities in the experiment mass spectrum (iIons) [13] and the matched percentage of the peak number in the experiment mass spectrum (nIons) [23]. The percentages of the confirmed results which passed the empirical rules (Table 4) convinced us that most of these matches had a high confidence level. It must be noted that RSp = 1 is a strict rule [44] and some correct matches may be lost if we require RSp = 1. For instance, only 76% correct matches are with RSp = 1 in the LTQ control dataset.

As a case study, we investigated the overlaps of the three methods on the LTQ dataset. More than 90% of the matches confirmed by M1 or M2 were covered by M3 (Figure 4), and 89.1 (FPR = 0.05) and 83.6 (FPR = 0.01) of the matches confirmed by the nonparametric model were covered by M1 ∪ M2. Each method of the three can all provide some matches that are not covered by the other two because they utilize different filter boundaries and different parameters.

Table 4 Validate the confirmed matches by empirical rules (%).

Full size table

Figure 5A shows the mesh grids of a DF of M3 (+2 charge state matches in D5, FPR = 0.01). As it appears, the matches with the smaller Xcorr, ΔCn or Sim were discarded by M3, which agrees with the experience that the matches with large scores (Xcorr, ΔCn or Sim) are more possibly correct. Figure 5B~Figure 5E illustrate the score distributions of the matches uniquely confirmed by M1~M3. It is clear that some matches with small Xcorr, ΔCn and Sim were confirmed by PeptideProphet (red points), which integrated some other parameters, such as preliminary score (Sp). M2 confirmed some matches with middle Xcorr and ΔCn but small Sim (green points). M3 confirmed many matches (4714) with relative smaller Xcorr and ΔCn but large Sim, which were discarded by M1 and M3. These results demonstrated that different filter boundaries with different parameters would generate different results with different sensitivity and integrating more complementary parameters by appropriate methods could improve the sensitivity of database search result validation.

In Table 5, we gave the numbers of confirmed matches, non-redundant peptides, identified proteins (Minimal protein list assembled by DBParser algorithm [47]) and the percentage of proteins with at least 2 or 3 peptide hits (The filter criteria can be found in Additional file 7). The nonparametric model can confirm up to 14.5% more proteins than the other two kinds of methods, which indicated that our model has a higher sensitivity. For the same kind of instrument, three methods gave about the same percentage of proteins with at least 2 or 3 peptide hits at different confidence levels. The percentage of proteins with at least 2 peptide hits reaches above 50% for the LCQ or LTQ dataset, but it is about 40% for the LTQ/FT dataset. It is interesting that the percentage of proteins with at least 2 or 3 peptide hits can not be improved by improving the confidence level of the peptide identifications when one method is used.

Table 5 Comparison of different methods on the complex datasets

Full size table

Discussion

Due to the complexity of the peptide identification problem, many parameters have been proposed for use in modeling the quality of matches between MS/MS spectra and peptides. For example, Xcorr and Sim assess the similarity between theoretical and experimental spectra, and ΔCn assesses the effect of database size. There are two main reasons for the simultaneous existence of multiple parameters. First, the complex physical and chemical process of the MS/MS platform makes it difficult to model the peptide identification problem universally [48]). Second, the huge computational burden of the database search makes it difficult to implement complex models. Thus, most MS/MS data processing approaches currently used include two steps: 1) find candidate peptides quickly and thus reduce the search space; 2) validate the results carefully by taking into account more information. As in this paper, a popular way for quality control of data in shotgun proteomics is to generate a set of easily calculated scores measuring the quality of the matches in different ways and then to combine these parameters to validate the results [23]. The randomized database method provides a feasible framework for constructing a negative control dataset and controlling the FPR of the acquired dataset. The nonparametric model introduced in this paper provides a framework for feature integration and determination of nonlinear DFs. However, if too many parameters are used, the nonparametric model will encounter a computational problem. With too many variable parameters in the model, there may be many solutions to the MLE equations. Thus, the iterative process of the EM algorithm may reach a local minimum, and good performance of the model cannot be guaranteed. Thus, when many features are used, it is recommended that the features be partitioned into different groups by hierarchical clustering [49] and the nonparametric model be applied to each cluster. Other feature-space reduction methods such as principal component analysis (PCA) and partial least squares (PLS) can also be used [50].

The EM algorithm is guaranteed to converge [51]. However, if there are too many variables, it may reach a local minimum. For double-charged matches in the LCQ control dataset (here, we only used two variables: Xcorr and ΔCn), we also tried a Gaussian mixed model with 15 components (5 fold of the model we used). The values of the ML function calculated in the iterative process of the EM algorithm increased monotonically for the Gaussian mixed model with 3 components, whereas for the Gaussian mixed model with 15 components they initially increased and then decreased along the iterative step (Figure 6). The performance (χ² statistic; smaller = better) of the 15-mixed models demonstrated the same pattern. It was confirmed that too many variables (90 variables) do not lead to better performance. It is fortunate that the Gaussian model with 3 mixed functions fit the data satisfactorily. For the large dataset and the model with more features, the number of component functions did not exceed 7. If a more complex mixed model is needed, we recommend the following strategies: 1) optimize the ML function directly using more robust nonlinear optimization techniques such as the conjugate gradient and quasi-Newton methods [52]; 2) directly fit the histogram with an optimized binned method (such as Scott's rule [53]) using a RBF neural network; or 3) use another nonparametric model such as the adaptive kernel density estimation proposed by Silverman [54].

The computational burden of the nonparametric model may be doubted, especially for the huge LTQ dataset. It is lucky that it does not need so many observations to build the nonparametric model. If the dataset is too large, we can resample the observations and use fewer observations to build the model. We tried this approach on the LTQ complex dataset. The results achieved by the model built with randomly selected 30,000 observations differed little from that of the model built with all the 432,338 observations. Thus, in the model building procedure, if the number of the observations exceeds 30,000, we resample the dataset and randomly select 30,000 observations to build the model and if the number of the observations is less than 30,000, all the observations are used. Therefore, the consumed time of the model building was less than 2 min on a PC with Intel Pentium 4 2.8G CPU and 512 MB memory.

The nonparametric model proposed in this paper is easy to use. First, a combined database is prepared containing the normal and randomized protein sequence. Then database search is performed on the combined database and the results are collected; the normal and randomized database matches are labeled with the assistance of references provided by the database search software. The randomized database matches are then used to build the nonparametric model. In this step, a parameter set different from that described here can be used. To obtain the final results, a search for the DF described in the "Nonparametric model and filter boundary" section given an expected FPR is performed. The workflow shown in figure 7 (Methods section) has been implemented by several Matlab (MathWorks, Natick, MA) scripts and in-house C++ programs. The database search results were collected using an in-house program called OutSum.exe, which were stored in the *.out files given by SEQUEST. The resulting data, stored in a plain-text file, were loaded into a Matlab workspace. A script called NoParQ.m was used to build the nonparametric model. The programs used in this paper were provided in a compressed archive [see Additional file 2].

Conclusion

In this paper, we provide a framework for validation of peptide identification in shotgun proteomics that is based on the randomized database method and a nonparametric model. The practical problems in implementing the nonparametric model were investigated, and its performance was found to be better than that of traditional methods. The nonparametric model can provide a more flexible and accurate solution for DF determination for quality control of large datasets in shotgun proteomics research. All the programs used in this work are available by request from the authors.

Methods

Datasets and database search

Six datasets generated by three kinds of mass spectrometry platforms (LCQ, LTQ and LTQ/FT) were used to demonstrate the performance of the nonparametric model. Three control datasets were used to validate the accuracy of the FPR estimation and the improvement of the sensitivity. Since the MS/MS datasets generated by the shotgun technique are always large, we also verified the generality of the nonparametric model on the large real sample datasets. The basic information about the six datasets is listed in Table 6.

Table 6 The 6 datasets used in this paper.

Full size table

The two unpublished LTQ/FT datasets were provided by Beijing Proteome Research Center (BPRC). The samples were digested with trypsin and then analyzed by a 7-Tesla LTQ/FT mass spectrometer (Thermo Electron, San Jose, CA) coupled with an Agilent 1100 nano-flow liquid chromatography system. The reverse phase C18 trap columns (300 μ m internal diameter × 5 mm long column) were connected with the 6-port column-switching valve for the on-line desalting. A PicoFritTM tip column (BioBasic C18, 5 μ m particle size, 75 μ m internal diameter × 10 cm long column, 15 μ m internal diameter at spray tip, New Objective, Woburn, MA, USA) was used for the following separation. Elution was solvent A (Milli-Q water, 2 % acetonitrile and 0.1%FA, v/v/v) and solvent B (Milli-Q water, 80% acetonitrile and 0.1%FA, v/v/v). The gradient was 15–40% B in 40 min, 40–100% B in 10 min. One FT full MS scan was followed by 5 data-dependent LTQ MS/MS scans on the five most intense ions. The dynamical excluding time was 45 seconds. Ions were accumulated in linear ion trap controlled by AGC. The AGC values were 5 × 10⁵ charges for FT full MS scan and 1 × 10⁴ charges for LTQ MS/MS scan. The resolution was 10,000 for FT full MS scan at m/z 400. The temperature of the ion transfer tube was set at 200°C and the spray voltage was 1.8 KV. The isolation width was 4Da and normalized collision energy was 35% for MS/MS scan. Mass spectra were acquired over the m/z range from 400 to 2000.

All the MS/MS spectra were extracted from the *.raw files by Extract_MSn.exe which is a console program in Bioworks 3.2 (Thermo Finnigan, San Jose, CA). For the LCQ datasets, the minimal total ion intensity is 10,000. For the LTQ or LTQ/FT datasets, the total ion intensity of each MS/MS spectrum is required to exceed 100. For all the datasets, the spectra must have at least 20 ions. Then the database search was performed on a local TurboSEQUEST (version 2.7) server. The fixed modification of oxidation (15.99Da) on the Met residue and the variable modification of carboxyamidomethylation (57.02Da) on the Cys residue were set. The enzyme was trypsin and the maximal allowed missed cleavage sites was 2. Only the b and y ions were taken into account. For the LCQ or LTQ datasets, the precursor mass error tolerance was 3.0Da, and for the LFQ/FT datasets, it was 15ppm.

For all the datasets except D2, which was searched against the database published by sPRG [55], the searched databases were derived from IPI Human 3.19 [60]. For the control datasets, the control sequences for dataset D1 and D3 [see Additional file 3, 4 and 5] including the sequences of purified proteins or peptides plus the typical sample contaminants such as keratin and trypsin were added into the IPI Human 3.19. The control sequences for D2 were determined according to the report of sPRG (see Additional file 4) [55]. The databases were constructed using the method proposed in one of our previous paper [58] and could be described as: the protein sequences in the normal database were digested in silico (trypsin), and then the amino acid residues (AAR) (except the one on the C-terminal) of the resulting peptides were reshuffled by using a random number generator. Then the reshuffled peptides were spliced to form new protein sequences in the randomized database. Finally, the normal database and the randomized database were merged to form the searched database.

After database searching, the matches with +1, +2 and +3 charge state were extracted (Table 7). For each spectrum, only the first rank match with an assigned peptide with more than 5 AAR was taken into account for further analysis. For the control datasets, the matches which were assigned peptides of control sequences were validated by the following criteria: 1) the b-ion or y-ion series should confirm at least 3 consecutive amino acids of the assigned peptide sequence [12], 2) ranked preliminary score (RSp) ≤ 50. The confirmed matches of control datasets were provided in the supplementary materials [see Additional file 6, 7 and 8).

Table 7 Database search results of the 6 datasets

Full size table

The workflow of the nonparametric model based method

The workflow of the nonparametric model based method is shown in Figure 7. Firstly, a randomized database was constructed by randomizing the tryptic peptide sequence. Then the MS/MS spectra were searched against the combined database using SEQUEST. Then, matches with an assigned peptide from the randomized database (we call them randomized database matches, RDM) were used to build the nonparametric model. The joint distribution of selected parameters (such as Xcorr, ΔCn and Sim [31, 45]) of random matches was fit with the nonparametric model using the FnPDFe method and the contour lines of the estimated PDF, which are complex nonlinear functions, were used as candidate DFs. The actually used DFs were determined according to the expected FPR and formula 2 for different charge states. Finally, the resulting DFs were used to filter the matches from the normal database. In the model-building step, k-means clustering was used to initialize the EM algorithm procedure.

Initial the nonparametric model with k-means clustering

K-means clustering [59] is commonly used to partition observations into different groups according to defined distance (such as Euclidean distance). The optimization goal of k-means clustering is to find a partition in which objects within each cluster are as close as possible to each other and as far as possible from objects in other clusters. However, in practice, the scale of each feature will significantly affect the clustering results when Euclidean distance is used. In our application, Xcorr and ΔCn were two main features. Xcorr is a float point value whose typical value is 2.5 but may be larger than 10; ΔCn is in the range [0, 1]. When directly using the observed values in the k-means clustering, Xcorr will dominate the partition results (Figure 8) because the distance (formula 5) between two observations (Xcorr_i, ΔCn_i), i = 1, 2, is mainly determined by Xcorr, which has a larger scale.

d = \sqrt{{(X c o r r_{1} - X c o r r_{2})}^{2} + {(Δ C n_{1} - Δ C n_{2})}^{2}}

(5)

Thus, a normalization step, which calculated the z-score of the observed values of each feature, was used to eliminate the scale difference, and thus achieve a more reasonable partition (Figure 8).

Nonparametric model and the EM algorithm

The basic objective of nonparametric density estimation is to approximate the distribution of observations using the weighted sum of a series of simple functions, which does not emphasize the physical meaning of the parameters but the accuracy of the approximation. This idea can be implemented using smoothing splines or radial basis function (RBF) neural network to fit the histogram directly [41]. Another way to implement the nonparametric model is to fit the distribution with kernel density functions. The optimization goal of the nonparametric model is to minimize the mean integrated squared error of the fit or to maximize the maximum likelihood function of the observations. Many kinds of nonparametric models have been proposed by different researchers [41]. The FnPDFe procedure [42] is attractive because it is easy to implement and has a clear statistical explanation. Let X be a d dimension random vector X ∈ R^d. Its PDF can be approximated by a Gaussian mixed model that is defined as the linear combination of N multivariate Gaussian density functions (MGDFs):

f (X) = \sum_{i = 1}^{N} P (i) f_{G} (X | i)

(6)

where:

f_{G} (X | i) = \frac{1}{{(2 π)}^{d / 2} {| Σ_{i} |}^{1 / 2}} e^{- \frac{1}{2} {(X - μ_{i})}^{T} Σ_{i}^{- 1} (X - μ_{i})}

(7)

and P(i),i = 1,...N satisfies: (1) 0 <P(i) ≤ 1; (2) $\sum_{i = 1}^{N} P (i) = 1$ . μ_i, Σ_iis the mean vector and covariance matrix of the i-th MGDF.

Consider independent and identically distributed observations set{x₁, x₂,......x_n}; the log-likelihood function of the mixed model is:

L (θ) = \sum_{k = 1}^{n} \ln f (x_{k})

(8)

Generally, MLE can be used to infer the parameters θ in the mixed model. However, the resulting MLE equations cannot be solved analytically. The FnPDFe method uses the EM algorithm to provide iterative solutions for these parameters [43], which can be read as:

(1)
Initial step: Initialize the objective parameters μ_i, Σ_i, and P(i) with heuristic knowledge or random values.
(2)
E-step: update the posterior distributions:
$g^{t + 1} (i | x_{k}) = \frac{f_{G}^{t} (x_{k} | i) P^{t} (i)}{\sum_{j = 1}^{N} f_{G}^{t} (x_{k} | j) P^{t} (j)}$
(9)
(3)
M-step: estimate the current parameters:
$P^{t + 1} (i) = \frac{1}{n} \sum_{k = 1}^{n} g^{t} (i | x_{k})$
(10)

$μ_{i}^{t + 1} = \frac{\sum_{k = 1}^{n} g^{t} (i | x_{k}) x_{k}}{\sum_{k = 1}^{n} g^{t} (i | x_{k})}$
(11)

$Σ_{i}^{t + 1} = \frac{\sum_{k = 1}^{n} g^{t} (i | x_{k}) {(x_{k} - μ_{i}^{t})}^{T} (x_{k} - μ_{i}^{t})}{\sum_{k = 1}^{n} g^{t} (i | x_{k})}$
(12)

(4) Repeat steps 2–3 until the change of parameters is very little.

One problem with implementation of the EM algorithm is how to initialize the parameters. Use of an improper starting point may prolong the converging time of the EM algorithm or cause it to reach a local minimum. In this work, k-means clustering was used to partition the observations into subclasses, and the means and covariance matrixes of the component Gaussian distributions were initialized using the means and covariance matrixes of the subclasses.

Another difficulty in implementing the EM algorithm is the selection of the number of component density functions. Generally speaking, inclusion of more functions will approximate the distributions of the observations more accurately, while allowing more parameters to be determined. However, overly complex models may cause the EM algorithm to reach a local minimum and worsen the performance of the resulting model. In this work, a trial and error procedure was used to select the minimum number of component density functions: try numbers from 2 until the change of the likelihood function value is very little (such as less than 1%).

Abbreviations

MS/MS:: tandem mass spectrometry
DF:: discriminate function
FPR:: false positive rate
LCQ:: 3D quadrupole ion trap
LTQ:: linear trap quadrupole
FT:: Fourier transform
PDF:: probability density function
FnPDFe:: fully nonparametric probability density function estimate
MLE:: maximum likelihood estimate
EM:: expectation-maximization
MGDF:: multivariate Gaussian density function
RDM:: randomized database matches
IPI:: international protein index
MLE:: maximum likelihood estimate
EM:: expectation-maximization.

References

Pandey A, Mann M: Proteomics to study genes and genomes. Nature 2000, 405(6788):837–46. 10.1038/35015709
Article CAS PubMed Google Scholar
Patterson SD, Aebersold RH: Proteomics: the first decade and beyond. Nat Genet 2003, 33(Suppl):311–23. 10.1038/ng1106
Article CAS PubMed Google Scholar
Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422(6928):198–207. 10.1038/nature01511
Article CAS PubMed Google Scholar
Domon B, Aebersold R: Mass spectrometry and protein analysis. Science 2006, 312(5771):212–7. 10.1126/science.1124619
Article CAS PubMed Google Scholar
Nesvizhskii AI, Aebersold R: Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS. Drug Discov Today 2004, 9(4):173–81. 10.1016/S1359-6446(03)02978-7
Article CAS PubMed Google Scholar
Eng JK, McCormack AL, Yates JR 3rd: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 1994, 5(11):976–89. 10.1016/1044-0305(94)80016-2
Article CAS PubMed Google Scholar
Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20(18):3551–67. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2
Article CAS PubMed Google Scholar
Domon B, Aebersold R: Challenges and opportunities in proteomics data analysis. Mol Cell Proteomics 2006, 5(10):1921–6. 10.1074/mcp.R600012-MCP200
Article CAS PubMed Google Scholar
Sadygov RG, Cociorva D, Yates JR 3rd: Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book. Nat Methods 2004, 1(3):195–202. 10.1038/nmeth725
Article CAS PubMed Google Scholar
Chamrad D, Meyer HE: Valid data from large-scale proteomics studies. Nat Methods 2005, 2(9):667–75. 10.1038/nmeth0905-647
Article Google Scholar
Chen Y, Kwon SW, Kim SC, Zhao Y: Integrated Approach for Manual Evaluation of Peptides Identified by Searching Protein Sequence Databases with Tandem Mass Spectra. J Proteome Res 2005, 4(3):998–1005. 10.1021/pr049754t
Article CAS PubMed Google Scholar
Tabb DL, McDonald WH, Yates JR 3rd: DTASelect and Contrast: Tools for Assembling and Comparing Protein Identifications from Shotgun Proteomics. J Proteome Res 2002, 1(1):21–6. 10.1021/pr015504q
Article PubMed Central CAS PubMed Google Scholar
Sun W, Li F, Wang J, Zheng D, Gao Y: AMASS: Software for Automatically Validating the Quality of MS/MS Spectrum from SEQUEST Results. Mol Cell Proteomics 2004, 3(12):1194–1199. 10.1074/mcp.M400120-MCP200
Article CAS PubMed Google Scholar
Link AJ, Eng J, Schieltz DM, Carmack E, Mize GJ, Morris DR, Garvik BM, Yates JR 3rd: Direct analysis of protein complexes using mass spectrometry. Nat Biotechnol 1999, 17(7):676–82. 10.1038/10890
Article CAS PubMed Google Scholar
Keller A, Nesvizhskii AI, Kolker E, Aebersold R: Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search. Anal Chem 2002, 74(20):5383–5392. 10.1021/ac025747h
Article CAS PubMed Google Scholar
Lopez-Ferrer D, Martinez-Bartolome S, Villar M, Campillos M, Martin-Maroto F, Vazquez J: Statistical model for large-scale peptide identification in databases from tandem mass spectra using SEQUEST. Anal Chem 2004, 76(23):6853–6860. 10.1021/ac049305c
Article CAS PubMed Google Scholar
Eriksson J, Fenyo D: A model of random mass-matching and its use for automated significance testing in mass spectrometric proteome analysis. Proteomics 2002, 2(3):262–270. 10.1002/1615-9861(200203)2:3<262::AID-PROT262>3.0.CO;2-W
Article CAS PubMed Google Scholar
Sadygov RG, Yates JR 3rd: A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal Chem 2003, 75(15):3792–3798. 10.1021/ac034157w
Article CAS PubMed Google Scholar
Sadygov RG, Liu H, Yates JR 3rd: Statistical Models for Protein Validation Using Tandem Mass Spectral Data and Protein Amino Acid Sequence Databases. Anal Chem 2004, 76(6):1664–1671. 10.1021/ac035112y
Article CAS PubMed Google Scholar
Moore RE, Young MK, Lee TD: Qscore: An Algorithm for Evaluating SEQUEST Database Search Results. J Am Soc Mass Spectrom 2002, 13(4):378–386. 10.1016/S1044-0305(02)00352-5
Article CAS PubMed Google Scholar
Baüczek T, Bucinski A, Ivanov Ar, Kaliszan R: Artificial Neural Network Analysis for Evaluation of Peptide MS/MS Spectra in Proteomics. Anal Chem 2004, 76(6):1726–1732. 10.1021/ac030297u
Article Google Scholar
Razumovskaya J, Olman V, Xu D, Uberbacher EC, VerBerkmoes NC, Hettich RL, Xu Y: A computational method for assessing peptide identification Reliability in tandem mass spectrometry analysis with SEQUEST. Proteomics 2004, 4(4):961–969. 10.1002/pmic.200300656
Article CAS PubMed Google Scholar
Anderson DC, Li W, Payan DG, Noble WS: A New Algorithm for the Evaluation of Shotgun Peptide Sequencing in Proteomics: Support Vector Machine Classification of Peptide MS/MS Spectra and SEQUEST Scores. J Proteome Res 2003, 2(2):137–146. 10.1021/pr0255654
Article CAS PubMed Google Scholar
Ulintz PJ, Zhu J, Qin ZS, Andrews PC: Improved classification of mass spectrometry database search results using newer machine learning approaches. Mol Cell Proteomics 2006, 5(3):497–509.
Article CAS PubMed Google Scholar
Qian WJ, Liu T, Monroe ME, Strittmatter EF, Jacobs JM, Kangas LJ, Petritis K, Camp DG 2nd, Smith RD: Probability-Based Evaluation of Peptide and Protein identifications from Tandem Mass Spectrometry and SEQUEST Analysis: The Human Proteome. J Proteome Res 2005, 4(1):53–62. 10.1021/pr0498638
Article CAS PubMed Google Scholar
Peng J, Elias JE, Thoreen CC, Licklider LJ, Gygi SP: Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J Proteome Res 2003, 2(1):43–50. 10.1021/pr025556v
Article CAS PubMed Google Scholar
Elias JE, Haas W, Faherty BK, Gygi SP: Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nat Methods 2005, 2(9):667–75. 10.1038/nmeth785
Article CAS PubMed Google Scholar
Higdon R, Hogan JM, Van Belle G, Kolker E: Randomized sequence databases for tandem mass spectrometry peptide and protein identification. OMICS 2005, 9(4):364–79. 10.1089/omi.2005.9.364
Article CAS PubMed Google Scholar
Li F, Sun W, Gao Y, Wang J: RScore: a peptide randomicity score for evaluating tandem mass spectra. Rapid Commun Mass Spectrom 2004, 18(14):1655–9. 10.1002/rcm.1535
Article CAS PubMed Google Scholar
Fenyo D, Beavis RC: A Method for Assessing the Statistical Significance of Mass Spectrometry-Based Protein Identifications Using General Scoring Schemes. Anal Chem 2003, 75(4):768–74. 10.1021/ac0258709
Article PubMed Google Scholar
Sun S, Meyer-Arendt K, Eichelberger B, Brown R, Yen CY, Old WM, Pierce K, Cios KJ, Ahn NG, Resing KA: Improved validation of peptide MS/MS assignments using spectral intensity prediction. Mol Cell Proteomics 2007, 6(1):1–17. 10.1074/mcp.M600449-MCP200
Article CAS PubMed Google Scholar
Strittmatter EF, Kangas LJ, Petritis K, Mottaz HM, Anderson GA, Shen Y, Jacobs JM, Camp DG 2nd, Smith RD: Application of peptide LC retention time information in a discriminant function for peptide identification by tandem mass spectrometry. J Proteome Res 2004, 3(4):760–9. 10.1021/pr049965y
Article CAS PubMed Google Scholar
Huttlin EL, Hegeman AD, Harms AC, Sussman MR: Prediction of Error Associated with False-Positive Rate Determination for Peptide Identification in Large-Scale Proteomics Experiments Using a Combined Reverse and Forward Peptide Sequence Database Strategy. J Proteome Res 2007, 6(1):392–398. 10.1021/pr0603194
Article PubMed Central CAS PubMed Google Scholar
Andersen JS, Lam YW, Leung AK, Ong SE, Lyon CE, Lamond AI, Mann M: Nucleolar proteome dynamics. Nature 2005, 433(7021):77–83. 10.1038/nature03207
Article CAS PubMed Google Scholar
Pilch B, Mann M: Large-scale and high-confidence proteomic analysis of human seminal plasma. Genome Biol 2006, 7(5):R40. 10.1186/gb-2006-7-5-r40
Article PubMed Central PubMed Google Scholar
de Godoy LM, Olsen JV, de Souza GA, Li G, Mortensen P, Mann M: Status of complete proteome analysis by mass spectrometry: SILAC labeled yeast as a model system. Genome Biol 2006, 7(6):R50. 10.1186/gb-2006-7-6-r50
Article PubMed Central PubMed Google Scholar
Adachi J, Kumar C, Zhang Y, Olsen JV, Mann M: The human urinary proteome contains more than 1500 proteins including a large proportion of membranes proteins. Genome Biol 2006, 7(9):R80. 10.1186/gb-2006-7-9-r80
Article PubMed Central PubMed Google Scholar
Beausoleil SA, Villen J, Gerber SA, Rush J, Gygi SP: A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat Biotechnol 2006, 24(10):1285–92. 10.1038/nbt1240
Article CAS PubMed Google Scholar
Everley PA, Bakalarski CE, Elias JE, Waghorne CG, Beausoleil SA, Gerber SA, Faherty BK, Zetter BR, Gygi SP: Enhanced analysis of metastatic prostate cancer using stable isotopes and high mass accuracy instrumentation. J Proteome Res 2006, 5(5):1224–31. 10.1021/pr0504891
Article CAS PubMed Google Scholar
Haas W, Faherty BK, Gerber SA, Elias JE, Beausoleil SA, Bakalarski CE, Li X, Villen J, Gygi SP: Optimization and use of peptide mass measurement accuracy in shotgun proteomics. Mol Cell Proteomics 2006, 5(7):1326–37. 10.1074/mcp.M500339-MCP200
Article CAS PubMed Google Scholar
Hwang JN, Lay SR, Lippman A: Nonparametric Multivariate Density Estimation: A Comparative Study. IEEE Transactions on Signal Processing 1994, 42(10):2795–2810. 10.1109/78.324744
Article Google Scholar
Archambeau C, Verleysen M: Fully nonparametric probability density function estimation with finite gaussian mixture models. 7th ICPAR Conf 2003, 81–84.
Google Scholar
Duda RichardO, Hart PeterE, Stork DavidG: Pattern Classification, Second Edition. John Wiley 2001, 10: 3–13.
Google Scholar
Resing KA, Meyer-Arendt K, Mendoza AM, Aveline-Wolf LD, Jonscher KR, Pierce KG, Old WM, Cheung HT, Russell S, Wattawa JL, Goehle GR, Knight RD, Ahn NG: Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. Anal Chem 2004, 76(13):3556–68. 10.1021/ac035229m
Article CAS PubMed Google Scholar
Zhang Z: Prediction of Low-Energy Collision-Induced Dissociation Spectra of Peptides. Anal Chem 2004, 76(14):3908–3922. 10.1021/ac049951b
Article CAS PubMed Google Scholar
Purvine S, Picone AF, Kolker E: Standard mixtures for proteome studies. OMICS 2004, 8(1):79–92. 10.1089/153623104773547507
Article CAS PubMed Google Scholar
Yang X, Dondeti V, Dezube R, Maynard DM, Geer LY, Epstein J, Chen X, Markey SP, Kowalak JA: DBParser: Web-Based Software for Shotgun Proteomic Data Analyses. J Proteome Res 2004, 3(5):1002–08. 10.1021/pr049920x
Article CAS PubMed Google Scholar
Nesvizhskii AI, Roos FF, Grossmann J, Vogelzang M, Eddes JS, Gruissem W, Baginsky S, Aebersold R: Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol Cell Proteomics 2006, 5(4):652–70.
Article CAS PubMed Google Scholar
Handl J, Knowles J, Kell DB: Computational cluster validation in post-genomic data analysis. Bioinformatics 2005, 21(15):3201–12. 10.1093/bioinformatics/bti517
Article CAS PubMed Google Scholar
Geladi P, Kowlaski B: Partial least square regression: A tutorial. Analytica Chemica Acta 1986, 35: 1–17. 10.1016/0003-2670(86)80028-9
Article Google Scholar
Bilmes JA: A gentle tutorial of the EM algorithm and its applications to parameter estimation for gaussian mixture and hidden Markov models. International Computer Science Institute, Berkeley, California; 1998.
Google Scholar
Nash SG, Sofer : Linear and Nonlinear Programming. McGraw-Hill 1996.
Google Scholar
Scott DW: On optimal and data-based histograms. Biometrika 1979, 66: 605–610. 10.1093/biomet/66.3.605
Article Google Scholar
Silverman BW: Density estimation for statistics and data analysis. Chapman Hall: London; 1986.
Chapter Google Scholar
[http://www.abrf.org/index.cfm/group.show/ProteomicsStandardsResearchGroup.47.htm]
Chen M, Ying W, Song Y, Liu X, Yang B, Wu S, Jiang Y, Cai Y, He F, Qian X: Analysis of human liver proteome using replicate shotgun strategy. Proteomics 2007, 7(14):2479–88. 10.1002/pmic.200600338
Article CAS PubMed Google Scholar
Prince JT, Carlson MW, Wang R, Lu P, Marcotte EM: The need for a public proteomics repository. Nat Biotechnol 2004, 22(4):471–2. 10.1038/nbt0404-471
Article CAS PubMed Google Scholar
Zhang J, Li J, Xie H, Zhu Y, He F: A new strategy to filter out false positive identifications of peptides in SEQUEST database search results. Proteomics 2007, 7(22):4036–44. 10.1002/pmic.200600929
Article CAS PubMed Google Scholar
Modha D, Scott-Spangler W: Feature weighting in k-means clustering. Machine Learning 2003, 52(3):217–237. 10.1023/A:1024016609528
Article Google Scholar
[ftp://ftp.ebi.ac.uk/pub/databases/IPI/old/HUMAN/ipi.HUMAN.v3.19.fasta.gz]

Download references

Acknowledgements

We thank Dr.Songfeng Wu of the Beijing Proteome Research Centre for his thoughtful discussion. We also thank Master's candidate JieMa of the Beijing Proteome Research Centre for assistance with the database search. The LCQ control dataset was provided by the BIATECH institute and Dr.Zhongqi Zhang kindly provided the program MassAnalyzer, we thank them here. This work was funded by the Chinese Ministry of Science and Technology (2006AA02A312, 2006AA02Z334, 2006CB910803, 2006CB910700), the National Natural Science Foundation of China (30621063, 342123), and the Beijing Municipal Science and Technology Project (H030230280590), Chinese National Key Program of Basic Research (2006CB910700).

Author information

Authors and Affiliations

College of Mechanical & Electronic Engineering and Automatization, National University of Defense Technology, Changsha, 410073, China
Jiyang Zhang, Hongwei Xie & Fuchu He
State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing, 102206, China
Jiyang Zhang, Jianqi Li, Xin Liu, Yunping Zhu & Fuchu He

Authors

Jiyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianqi Li
View author publications
You can also search for this author in PubMed Google Scholar
Xin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hongwei Xie
View author publications
You can also search for this author in PubMed Google Scholar
Yunping Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Fuchu He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yunping Zhu or Fuchu He.

Additional information

Authors' contributions

JZ developed the program for data processing and wrote the main text of the paper. XL finished the experiment to analyze the samples on LTQ/FT platform. HX inspected all the algorithm problems and provided abundant suggestions for improving the implementation of the EM algorithm. YZ and FH reviewed the paper and revised its framework.

Electronic supplementary material

12859_2007_2014_MOESM1_ESM.RAR

Additional file 1: The parameters of the nonparametric models for different datasets. This file collected the parameters of the nonparametric models and filter criteria for different datasets. The file was compressed as RAR archive to reduce the size. (RAR 87 KB)

12859_2007_2014_MOESM2_ESM.RAR

Additional file 2: Program package. This file packaged all the programs used in this work, which include the Microsoft Windows executable EXE files and the Matlab script M files. A readme file is provided in this package to illustrate how to use these programs. (RAR 337 KB)

12859_2007_2014_MOESM3_ESM.RAR

Additional file 3: The control sequences of the LCQ control dataset. This file includes the control sequences for the LCQ control dataset, which include the sequences of control proteins and the common contaminants. The file was compressed as RAR archive to reduce the size. (RAR 22 KB)

12859_2007_2014_MOESM4_ESM.RAR

Additional file 4: The control sequences of the LTQ control dataset. This file includes the control sequences for the LTQ control dataset, which include the sequences of control proteins and the common contaminants. The file was compressed as RAR archive to reduce the size. (RAR 92 KB)

12859_2007_2014_MOESM5_ESM.RAR

Additional file 5: The control sequences of the LTQ/FT control dataset. This file includes the control sequences for the LTQ/FT control dataset, which include the sequences of control proteins and the common contaminants. (RAR 17 KB)

12859_2007_2014_MOESM6_ESM.RAR

Additional file 6: Validated matches in the LCQ control dataset. This file contains the validated correct matches for the LCQ control dataset. The file was compressed as RAR archive to reduce the size. (RAR 79 KB)

12859_2007_2014_MOESM7_ESM.RAR

Additional file 7: Validated matches in the LTQ control dataset. This file contains the validated correct matches for the LTQ control dataset. The file was compressed as RAR archive to reduce the size. (RAR 612 KB)

12859_2007_2014_MOESM8_ESM.RAR

Additional file 8: Validated matches in the LTQ/FT control dataset. This file contains the validated correct matches for the LTQ/FT control dataset. The file was compressed as RAR archive to reduce the size. (RAR 224 KB)

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Zhang, J., Li, J., Liu, X. et al. A nonparametric model for quality control of database search results in shotgun proteomics. BMC Bioinformatics 9, 29 (2008). https://doi.org/10.1186/1471-2105-9-29

Download citation

Received: 05 June 2007
Accepted: 21 January 2008
Published: 21 January 2008
DOI: https://doi.org/10.1186/1471-2105-9-29

A nonparametric model for quality control of database search results in shotgun proteomics

Abstract

Background

Results

Conclusion

Background

Results

Nonparametric model and the DF

Incorporating more features

The accuracy of the FPR estimation

Compare the performance of nonparametric model with other methods

Application to large datasets

Discussion

Conclusion

Methods

Datasets and database search

The workflow of the nonparametric model based method

Initial the nonparametric model with k-means clustering

Nonparametric model and the EM algorithm

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Authors' contributions

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us