Metaprotein expression modeling for labelfree quantitative proteomics
 Joseph E Lucas†^{1}Email author,
 J Will Thompson†^{1},
 Laura G Dubois^{1},
 Jeanette McCarthy^{1},
 Hans Tillmann^{2},
 Alexander Thompson^{2},
 Norah Shire^{3},
 Ron Hendrickson^{3},
 Francisco Dieguez^{3},
 Phyllis Goldman^{3},
 Kathleen Schwarz^{4},
 Keyur Patel^{2},
 John McHutchison^{2} and
 M Arthur Moseley^{1}
https://doi.org/10.1186/147121051374
© Lucas et al.; licensee BioMed Central Ltd. 2012
Received: 12 December 2011
Accepted: 4 May 2012
Published: 4 May 2012
Abstract
Background
Labelfree quantitative proteomics holds a great deal of promise for the future study of both medicine and biology. However, the data generated is extremely intricate in its correlation structure, and its proper analysis is complex. There are issues with missing identifications. There are high levels of correlation between many, but not all, of the peptides derived from the same protein. Additionally, there may be systematic shifts in the sensitivity of the machine between experiments or even through time within the duration of a single experiment.
Results
We describe a hierarchical model for analyzing unbiased, labelfree proteomics data which utilizes the covariance of peptide expression across samples as well as MS/MSbased identifications to group peptides—a strategy we call metaprotein expression modeling. Our metaprotein model acknowledges the possibility of misidentifications, posttranslational modifications and systematic differences between samples due to changes in instrument sensitivity or differences in total protein concentration. In addition, our approach allows us to validate findings from unbiased, labelfree proteomics experiments with further unbiased, labelfree proteomics experiments. Finally, we demonstrate the clinical/translational utility of the model for building predictors capable of differentiating biological phenotypes as well as for validating those findings in the context of three novel cohorts of patients with Hepatitis C.
Conclusions
Massspectrometry proteomics is quickly becoming a powerful tool for studying biological and translational questions. Making use of all of the information contained in a particular set of data will be critical to the success of those endeavors. Our proposed model represents an advance in the ability of statistical models of proteomic data to identify and utilize correlation between features. This allows validation of predictors without translation to targeted assays in addition to informing the choice of targets when it is appropriate to generate those assays.
Keywords
Background
Mass spectrometry proteomics
The field of proteomics has made remarkable advances in analytical hardware and software which have provided increasingly sensitive and robust analyses on platforms capable of detecting low abundance proteins from complex mixtures, such as serum and cell lysates. The nanoscale liquid chromatography and mass spectrometry (LCMS) proteomic technology, which has become the stateoftheart for differential expression proteomic studies in most major laboratories around the world, typically uses a ‘bottomup’ approach, where the samples are subjected to a total proteolytic digestion, and the peptide ‘surrogates’ of the protein are quantified and identified using tandem mass spectrometry. There are two general approaches used for bottomup differential expression proteomics via LCMS, those based on the use of stable isotope labeling or tagging of the peptides, and the socalled labelfree methods (shotgun proteomics) [1–4]. Advances in both approaches have occurred in recent years such that currently both relative and absolute quantitation of proteins is possible from complex mixtures by either labeled or labelfree methodologies [5].
Although a specific advantage of the labeling approaches is the ability to heavily fractionate the samples to “dig deeper” into the proteome while maintaining quantitative capabilities, extensive fractionation of the sample is often impractical in the context of a clinical study with tens or even hundreds of samples. The proteomics community has seen a significant increase in the use of the label free approach due to increased instrument stability and software sophistication, and it is emerging as the method of choice for larger clinicallybased studies where use of the labeling strategies is impossible or impractical. In particular, an advantage of labelfree strategies which measure areaunderthecurve (AUC) of the LCMS peak is that any of a number of commercial or opensource software packages can be used to extract ion intensities from each individual analysis, and statistical analysis on the relative abundance of these ions can be performed even in the absence of a peptide identification. The ability to precisely and reproducibly quantify thousands of proteolytic peptides using the labelfree method was demonstrated by Wang, et al. and has been since employed and reproduced in a number of laboratories [4, 6].
Techniques for aggregating peptides into larger units generally revolve around protein identifications. A variety of approaches exist to combine individual peak areas to generate a relative or absolute aggregate expression levels. Once peptides are assigned to their parent proteins, using an algorithm such as ProteinProphet, either the peptide frequency of observation (ßpectral counting) or MS intensity is used to estimate protein abundance [7]. The spectral counting approaches have gained a fairly large degree of use in the community due to their ease of implementation, however they generally suffer from a limited dynamic range and they are insensitive to small changes in expression level due to the large number of species which have peptides observed only 1–3 times [5]. Labelfree AUC approaches generally overcome these limitations by locating a peak in the retentiontime and accuratemass matrix using sophisticated software, and extracting the area under the LCMS peak. An important characteristic of AUC labelfree studies is that they need to be performed on high resolution instruments for the best results, which limits the application of this approach to more expensive QToF, FTICR, or Orbitrap instruments.
Analysis of shotgun proteomics
Error in proteinlevel quantitation can first occur due to incorrect peptide identifications. Even at a relatively low peptide falsediscovery rate (i.e. 1%), the fraction of proteins detected that contain at least one false peptide identification is much higher because multiple peptides match back to the same protein. If a falsepositive peptide is included in the protein level quantitation it can cause increased error in the proteinlevel quantitation. This can be partially overcome utilizing only the highestquality or bestionizing peptides for protein quantitation, however in current implementations of “Top 3” quantitation, the individual peptide confidence is not utilized as an inclusion parameter [8, 9]. A second type of error in protein quantitation occurs when many homologues share a common peptide. In this situation the protein grouping algorithm, such as ProteinProphet, makes an informed decision about which parent sequence a peptide belongs to and typically associates all of the peptide intensity to that parent sequence. This can deliver erroneous protein quantitation results when multiple homologues are present. A final challenge is with protein isoforms, posttranslationally modified or proteolytically processed peptides, which may show a biologically relevant and different expression pattern than the proteotypic peptides. In these cases, they should not be grouped together with the other peptides for the purposes of modeling expression.
This paper describes a statistical model which is designed to allow the inclusion and modeling of correlation structure for the problem of differential expression in mass spectrometry proteomics. There are a number of different approaches designed for protein level quantitation. The simplest of these use direct summarization of all features/isotope groups/peptides that are identified for each protein, such as averaging or robust summarization based on quantiles [10], or averaging only the most abundant three peptides from a protein [8, 9]. In addition to these algorithms, there are ANOVA approaches for protein quantitation [11] and differential expression [12, 13]. These are regression models that variously include or exclude fixed effects for experimental group and random effects to handle repeated measurements of the same sample (technical replicates). All of these approaches rely on protein identifications and none make explicit use of correlations between isotope groups. While we do not consider the introduction of fixed effects for biological phenotypes or the introduction of random effects for cases in which we have replicate measurements from the same sample, the factor model we describe is a regression model. Therefore, one might introduce these features in a relatively straightforward way.

Allows for the subtraction of large scale correlation structure between proteins that likely arise from technical rather than biological variability (batch effects).

Appropriately models both identified and unidentified features of the LCMS output

Utilizes feature identifications from MS/MS spectra, but allows for the probability that some of those identifications will be incorrect

Produces a full posterior distribution on the model parameters, which leads to the quantification of uncertainty in the results.

Admits the possibility that sections of a protein will be posttranslationally modified and therefore may not be representative of the expression pattern of the protein as a whole.

Makes use of correlation structure across samples, which provides significant information about feature relationships that is unused in many other approaches.

Can be used in the creation of predictive models based on multiple proteins, rather than just the enumeration of proteins associated with a particular outcome.
We recognize that there are many excellent approaches to modeling labelfree proteomic data that share some of these properties, however our proposed model is unique in its ability to concurrently model all of them.
In addition to advances in statistical modeling of labelfree, unbiased proteomics data, this paper presents a novel preclinical predictor of response to therapy in patients with Hepatitis C. Finally, we demonstrate the validation of that predictor in two separate cohorts. First, we show that the approach generates a predictor that is reproducible between two different labs that are utilizing entirely different mass spectrometry technologies. Second, we show that the predictor is able to accurately differentiate clinical responders from nonresponders in a novel cohort of pediatric patients with Hepatitis C. In both cases, the predictor is based on unbiased, labelfree mass spectrometry data. We are aware of one previous example of the validation of shotgun proteomics findings with further shotgun proteomics experiments [14], in which a number of individual peptides were validated as markers of central nervous system lymphoma. Standard practice is to validate using a targetted platform such as selected reaction monitoring (SRM). To the best of our knowledge, ours is the first example of the validation of a full predictor. Finally, we verify some of the key elements of the predictor in our original cohort using SRM.
Results and discussion
Metaprotein factor model
In order to estimate metaprotein abundance, we build our model from preprocessed data with intensity estimates aggregated to the isotope group level. In our modeling approach, we allow the possibility that an isotope group will be incorrectly identified, or be correctly identified, but have a pattern of expression that is distinct from the bulk of peptides from the corresponding protein. In practice, this new grouping approach often leads to metaproteins which may be dominated by isotope groups from a particular protein, but which contain isotope groups from other proteins as well.
The P dimensional vector μ has elements μ_{ i } representing the mean expression of isotope group i and 1_{ N } is a column vector of ones. The N × Kdimensional matrix Λ represents latent factors which will be learned from the data and A is a P × Kdimensional matrix of factor loadings with elements a_{ i,k }. The random variable ϵ is a P × N matrix of idiosyncratic noise.
Our goal is to estimate relative protein concentration from this model using the latent factors in Λ. Recall that we have identifications for some subset of the isotope groups. With this in mind, suppose we identify each column of A and the corresponding column of Λ with one identified protein. If we set a_{ i,k } = 1 when isotope group i is from a peptide identified as coming from protein k and a_{ i,k } = 0 otherwise, then our model is describing the expression pattern of each isotope group as a noisy approximation of the expression pattern of the protein, where the protein is known.
This distribution represents our belief that these effects span all isotope groups, but with varying effect sizes. This prior also minimizes identifiability issues between B, which is not sparse, and A which is very sparse with informative priors.
Rather than assigning isotope groups to factors based purely on identification, we want to utilize a prior on A that allows for possible posttranslational modifications and misidentifications. With this in mind, we want to relax our strict assignment of zeros and ones in the loadings matrix A. Instead, our prior distribution for ${a}_{i,k}$ will reflect our level of certainty that we know which factor should represent the expression of this peptide. When we have an identification for peptide i and have mapped that peptide to protein k, our prior distribution will reflect an increased certainty that ${a}_{i,k}\ne 0$.
where ${\alpha}_{k}$ is substantially larger than ${\alpha}_{0}$ to reflect our prior belief that ${z}_{i}=k$. We default to ${\alpha}_{k}=500\cdot {\alpha}_{0}$, but have tried values from 100 through 1000 and these lead to only minor shifts in metaprotein membership. As the weight of this prior decreases we are decreasing the importance of identification information and placing progressively more importance on correlation structure. We find, for the Hepatitis application below, that the association of metaproteins with outcome doesn’t substantially change until we increase the weight of the identification data to very high levels. We find that using ${\alpha}_{0}=1$ leads to interpretable metaproteins without loss of association with the outcomes. For peptides which do not have identifications, we utilize an unbiased prior ${z}_{i}\sim Dir\left({\alpha}_{0}\right)$. Because different peptides showing similar expression patterns may, nonetheless, show a different magnitude of expression of that pattern due to the relative sensitivity of the mass spectrometer for the peptide, we model each of the nonzero elements of A independently, such that ${a}_{i,k}\sim N({m}_{a},{v}_{a})$ when ${z}_{i}=k$ and ${a}_{i,k}=0$ otherwise.
There is not a specific threshold for determining the grouping of isotope groups into metaproteins. Instead, the assignment of an isotope group to a particular metaprotein is a function of the variance associated with that isotope group, the number of isotope groups already assigned to each metaprotein and the level of agreement between the expression pattern of the isotope group and that of the metaprotein. All of these things are estimated using MCMC within the context of the model fitting.
We note that, in the limit as ${\alpha}_{k}\to \infty $ we obtain an ANOVA model in which there is no uncertainty about which metaprotein each isotope group belongs to. This is a fixed effects model with featurespecific variance similar to Clough et al. [11]. That limiting model implies that identifications are assumed to be accurate and assumes that posttranslational modifications are of minor importance. Clough et al. correctly point out that, by collecting features one obtains higher power for detecting associations versus simple tests of association with individual isotope groups. That model, as well as the one we present here, may be expanded with additional design vectors identifying experimental groups or particular interventions if so desired. These may be included as columns in H which are simply not updated in the MCMC (Markov chain Monte Carlo).
To complete the model specification, we assume a conjugate, row specific inverse gamma prior for the variance of ϵ. This is similar to the protein level aggregation model of [11], and allows differing variance estimates for each isotope group. Because we are working with a relatively large number of samples (and thereby have no issues with identifying variance), we use a prior with mean 1 and variance 100. We also assume that the individual columns of Λ arise from a uniform distribution on the Ndimensional sphere of radius $\sqrt{N}$. The model is fit via MCMC and the result of this fit is a set of draws from the posterior distribution of all of the model parameters. All prior distributions are conjugate, and therefore we may use Gibbs sampling to update the model parameters at each step of the MCMC. The data sets we are modeling have been aggregated at the isotope group level, and as such they have between 20 and 40 thousand measurements per sample. While our sampling scheme is able to fit this data in just a few hours on a desktop, we expect that some sort of parallel processing will be desirable for data that is aggregated at the feature level. We have tested our model on multiple simulated data sets of various sizes (both sample size and number of isotope groups) to verify the accuracy of the parameter recovery even in the presence of intentionally mislabeled isotope groups.
Overlapping peaks
This approach allows for the restriction on isotope group association with just one metaprotein to be relaxed. However, as the resolution of mass spectrometry increases, and as fractionation in multiple dimensions (such as 2D chromatography and ionmobility) makes the distinction between polypeptides clearer, this modification to the model will become less and less important. Further, experience suggests that the vast majority of measured peaks are single species. Because of this, the addition of features to deal with overlapping peaks can introduce more noise than it removes, particularly when the number of samples in the experiment (and therefore the amount of information available from the correlation structure) is limited.
Features of the factor model
One of the strengths of our approach is the ability to collect isotope groups based not only on identifications, but also on their coexpression across samples. Perhaps the most common method for visualizing correlation structure in highdimensional data is hierarchical clustering. However, this is most typically used as a visualization strategy and does not, by default, provide quantitative estimates of aggregate behavior. While it is possible to generate models based on what is visualized from hierarchical clustering, nearest centroids for example, these have not to our knowledge been published for proteomic data. In addition, there are questions surrounding how one might combine peptide identifications and correlation structure in a principaled way to jointly model all of the available information.
One can identify collections of coexpressed isotope groups as well as an approximation of the expression patterns of each group from our model based on posterior distributions of the model parameters. The posterior parameters of greatest interest will depend on the specific application, but often we will be most interested in the vector of factor memberships, z, which describes which peptides group together most often. In data sets intended for the generation of predictive models in clinical/translational studies (as well as other types of studies), we will be interested in Λ. The columns of this matrix define our estimates of the expression patterns of the metaproteins across our samples. These can be used to estimate fold change, or can be treated as independent variables in any type of model that is appropriate for the study.
Inconsistent expression of peptides from the same protein
We define a “dominant metaprotein” for a protein to be the metaprotein(s) with more than half of its identified isotope groups contributed by that protein. We note that, because there are many metaproteins it is possible for a dominant metaprotein to consist largely (or even entirely) of peptides from one protein, but not contain all (or even a majority) of the isotope groups from that protein. One of the features we have observed from studying posterior parameters from our model is that there are manyexamples in which an identified isotope group (one with a peptide and protein label) does not follow the expression pattern of its corresponding dominant metaprotein. That is to say, for any given protein there is often a “consensus” expression pattern that many of the isotope groups from that protein follow, but that there is also a large minority of isotope groups which do not follow that expression pattern. We fit our model to all 109 proteins which have more than 1 identified peptide in the data set. Heatmaps similar to Figure 1 for each of these proteins are available in Additional file 1. Examination of these figures shows that the presence of peptides that show expression patterns significantly different from their corresponding dominant metaprotein is the rule, rather than the exception.
There are a few reasonable explanations for this. The most obvious possible explanation is that the poorly conforming peptides are those with the lowest overall intensity, and therefore subject to smaller signal to noise ratios. However, examination of heatmaps showing the exact same peptides, but now sorted by mean intensity across the samples rather than by metaprotein membership demonstrates that there is not a strong predominance of low intensity peptides among those that do not coexpress with the other peptides from the protein. All 109 of those heatmaps are available in the Additional file 2. We tested, using a nonparametric KruskalWallis test, for association between metaprotein membership and mean signal intensity for each ofthe metaproteins, and found that, of the 109 proteins tested, only 2 showed significant association (pvalue < 0.01, APOB and CERU).
Another possible explanation for the presence of poorly coexpressing peptides within a single protein is misalignment between runblocks. The data set was analyzed in three runblocks, one of which occurred months after the original two, and aligned according to the algorithm described in the methods section. There are sometimes shifts in retention time between runblocks which may lead to misalignment, however, we expect misalignment to be a rare occurrence. Additionally, if peptides are misaligned, we would expect to see a peptide that coexpresses with its dominant metaprotein well in two run blocks but is mismatched in the third, and this is not generally the case. We are able to find some examples that fit this pattern, however, it is almost always the case that when a peptide does not share an expression pattern with its dominant metaprotein one runblock, it also does not share that pattern in the other run blocks.
A third explanation for peptides that are uncorrelated is misidentification. However, we are using identification algorithms with parameter settings that lead to very low (approximately 1%) false identifications. We expect around 34 such misidentifications, assuming that identification is correct 99% of the time (based on the 3398 total peptides with identifications). In fact, examining the list of peptides that do not belong to their dominant metaprotein, we see that more than half fall into this category (1640 out of 3398). This is true despite our prior distribution assigning a 500x greater likelihood of a peptide belonging to its dominant metaprotein than to a different metaprotein.
Posttranslational modification of proteins is a well known process, but it is unclear how extensive these modifications are. If proteins are extensively and dynamically modified after translation, then we should expect many of them to exhibit expression patterns that do not match the bulk of peptides from that same protein. Also, if peptide modification is a significant contributor to observed patterns of expression, then we also expect to find peptides that have targets for posttranslational modification to be more likely to be found outside their dominant metaprotein. We examined the probability of peptides containing Glutamine and Asparagine, which are known sites of deamidation, to belong to their dominant metaproteins. Correcting for the number of peptides inside and outside their dominant metaproteins, we find that peptides containing Glutamine are approximately 1.2 times more likely to not follow the dominant expression pattern for any given protein, and that this is a statistically significant difference (pvalue .0013, fisher’s exact test). In addition, peptides with Asparagine are 1.22 times more likely to fail to coexpress with the dominant group of peptides from a protein (pvalue 0.0010). In addition to these two sites ofposttranslational modification, we examined the motif NxS/T, which is a known site of Nlinked glycosylation. For these two motifs, 25 of the 30 peptides which contain the “NxT” motif and 53 of the 60 peptides which contain the “NxS” motif follow expression patterns that are different from their dominant metaproteins (odds ratios 4.3 and 6.6 respectively, pvalues 0.0013 and 2.1e8 respectively). Additionally, both Serine and Threonine are known to be sites of Olinked glycosylationas well as phosphorylation. We find that both Threonine and Serine are also more likely to show odd expression patterns (Threonine: odds ratio 1.2, pvalue .002 and Serine odds ratio 1.3, pvalue 8.1e6). We tested Proline (odds ratio = 1, p = .96) and Histidine (odds ratio = 1.1, p = .40) as negative controls.
We note that we are not directly detecting the posttranslationally modified peptides. Instead, we suspect that we are detecting changes in the expression levels of the unmodified peptides due to post translational modification. Thus, peptides with any of these posttranslational modification motifs are significantly less likely to follow the dominant expression pattern for the protein from which they are derived. It is possible that posttranslational modification is a pervasive feature of plasma proteomics, and that protein level quantitation is likely to either introduce errors by summing across uncorrelated parts of a protein (if quantitation is accomplished through summation across all associated peptides) or to miss critical posttranslational modifications (if quantitation is accomplished by summation of only the top three peptides). It is important to note, however, that inconsistent expression of peptides from the same protein is likely due to multiple different causes, including protein isoforms, increased noise in low abundance peptides, misalignment, misidentification, peptide modification, and sample to sample variation in ion suppression from variable coelution in the LC separation. Regardless of the root cause, the statistical model we propose will identify isotope groups that show this characteristic and treat them appropriately.
Analysis of spikein data: Relabeling of misidentified peptides
We obtained a publically available spikein data set from [19] which consists of three replicate measurements of each of six conditions. The base solution is solidphase Nglycocapture from human serum, and each of six different nonhuman proteins were spiked in at each of six different concentrations in a latin square design. After fitting our model to this data, we find that there is strong evidence for a significant increase in the total number of isotope groups that could be identified as coming from the known six spiked in proteins (Additional file 3). Estimated foldchange for each protein was also calculated for each experimental group and each spiked in protein. We find that, while the estimation works well for the highest four groups for each spiked in protein, the lowest two groups are typically underestimated due to the significant level of missing data.
In order to test the ability of our model to identify and correct inaccurate identifications, we randomly permuted 10% of the identifications in this data and refit our model. This experiment was repeated 100 times. Additional file 3 shows the results of this experiment for the ‘MYG_HORSE’ metaprotein for one such permutation. Note that there are 3 isotope groups that are correctly assigned to the MYG_HORSE metaprotein even though the original identification would have placed them in other proteins. This experiment was repeated 50 times, and of the 295 total reassigned identifications across those experiments, 223 were correctly reassigned to their original protein. Of the remainder, the peptide was fit to a noise factor 61 times and fit to some other protein 11 times. None were assigned to the protein associated with the incorrect relabel.
Comparison with protein level quantitation
While there are similarities between our metaprotein model and various techniques for protein level quantitation, the ability to group peptides based on coexpression across all samples, and therefore identify peptides that show evidence of posttranslational modification is a critical difference that is shared by no protein level quantitation method that we are aware of. Nonetheless, it is interesting to compare our model to protein level quantitation algorithms. For this purpose, we will examine two such algorithms. The summation algorithm estimates protein level quantitation by summing total expression across all peptides from a protein. This algorithm automatically gives peptides with high intensity measurements a larger effect on the estimated protein level quantitation. A second algorithm, Top 3, estimates the protein level expression as the mean of the three peptides with the highest intensity (average across the samples). These two algorithms typically give similar results, however, because peptides from the same protein do not always show consistent expression patterns, these approaches can lead to protein level quantitation that is unnecessarily noisy.
In general, we find that the correlation between our metaprotein model and protein level quantitation for estimation is high when there are a large number of peptides from the given protein. For example, one of the most abundant proteins in this data set is Apolipoprotein B (represented by 409 isotope groups), and correlation between estimated expression from our factor model and from summation of all Apo B identified peptides is 0.97. However, when there are fewer peptides available or if there are many misidentifications or modifications, we find evidence that our factor model gives improved estimation of protein level expression patterns. For example, pregnancy zone protein, which has only three associated isotope groups in the data set, is known to be over expressed in women compared to men. While both the metaprotein model and the summation algorithm show differential expression for this protein, the factor model gives a pvalue of 4.5e4 (statistically significant even after Bonferroni correction for multiple hypotheses) as compared to a pvalue of .0014 for estimation by summation over identified peptides (not significant after multiple hypothesis correction).
However, examination of the top three isotope groups from Apo E (Figure 1) shows a different picture. The second most abundant isotope group from Apo E is in a different metaprotein because it shows a substantially different expression pattern from the bulk of the Apo E isotope groups. In addition, if we delete the two outliers from the data (they are outliers by all three quantitation methods), the correlations between the three top Apo E isotope groups and the antibody assay of Apo E activity are.59, .23 and .56 respectively (pvalues of .0002, .17 and .0004 respectively). Thus, this second most abundant isotope group should be adding noise to the Top 3 protein level estimate of Apo E. Interestingly, the correlation between the Top 3 estimate and the antibody assay is .60. This is higher than any of the three separately, which suggests that the antibody assay is in fact measuring an aggregation of two different forms of Apo E.
Metaprotein expression in a hepatitis C cohort
In addition to the analysis of a publically available latin square data set (Additional file 3), we obtained pretreatment serum samples from 96 patients with Hepatitis C who have a known response or nonresponse to the standard of care treatment with interferon and Ribavirin [20]. Serum from the patients was measured with open platform LCMS/MS. The overall goal of the study was to predict who among the study subjects will respond to therapy and who will not. We are also interested in estimating which proteins and peptides are potential markers of response, allowing for future, targeted assay development.
Analysis of this data set proceeds in two steps. First, the model described above is fit to the proteomic data without regard to the phenotype of interest. There were a total of 6,729 peptides in the data set with either positive identifications or with average expression levels greater than the mean. Of these 3390 had identifications. These were matched with 265 different proteins of which 109 had two or more associated, identified peptides.
Prediction of outcome
As with the computation of protein level expression, our 109 metaproteins may be used in any context as independent predictor variables. Additionally, we may assess the level of association between metaproteins and other biological phenotypes in either case.
In addition to strong associations for three metaproteins, we find that there are a total of 13 metaproteins with pvalues less than .01 (random association would dictate only 1). Thus there is clear evidence of the presence of bloodborne markers of response to therapy in Hepatitis C.
Identification of candidate peptides
We would like to identify a set of candidate peptides for use in future targeted studies such as selected reaction monitoring (SRM) or antibody studies. From the analysis of associations between averaged metaproteins and the phenotype, we are confident that there are markers of interest. In order to identify the most relevant isotope groups, we propose to obtain draws from the MCMC chain, and for each draw build a predictor and observe which peptides are included in that predictor. In this way, the values of Λ are computed directly from the peptides included in the corresponding metaprotein. Additionally, by keeping track of which peptides are most often included in the predictors, we obtain a list of candidate biomarkers for future study.
Because we have 109 metaproteins but only 87 samples, direct regression is not possible in this context. We instead use variable selection with model averaging. Variable selection allows regression with a small subset of the total number of predictors, while model averaging allows us to properly account for uncertainty in which models are correct. In this context, model averaging has been shown to outperform the single best model for predictive accuracy on hold out data sets [21]. We use a publicly available implementation of variable selection and model averaging called Shotgun Stochastic Search [22].
Validation of a predictor discovered in an unbiased, labelfree data set
There are two significant challenges facing users of labelfree, unbiased, mass spectrometry proteomics. First, there are a number of different mass spectrometry machines, and some utilize different physical principals in order to measure masstocharge ratio. Differences in these machines and differences in protocols between laboratories can make reproducing the same results at different labs difficult—even when the samples used are exactly the same. Second, validation of findings in new samples can be difficult, particularly when the validation approach involves the use of different sample preparation or different techniques for measuring the peptides or proteins. It is quite common to see two phase experimental designs in which candidate peptides/proteins are identified by shotgun mass spectrometry then those candidates are validated on a much larger group of samples, but using antibody assays or SRM mass spectrometry. These approaches introduce additional variables in the sample preparation or measurement, and in cases where validation fails the explanation for that failure becomes uncertain—it may have failed because the peptides/proteins are not good candidates or because of the changes in the way they were measured.
We validated our results on the predictor of response to therapy by both testing the consistency of the predictor when measured in different labs and also when testing the accuracy of the predictor in a new set of samples. In each case, we are attempting to validate our predictor through the use of an additional unbiased, labelfree data set. Because of difficulties with alignment between data sets, this approach is challenging and is usually not taken. The approach to this type of validation undertaken successfully by [14] involved the validation of multiple individual makers of disease. That approach produces a set of putative biomarkers, some of which validate in the followup data and some of which do not. It leads to a more highly filtered, and therefore more likely to succeed, list of biomarkers, but it does not lead to a validated predictor. Validation of a such a predictor must take into account uncertainties associated with alignment and false discovery. The advantage of our approach for this purpose is in the use of factors as predictors rather than individual isotope groups. By aggregating multiple isotope groups and using the aggregated expression pattern, the metaprotein model produces predictions that are robust to misalignments or false discovery associated with individual analytes.
Verification of differential expression for individual peptides
Conclusions
To date, the algorithms and models for aggregating mass spectrometry features in larger groups than peptides have relied on protein identifications and do not use correlations across samples in any way. The model we have described makes use of these correlations to identify peptides that do indeed show consistent expression in addition to agreement in identification. This paper describes both a novel statistical approach for the analysis of labelfree, unbiased mass spectrometry proteomics and an approach for the validation of discoveries on that platform by further experiments on the same platform. Furthermore, we have demonstrated the application of this approach in the context of the treatment of Hepatitis C, which is a disease causing significant morbidity and mortality worldwide. The predictor of response to therapy described in this paper has the potential to significantly affect therapeutic decision making.
The model offers an approach to aggregation of mass spectrometry proteomics data that differs from protein level quantitation, and that in some situations, offers significant advantages over previous approaches. The critical difference between protein level quantitation and our metaprotein model is the inclusion of correlation structure in the metaprotein model. We should note that the inclusion of correlation structure may offer only minimal advantage low sample size situations. Additionally, in cases where there is an overwhelming biological effect, that effect can dominate the observed correlations and drown out what would otherwise be differing expression patterns. However, we have demonstrated the utility of the approach for validation studies and for potential clinical applications.
Our model is particularly well suited to situations in which we have a relatively large number of samples (>30) with a high degree of biological variability. We have shown that it can be used to gain insights into translational studies, and have exemplified its use with a study of response to therapy in patients with Hepatitis C. The model is appropriate for any highquality quantitative (areaunderthecurve) proteomics data. This is not limited to dataindependent acquisitions, rather the approach can be utilized on any labelfree quantitative data which is based on precursor intensity for the quantitation and collects enough points across the peak to accurately define the peak area.
Methods
Sample selection
Demographics of the training data
Responders  Nonresponders  

Gender  13 female, 24 male  4 female, 14 male 
Race  50 caucasian, 11 AA  23 caucasian, 8 AA 
Logviral load  14 ± 3.7  15.6 ± 1.6 
Metavir  1.6 ± 1.7  1.7 ± 2.3 
Sample preparation, instrument operation, and data preparation
The plasma sample preparation by immunodepletion and trypsin digestion, as well as the unbiased LCMSE data collection on a nano Acquity and QToF Premier mass spectrometer (Waters Corporation), has been described in detail previously. [20]. Unbiased proteomic data analysis using labelfree areaunderthecurve quantitation was performed in Rosetta Elucidator, utilizing both Mascot (Matrix Sciences, Inc) and PLGS v2.4 (Waters Corporation) and exported for statistical analysis, also as previously described [20]. Peptide annotation for this dataset was performed at a 1 peptide FDR using decoy database validation. To enable external analysis of the quantitative data using alternative methods, we have made the two unbiased quantitative data sets available (see Additional files 4 and 5). LCSRM analyses were performed using a subset of the same samples, which were stored at −80 C between the initial unbiased analyses and the targeted analysis. A scheduled SRM method was generated in Skyline v0.7 (https://brendanxuw1.gs.washington.edu/labkey/project/home/software/Skyline/begin.view) for 82 of the most promising biomarker peptides, and 5 peptides from yeast alcohol dehydrogenase (ADH1_YEAST) were also included as internal standards. The method included up to 3 transitions per precursor ion, and the MS method details as well as quantitative data have been included in Additional file 4. Analyses were performed on a nanoAcquity UPLC system coupled to a Xevo TQ mass spectrometer (Waters Corporation), using a method identical to that described for the unbiased data, with the following exceptions. The gradient length was 30 min, the column utilized was a 75 μm x 150 mm BEH, the flow rate was 0.4 μL/min and column temperature was 35°C. Raw data was imported into Skyline, quantified using a labelfree approach, and exported for statistical analysis.
Preparation of data for analysis
To accomplish data alignment and feature quantitation across all biological samples and thus form the matrix discussed in the statistical methods section below, we utilized Rosetta Elucidator™v3.3 software package (Rosetta Biosoftware) to import and align all MSE and datadependent acquisition (DDA) raw data files [26–30]. Database searches were performed against a forward/reverse Swissprot database (v 56.5) with human taxonomy, using ProteinLynx Global Server v2.4 (IdentityE algorithm, Waters Corporation) for MSE searches or Mascot v2.2 for DDA data. Database searches are either performed externally and results imported (PLGS 2.4) or queued directly from within Elucidator (Mascot) to allow identification of many of the quantified features in the proteomic dataset. All database searches were performed with high mass accuracy on precursor and product ions (typically 20 ppm precursor and 0.04 Da product ion tolerance), with fixed carbamidomethylation(Cys), variable oxidation(Met) and variable deamidation(Asn and Gln). Annotation of the peptides is accomplished at an estimated 1% FDR using the Elucidator implementation of PeptideProphet algorithm [31]. Visual scripting within Elucidator is utilized to extract feature intensities for those features which have quantitative values above the 1000 counts (approximately 10th percentile) in 50% of the samples. The final file for statistical analysis is made up of a matrix of intensities, with the rows corresponding to isotope groups and the columns to technical observations (LCMS analysis). An isotope group is defined as all of the peaks associated with a single peptide at a specific charge state and retention time. This level of quantitation combines peaks from the same peptide that differ according to the number of carbon 13’s incorporated, but does not combine the same peptide measured at different charge states. The intensity of an isotope group for a given sample is the total volume under the feature peaks associated with that isotope group. This is monotonically related to the concentration of that isotope group in the original sample, and it is these intensities that we work with. We have made the matrices of intensity values associated with the study available in Additional file 5.
Technical variation
This data set was collected in three run blocks. Two of these were consecutive and the third was run months later. There are significant batch effects present in comparing the first two to the third even after correcting for observed total protein. Even though we did not include explicit design vectors for batch in our regression, our inclusion of a factor matrix describing systematic effects (β and H) allows this source of technical variation to be automatically modeled without foreknowledge. We find that the first row of H perfectly distinguishes runblocks 1 and 2 from runblock 3, with values between .95 and 1.05 for the former and between − .95 and −1.05 for the latter. This ability to bridge between two different data sets supports our claim that this model is able to identify and subtract out some variation in sensitivity due to batch effects.
Alignment of data sets
The data from a single experiment consists of a list of features along with their associated masstocharge ratios and retention times. Because there is some level of randomness to all of these measurements, there is some uncertainty in the identification of which feature from one experiment should be associated with a given feature in another experiment. The process of matching features across experiments is termed data alignment. For matching features within a single experiment, we utilize Rosetta Elucidator™, which is a commercial package for the processing and analysis of proteomics data. However, there were sufficient differences between the different experiments (which were run months apart, at different labs and on different machines) to make the Rosetta algorithm inadequate for the task of alignment across datasets. For data alignment across the different batches, we utilized the following construction.
Let i be an index over the set of all peptides measurable in our experiment. Further, define ${\gamma}_{i}$ to be 1 if the i th peptide was measured in the experiment. Let ${x}_{i}$ be a vector containing the “true” retention time and masstocharge ratio associated with the i th peptide. Then, if ${\gamma}_{i}=1$, we assume our measured values, ${x}_{i}^{*}$ are normally distributed around ${x}_{i}$ with some shift and scale along with an unknown covariance, ${x}_{i}^{*}\sim N(\mu +\delta {x}_{i},{\Phi}^{1})$. There is a small subset of isotope groups that have been identified in all data sets. We initialize all of our parameters to maximize the likelihood of this small subset, then use a greedy algorithm to select matches for the remainder of the isotope groups. The algorithm stops assigning matches based on the prior probability that ${\gamma}_{i}=1$ (we have used .5).
There is substantial information available that is not being used in this algorithm. First, it would be possible to assign prior distributions to the model parameters and iteratively fit this model. This would lead to better estimates of model parameters allong with full posterior distributions. However, because the distribution is extremely spiky, the estimation of uncertainty from this algorithm is somewhat uninteresting and uninformative. Second, there is information available in the high energy mass spectrometry trace even when that trace is insufficient to fully identify the peptide which one might include in our model as additional dimensions to ${x}_{i}$. Third, we are not making use of the intensity of the measured isotope groups across the samples. This allows for the possibility that there may be drastic changes in peptide concentrations between the two experiments. Even so, it is likely possible to obtain and use reasonable and informative distributions on these intensities for the purposes of alignment. However, because all of our results are based on factors, which are the aggregate expression of multiple isotope groups, a low but nonzero level of inaccurate alignments may lead to a mild increase in noise, but not a drastic change in our overall results.
Full model specification

P is the number of isotope groups measured

N is the number of samples in the study

X is a P × N dimensional array of intensity values with elements ${x}_{i,j}$

ϵ is a P × N matrix of normally distributed noise (residuals). We assume that each isotope group, i, has its own noise variance, ${\tau}_{i}^{1}$

μ is a Pdimensional vector representing mean intensity for each isotope group. The elements of this vector are μ_{ i }.

B is a P × D matrix of loadings and H is a N × D matrix of factors. These are intended to describe patterns of expression that span most or all isotope groups (as reflected by the priors). The elements of extitB are ${b}_{i,d}$ and the elements of H are ${h}_{n,d}$.

A is a P × K matrix of loadings and Λ is a N × K matrix of factors. K is the number of metaproteins. These factors are distinguished from the previous ones by their prior structure. Specifically, A will contain only P nonzero elements, and the nonzero elements for any particular factor should consist of a large number of isotope groups from the same protein. This contrasts with B which contains no nonzero elements and describes expression patterns across the entire set of peptides. The elements of A and Λ are ${a}_{i,k}$ and ${\lambda}_{n,k}$ respectively.

z is a Pdimensional vector (with elements z_{ i }) containing, for each isotope group, the index of the metaprotein to which that isotope group has been assigned.

${\mu}_{0}$, ${\mathit{\varphi}}_{0}$, ${v}_{0}$, ${\nu}_{0}$, ${\delta}_{0}$, ${a}_{0}$ and ${a}_{k}$ are constants set at the time of model fitting.
All of the constants are set in order to minimize the influence of the prior distribution. Specifically, we use ${\mu}_{0}=8$ and ${\mathit{\varphi}}_{0}=100$ corresponding to a 95% confidence interval on the logtransformed intensities between −12 and 28 (significantly overdispersed relative to the empirical distribution). Similarly, we set ${v}_{0}=100$ which allows potential fold changes $>{10}^{8}$. We set ${\nu}_{0}$ and ${\delta}_{0}$ both to .001 corresponding to a mean estimated residual of 1, but with a variance of 1000. We note that estimation of variance can be difficult, however, we have >100 samples in our study. For smaller studies, one might use more informative priors for variance. Both ${a}_{0}$ and ${a}_{k}$ are pseudocounts for our Dirichlet distribution. We use ${a}_{0}=.01$ and ${a}_{k}=500$. We note that this last parameter is the only one in the model that is potentially strongly informative, so we have tried values from 1 to 1000. We considered the percentage of peptides that are members of their dominant metaprotein. We find that, in our study population the affect of varying this parameter is to move that percentage between 43% and 53%. The relatively minor effect of this large change in pseudocounts is probably due, again, to our large sample size. This leads to very high data likelihoods for membership in particular metaproteins for almost all isotope groups. This parameter will have a larger effect in smaller studies.
Notes
Declarations
Acknowledgements
Supported in part by Duke University’s CTSA grant 1 UL1 RR02412801 from NCRR/NIH. Supported in part by a gift from David H. Murdock. We gratefully acknowledge Waters Corporation and Rosetta Biosoftware, Inc for hardware and software support for the data presented in this manuscript. In addition, we would like to acknowledge the PEDS C Clinical Research Network, the NIDDK grant U01DK067767 and Roche Pharmaceuticals, Inc for the collection of the pediatric HCV samples.
Authors’ Affiliations
References
 Ong SE, Blagoev B, Kratchmarova I, Kristensen D, Steen H, Pandey A, Mann M: Stable isotope labeling by amino acids in cell culture, SILAC, as a simple accurate approach to expression proteomics. Mol Cell Proteomics 2002, 1(5):376–386. 10.1074/mcp.M200025MCP200View ArticlePubMedGoogle Scholar
 Wiese S, Reidegeld K, Meyer H, Warscheid B: Protein labeling by iTRAQ: a new tool for quantitative mass spectrometry in proteom research. Proteomics 2007, 7(3):340–350. 10.1002/pmic.200600422View ArticlePubMedGoogle Scholar
 Gygi S, Rist B, Gerber S, Turecek F, Gelb M, Aebersold R: Quantitative analysis of comples protein mixtures using isotopecoded affinity tags. Nat Biotechnol 1999, 17: 994–999. 10.1038/13690View ArticlePubMedGoogle Scholar
 Wang W, Zhou H, Lin H, Roy S, Shaler T, Hill L, Norton S, Kumar P, Anderle M, Becker C: Auantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. Anal Chem 2003, 75(18):4818–4826. 10.1021/ac026468xView ArticlePubMedGoogle Scholar
 Kito K, Ito T: Mass spectrometrybased paproaches toward absolute quantitave proteomics. Curr Genomics 2008, 9(4):263–274. 10.2174/138920208784533647PubMed CentralView ArticlePubMedGoogle Scholar
 Wang G, Wu W, Zeng W, Chou CL, Shen RF: Labelfree protein quantification using LCcoupled ion trap or FT mass spectrometry: Reproducibility, linearity, and application with complex proteomes. J Proteome Res 2006, 5(5):1214–1243. 10.1021/pr050406gView ArticlePubMedGoogle Scholar
 Nesvizhskii AI, Keller A, Kolker E, Aebersold R: A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem 2003, 75(17):4646–4658. 10.1021/ac0341261View ArticlePubMedGoogle Scholar
 Silva JC, Denny R, Dorschel C, Gorenstein MV, Li GZ, Richardson K, Wall D, Geromanos SJ: Simultaneous qualitative and quantitative analysis of the Escherichia coli proteome: a sweet tale. Mol Cell Proteomics 2006, 5(4):598–607.Google Scholar
 Silva JC, Gorenstein MV, Li GZ, Vissers JPC, Geromanos SJ: Absolute quantification of proteins by LCMSE: a virtue of parallel MS acquisition. Mol Cell Proteomics 2006, 5: 144–156.View ArticlePubMedGoogle Scholar
 Polpitiya A, Qian WJ, Jaitly N, Petyuk V, Adkins J, Camp D, Gordon A, Smith R: DAnTE: a statistical tool for quantitative analysis of omics data. Bioinformatics 2008, 24(13):1556–1558. 10.1093/bioinformatics/btn217PubMed CentralView ArticlePubMedGoogle Scholar
 Clough T, Key M, Ott I, Ragg S, Schadow G, Vitek O: Protein Quantitation in LabelFree LCMS Experiments. J Proteome Res 2009, 8: 5275–5284. 10.1021/pr900610qView ArticlePubMedGoogle Scholar
 Daly DS, Anderson KK, Panisko EA, Purvine SO, Fang R, Monroe ME, Baker SE: MixedEffects Statistical Model for Comparative LCMS Proteomics Studies. J Proteome Res 2008, 7: 1209–1217. 10.1021/pr070441iView ArticlePubMedGoogle Scholar
 Karpievitch Y, Stanley J, Taverner T, Huang J, Adkins JN, Ansong C, Heffron F, Metz TO, Qian WJ, Yoon H, Smith RD, Dabney AR: A statistical framework for protein quantitation in bottomup MSbased proteomics. Bioinformatics 2009, 25(16):2028–2034. 10.1093/bioinformatics/btp362PubMed CentralView ArticlePubMedGoogle Scholar
 Roy S, Josephson SA, Fridlyand J, Karch J, Kadoch C, Karrim J, Damon L, Treseler P, Kunwar S, Shuman MA, Jones T, Becker CH, Schulman H, Rubenstein JL: Protein Biomarker Identification in the CSF of Patients With CNS Lymphoma. J Clin Oncol 2008, 26: 96–105. [http://jco.ascopubs.org/content/26/1/96.abstract] 10.1200/JCO.2007.12.1053PubMed CentralView ArticlePubMedGoogle Scholar
 Carvalho C, Chang J, Lucas J, Nevins J, Wang QL, West M: HighDimensional sparse factor modeling: Applications in gene expression genomics. J Am Stat Assoc 2008, 103: 1438–1456. 10.1198/016214508000000869PubMed CentralView ArticlePubMedGoogle Scholar
 Lucas J, Carvalho C, Wang Q, Bild A, Nevins J, West M: Sparse statistical modelling in gene expression genomics. In Bayesian Inference for Gene Expression and Proteomics. Edited by: Vannucci M, Do KA, Müller P. Cambridge University Press, Cambridge; 2006:155–176.Google Scholar
 Lucas J, Carvalho C, Chen JLY, Chi JT, West M: Crossstudy projections of genomic biomarkers: An evaluation in cancer genomics. PLoS One 2009. [http://dx.plos.org/10.1371/journal.pone.0004523,PMCID:PMC2638006,e4523. doi:10.1371/journal.pone.0004523]Google Scholar
 Lucas J, Carvalho C, West M: A Bayesian Analysis Strategy for CrossStudy Translaton of Gene Expression Biomarkers. Statistical Applications in Genetics and Molecular Biology 18.Google Scholar
 Mueller L, Rinner O, Schmidt A, Letarte S, Bodenmiller B, Brusniak M, Vitek O, Aebersold R, Muller M: SuperHirn—a novel tool for high resolution LCMSbased peptide/protein profiling. Proteomics 2007, 7(19):3470–3480. 10.1002/pmic.200700057View ArticlePubMedGoogle Scholar
 Patel K, Lucas JE, Thompson J, Dubois L, Tillman H, Thompson A, Uzarski D, Califf R, Moseley M, Ginsburg G, McHutchison J, McCarthy J: High Predictive accuracy of an unbiased proteomicd profile for sustained virologic response in chronic hepatitis C patients. Hepatology 2011, 53(6):1809–1818. 10.1002/hep.24284View ArticlePubMedGoogle Scholar
 Raftery A, Madigan D, Hoeting J: Bayesian model averaging for linear regression models. J Am Stat Assoc 1997, 92: 191–197.View ArticleGoogle Scholar
 Hans C, Dobra A, West M: Shotgun stochastic search in regression with many predictors. J Am Stat Assoc 2007, 102: 507–516. 10.1198/016214507000000121View ArticleGoogle Scholar
 Schwarz KB, GonzalezPeralta RP, Murray KF, Molleston JP, Haber BA, Jonas MM, Rosentha P, Mohan P, Balistreri WF, Narkewicz MR, Smith L, Lobritto SJ, Rossi S, Valsamakis A, Goodman Z, Robuck PR, Barton BA: The combination of ribavirin and peginterferon is superior to peginterferon and placebo for children and adolescents with chronic hepatitis C. Gastroenterology 2011, 140(2):450–458. 10.1053/j.gastro.2010.10.047PubMed CentralView ArticlePubMedGoogle Scholar
 Zhang H, Liu Q, Zimmerman LJ, Ham AJL, Slebos RJ, Rahaman J, Kikuchi T, Massion PP, Carbone DP, Billheimer D, Liebler DC: Methods for peptide and protein quantitation by liquid chromatographymultiple reaction monitoring mass spectrometry. Molecular and cellular proteomics 2011., 10(6):Google Scholar
 Tressel T, McCarthy JB, Calaycay J, Lee TD, Legesse K, Shively JE, Pande H: Human plasma fibronectin. Demonstration of structural differences between the A0 and B0chains in the III CS region. Biochemical J 1991, 274(Pt. 3):731–738.View ArticleGoogle Scholar
 Paweletz C, Wiener M, Bondarenko A, Yates N, Song Q, Liaw A, Lee A, Hunt B, Henle E, Meng F, Sleph H, Holahan M, Sankaranarayanan S, Simon A, Settlage R, Sachs J, Shearman M, Sachs A, Cook J, Hendrickson R: Application of an endtoend biomarker discovery platform to identify target engagement markers in cerebrospinal fluid by high resolution differential mass spectrometry. J Proteome Res 2010, 9(3):1392–1401. 10.1021/pr900925dView ArticlePubMedGoogle Scholar
 Sietsema K, Meng F, Yates N, Hendrickson R, Liaw A, Song Q, Brass E, Ulrich R: Potential biomarkers of muscle injury after eccentric exercise. Biomarkers 2010, 15(3):249–258. 10.3109/13547500903502802View ArticlePubMedGoogle Scholar
 Neubert H, Bonnert T, Rumpel K, Hunt B, Henle E, James I: Labelfree detection of differential protein expression by LC/MALDI mass spectrometry. J Proteome Res 2008, 7(6):2270–2279. 10.1021/pr700705uView ArticlePubMedGoogle Scholar
 Meng F, Wiener M, Sachs J, Burns C, Verma P, Paweletz C, Mazur M, Keyanova E, Yates N, Hendrickson R: Quantitative analysis of complex peptide mixtures using FTMS and differential mass spectrometry. J Am Soc Mass Spectrom 2007, 18(2):226–233. 10.1016/j.jasms.2006.09.014View ArticlePubMedGoogle Scholar
 Nittis T, Guittat L, LeDuc R, Dao B, Duxin J, Rohrs H, Townsend R, Stewart S: Revealing novel telomere proteins using in vivo crosslinking, tandem affinity purification, and label free quantitative LCFTICRMS. Mol Cell Proteomics 2010, 9(6):1144–1156. 10.1074/mcp.M900490MCP200PubMed CentralView ArticlePubMedGoogle Scholar
 Keller A, Nesvizhskii A, Kolker E, Aebersold R: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem 2002, 75: 4646–4658.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.