Statistical protein quantification and significance analysis in label-free LC-MS experiments with complex designs
© Clough et al.; licensee BioMed Central Ltd. 2012
Published: 5 November 2012
Skip to main content
© Clough et al.; licensee BioMed Central Ltd. 2012
Published: 5 November 2012
Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) is widely used for quantitative proteomic investigations. The typical output of such studies is a list of identified and quantified peptides. The biological and clinical interest is, however, usually focused on quantitative conclusions at the protein level. Furthermore, many investigations ask complex biological questions by studying multiple interrelated experimental conditions. Therefore, there is a need in the field for generic statistical models to quantify protein levels even in complex study designs.
We propose a general statistical modeling approach for protein quantification in arbitrary complex experimental designs, such as time course studies, or those involving multiple experimental factors. The approach summarizes the quantitative experimental information from all the features and all the conditions that pertain to a protein. It enables both protein significance analysis between conditions, and protein quantification in individual samples or conditions. We implement the approach in an open-source R-based software package MSstats suitable for researchers with a limited statistics and programming background.
We demonstrate, using as examples two experimental investigations with complex designs, that a simultaneous statistical modeling of all the relevant features and conditions yields a higher sensitivity of protein significance analysis and a higher accuracy of protein quantification as compared to commonly employed alternatives. The software is available at http://www.stat.purdue.edu/~ovitek/Software.html.
Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) is widely used for relative protein quantification in complex biological mixtures. Several quantification strategies have been described that can broadly be grouped as those using stable isotope labeling and label-free methods [1–3]. Label-free methods can be further differentiated into methods based on spectral counting  and methods that infer analyte quantities from the ion current of the respective molecular ions. The latter method has been implemented under the term label-free LC-MS. It consists of the enzymatic digestion of proteins into a mixture of peptides, the separation of the peptides by capillary liquid chromatography, the ionization of the peptides, and the further separation of the molecular ions by the mass spectrometer according to their ratio of mass to charge. One instrument run yields a two-dimensional LC-MS profile, where peaks correspond to peptide ions, and the intensities of the peaks are related to the abundances of the peptides. The molecular ions representing specific LC-MS peaks can be further fragmented in the mass spectrometer by collision activated dissociation to generate fragment ion spectra (MS/MS), which are informative of the amino acid sequences of the peptides. The label-free shotgun LC-MS/MS workflow [5, 6] is popular because it requires minimal sample processing and is relatively inexpensive. A variety of computational tools are now available to detect, quantify, and align LC-MS peaks across multiple samples and runs, and to annotate them when possible with peptide and protein identities .
The output of the LC-MS/MS workflow is a list of features formed by identified, quantified, and aligned LC-MS peaks. Multiple features can represent a protein when we observe differentially charged ions of the same peptide, or several peptides of the protein. In many investigations, specifically in systems biology  and in the discovery of biomarkers of disease [9, 10], it is necessary and appropriate to summarize all the features into protein-level conclusions. Such summarization has two goals. First, protein significance analysis combines the measurements for a protein across peptides, charge states, and technical replicates across samples and conditions, and detects proteins that change in abundance between conditions more systematically than can be expected by random chance. Second, protein quantification combines the measurements for a protein within a particular biological sample and estimates the abundance of the protein in the sample, e.g., for use in clustering or classification. Thus, the accuracy of both protein significance analysis and protein quantification has important implications for the biological or clinical conclusions.
Many technologies other than LC-MS/MS require measurement summarization, and the solution often involves some form of averaging. E.g., in Affymetrix oligonucleotide microarrays a gene is represented by 11-16 oligonucleotides , and the expression of the respective gene is quantified with Tukey median polish, implemented as part of the robust multi-array averaging (RMA) normalization and summarization [12, 13]. Unfortunately, a similar such summarization of the (log-)intensities of features mapped to a protein fails to produce accurate results in label-free LC-MS/MS, and it undermines the quality of the result . Unlike in microarrays where the probes are designed and optimized, the LC-MS/MS workflow has little control over the observed features. It yields proteins with features that vary in number, have many interferences due to co-eluting, mis-identified or mis-quantified peptides, and have frequently missing intensities of peaks. Furthermore, the peptide ions derived from a specific protein vary in their ionization efficiency.
Measurement summarization in label-free LC-MS/MS is improved by a probabilistic modeling, which explicitly characterizes all the sources of variation. Such models have been increasingly introduced into quantitative proteomics and proven superior to averaging in both accuracy and sensitivity [14–19]. However, most of these models are applicable to restricted experiment types, e.g., the comparison of protein abundances between a limited number of conditions. To compensate for the variation and for the missing values, these models require a fairly large number of biological replicates.
In this work, we argue that the accuracy of protein significance analysis and protein quantification can be enhanced by simultaneously studying a larger number of experimental conditions, even when the number of replicates in each condition is relatively small. Increasing the number of conditions is advantageous biologically, because a deeper insight can be obtained by considering multiple inter-related conditions of the same model organism, or by acquiring repeated measurements on the same biological subject. It is also advantageous statistically, as it improves the accuracy of protein significance analysis and protein quantification.
In previous work, we introduced the linear mixed effects modeling framework for protein significance analysis and protein quantification in label-free LC-MS/MS experiments with simple, comparative designs . The contribution of the current manuscript is to extend this work to arbitrary complex experimental designs, support the proposed framework with a software solution, and demonstrate the advantage of a joint modeling of multiple experimental conditions to practitioners with a limited statistical background. The open-source R -based software package MSstats automatically recognizes the design of the experiment, and can be used for both protein significance analysis and protein quantification. We show that the framework provides meaningful results even with a moderate number of biological replicates, and in experiments with noisy and missing LC-MS peaks.
Independent extraction procedures  were performed on each culture (i.e., biological replicate), the order of the samples was randomized, and three separate mass spectrometry runs (i.e., technical replicates) were acquired for each sample on a hybrid LTQ-Orbitrap mass spectrometer (ThermoFischer Scientific, San Jose, CA, USA), producing 48 runs. LC-MS features were quantified, aligned, and annotated with peptide and protein identities using the OpenMS software . The feature intensities were log-transformed and subjected to constant normalization . The signal processing identified 278 proteins groups with two to 19 unambiguously mapped LC-MS features per group, and with up to 40 missing peak intensities per feature.
Additional details on the dataset are available in Supplementary Section 1.
Albumin was depleted prior to tryptic digest, the order of the samples was randomized, and each sample was analyzed in single replicate runs using Thermo-Finnigan linear ion-trap mass spectrometry, producing 130 runs. LC-MS features were quantified, aligned, and annotated with peptide and protein identities as previously described . When a peak was not detected in a run the workflow reported the background noise, and therefore the dataset had no missing peak intensities. The intensities were log-transformed and subjected to quantile normalization . To reduce the number of peptides with ambiguous mappings to protein isoforms, each peptide was mapped to the underlying Entrez gene ID, and the signal processing identified 121 gene groups with two to 295 peptides per group. For simplicity, we refer to the gene groups as proteins.
We define protein significance analysis as a sequence of four steps: (1) statement of the problem, i.e., definition of the comparisons of interest and of the scope of conclusions before collecting the data, (2) exploratory data analysis, i.e., visualization of protein-specific features for quality control, (3) model-based analysis, i.e., representation of the quantitative measurements in a probabilistic model, and determination of proteins that change in abundance between conditions, while controlling the False Discovery Rate (FDR), and (4) statistical design of the follow-up experiments. The four steps are common to many experiments that study differences in analyte abundance, and are described, e.g., in Chang et al. . In this manuscript we emphasize the details that are specific to label-free LC-MS/MS experiments with complex designs.
Experiments with complex designs simultaneously answer multiple related questions. In the study of breast cancer cell lines we can compare, e.g., the abundance of proteins between oxygen treatments in the high invasive line, at a specific time point such as six hours of treatment. Multiple other, similar questions could be asked from the dataset. Although such comparisons can be performed by separately analyzing the runs within the respective conditions and ignoring the rest of the data, we show in Section 3 that the joint modeling of all the available conditions increases our ability to detect changes that are relevant for the specific question asked. As an illustration, here we compare the protein abundance after six hours of normoxia in the low invasive versus the high invasive cell lines, on average over the biological replicates. In the study of subjects with osteosarcoma we compare protein abundance prior to surgery (week 10) and post-surgery (week 13). The Step 3 below enables arbitrary complex comparisons of this kind.
Statement of the problem also specifies the scope of the conclusions, i.e., the interpretation that we attribute to the biological and technical replication. The reduced scope of biological replication means that we restrict our conclusions to the subjects in the study, e.g., to the 43 subjects in the study of osteosarcoma. This is appropriate for early-stage screening experiments. The expanded scope of biological replication means that we would like our conclusions to hold beyond the selected subjects, and for the underlying populations. E.g., in the study of subjects of osteosarcoma this means that we expand our conclusions beyond the selected 43 subjects, and to the entire population that these subjects represent. This is necessary for experiments at the validation stage. The choice of the scope is required for all statistical models and all experimental designs [25, 26] and Step 3 below enables this choice. Section 3 illustrates that in addition to the important differences in interpretation, the scope of biological replication impacts the sensitivity and the specificity of the results. Therefore, the scope of replication should be specified according to the goals of the experiment, prior to collecting the data.
The plots are generated automatically for all proteins in MSstats with a one-line command. The proteins selected in Figure 3 show only minor interferences in feature intensities, and the large changes between conditions appear systematic. An example of a protein with interferences is given in Supplementary Section 1.2.
The conclusions regarding changes in protein abundance are formalized using statistical modeling and inference. To represent a complex experimental design we introduce a unique identifier for each condition, i.e., for each combination of the available experimental factors. For example, in the experiment of breast cancer cell lines each combination of cell line type, oxygen treatment, and exposure time forms one condition. In the study of subjects with osteosarcoma, each combination of disease status and time forms one condition. We further introduce a unique identifier for each subject, i.e., for each biological replicate.
The model expresses the reduced scope of biological replication by viewing the between-subject changes in abundance as fixed values. Alternatively, it expresses the expanded scope of biological replication by viewing these changes as randomly sampled from the underlying population. In practice a model with one of the two forms is chosen based on the desired scope of conclusions, and is specified for every protein quantified in the investigation. We show in Section 3 that the reduced scope of replication leads to higher sensitivity of detecting changes in abundance as compared to the expanded scope of replication, but that this comes at the cost of a lower specificity.
A property of label-free LC-MS/MS is that, in the absence of technical replication, biological variation is confounded with technical variation. Therefore, conclusions from the model with reduced scope of biological replication are also restricted to the performed mass spectrometry runs, unless the experiment also incorporates technical replication.
The most difficult aspect of these steps is in determining the coefficients of the linear combination. The coefficients depend on the comparison of interest, type, number, and layout of the experimental factors, on the numbers of features mapped to the protein, and on whether we reduce or expand the scope of biological replication. The implementation in MSstats automatically derives all such expressions from the input dataset and streamlines the testing for users with a limited statistics background. Finally, MSstats follows the standard testing procedure and calculates the ratio of the estimated differences and theirs standard error (in statistical terminology, the test statistic) for each protein, compares the test statistics to the Student distribution with the appropriate degrees of freedom to obtain p-values, and adjusts the p-values for multiple comparisons to control the False Discovery Rate in the list of differentially abundant proteins .
and t 1-β, df and t 1-α/2, df are the 100(1 - β) th and the 100(1 - α/2) th percentiles, respectively, of the t-distribution with df = IJK (L - 1) + (I - 1) J (K - 1) degrees of freedom. From the formulae, given a fixed number of features I, and a fixed number of biological replicates K and technical replicates L per condition, increasing the number of related conditions J increases the degrees of freedom, which has the effect of decreasing the lower bound of the detectable fold change. We show this empirically in Section 3.
A distinct goal of label-free LC-MS/MS investigations is protein quantification, i.e., the estimation of protein abundance in a biological replicate, or in one condition on average over all the replicates. In experiments with complex designs these estimates are of a particular interest, as they serve as input to machine learning tools to generate new biological knowledge from the complex datasets. For example, in the study of breast cancer cell lines we can cluster the profiles of protein abundance across conditions to find functionally related proteins. In the study of subjects with osteosarcoma the profiles can be used to predict the subject's therapy response. Such estimates of abundance differ from absolute protein quantification  in that they are only comparable for a same protein across conditions and runs, but not between proteins.
MSstats automatically derives the estimates and streamlines the estimation. In Section 3, we show that the accuracy and the precision of such protein quantification is improved when simultaneously analyzing all the available conditions.
Experiments with complex designs are particularly susceptible to missing values, as they combine intertwined effects of multiple conditions and a small sample size. An advantage of linear mixed effects models is that they tolerate missing LC-MS peaks if each feature has at least one intensity in each condition . In the presence of missing data the model-based estimates of abundance differ from average log-intensities, and reflect more accurately the available information. The estimates of variation are also adjusted. E.g., the confidence intervals in Figure 8(a) vary in width due to the missing feature intensities in the study of breast cancer cell lines, and in Figure 8(b) due to missing time points in the study of subjects with osteosarcoma.
When a feature is missing entirely in one condition, this can be due to the low abundance, but also to other reasons unrelated to abundance, e.g., post-translational modifications, or even due to technical reasons. The extent of missing values is influenced by the experimental settings. There is currently no consensus on how to best account for the missing data in this case. Some authors advocate imputing missing intensities using the estimates of background noise [24, 36] or by using a classification technique such as K-nearest neighbor . Others argue against imputation, and transform the peak intensities to binary present/absent values , treat the missing intensities as censored values as in survival analysis models [18, 39], or advocate testing a peptide for whether its missing intensities occur more frequently in some experimental groups than expected at random, in which case the peptide is viewed as differentially abundant .
Since the goal of MSstats is to flexibly represent any experimental situation, it does not enforce a one-size-fits-all treatment of missing peaks. Instead it allows the user to choose one of three options, based on prior biological expectations and on the exploratory data analysis plots. First, the user can assume that the intensities in the specific condition are missing due to low ion abundance, and impute the missing values in this condition with the average minimum log-intensity across runs. The analysis will reflect the interference in feature intensities, but can violate the assumptions of Normality. Second, the user can make no assumptions regarding the reason for the missing values. This will require an alternative assumption that the features have no interference (in other words, the model will not have the feature × condition statistical interaction in Figure 4). Model-based estimates will account for the missing data, but significance analysis may lose sensitivity. The third option is to assume that this is a generally poor quality feature, and remove it entirely from the dataset. In Section 3, we show that the three treatments of the missing values have only a small effect on the detection of differentially abundant proteins.
MSstats is implemented in the open-source and open-development environment R . Although the models for the factorial and the time course experiments are different, their specification in MSstats is identical. The software requires a unique identifier for each biological replicate, and uses the identities to detect the presence of repeated measurements and to internally choose the correct model.
MSstats is a series of wrappers around the generic methods lm in the library base and lmer in the library lme4 . lm fits linear models where all factors are fixed, and estimates model parameters using ordinary least squares [28, 42]. MSstats uses this function for models with reduced scope of biological replication. lmer fits linear mixed effects models with an arbitrary number of random effects, can handle large and unbalanced datasets, and estimates model parameters using restricted maximum likelihood . MSstats uses this method for models with expanded scope of biological replication.
The use of lm and lmer is adapted to the specifics of LC-MS/MS. First, since each protein can have a different number of features, a different number of model terms (i.e., a different design matrix) is implemented for each protein. Second, options are implemented to handle heterogeneous variance and missing values. Finally, MSstats automatically derives model-based summaries for protein significance analysis and protein quantification, such that the biological comparison is the only input required from the user. Occasionally, it can be of interest to perform an analysis at the feature level instead of at the protein level, and MSstats also supports this option. Supplementary Information contains R-based code for steps of the analysis. The software and the documentation are publicly available at http://www.stat.purdue.edu/~ovitek/Software.html.
We illustrate the performance of the proposed framework in the two case studies. As an example, for the study of breast cancer cell lines, we compare protein abundances between cell line types after six hours of normoxia (i.e., the comparison in Figure 6). In all the results, for features with peak intensities missing in an entire condition the values were imputed with the average minimum log-intensity across all runs of the experiment (as implemented in MSstats) unless stated otherwise. For the study of subjects with osteosarcoma, we compare protein abundances prior to surgery (week 10) and post-surgery (week 13), as illustrated in Supplementary Section 2.2. The results for both datasets were obtained with reduced scope of biological replication unless stated otherwise.
The figure shows, for each analysis, the number of proteins with an FDR-adjusted p-value below various p-value cut-offs. The results indicate an increased sensitivity of joint modeling over the pairwise analysis and the naïve analysis in both case studies. While one can expect, in general, a greater improvement in sensitivity by joint modeling over the alternative analyses when fewer conditions are considered in a comparison, the extent of the sensitivity gain varies between experiments, and typically increases with the sensitivity of the mass spectrometer, and decreases with technical variation in the log-intensities and with the complexity of biological samples.
In addition to the factors outlined in Section 2.3, the sample size is strongly affected by the experimental design. A time course design, such as in the study of subjects with osteosarcoma, separates the within-subject and the between-subject variation. Therefore, a comparison of conditions (or time points) on the same subjects often leads to a smaller required sample size. This is illustrated by the smaller number of biological replicates in the time course study of osteosarcoma in Figure 10(b) as compared to the factorial design of breast cancer cell lines in Figure 10(a). As before, the extent of such differences depends on the specifics of the study.
Our results show that in situations where measurements are available from multiple related conditions, the proposed joint probabilistic modeling and summarization of all the LC-MS/MS runs leads to more sensitive protein significance analysis and more accurate and precise protein quantification than when separately analyzing subsets of conditions. The gain is due to a more efficient use of the data, and to a more accurate understanding of the systematic and random variation.
A distinctive feature of linear mixed effects models is the ability to distinguish two interpretations that we attribute to biological replication. The expanded scope of biological replication means that we expect to reproduce the results in a new set of biological replicates that are randomly selected from the underlying population. Significance testing based on the expanded scope is conservative, and is appropriate for confirmatory investigations. On the other hand, a model specifying the reduced scope of conclusions only expects to reproduce the results in a replicate mass spectrometry analysis of the same biological samples. Protein significance in these models may or may not be reproduced in another set of biological replicates. Such models are appropriate for small-size screening experiments. The distinction between the scopes of biological replication is a property of all biological experiments, and the advantage of linear mixed effects models is that they make this distinction explicit, and allow practitioners to make an informed choice. In all cases, the scope of conclusions should be clearly specified prior to collecting the data, the limits of generalizability should be clearly stated when reporting the results, and the conclusions should be followed by a thorough experimental validation.
An overview of the proposed data analysis workflow.
Statement of the problem
• Specify comparisons of interest
• Express comparisons as statistical hypotheses
• Define scope of biological replication
• Restricted scope suitable for screening; expanded scope required for validation
Exploratory data analysis
• Detect mis-identified features
• Remove obvious outliers
• Detect features with missing values
• Choose imputation strategy
• Fit linear mixed model per protein
• Reduced scope of biological replication = fixed subjects; expanded scope = random subjects
• Check qq-plots plots for Normality
• If deviations, conclusions are approximate only
• Check residual plots for equal variance
• If deviations, use iterative least squares
• Test comparisons of interest
• Adjust p-values per comparison to control FDR
• Quantify protein abundance in conditions or samples of interest
• Use as input with downstream clustering or classification
Design follow-up experiments
• Evaluate power and sample size
• Find minimal sample size for a fold change
• Find minimal fold change for a sample size
The proposed framework, and the software implementation, should not be confused with that of the recently introduced software SRMstats for protein significance analysis in label-based selected reaction monitoring (SRM) workflows . While also based on the linear mixed effects modeling framework, the class of models employed in SRMstats are designed to optimize performance in datasets specific to the SRM workflow. SRM datasets contain extra information from heavy-labeled reference peptides, which facilitates several aspects of an analysis, including normalization to remove systematic between-run variation, separation of true biological variation from non-systematic between-run variation, and a more straightforward handling of missing values within the linear mixed effects modeling framework. The framework introduced in this work is designed specifically for label-free LC-MS/MS investigations, which are characterized by a confounding of variation from multiple convoluted sources, heterogeneous stochastic variation, frequent missing data, and greater uncertainty in the conclusions. The implementation in MSstats is designed to utilize experiments with complex designs in order to better quantify the sources of variation and reduce uncertainty.
The proposed approach has several limitations, most of which are simplifications that reduce the complexity and the analysis time of large-scale datasets. First, MSstats requires the same treatment of missing peak intensities to all proteins with excessive missings. A user-specified protein-specific treatment, motivated by the quality control plots, will likely be implemented in a future version of the software. Second, MSstats applies the same class of models to all proteins in the dataset. While per-protein refinements, such as removing unnecessary interaction terms, are possible they are often impractical, slow the analysis, require adjustments for multiple testing, and may lead to more conservative tests and overfitting. Another potential problem is the assumption of Normally distributed random terms. Although this assumption is rarely satisfied exactly, in our experience the deviations from Normality on the log-intensity scale are quite minor, and linear models are known to be robust to such small deviations. Residual plots help diagnose features with major deviations, which can then be manually excluded from the analysis. The final limitation of the proposed approach is the assumption that the LC-MS features in the dataset are correctly identified, correctly mapped to protein groups, and are informative of protein abundance (e.g., are within the limits of the dynamic range). An incorrectly mapped feature, a feature close to the edge of the signal range, or a feature with a profile distorted by a post-translational modification can undermine the quality of the results. The quality control plots currently help partially alleviate this issue.
Overall, the proposed approach balances accuracy and practicality, and enables the analysis of complex experiments in high throughput. Its open-source implementation is friendly to users with a limited statistics and programming background. We hope that the proposed approach will become a valuable tool for proteomic investigations.
The work was supported by the NSF CAREER grant DBI-1054826 to O.V., by grant 5K23RR019540 from the National Institutes of Health (NIH) to S.R. and by the Indiana University Melvin and Bren Simon Cancer Center ITRAC Pilot Grant Mechanism to S.R.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 16, 2012: Statistical mass spectrometry-based proteomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S16.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.