Corra: Computational framework and tools for LC-MS discovery and targeted mass spectrometry-based proteomics
© Brusniak et al; licensee BioMed Central Ltd. 2008
Received: 20 May 2008
Accepted: 16 December 2008
Published: 16 December 2008
Quantitative proteomics holds great promise for identifying proteins that are differentially abundant between populations representing different physiological or disease states. A range of computational tools is now available for both isotopically labeled and label-free liquid chromatography mass spectrometry (LC-MS) based quantitative proteomics. However, they are generally not comparable to each other in terms of functionality, user interfaces, information input/output, and do not readily facilitate appropriate statistical data analysis. These limitations, along with the array of choices, present a daunting prospect for biologists, and other researchers not trained in bioinformatics, who wish to use LC-MS-based quantitative proteomics.
We have developed Corra, a computational framework and tools for discovery-based LC-MS proteomics. Corra extends and adapts existing algorithms used for LC-MS-based proteomics, and statistical algorithms, originally developed for microarray data analyses, appropriate for LC-MS data analysis. Corra also adapts software engineering technologies (e.g. Google Web Toolkit, distributed processing) so that computationally intense data processing and statistical analyses can run on a remote server, while the user controls and manages the process from their own computer via a simple web interface. Corra also allows the user to output significantly differentially abundant LC-MS-detected peptide features in a form compatible with subsequent sequence identification via tandem mass spectrometry (MS/MS). We present two case studies to illustrate the application of Corra to commonly performed LC-MS-based biological workflows: a pilot biomarker discovery study of glycoproteins isolated from human plasma samples relevant to type 2 diabetes, and a study in yeast to identify in vivo targets of the protein kinase Ark1 via phosphopeptide profiling.
The Corra computational framework leverages computational innovation to enable biologists or other researchers to process, analyze and visualize LC-MS data with what would otherwise be a complex and not user-friendly suite of tools. Corra enables appropriate statistical analyses, with controlled false-discovery rates, ultimately to inform subsequent targeted identification of differentially abundant peptides by MS/MS. For the user not trained in bioinformatics, Corra represents a complete, customizable, free and open source computational platform enabling LC-MS-based proteomic workflows, and as such, addresses an unmet need in the LC-MS proteomics field.
One area of particular interest to the proteomics community is the application of proteomics to the determination of proteins that are differentially expressed or abundant between samples representing different physiological or disease states [1–3]. Typically, such analyses require a quantitative proteomics approach, for which there is a wide range of experimental options available to the researcher. These generally fall into one of two categories, or represent some combined form of both : i) stable isotope labeling, combined with LC-MS/MS identification, providing accurate relative abundance, or, if suitably calibrated peptide or protein reference samples are available, absolute quantification; ii) LC-MS label free quantification (i.e. pattern-based), in which quantification is determined via observed changes in the ion current for individual analytes.
Isotopic labeling and label free approaches each have their own set of challenges and limitations. MS/MS-based isotopic labeling approaches must introduce the label pre- or post-sample isolation. Post-isolation methods include the use of labeling reagents such as ICAT  and iTRAQ , whereas pre-isolation labeling methods (i.e. in vivo) include the use of SILAC  labeling reagents, for use in cell culture-based experiments. All of these methods, however, limit the number of individual biological samples that can be compared in a single experiment to a very low number, and peptides can generally only be quantified if they are also successfully identified by MS/MS, unless combined with a LC-MS profiling approach. In contrast, LC-MS-based label free approaches are ideal for the comparison of large sets of samples or populations where, in principle, every feature detected by the mass spectrometer is potentially quantifiable. However, since LC-MS approaches rely on some form of data alignment or pattern matching, they require a much higher degree of experimental reproducibility. This can be challenging for LC-MS, when large numbers of consecutive analyses are often required.
As a result of both the increased use of LC-MS-based workflows, and the complex computational challenge that the alignment of large sets of LC-MS data represent, a wide range of tools to address this need have appeared [8–16]. It is also because of the complexity of this problem, and the different computational approaches that can be taken to solving it, that each of the tools has its own, individual set of strengths and weaknesses. Thus for the biologist or proteomics researcher, tool selection can depend on what experiment is being done, or what mass spectrometer or data type is being used etc. For example, the two LC-MS tools we have implemented in the version of Corra presented here, SpecArray  and SuperHirn , each work better than the other with data generated with different types of mass spectrometer, as is discussed further below. Other tool limitations/considerations include: those designed to run on a single processor versus those that can run on a cluster; some have a limitation on the number of samples (data files) that can be aligned in one experiment; others are designed to use MS/MS identification of some peptides as 'landmarks' during the alignment process. In general, these LC-MS tools also have no, or only very minimal statistical capability to assign confidence to data alignments, and thus control the false discovery rate. Finally, as each tool was developed in a different academic setting, each with their own set of needs and workflows, the input/output formats of the tools do not make them readily compatible with each other, or the array of statistical packages that have been developed for analysis of high-dimensional data. For all these reasons, embarking on an LC-MS-based proteomic discovery workflow can be a daunting task for the biologist or other researcher who is not already well versed in proteomics and bioinformatics.
We therefore set out to develop, and here present Corra: a free, open source and customizable computational platform that enables LC-MS proteomic workflows. The Corra framework extends and adapts existing algorithms used for LC-MS-based proteomics, as well as existing statistical algorithms from the microarray community suitable for the analysis of high-dimensional LC-MS data, as well as adapting additional software engineering technologies, such as distributed processing and Google Web Toolkit. As such, Corra addresses an unmet need in the LC-MS proteomics field: to provide an open source computational platform that allows biologists and other researchers, not formally trained in bioinformatics, to easily process, visualize and analyze LC-MS data in the manner of their choosing, all within a single application, and on their own workstation. Using a web-based interface, Corra guides the user from MS data generation, through data processing, visualization, and statistical analysis steps, providing for multiple server-side data processing modes and statistical analyses along the way, towards the identification of differentially abundant or expressed candidate features for prioritized targeted identification by subsequent MS/MS. Corra also provides all the information and documentation required for the knowledgeable user to customize the platform, and integrate the data processing and/or statistical analysis tools of their own choosing, according to their specific workflow needs.
To illustrate the implementation of Corra for the analysis, visualization and interpretation of biologically relevant LC-MS data, we present analyses of two biological pilot studies as examples of commonly performed proteomic LC-MS experiments, that each highlight different aspects and uses of Corra. The first pilot study demonstrates the use of Corra for candidate plasma biomarker discovery aimed at human type 2 diabetes. Here we show the use of aligned LC-MS features to correctly classify the normal and disease plasma sample groups, followed by Corra-enabled targeted MS/MS identification of the differentially abundant peptide features. The second pilot study demonstrates the use of Corra for phosphopeptide profiling, where we identify in vivo substrates of a protein kinase using a kinase deletion strain of yeast. Here we expect that the target phosphopeptides of the kinase will be absent from one set of samples. Again, follow-up Corra-enabled targeted MS/MS identified the phosphopeptide features absent in the kinase knockout strain.
Overview of the APML format
To facilitate the integration into a coherent analysis platform of existing software tools with those yet to be developed, it was necessary to define a data format that could capture any and all relevant information relating to the data and the experiment, to ensure portability of data between the disparate tools in use. We therefore defined and implemented the Annotated Putative peptide Markup Language (APML) data format within the Corra framework to both store processed data, and port it from one tool to the next. APML is defined using eXtensible Markup Language (XML). We chose XML since it is a simple language, designed for data representation, and is easily parsed and self-describing by markup tags. APML captures essential LC-MS data information for statistical analysis, as well as additional information for identification, profiling, clustered features, etc. The APML schema and documentation can be found here .
APML implementation with LC-MS and statistical tools
A goal of Corra was to enable the integration of multiple and disparate LC-MS data analysis tools, and integrate them, seamlessly, with common statistical packages to allow for better comparison between differently-processed datasets, via the addition of statistical measures of confidence and error rates. The integration of tools was achieved via AMPL and the various parsers. In the current build of Corra we have integrated SpecArray  and SuperHirn , both adapted to generate their final output in APML, as well as to operate in a distributed computing environment. To integrate the tools with the needed statistical analyses, we implemented another APML parser, using R's XML library, to facilitate an interface to CorraStatistics.R. This allows the user to launch within Corra additional statistical data analyses, suitable for quantitative proteomics data analysis, via the use of selected Bioconductor packages.
Bioconductor is a leading open source and software project for the statistical analysis of high-throughput biological data , and is primarily based on R, a language and environment for statistical computing and graphics. Bioconductor represents a rich source of statistical packages pertinent to proteomic data analysis, and thus a perfect source of tools for Corra's data processing framework. The current version of CorraStatistics.R includes the LIMMA  and maSigPro  Bioconductor packages to detect statistically significant features that are differentially abundant between related sample sets (e.g. disease control vs. case) and with a controlled false discovery rate, with the option of adding time course information, if required. CorraStatistics.R also allows for the use of MLInterfaces and Hierarchical Clustering for supervised and unsupervised clustering applications.
An inherent aspect of any LC-MS profiling experiment, where features are to be aligned across multiple LC-MS runs, is the issue of missing features across a subset of the LC-MS runs being aligned. One explanation for such 'missing' features is the failure of the feature detection and/or alignment tools to correctly detect and align every true feature in every data file, usually as a result of experimental variation (e.g. variations in LC retention, signal to noise, m/z drift, etc.). However, missing features may also be indicative of real and desired experimental information. Therefore, instead of throwing away aligned features with missing intensity values of a subset of the LC-MS runs, CorraStatistics.R also provides the user with the option to replace such missing intensity values, either with a user-specified value, or the default setting: the minimum intensity value obtained among all aligned features, after intensity normalization, for the given data set. These missing values are then applied to the aligned intensity dataset, allowing subsequent analyses to detect features that were above the limit of detection by LC-MS in one data/sample sub-group, but not in another.
APML Parsers and viewers
Corra web graphical interface
Another important design goal for Corra was to enable relative ease of use of its various tools and statistical packages by users with little or no formal bioinformatic training, all within a single computational interface. This was accomplished via the development a web-based graphical user interface (GUI) to help guide any user through the various desired data processing and analysis steps, in a systematic and straightforward way. This was done by using Google Web Toolkit for the client web application. Access to the interface is via a web-browser, and does not require installation of any custom software on the user's computer. Data processing happens on a remote server that leverages a compute cluster environment to achieve high-throughput and scalability. This has the advantage of not tying up the user's computational resources during time-consuming analyses.
In addition to the number of LC-MS runs to be analyzed and their inherent file-sizes, the processing time is highly variable and dependent on which LC-MS tool is used, user specified setup parameters, as well as the hardware and number of compute nodes that the Corra platform is run on. However, as a guide, utilizing SpecArray for alignment in a multi-threaded and distributed computing environment, feature extraction for 105 LC-MS runs, with a ~3 GB mzXML file per LC-MS run, took Corra ~25 hours. This was done on a six dual-core, dual processor AMD Opteron 275 (2.2 GHz) cluster, each with 1 MB of level 2 cache.
Implementing new LC-MS tools or statistical packages in Corra
As has been discussed above, the version of Corra presented here includes implementation of SpecArray  and SuperHirn  for the LC-MS data alignments, and the LIMMA  and maSigPro  packages for statistical analysis of aligned data. However, researchers with a sufficient bioinformatics background can readily implant additional tools of their choosing. To add an additional LC-MS tool, this would minimally require modifying the new tool's code to be able to input and output data in APML format. Tools that also use MS/MS data would have to be modified to use pepXML  representation of the MS/MS data. Additionally, tools that are not designed to do so could be modified to take advantage of Corra's distributed processing architecture, if desired. Since Corra already includes parsers for R's ExpressionSet data format, introducing existing or new R statistical packages is very simple. Finally, the Corra GUI would have to be updated to show the new tools in the web interface.
Enrichment of N-glycopeptides from human plasma samples relevant to type 2 diabetes
22 human plasma samples were obtained directly from the funding agency (i.e. from the NIDDK, collected specifically for the project PAR-04-076 ). Diagnostic classifications had been made via the oral glucose tolerance test (OGTT), the current diagnostic 'gold standard' for diabetes. 13 control samples were from subjects with normal glucose tolerance (NGT), with blood glucose levels ranging from 54 to 98 mg/dl 2 hours post-glucose challenge after fasting. 9 samples were from patients newly diagnosed with type 2 (adult onset) diabetes (DB) (i.e. had not received treatment, or made lifestyle adjustments), with blood glucose levels ranging from 202 to 279 mg/dl 2 hours post-glucose challenge after fasting. N- linked glycoproteins were enriched from the plasma samples, and the formerly N- glycosylated peptides recovered, following digestion with trypsin and protein N-glycosidase F (PNGase F) sequentially, essentially as described elsewhere . Sample preparations were performed in a 96-well plate format, using a Freedom Evo robotic workstation (TECAN, Maennedorf, Switzerland) for all liquid handling and incubation steps, following manual transfer of plasma samples to the 96-well plate. Sample locations in the plate were randomized and recorded. Final N- glycopeptide isolates were stored, dry, in glass vials at -80°C until needed. Samples were resuspended in 0.4% acetic acid prior to MS analyses.
Mass spectrometric analyses of human plasma N-glycopeptide isolates
Peptides were separated on an 1100 Series HPLC system (Agilent, Santa Clara, CA) equipped with a nanoflow pump, operating at a flow rate of 1 μl/min. Mobile phase A was a 0.1% formic acid in water, and mobile phase B was 0.1% formic acid, 5% water, 95% acetonitrile. A binary linear gradient from 5% to 35% B was used to separate the peptides on a 10 cm monolithic C18 column, with a 100 μm inner diameter (Merck KGaA, Darmstadt, Germany). A self-packed integraFrit column (new Objective, Woburn, MA) with a bed of magic C18 5 μm particles (Michrom Bioresources, Auburn, CA), 2 cm × 100 μm, was used as a pre-column. Sample volumes of 5 μl were injected by the autosampler. Samples were randomized for loading, and re-randomized for all subsequent technical replicates.
Mass analysis was performed on a micrOTOF electrospray time-of-flight mass spectrometer (Bruker Daltonics, Billerica, MA) at a mass accuracy of 5 ppm, and a resolving power of 9,000 or better. The mass scale was calibrated using glu1-fibrinopeptide B (Glufib)(Sigma, Saint-Louis, MO) and mass spectra were acquired at 1 spectrum/s over an m/z range of 300–1,600. High mass on the micrOTOF was maintained via automated instrument recalibration between each sample injection. This was achieved by injecting 320 fmol of Glufib, before running a 15 minute wash gradient, increasing the cone voltage to 220 V to induce in-source CID. The Glufib fragment ions were then used by a visual basic script to recalibrate the mass spectrometer on-the-fly, thereby ensuring high mass accuracy from the first to last sample. This measure also has the benefit of greatly reducing carry-over between complex samples, such as plasma-derived peptide isolates, and also provides a way to monitor sensitivity and reproducibility of the system during large-scale sample batch analyses.
For MS/MS identification of Corra-identified discriminatory N- glycopeptides, an inclusion list for the top-ranked 400 discriminatory peptides was used for targeted MS/MS on an LTQ-FT mass spectrometer (ThermoFisher, San Jose, CA) as described elsewhere .
Enrichment of phosphopeptides from control and Ark1 kinase knockout strains of yeast
Three biological replicates each of the yeast S. cerevisiae wild type (BY7092: can1::STE2pr-Sp_his5 lyp1Delta his3Delta leu2Delta ura3Delta met15Delta) and the deletion strain (BY7092: can1::STE2pr-Sp_his5 lyp1Delta his3Delta leu2Delta ura3Delta met15Delta ark1Delta) were grown to OD ~0.8 at 30°C in synthetic defined (SD) medium (per liter: 1.7 g YNB, 5 g ammonium sulfate, 2% glucose (w/v), 0.03 g isoleucine, 0.15 g valine, 0.04 g adenine, 0.02 g arginine, 0.1 g leucine, 0.03 g lysine, 0.02 g methionine, 0.05 g phenylalanine, 0.2 g threonine, 0.04 g, 0.04 g tryptophan, 0.03 g tyrosine, 0.02 g uracil, 0.1 g glutamic acid and 0.1 g aspartic acid). Cells were harvested at 30°C by centrifugation. Then cells were washed once in SD medium (at 30°C), pelleted by centrifugation, and were shock-frozen in liquid nitrogen until needed. Yeast cell pellets were subsequently thawed in ice-cold lysis buffer (20 mM TrisHCl pH 8.0, 100 mM KCl, 5 mM EDTA, 20 nM calyculin A, 200 nM okadaic acid, 4.8 μm cypermethrin (all from Merck KGaA, Darmstadt, Germany), 2 mM vanadate, 10 mM sodium pyrophosphate and 10 mM NaF) using 1 ml of lysis buffer per gram of yeast. Yeast cells were lysed by glass bead beating (using acid washed glass beads). Phosphopeptides were isolated following tryptic digestion of total protein isolates using titanium dioxide affinity purification, as described elsewhere [27, 28].
Mass spectrometric analyses of yeast phosphopeptide isolates
The phosphopeptide samples were analyzed on a hybrid LTQ-OrbiTrap mass spectrometer (ThermoFisher Scientific, Bremen, Germany) interfaced with a nanoelectrospray ion source. Chromatographic separation of peptides was achieved on an Eksigent nano LC system (Eksigent Technologies, Dublin, CA, USA), equipped with a 11 cm fused silica emitter, 75 μm inner diameter (BGB Analytik, Böckten, Switzerland), packed in-house with a Magic C18 AQ, 5 μm beads, loaded from a cooled (4°C) Spark Holland autosampler, and were separated using acetonitrile/water solvent system containing 0.1% formic acid, at a flow rate of 200 nl/min. Peptide mixtures were separated with a gradient from 5 to 30% acetonitrile over 80 minutes. For MS/MS data acquisition, one data-dependent MS/MS scan was acquired in the linear ion trap for each OrbiTrap-MS scan, the latter acquired at 60,000 nominal resolution settings (full width at half maximum), with an overall cycle time of ~2 seconds. Charge state screening was employed to select for 2+ ions, rejecting 1+ ions and those with undetermined charge. For injection control, the automatic gain control was set to 5 × 105 and 1 × 104 for full OrbiTrap-MS and linear ion trap MS/MS, respectively. The instrument was calibrated externally, according to manufacturer's instructions. Data was acquired using internal lock mass calibration on m/z 429.088735 and 445.120025.
Database searching of MS/MS data for peptide identification
Data from targeted LC-MS/MS on an LTQ-FT, for the human plasma study, were searched using SEQUEST v27, and the human IPI database v3.23 (which contains 66,617 proteins). Trypsin was specified for cleavage, allowing one non-tryptic terminus. No mass filtering was used, a 0.1 Da precursor mass tolerance was used, and a fragment ion tolerance zero was applied (which in this version of SEQUEST corresponds to a tolerance of ~0.5 Da due to the unit mass binning that SEQUEST applies to the input spectra). A stable modification for Cys of +57.05 Da was used, as well as differential modifications for Met of +16.0 Da and for Asn of +0.984 Da. A maximum of 4 modified residues per peptide were allowed. OrbiTrap MS/MS data, for the yeast kinase study, were searched using SORCER ER-SEQUEST v3.0.3, running on SageN Sorcerer, and using the Yeast SGD database (Version of 10.20.2007, which contains 6,795 forward protein entries and 6,795 reversed protein entries). Trypsin was specified for cleavage, allowing for two missed cleavages and one non-tryptic terminus. Mass tolerance was set to 25 ppm for the monoisotopic precursor ions, and to 0.5 Da for fragment ions. Stable modification for Cys of +57.0214 Da was used, along with stable modification for all carboxylate groups of +14.0156 Da, and differential modification for Ser, Thr and Tyr of +79.9663 Da was also used. For both datasets, final peptide assignments were made and false discovery rates calculated by PeptideProphet (v3.0) interpretation of SEQUEST search results . For the purposes of this study, modified amino acid assignments, according to above criteria, were made by virtue of the top-ranked SEQUEST match, followed by manual/visual inspection of MS/MS spectra.
Results and discussion
Corra data input and software processing
At the outset, (multiple) raw LC-MS data files are first converted to mzXML , prior to importing into the Corra framework. The data can then be processed for feature/peak picking and alignment. The current implementation of Corra uses SuperHirn for very high mass accuracy OrbiTrap or FT-MS data (and is the default tool setting for such data, unless otherwise specified by the user) and SpecArray for high mass accuracy TOF-MS data (similarly the default tool setting for these data). These default settings were, in fact, determined through testing the feature picking/alignment tools on multiple data types, where we observed that a given tool performed better with certain types of MS data, and in a somewhat instrument-dependent way. We reasoned that this effect likely resulted from the original data sets that were used during the initial stages of tool development and testing. SpecArray was initially developed for the analysis of ESI-TOF data , and thus performed better than SuperHirn for the analysis of the ESI-TOF data, such as that obtained for the human type 2 diabetes plasma study shown below. On the other hand, SuperHirn was initially developed for the analysis of very high mass accuracy FT-MS data , and thus performed better than SpecArray for the analysis of very high mass accuracy OrbiTrap FT-MS data, such as that in the yeast kinase knockout study, also shown below. Since it is necessary to have high mass accuracy data (i.e. from TOF or FT mass spectrometry platforms) in order to perform LC-MS profiling of complex samples, the implementation of SpecArray and SuperHirn in this initial version of Corra represents sufficient choice for anyone wanting to perform LC-MS profiling using Corra. However, the Corra platform, being open source, was designed so that additional tools could be integrated, according to project-specific needs, as described under Methods above.
Another feature of the Corra framework is that it facilitates the process of peak picking and alignment on the server side (thus not tying up the user's own computer) utilizing an underlying cluster environment with a job-scheduling system (in this case a Portable Batch System) to process all the LC-MS runs in a given data set. This mode of analysis also ensures that the process is not limited by the size of the data set (i.e. the number of LC-MS runs) as some stand-alone tools can be. Following feature detection, the user can display pertinent statistics to evaluate the data quality and usefulness, such as a distribution for the number of features extracted from each LC-MS run. Finally, the outputs of these analyses are converted to APML, both for storage, and for data visualization via Corra's APML viewer (see Figure 4 and above under Methods).
Corra peak picking and alignment
As discussed above, the current implementation of the Corra framework allows for the use of either SpecArray or SuperHirn for feature picking and alignment purposes. However, in order to facilitate the analysis of large data sets, where MS signal intensity typically varies over time, Corra normalizes the MS signal intensity data, prior to importing into the peak picking and alignment tools, using LIMMA (Linear Models for Microarray Data) . Corra, via APML, also allows for subsequent annotation of aligned peaks where MS/MS data is available, for example via subsequent targeted MS/MS identification of differentially abundant peptide features.
It is worth noting here that Corra was designed primarily for the analysis of LC-MS-based (i.e. label free) quantitative proteomic data. It is thus highly desirable that the data itself should be acquired under conditions that maximize for reproducibility. To this end, in one of the studies discussed below, we instituted an automated calibration of the mass spectrometer via the inclusion of a calibration standard in the blank/wash cycle. This provided for very consistent mass resolution and accuracy, meaning that the major concerns to focus efforts on were the maintaining reproducibility for both chromatography and sensitivity. If sensitivity were to drop appreciably, then many features that were above signal-to-noise may no longer be detected. This effect could mislead the user into thinking that the absence of such features was related to biology, rather than machine performance. It is therefore important to be mindful of this issue when analyzing large LC-MS profiling datasets. Indeed, the use of the calibration standard between runs enabled us to closely follow MS sensitivity over the course of large-scale experiments, and several datasets were abandoned before we obtained the data presented in the diabetes example study below, due to the detection of MS sensitivity-related problems. Reproducibility of LC retention time, on the other hand, is somewhat more challenging, the main issues being sample carry-over and gradient drift. Fortunately, the short wash cycle between analyses greatly reduced carry-over, without significantly increasing the time required per sample analysis. Gradient drift can be harder to control. However, improvement in the alignment algorithms currently implemented, have meant that, for the higher-end LC systems commonly in use, this is rarely an issue, save for a major breakdown in the LC-system.
Statistical data analysis using CorraStatistics.R
To be complete, any single platform for LC-MS data analysis would need to include statistical algorithms, appropriate for the analysis of LC-MS data, to generate measures of significance for (peptide) features that appeared to be differentially expressed or abundant between two or more sample groups. Fortunately, many such algorithms have been established for the analysis of genomic and microarray data, now freely available via Bioconductor . Corra thus includes Bioconductor R statistical packages that are useful and appropriate for the interpretation LC-MS data to meet this need, and which are contained in the CorraStatistics.R module.
To perform statistical analyses, APML data sets are first imported into CorraStatistics.R, which first parses APML's dataset and sample information to create an annotated sample and feature intensity data format in the ExpressionSet  object, the format required for application to the R's statistical packages from Bioconductor. The implementation of CorraStatistics.R presented here, uses LIMMA  for processing data without time course information, and maSigPro  for data that does contain time course information. Regardless of which is used, the final output is always the same: a ranked list of features that best discriminate between one or more biological/physiological/clinical groups. This list can then be used to generate an inclusion list for targeted MS/MS analysis and subsequent identification of the discriminatory peptides/proteins of interest (see Figure 6). Finally, the MS/MS spectra, and resultant peptide sequences identified etc., can be annotated back into the aligned APML file for that particular data set.
One drawback of clustering analyses, and indeed many other statistical methods that could be applied, is that they can only use features that aligned across all LC-MS runs. However, it is possible that a given feature may not be present in one sample pool versus another due to a genuine biological effect, rather than it being below the limit of MS detection in one or more LC-MS runs, or due to an error made by either the feature picking or alignment tool. Thus, in order to work around the clustering limitations for such real-life situations, we included an optional function within Corra (called 'n/a replace'), where the user can replace missing intensity data (i.e. given features not aligned across all LC-MS-runs) with the minimum measured intensity for the entire dataset (the default setting), or a specified nominal value of their choosing. When missing intensity values are not replaced, only features that have intensities across all LC-MS runs will be used for supervised or unsupervised clustering analyses. LIMMA analysis of aligned features can then be performed to calculate fold-changes in intensity for each aligned feature across all LC-MS runs which, in turn, can be used to assign a measure of statistical significance for the observed fold changes, for the given dataset. When missing values are replaced prior to clustering, this will produce highly artificial ratios, which can be very misleading if interpreted improperly. Therefore, great care must be taken in applying this optional functionality prior to clustering analysis. For example, if a given feature in the 'control' population aligned across 19 of 20 runs, then replacing the missing feature could be beneficial, since this is likely a 'real' feature that was missed in just one run by the MS or software tools. However, it may be unwise to replace all 17 missing features for another feature that aligned across only 3 runs.
Nevertheless, there are clearly situations where the ability to replace missing features with a nominal value are of use, hence the provision of this function. An example of such a case is given below, where LC-MS profiling was performed on phosphopeptides isolated from a specific protein kinase knockout strain of yeast, in comparison to a wild-type control strain. In this case, we expected missing features in the knockout, when compared with the control. Thus, by using the missing feature replacement function wisely, we were able to successfully cluster the data to identify phosphopeptides that were not present in the profile from the knockout yeast strain. Similarly, in a typical biomarker discovery workflow, there may be markers only present (or absent) in the disease samples, due to a change in gene expression versus the control samples. Thus if one were to observe features that aligned across all (or most) cases, but not the controls, the judicious use of this function would similarly help identify such features. However, since the ratios it generates are highly artificial and therefore open to misinterpretation, it is up to the individual user to ensure that they use this particular function wisely, and to report it when they do so. Indeed, it should be stressed that there are many ways in which high-dimensional data, such as LC-MS data, can be validly analyzed. Thus it is always incumbent on the individual user to first consult with the literature, and/or a suitably qualified biostatistician, before embarking on such complex statistical analyses. Finally, the Bioconductor packages implemented so far were chosen for applicability to our current proteomics research. However, alternative approaches to both statistical data analysis and missing feature replacement are enabled by Corra's open software architecture. With the built-in converter to R's ExpressionSet file format, a user can readily extend or plug-in their own Bioconductor packages of preference into the Corra pipeline, as discussed under Methods above.
Examples of Corra application to biological studies
We next used Corra for the analysis of LC-MS data from two biological pilot studies, as examples of commonly performed proteomic LC-MS experiments. The first goal of these biological studies was to validate the Corra platform's capability to take LC-MS data all the way through to the identification of statistically credentialed, differentially abundant peptides for targeted MS/MS identification. The second goal was to provide 'real life' examples of discovery-based proteomic experiments to illustrate the type of experiments Corra is useful for analyzing, and to show the type of information it can provide for the biologist end user. These two studies were also chosen since they separately highlight different aspects of, and the flexibility of Corra.
The first of these is from a pilot type 2 diabetes biomarker discovery project using human plasma. Here we wanted to be able to initially classify the samples according to disease state via label-free LC-MS analysis, then subsequently identify differentially abundant peptides via MS/MS. The second is from a study to identify candidate protein kinase substrates in vivo via LC-MS phosphopeptide profiling, using kinase deletion strains of the yeast S. cerevisiae. Here we show an example using a yeast deletion strain for the kinase Ark1. In this case, unlike the type 2 diabetes study, we expected the phosphopeptides of interest to be completely absent from the LC-MS profiles compared with a wild-type strain, requiring a different analytical strategy using Corra. Again, we subsequently identified the missing phosphopeptides of interest by MS/MS reanalysis of the wild-type yeast strain.
Application of Corra to plasma biomarker discovery for human type 2 diabetes
The purpose of the pilot study presented here was to apply current LC-MS quantitative proteomics technology to try and identify potential type 2 diabetes candidate plasma biomarkers via profiling of (formerly) N- glycosylated peptides (N-glycosite peptides). To do this, N-glycosite peptides were isolated from plasma samples collected from control individuals with normal glucose tolerance (NGT), as well as from newly diagnosed cases of type 2 diabetes (DB).
N-glycosite peptide isolates were thus prepared from 13 individual NGT plasma samples and 9 individual DB plasma samples, as previously described  and summarized above under Methods. The 22 samples were then randomized for LC-MS analysis, followed by 2 additional technical LC-MS replicate analyses of all 22, each with a new randomization of sample run order to reduce potential bias, for a total of 66 LC-MS runs. Following conversion of the raw data to mzXML format, the 66 data files were input into Corra for feature picking and alignment by SpecArray. The aligned datasets were then imported into the CorraStatistics.R module for statistical analysis, as described above.
Top 20 most discriminatory peptides for diabetes identified by follow-up targeted MS/MS analysis.
B (log Odds)
IPI Accession Number
IPI00217471,IPI00220706,IPI00473011, IPI00554676,IPI00654755,IPI00657660, IPI00657911,IPI00744503,IPI00749035, IPI00784636,IPI00791558,IPI00796636
IPI00168728,IPI00399007,IPI00426051, IPI00470657,IPI00472345,IPI00784807, IPI00784942
Finally, it should be noted here that these data are from a pilot study, and as such, none of the proteins identified were, or should be considered as candidate markers for diabetes without further studies. Nevertheless, they do illustrate how the Corra framework was implemented to determine, and then identify, targets of interest in any LC-MS-based biomarker discovery workflow. In turn, it also shows how Corra could be of use to biologists or other researchers interested in LC-MS data analysis of any other, similar comparison of related physiological states.
Application of Corra to phosphopeptide profiling of an Ark1 kinase knockout yeast strain
As a second example of Corra application, we analyzed LC-MS profiles of phosphopeptides isolated from a wild-type yeast strain, for comparison to those obtained from a yeast strain lacking the protein kinase Ark1. The goal was to see whether we could identify potential Ark 1 target proteins, and phosphorylation sites, for this kinase. Unlike for the human diabetes pilot study above, here we fully expected to observe the 'missing features' effect in the Ark 1 deletion strain. Additionally, these analyses were performed on a very high mass accuracy OrbiTrap-LTQ spectrometer, and the SuperHirn tool was instead used for the feature picking and alignments within Corra.
We therefore prepared total phosphopeptide isolates from the two yeast strains, as previously described  and summarized under Methods. The two samples were then analyzed on an OrbiTrap-LTQ spectrometer, in triplicate (6 LC-MS runs in total), limiting the LC-retention time range for data analysis to the 20 to 90 minute window, since this was the region where the peptides eluted for these analyses. On average, SuperHirn detected ~23,300 features per LC-MS run, with 54,059 total detected features. Of these, 6,840 aligned across all six LC-MS runs, with 22,562 that aligned across three or more LC-MS runs.
Since there were 2 biological samples, each analyzed in triplicate, we took the 22,562 features that aligned across three or more runs for importing into the CorraStatistics.R module, to search for the differentially abundant features between the control and the kinase knockout yeast strains. In this study, we were especially interested in phosphopeptides not detected in the knockout strain versus the control strain, since these would represent potential targets for the missing kinase. Thus to do this, we utilized the 'n/a replace' (for missing values) functionality within Corra to set missing features between all 6 runs to the minimum intensity value detected for the entire dataset. While we recognized that this would generate artificial ratios when we performed the analysis, it nevertheless provided us with the information we needed, since we were, ultimately, only interested in peptide identity here, and not a quantitative measure of differential abundance.
Top 12 most discriminatory yeast phosphopeptides identified by follow-up targeted MS/MS.
B (log Odds)
Here we present Corra, a complete, free and open source computational platform that enables LC-MS-based proteomic workflows, and as such, addresses an unmet need in the LC-MS proteomics field. The Corra framework extends and adapts existing algorithms used for LC-MS-based proteomics, as well as existing statistical algorithms from the microarray community suitable for the analysis of high-dimensional LC-MS data, with a view to controlling false-discovery rates. Corra also adapts additional software engineering technologies, such as distributed processing and Google Web Toolkit. Corra enables the use of an array of LC-MS data alignment tools, integrating them with various forms of statistical analyses via Bioconductor, for the generation of statistically validated lists of differentially abundant peptide features for subsequent targeted LC-MS/MS identification, all within a single computational platform. Corra is not intended as an alternative to pre-existing LC-MS analytical tools, such as SpecArray , SuperHirn , msInspect [8, 13], PEPPeR , and others, but rather provides a framework for using such tools in a way that can overcome some of their limitations, such as the ability to process datasets of large size, via Corra's distributed computation process. Corra effectively obviates cross-tool incompatibilities through the use of a new common data format, AMPL, that allows for rich data annotation from multiple experimental workflows, as well as a set of parsers between different data input/output formats, including R's ExpressionSet format for statistical data analysis. Finally, in part through its implementation via a web-based GUI, Corra fills a noticeable gap in the field of LC-MS, or 'label-free' quantitative proteomics, in that it provides for biologists and other researchers, not just those trained in bioinformatics, to process, visualize and analyze their data in the manner of their choosing, all within a single application, and on their own workstation.
In the version of Corra presented here, two of the aforementioned LC-MS tools, SpecArray and SuperHirn were implemented, as were the Bioconductor packages LIMMA and maSigPro for statistical analysis. However, with the description of the a common file format for LC-MS data, APML, and with the parsers and documentation also provided with Corra, the ability to incorporate additional software tools and statistical algorithms for additional workflows is fully supported. This allows for a user with appropriate bioinformatics training to set up Corra on any Linux server, either as is, or customized for their own personal or groups workflows. Finally, Corra is provided as open source (Apache 2.0), and may be downloaded, along with all relevant documentation . Corra is also available from Sourceforge.net (keyword search: Corra). Further development of Corra and APML remain ongoing. Upcoming releases of Corra will include an APML-ready version of the LC-MS analysis tool msInspect , additional functionality to allow for the extraction and quantification of LC-MS data that includes the incorporation of stable isotope labels, and providing Corra outputs in a format suitable for downstream interaction with selected reaction monitoring-based workflows.
annotated putative peptide markup language
type 2 diabetes
graphical user interface
liquid chromatography mass spectrometry
linear models for microarray data
tandem mass spectrometry
normal glucose tolerance
oral glucose tolerance test
protein N-glycosidase F
time of flight
extensible markup language.
This work was supported, in part, with federal funds from the National Heart, Lung, and Blood Institute Seattle Proteome Center (contract No. N01-HV-8179 to RA), the National Cancer Institute (contract No. N01-CO-12400 to JW), the National Institute of Diabetes & Digestive & Kidney Disease (grant No. 1R21-DK71275 to JW), and by the Swiss National Science Foundation (grant No. 31000-10767 to RA). This work was also supported, in part, by a grant from the Entertainment Industry Foundation (EIF) and the EIF Women's Cancer Research Fund to the Breast Cancer Biomarker Discovery Consortium. The Consortium is a collaborative group focused on breast cancer biomarker discovery and includes: Battelle-Pacific Northwest National Laboratory, Broad Institute of MIT and Harvard, Dana-Farber Cancer Institute, Fred Hutchinson Cancer Research Center, Institute for Systems Biology, MD Anderson Cancer Center, and USC/Norris Comprehensive Cancer Center. Plasma samples for human type 2 diabetes were provided by the National Institute of Diabetes & Digestive & Kidney Disease, as a component of the R21 grant to JW (under PAR-04-076). The authors also wish to thank the Functional Proteomics Center, Zurich, Switzerland, especially B. Gerrits, for use their OrbiTrap mass spectrometer, their support, advice, and fruitful discussions.
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422(6928):198–207. 10.1038/nature01511View ArticlePubMed
- Gillette MA, Mani DR, Carr SA: Place of pattern in proteomic biomarker discovery. J Proteome Res 2005, 4(4):1143–1154. 10.1021/pr0500962View ArticlePubMed
- MacCoss MJ, Matthews DE: Quantitative MS for proteomics: teaching a new dog old tricks. Anal Chem 2005, 77(15):294A-302A. 10.1021/ac053431eView ArticlePubMed
- Mueller LN, Brusniak MY, Mani DR, Aebersold R: An assessment of software solutions for the analysis of mass spectrometry based quantitative proteomics data. J Proteome Res 2008, 7(1):51–61. 10.1021/pr700758rView ArticlePubMed
- Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R: Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol 1999, 17(10):994–999. 10.1038/13690View ArticlePubMed
- Ross PL, Huang YN, Marchese JN, Williamson B, Parker K, Hattan S, Khainovski N, Pillai S, Dey S, Daniels S, et al.: Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol Cell Proteomics 2004, 3(12):1154–1169. 10.1074/mcp.M400129-MCP200View ArticlePubMed
- Ong SE, Blagoev B, Kratchmarova I, Kristensen DB, Steen H, Pandey A, Mann M: Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol Cell Proteomics 2002, 1(5):376–386. 10.1074/mcp.M200025-MCP200View ArticlePubMed
- Bellew M, Coram M, Fitzgibbon M, Igra M, Randolph T, Wang P, May D, Eng J, Fang R, Lin C, et al.: A suite of algorithms for the comprehensive analysis of complex protein mixtures using high-resolution LC-MS. Bioinformatics 2006, 22(15):1902–1909. 10.1093/bioinformatics/btl276View ArticlePubMed
- Du P, Sudha R, Prystowsky MB, Angeletti RH: Data reduction of isotope-resolved LC-MS spectra. Bioinformatics 2007, 23(11):1394–1400. 10.1093/bioinformatics/btm083View ArticlePubMed
- Jaffe JD, Mani DR, Leptos KC, Church GM, Gillette MA, Carr SA: PEPPeR, a platform for experimental proteomic pattern recognition. Mol Cell Proteomics 2006, 5(10):1927–1941. 10.1074/mcp.M600222-MCP200View ArticlePubMed
- Katajamaa M, Miettinen J, Oresic M: MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics 2006, 22(5):634–636. 10.1093/bioinformatics/btk039View ArticlePubMed
- Li XJ, Yi EC, Kemp CJ, Zhang H, Aebersold R: A software suite for the generation and comparison of peptide arrays from sets of data collected by liquid chromatography-mass spectrometry. Mol Cell Proteomics 2005, 4(9):1328–1340. 10.1074/mcp.M500141-MCP200View ArticlePubMed
- May D, Fitzgibbon M, Liu Y, Holzman T, Eng J, Kemp CJ, Whiteaker J, Paulovich A, McIntosh M: A platform for accurate mass and time analyses of mass spectrometry data. J Proteome Res 2007, 6(7):2685–2694. 10.1021/pr070146yView ArticlePubMed
- Mayr BM, Kohlbacher O, Reinert K, Sturm M, Gropl C, Lange E, Klein C, Huber CG: Absolute myoglobin quantitation in serum by combining two-dimensional liquid chromatography-electrospray ionization mass spectrometry and novel data analysis algorithms. J Proteome Res 2006, 5(2):414–421. 10.1021/pr050344uView ArticlePubMed
- Mueller LN, Rinner O, Schmidt A, Letarte S, Bodenmiller B, Brusniak MY, Vitek O, Aebersold R, Muller M: SuperHirn – a novel tool for high resolution LC-MS-based peptide/protein profiling. Proteomics 2007, 7(19):3470–3480. 10.1002/pmic.200700057View ArticlePubMed
- Smith CA, Want EJ, O'Maille G, Abagyan R, Siuzdak G: XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem 2006, 78(3):779–787. 10.1021/ac051437yView ArticlePubMed
- Seattle Proteome Center (SPC) – Corra[http://tools.proteomecenter.org/Corra/corra.html]
- Desiere F, Deutsch EW, Nesvizhskii AI, Mallick P, King NL, Eng JK, Aderem A, Boyle R, Brunner E, Donohoe S, et al.: Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol 2005, 6(1):R9. 10.1186/gb-2004-6-1-r9PubMed CentralView ArticlePubMed
- Zhang H, Loriaux P, Eng J, Campbell D, Keller A, Moss P, Bonneau R, Zhang N, Zhou Y, Wollscheid B, et al.: UniPep–a database for human N-linked glycosites: a resource for biomarker discovery. Genome Biol 2006, 7(8):R73. 10.1186/gb-2006-7-8-r73PubMed CentralView ArticlePubMed
- Bioconductor: open source software for bioinformatics[http://www.bioconductor.org/]
- Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004., 3:
- Conesa A, Nueda MJ, Ferrer A, Talon M: maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments. Bioinformatics 2006, 22(9):1096–1102. 10.1093/bioinformatics/btl056View ArticlePubMed
- Keller A, Eng J, Zhang N, Li XJ, Aebersold R: A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol Syst Biol 2005., 1:
- Proteomic and metabolomic approaches to diagnose diabetes and pre-diabetes[http://grants.nih.gov/grants/guide/pa-files/PAR-04–076.html]
- Zhou Y, Aebersold R, Zhang H: Isolation of N-linked glycopeptides from plasma. Anal Chem 2007, 79(15):5826–5837. 10.1021/ac0623181View ArticlePubMed
- Schmidt A, Gehlenborg N, Bodenmiller B, Mueller LN, Campbell D, Mueller M, Aebersold R, Domon B: An integrated, directed mass spectrometric approach for in-depth characterization of complex peptide mixtures. Mol Cell Proteomics 2008, 7(11):2138–2150. 10.1074/mcp.M700498-MCP200PubMed CentralView ArticlePubMed
- Bodenmiller B, Mueller LN, Mueller M, Domon B, Aebersold R: Reproducible isolation of distinct, overlapping segments of the phosphoproteome. Nat Methods 2007, 4(3):231–237. 10.1038/nmeth1005View ArticlePubMed
- Bodenmiller B, Mueller LN, Pedrioli PG, Pflieger D, Junger MA, Eng JK, Aebersold R, Tao WA: An integrated chemical, mass spectrometric and computational strategy for (quantitative) phosphoproteomics: application to Drosophila melanogaster Kc167 cells. Mol Biosyst 2007, 3(4):275–286. 10.1039/b617545gView ArticlePubMed
- Keller A, Nesvizhskii AI, Kolker E, Aebersold R: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem 2002, 74(20):5383–5392. 10.1021/ac025747hView ArticlePubMed
- Pedrioli PG, Eng JK, Hubley R, Vogelzang M, Deutsch EW, Raught B, Pratt B, Nilsson E, Angeletti RH, Apweiler R, et al.: A common open representation of mass spectrometry data and its application to proteomics research. Nat Biotechnol 2004, 22(11):1459–1466. 10.1038/nbt1031View ArticlePubMed
- Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S: Bioinformatics and Computational Biology Solutions Using R and Bioconductor. New York: Springer; 2005.View Article
- Ko GT, Chan JC, Woo J, Lau E, Yeung VT, Chow CC, Cockram CS: The reproducibility and usefulness of the oral glucose tolerance test in screening for diabetes and other cardiovascular risk factors. Ann Clin Biochem 1998, 35(Pt 1):62–67.View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.