The APEX Quantitative Proteomics Tool: Generating protein quantitation estimates from LC-MS/MS proteomics results
- John C Braisted†1Email author,
- Srilatha Kuntumalla†1,
- Christine Vogel2,
- Edward M Marcotte2,
- Alan R Rodrigues1,
- Rong Wang1,
- Shih-Ting Huang1,
- Erik S Ferlanti1,
- Alexander I Saeed1,
- Robert D Fleischmann1,
- Scott N Peterson1 and
- Rembert Pieper1
© Braisted et al; licensee BioMed Central Ltd. 2008
Received: 07 August 2008
Accepted: 09 December 2008
Published: 09 December 2008
Mass spectrometry (MS) based label-free protein quantitation has mainly focused on analysis of ion peak heights and peptide spectral counts. Most analyses of tandem mass spectrometry (MS/MS) data begin with an enzymatic digestion of a complex protein mixture to generate smaller peptides that can be separated and identified by an MS/MS instrument. Peptide spectral counting techniques attempt to quantify protein abundance by counting the number of detected tryptic peptides and their corresponding MS spectra. However, spectral counting is confounded by the fact that peptide physicochemical properties severely affect MS detection resulting in each peptide having a different detection probability. Lu et al. (2007) described a modified spectral counting technique, Absolute Protein Expression (APEX), which improves on basic spectral counting methods by including a correction factor for each protein (called O i value) that accounts for variable peptide detection by MS techniques. The technique uses machine learning classification to derive peptide detection probabilities that are used to predict the number of tryptic peptides expected to be detected for one molecule of a particular protein (O i ). This predicted spectral count is compared to the protein's observed MS total spectral count during APEX computation of protein abundances.
The APEX Quantitative Proteomics Tool, introduced here, is a free open source Java application that supports the APEX protein quantitation technique. The APEX tool uses data from standard tandem mass spectrometry proteomics experiments and provides computational support for APEX protein abundance quantitation through a set of graphical user interfaces that partition thparameter controls for the various processing tasks. The tool also provides a Z-score analysis for identification of significant differential protein expression, a utility to assess APEX classifier performance via cross validation, and a utility to merge multiple APEX results into a standardized format in preparation for further statistical analysis.
The APEX Quantitative Proteomics Tool provides a simple means to quickly derive hundreds to thousands of protein abundance values from standard liquid chromatography-tandem mass spectrometry proteomics datasets. The APEX tool provides a straightforward intuitive interface design overlaying a highly customizable computational workflow to produce protein abundance values from LC-MS/MS datasets.
The field of proteomics has used mass spectrometry (MS) techniques to provide qualitative results that describe the protein complement of complex protein samples . Researchers also use modifications of these MS technologies for the quantitative analysis of proteins in complex samples [1–3], and often hundreds to thousands of proteins are quantified per experiment. Some quantitative techniques involve peptide isotopic labeling [4–8]. In contrast, label-free techniques have focused on analysis of MS/MS peak heights or observed peptide spectral count information [9–12]. Peptides are produced in an enzymatic digestion of the protein mixture, often using trypsin, which generally cleaves the proteins at the C-terminus of lysine or arginine amino acid residues .
Spectral counting techniques typically infer the relative quantity of a protein by counting the number of MS detected tryptic peptides associated with the protein being quantified as a fraction of all observed peptide counts. However, spectral counting can be confounded by the fact that the likelihood of peptide detection by MS techniques can vary greatly from one peptide to another based on the particular physicochemical properties of the peptide sequences. Peptide physicochemical properties can affect final MS detection through several factors such as the ability to recover peptides during the cation exchange and reversed phase LC stages of sample preparation, variation in ionization efficiency of the peptide in the ion source of a particular MS instrument, and can affect mass analysis in MS and MS/MS modes [9, 14, 15]. Peptide properties such as peptide length, mass, amino acid composition, solubility, net charge, and other properties can impact peptide detection. This variability in peptide detection can lead to errors in assessing the abundance of the parent protein producing the tryptic peptides.
Lu et al.  have described a novel technique for protein quantitation, Absolute Protein Expression measurements (APEX), where machine learning techniques are used to improve quantitation results over basic spectral counting. In the APEX technique, a supervised classification algorithm is used to predict the probability of peptide detection by MS based on the peptide's physicochemical properties. For each protein in the sample, the expected number of peptide observations (spectral counts) is computed based on predicted MS detectability of the corresponding tryptic peptides. In other words, the computationally predicted (expected) spectral counts are corrected for the variable peptide detection probabilities related to peptide physicochemical properties and the specific MS technology in use.
APEX abundance estimates are absolute in the sense that they are not relative to a second dataset representing a different condition or control, as is done in some relative protein quantitation methods such as SILAC . Also, the abundance estimates within a sample are normalized and can be readily compared to estimates from other samples. While a particular protein's abundance is presented relative to all proteins within the sample, multiplication by C puts the abundance values into absolute terms.
This paper describes a new software tool, the APEX Quantitative Proteomics Tool, an implementation of the APEX technique for the quantitation of proteins based on LC- MS/MS proteomics results. The main role of the tool is to compute APEX protein abundance values using equation 1, however the tool also supports preparation of prior information, such as derivation of O i values for proteins under study, as well as post-processing data analysis.
The second processing task is the generation of an O i value for each protein under study. This step uses the generated training data, peptide physicochemical properties and peptide MS detection calls, to build a classifier to predict peptide detection probabilities. Each protein sequence from a supplied FASTA sequence file undergoes an in silico trypsin digestion and each peptide is assigned an MS detection probability. The probabilities for each peptide derived from protein i are summed to produce the protein's O i value. This O i value is the predicted peptide detection (spectral) count for one molecule of protein i.
The third processing task uses the previously generated O i values and LC-MS/MS experimental results, which provide n i and p i , to produce protein abundance values according to equation 1. These quantitation results can be piped into several post-processing tools.
Building Training Data
Several peptide physicochemical properties are computed for each of the corresponding tryptic peptides; the APEX tool supports the computation of as many as 35 different properties. Among these properties are peptide mass, length, amino acid composition, and properties related to charge, hydrophobicity measures, and amino acid frequencies within secondary peptide structures. The value of each peptide property, in terms of predicting whether a peptide will be observed by MS, varies based on the MS technology in use . The 35 peptide properties available in the APEX tool are a combination of properties identified in the APEX technique paper by Lu et al.  and of those described in the paper by Mallick et al. . The list of peptide properties can be found in the APEX manual's Appendix.
Next, prior MS result files in standard protXML format are searched for each tryptic peptide sequence and each tryptic peptide is given a peptide MS detection call which categorizes it as being either observed or not observed within the MS result. The input protXML MS result files are generated by preprocessing standard SEQUEST or Mascot files using PeptideProphet™ and ProteinProphet™ which are part of the Trans-Proteomic Pipeline (TPP) [17, 18]. Once the peptide MS detection calls have been determined, the data is output in a matrix format as depicted in figure 2. Each row in the matrix captures data related to a single peptide and includes a set of computed peptide physicochemical properties and the peptide MS detection call. The training data is output to a file in the Attribute-Relation File Format (ARFF). The ARFF format represents the matrix of training values in a comma delimited format and has a section that identifies the attributes or columns in the data matrix. The ARFF format is used as input by the Weka collection of machine learning data mining tools [19, 20]. The generated training data file merges peptide properties and peptide MS detection calls, and will be used to train a classifier to compute peptide detection probabilities based on peptide physicochemical properties.
Oi Value Generation
The classification algorithms in the APEX tool are implementations from the Weka data mining software package. Weka is an extensive collection of open source machine learning algorithms implemented in Java [19, 20]. The APEX tool allows the user to select from three different Weka classifier algorithms: Random Forest, RIDOR (Ripple Down Rule Learner), and J4.8 Decision Trees. The original work on the APEX technique  showed that averaging classifier models through bootstrap aggregating (Bagging) improved classifier performance . This work also found performance improvements when building the classifier as a cost sensitive classifier to account for the bias in the training data toward non-observed peptides; training data peptides were not evenly split between observed and non-observed such that non-observed peptides are more prevalent. The APEX tool provides both the option to perform classifier algorithms using bagging and cost sensitive evaluation. Although the APEX tool provides three classifier options by default, the tool also includes a classifier configuration file that can be edited to allow one to configure the APEX tool to make use of any classifier algorithm implemented in the Weka tool set. The classifier configuration file lists the available classifiers and defines the unique parameter attributes for each classifier. Random Forest is the default classification algorithm within the APEX tool since this algorithm had been found to perform best . The Random Forest classifier has worked well in our evaluation. The tool has a utility to permit users to evaluate classifier performance in the context of their own data.
Computing APEX Quantitation Values
The FPR can be used to select a subset of high confidence proteins on which to perform APEX quantitation. The APEX tool thereby provides the user with a choice to determine the cutoff FPR for APEX quantitation, typical cutoffs are 1 or 5%. Following the selection of the protein list, an output file with the APEX quantitation results is generated. The output file captures protein identifier or accession, protein descriptive annotation available in the protXML input file, input parameters, input file paths, input MS values (n i and p i ), O i values, and the APEX abundance values.
APEX Tool Implementation Details and Architecture Overview
User interface classes are separated from processing classes by the use of a processing event dispatcher that spawns processing threads as needed. Developers can easily add new processing tasks by extending an abstract process panel class that presents parameter controls and by adding a new processing class or adding methods to the core processing class. Constants such as amino acid level physicochemical properties are contained in a single class that contains numerical constants. Unified Modeling Language (UML) class diagrams that cover several of the key Java classes within version 1.0 of the APEX tool are available within the APEX tool source code download.
Results and discussion
In addition to protein quantitation, the APEX tool also offers basic utilities for post processing of quantitation results (Figure 5, Utilities and Analysis interface panel). One utility merges multiple APEX result files into a tab delimited matrix that contains protein quantitation results. Each data row contains protein annotations and a set of abundance values that represent protein expression over the various conditions under study. Results are aligned so that each row represents a vector of abundance values for a particular protein. The tab delimited data matrix can be loaded directly into the MultiExperiment Viewer (MeV) . MeV contains many methods to cluster proteins based on expression profile and can perform statistical analyses to find proteins showing differential expression in accordance with the experimental conditions.
The APEX tool also provides a two sample Z-score test for differential expression as described in Lu et al. . This test handles experimental designs with two samples, each representing a condition or state under study. During the test each protein has a Z-score computed that reflects differential expression by considering the proportion of spectra in the two samples attributable to the protein being scored. Proteins with very different total spectral counts (n i ) between the two samples and whose spectral counts are sufficiently large tend to have high Z-scores. The formula and underlying assumptions behind the test are given in Lu et al. . The Z-score has an associated a p-value for each protein which reflects the significance of the observed expression change by reporting the probability of having an absolute Z-score of the observed magnitude or greater. The APEX interface provides a graphical representation of the Z-score results from the two files to allow the selection of a significant protein lists based on the user defined p-value cutoff (Figure 5, foreground panel). The test outputs a summary result file that contains a row for each protein listing the protein annotations, APEX values for the two conditions under study, APEX abundance fold change, n i values, computed Z-score and p-value.
The third utility provides classifier cross-validation which reports on the performance of the selected classifier and particular parameter selections. This process requires an input training data file and iteratively uses a randomly selected subset of the data to train the classifier and tests the classifier's ability to predict peptide detection calls on the rest of the data. A number of performance statistics are reported, for example true positive rate, false positive rate, prediction accuracy, and recall, that can be saved to a text file. This feature allows to determine which classifiers, classifier parameters, and peptide properties perform best considering the nature of the data and the MS technology in use.
Several potential features are targeted as future enhancements of the APEX tool. Future versions of the tool will include improved support for selecting proteins for the generation of peptide sequences for classifier training data. The training data should include a set of peptides with sufficient representation of observed and non-observed peptides based on prior MS results. The future APEX tool will enable the user to set protein selection criteria such as number of proteins or peptides to include and a minimum pi value. The training data selection enhancement will also include randomized selection of training proteins from a larger pool of proteins that pass the imposed criteria. In addition, we will allow users to exclude degenerate peptides that map to more than one parent protein.
Data preprocessing options are another area of future development in the APEX tool. APEX computation requires a protein identification probability (p i ). The original APEX methodology paper  and this implementation both depend on the Trans-Proteomic Pipeline (TPP) to preprocess MS data to compute the required p i values. Support for TPP derived input will continue but we will expand input options for users not using the TPP for upstream data processing.
Thus far, our data are based on peptide detection and fragmentation in 3D and linear ion trap mass spectrometers (LCQ and LTQ, Thermo Fisher Scientific Inc.). However, the APEX tool is not limited to data processing from these mass analyzers. The training data generation uses prior MS results to insure that the training data reflects the peptide detection capabilities of the instrumentation in use. In turn, O i values generated from the training data will adjust based on peptide detection sensitivity tendencies of the instrument in use. The APEX protocol site [23, 24] has posted files containing Oi values generated from both LCQ and LTQ-Orbitrap™ MS data for three different organisms, E. coli, yeast, and human. The APEX tool can be used to construct training data and generate O i values for data derived from any MS instruments, reflecting characteristics of the individual instruments. Additional peptide properties can be incorporated as they are identified by users as valuable toward improving peptide detection predictions. Future versions of the APEX tool will include a new utility to assess the predictive value of each peptide property given a particular training data set, classifier algorithm and associated parameters. The accuracy of estimated protein abundances depends on the quality of peptide detection probabilities. Further work in this area will refine understanding of peptide properties that are good predictors of peptide detection by particular MS techniques, as an extension of published work [14, 15].
The APEX Quantitative Proteomics Tool provides researchers with the ability to quantify proteins observed in LC-MS/MS proteomics data. This process requires generation of classifier training data and the computation of O i values, i.e., the expected spectral counts for each protein. Both the training data and O i values are based on prior MS results that in turn relate to the specific conditions within the user's protocol, including sample preparation procedures, MS technology, and instrumentation settings. Customized O i value generation, facilitated by the APEX Tool, means that the final quantitation values take into account the user's settings and are thus more accurate.
The APEX Tool is a free open source tool and has an intuitive user interface that logically subdivides the controls for the various processing tasks and utilities onto separate tabbed panels. The integrated help and information system and the manual describe both the mechanics of processing data as well as the precise details of how the data is handled at each step of the process. The APEX tutorial provides a step-by-step introduction for the first time user. Source code allows those interested in the computational details to fully explore the inner workings of the tool while the simple software architecture will allow developers to modify or expand on existing utilities.
Availability and requirements
Project name: APEX Quantitative Proteomics Tool
Project Home Page: http://pfgrc.jcvi.org/index.php/bioinformatics/apex.html
Operating Systems: Platform Independent
Programming Language: Java
Other Requirements: Java 1.5 or higher, Trans-Proteomic Pipeline (TPP) tools to process MASCOT dat files or SEQUEST HTML summary files to produce protXML input files. TPP tools: http://tools.proteomecenter.org/TPP.php
License: GNU GPL v3.0
Any restrictions to use by non-academics: None.
Liquid Chromatography with Tandem Mass Spectrometry
Absolute Protein Expression
Attribute Relation File Format
False Positive Rate.
The APEX Quantitative Proteomics Tool development, related software testing and laboratory validation were funded by contract No. N01-AI-15447 awarded by the National Institute of Allergy and Infectious Diseases (NIAID) of the National Institutes of Health to the J. Craig Venter Institute, Rockville, MD. Additional funding was provided by the National Science Foundation, the Welsh and Packard Foundations (EMM), and the International Human Frontier Science Program (CV).
The authors wish to thank the members of the PFGRC Bioinformatics-Analysis Team, Vasily Sharov, Jianwei Li, Wei Liang, Chun-Hua Wan, Lynn Stevens, and Miguel Covarrubias, who reviewed and provided valuable feedback on the tool's design.
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422: 198–207. 10.1038/nature01511View ArticlePubMed
- Mueller LN, Brusniak M, Mani DR, Aebersold R: An Assessment of Software Solutions for the Analysis of Mass Spectrometry Based Quantitative Proteomics Data. J Prot Res 2008, 7: 51–71. 10.1021/pr700758rView Article
- Steen H, Pandey A: Proteomics goes quantitative: measuring protein abundance. Trends Biotech 2002, 20(9):361–364. 10.1016/S0167-7799(02)02009-7View Article
- Conrads TP, Issaq HJ, Veenstra TD: New tools for quantitative phosphoproteome analysis. Biochem Biophys Res Commun 2002, 290: 885–890. 10.1006/bbrc.2001.6275View ArticlePubMed
- Mirgorodskaya OA, Kozmin YP, Titov MI, Körner R, Sönksen CP, Roepstorff P: Quantitation of peptides and proteins by matrix-assisted laser desorption/ionization mass spectrometry using 18O-labeled internal standards. Rapid Commun Mass 2000, 14: 1226–1232. Publisher Full Text 10.1002/1097-0231(20000730)14:14%3C1226::AID-RCM14%3E3.0.CO;2-VView Article
- Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R: Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nature Biotech 1999, 17: 994–999. 10.1038/13690View Article
- Zhou H, Ranish JA, Watts JD, Aebersold R: Quantitative proteome analysis by solid-phase isotope tagging and mass spectrometry. Nature Biotech 2002, 20: 512–515. 10.1038/nbt0502-512View Article
- Ong SE, Blagoev B, Kratchmarova I, Kristensen DB, Steen H, Pandey A, Mann M: Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol Cell Proteomics 2002, 1: 376–386. 10.1074/mcp.M200025-MCP200View ArticlePubMed
- Rappsilber J, Ryder U, Lamond AI, Mann M: Large-Scale Proteomic Analysis of the Human Spliceosome. Genome Res 2002, 12: 1231–1245. 10.1101/gr.473902PubMed CentralView ArticlePubMed
- Liu H, Sadygov RG, Yates JR III: A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal Chem 2004, 76(14):4193–4201. 10.1021/ac0498563View ArticlePubMed
- Gao J, Opiteck GJ, Friedrichs MS, Dongre AR, Hefta SA: Changes in the protein expression of yeast as a function of carbon source. J Proteome Res 2003, 2(6):643–649. 10.1021/pr034038xView ArticlePubMed
- Ishihama Y, Oda Y, Tabata T, Sato T, Nagasu T, Rappsilber J, Mann M: Exponentially modified protein abundance index (emPAI) for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein. Mol Cell Proteomics 2005, 4(9):1265–1272. 10.1074/mcp.M500061-MCP200View ArticlePubMed
- Washburn MP, Wolters D, Yates JR III: Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotech 2001, 19: 242–247. 10.1038/85686View Article
- Mallick P, Schirle M, Chen SS, Flory MR, Lee H, Martin D, Ranish J, Raught B, Schmitt R, Werner T, Kuster B, Aebersold R: Computational Prediction of Proteotypic Peptides for Quantitative Proteomics. Nat Biotech 2007, 25(1):125–131. 10.1038/nbt1275View Article
- Tang H, Arnold RJ, Alves P, Xun Z, Clemmer DE, Novotny MV, Reilly JP, Radivojac P: A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics 2006, 22(14):e481-e488. 10.1093/bioinformatics/btl237View ArticlePubMed
- Lu P, Vogel C, Wang R, Yao X, Marcotte EM: Absolute Protein Expression Profiling Estimates the Relative Contributions of Transcriptional and Translational Regulation. Nat Biotech 2007, 25(1):117–124. 10.1038/nbt1270View Article
- Keller A, Eng J, Zhang N, Li X, Aebersold R: A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol Syst Biol 2005, 1–17. 10.1038/msb4100024
- Trans-Proteomics Pipeline[http://tools.proteomecenter.org/TPP.php]
- Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. 2nd edition. San Francisco: Morgan Kaufmann Publishers; 2005.
- Weka Machine Learning Data Mining Tools[http://www.cs.waikato.ac.nz/ml/weka/]
- Keller A, Nesvizhskii A, Kolker E, Aebersold R: Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search. Anal Chem 2002, 74(20):5383–5392. 10.1021/ac025747hView ArticlePubMed
- TM4 Software Suite's MultiExperiment Viewer[http://www.tm4.org/mev.html]
- APEX Protocol Website, Marcotte Lab[http://www.marcottelab.org/APEX_Protocol]
- Vogel C, Marcotte EM: Calculating Absolute and Relative Protein Abundance from Mass Spectrometry-based Protein Expression Data. Nat Protoc 2008, 3(9):1444–1451. 10.1038/nport.2008.132View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.