PAnalyzer: A software tool for protein inference in shotgun proteomics
© Prieto et al.; licensee BioMed Central Ltd. 2012
Received: 2 June 2012
Accepted: 30 October 2012
Published: 5 November 2012
Protein inference from peptide identifications in shotgun proteomics must deal with ambiguities that arise due to the presence of peptides shared between different proteins, which is common in higher eukaryotes. Recently data independent acquisition (DIA) approaches have emerged as an alternative to the traditional data dependent acquisition (DDA) in shotgun proteomics experiments. MS E is the term used to name one of the DIA approaches used in QTOF instruments. MS E data require specialized software to process acquired spectra and to perform peptide and protein identifications. However the software available at the moment does not group the identified proteins in a transparent way by taking into account peptide evidence categories. Furthermore the inspection, comparison and report of the obtained results require tedious manual intervention. Here we report a software tool to address these limitations for MS E data.
In this paper we present PAnalyzer, a software tool focused on the protein inference process of shotgun proteomics. Our approach considers all the identified proteins and groups them when necessary indicating their confidence using different evidence categories. PAnalyzer can read protein identification files in the XML output format of the ProteinLynx Global Server (PLGS) software provided by Waters Corporation for their MS E data, and also in the mzIdentML format recently standardized by HUPO-PSI. Multiple files can also be read simultaneously and are considered as technical replicates. Results are saved to CSV, HTML and mzIdentML (in the case of a single mzIdentML input file) files. An MS E analysis of a real sample is presented to compare the results of PAnalyzer and ProteinLynx Global Server.
We present a software tool to deal with the ambiguities that arise in the protein inference process. Key contributions are support for MS E data analysis by ProteinLynx Global Server and technical replicates integration. PAnalyzer is an easy to use multiplatform and free software tool.
Shotgun proteomics is still the method most widely used for large-scale identification of proteins in complex biological samples[1, 2]. This approach involves proteolysis of the proteins followed by chromatographic separation of the resulting peptides coupled to tandem mass spectrometry analysis. Then, a search engine assigns fragmentation spectra to peptide sequences by scoring the similarity between observed spectra and the predicted spectra of peptide sequences in a database. Finally, the assigned peptides are grouped to deduce the list of identified proteins in a process termed protein inference. This process is not trivial, especially in the case of higher eukaryote organisms, where a significant amount of peptides derives from several proteins due to the presence of homologous proteins, isoforms or redundant entries in the sequence database. These shared or degenerate peptides introduce ambiguity in determining the identities of sample proteins, complicating the analysis and biological interpretation of the data obtained.
The manner in which different protein inference software tools manage these ambiguities differ (reviewed in). ProteinProphet, probably the most widely used software for peptide assembly, infers the most likely protein corresponding to degenerate peptides by apportioning such peptides among their corresponding proteins according to the estimated probabilities of those proteins in the sample, while in turn computing the protein probabilities taking into account those estimated apportionments. IsoformResolver uses a “peptide-centric” approach where the observed peptides are directly referenced against a peptide database with the context of all possible proteins from which they can derive, which are then clustered in the same group. PeptideClassifier[7, 8], besides MS data, considers data from other sources to facilitate peptide assembly. Each peptide is classified according to its information content with respect to protein sequences and gene models, and shared peptides are distinguished depending on whether the implied proteins could be encoded either by the same or by distinct gene models.
In the typical LC-MS/MS experiment, co-eluting peptides get into the mass spectrometer together and are selected for fragmentation on the basis of their abundance (data dependent acquisition or DDA). As an alternative, the mass spectrometer can be operated to fragment the entire range of co-eluting peptide ions without prior selection. Data-independent acquisition (DIA) based proteomics from complex mixtures of proteins has mainly been performed on Waters’ QTOF instruments where it has been termed MS E .
The commercial software tool ProteinLynx Global Server (PLGS) developed by Waters Corporation is the only one that deals with MS E data. PLGS renders a list of all proteins identified, including those identified only with shared peptides, and is also able to group proteins that share peptides into groups or hits. However, the principles used by this software to group the proteins and to select the protein that will name the group are not always based on peptide evidence. Therefore, the report of the results and the comparison of technical replicates are difficult, and besides manual inspection, PLGS does not provide a way to classify the identified proteins based on peptide evidence. Furthermore, none of the protein inference tools can read the output provided by PLGS.
Here we report PAnalyzer, a simple software tool to group and report the list of identified proteins into four simple categories following the recommendations proposed by Nesvizhskii & Aebersold. PAnalyzer is the only software available that directly imports data files generated by PLGS after the database search using MS E data and it can also import results generated by other tools in the standardized mzIdentML format. The assigned peptides used to infer the proteins can be filtered according to their score. Moreover, multirun analysis is supported and a single list of identified proteins can be created from several replicate runs. Results can be exported as comma separated values files, as HTML reports and also as mzIdentML in the case of a single mzIdentML input file.
PAnalyzer classifies proteins according to their evidence using three types of peptides:
Unique: peptides that can only be assigned to a single protein.
Discriminating: shared peptides which presence is explained by a set of proteins without unique peptides.
Non-discriminating: shared peptides which presence is explained by proteins with unique or discriminating peptides.
Using these three types of peptides, proteins matching a correct peptide identification can be classified into four evidence groups:
Conclusive: proteins with unique peptides.
Non-conclusive: proteins containing only non-discriminating peptides.
Indistinguishable: proteins that have discriminating peptides only shared by them and which also share the rest of the peptides.
Ambiguous group: a group of proteins that explain the presence of a group of discriminating peptides.
An initial peptide classification is carried out to further refine it through iteration in the next step. Initially all the peptides are marked as discriminating. Then, peptides with just one entry in their protein list are marked as unique. Next, proteins with unique peptides in their peptide list are marked as conclusive. Finally, peptides not marked as unique and with conclusive proteins in their protein list and are marked as non-discriminating.
An iterative process is carried out to find the non-discriminating peptides temporary marked as discriminating in the previous step. The entries in the protein list of the first discriminating peptide are considered as a temporary group. Any other of the remaining discriminating peptides shared by all of the proteins in this group and also by other proteins outside the group are marked as non-discriminating. This process is repeated until it reaches the last peptide marked as discriminating.
Once all the peptides are classified, the protein classification and grouping process is completed. Conclusive proteins had already been classified in the first step. Proteins containing only non-discriminating peptides are marked as non-conclusive. Then proteins sharing discriminating peptides are merged into an ambiguous group. Finally groups with proteins sharing the same peptides are marked as indistinguishable.
Integration of replicate runs
It is a common practice to perform different replicate runs of a sample in order to account for artifacts or to analyze it in a more exhaustive way. PAnalyzer accounts for this practice by providing the user the option to specify the minimum number of replicates (runs threshold) in which a peptide must be present to be considered in the protein inference step. This feature allows the users to be as flexible as they desire on the criteria adopted to accept the presence of a peptide in the sample.
Depending on the value selected by the user different approaches can be used. Two particular cases of interest are the following:
Peptide voting: runs threshold equals half plus one of the replicate runs. This is a conservative approach since only peptides present in at least half plus one of the replicate runs are considered for protein inference.
Peptide merging: runs threshold equals one. This is a less conservative approach since all the identified peptides in any of the runs are considered as valid, so all the peptides are merged before protein inference.
Multiplatform support is provided by .NET using the Microsoft implementation for Windows or the Mono free software implementation. This last one has been ported to different platforms, such as GNU/Linux, MacOS X and also Windows. GUI portability is assured by using the free software GTK# toolkit.
Input files can be in the output format of ProteinLynx Global Server, which is the software provided by Waters Corporation for their MS E data, or in the mzIdentML format recently standardized by the HUPO-PSI. After performing the protein classification, results can be saved into comma-separated values (CSV), mzIdentML and HTML files as shown in the results section.
Results and discussion
In order to validate the proposed methodology and software implementation a shotgun proteomics experiment has been carried out using protein extracts of Human Embryonic Kidney 293T (HEK 293T) cells. The results obtained by PAnalyzer are compared with those provided by ProteinLynx Global Server.
PAnalyzer is an easy to use tool to handle with the protein inference problem from peptide identifications. Input file formats supported are the output XML files of protein identifications generated by ProteinLynx Global Server (PLGS) versions 2.4 and 2.5, and the new HUPO-PSI standardized format mzIdentML version 1.0.0 and version 1.1.0. In the case of using PLGS files as input, peptide threshold can be specified as “green”, “yellow” or “red” and the associated numerical values are read from the log files generated by PLGS. In the case of mzIdentML files, peptides are filtered according to the “passThreshold” attribute of their corresponding “SpectrumIdentificationItem”. In any case, multiple input files can be selected and are interpreted as different replicate runs, allowing the user to choose the value of the replicate runs threshold.
A simple peptide and protein browser is provided in order to have a quick look at the resulting peptides and inferred proteins before exporting them to an output file. The results can be saved to CSV, HTML and mzIdentML.
CSV output is provided as a convenient format for researches that only want to see the basic data in a spreadsheet or for those who want to perform basic scripting. An example is provided in Additional file1.
mzIdentML output is only generated if the input data is also in this format and it is a single run analysis. The inferred proteins replace the “ProteinDectectionList” with new “ProteinAmbiguityGroup” entries. These groups can be composed of one or more inferred proteins according to their shared peptides. But this scheme is not enough to account for the protein evidence categories previously presented. PAnalyzer solves this by using the possibility that mzIdentML has to include additional information from a controlled vocabulary. A new entry (accession number MS:1001600) named “Protein Inference Confidence Category” has been specifically included in the PSI mass spectrometry ontology (PSI-MS, version 2.40) and PAnalyzer makes use of it in every “ProteinDetectionHypothesis” within the “ProteinAmbiguityGroup” to specify whether it consists on a conclusive protein, non-conclusive protein, ambiguous group or indistinguishable group. An example is provided in Additional file3.
Proteins were extracted from Human Embryonic Kidney 293T (HEK 293T) cell line by incubating in lysis buffer (10 mM TrisHCl pH 7.4, 1% Triton X-100, 0.5% NP-40, 150 mM NaCl, 1 mM EDTA, 1 mM EGTA, 0.2 mM ortovanadate).Samples were centrifuged at 13000 rpm for 15 min, supernatant was recovered and protein content was measured by Bradford assay (Sigma Aldrich, St. Louis, MO, USA). Protein extracts (50 μ g) were precipitated with 2-D Clean-Up kit (GE Healthcare) resuspended in 0.1% RapiGest SF (Waters, Milford, MA, USA) and digested with trypsin (Roche Diagnostics, Penzberg, Germany) following standard digestion protocol. LC-MS E analysis was performed with a nanoAcquity UPLC System interfaced to a SYNAPT HDMS mass spectrometer (Waters). 1.5 μ g of protein was loaded onto a Symmetry 300 C18, 180 μ m x 20 mm precolumn (Waters). The precolumn was connected to a BEH130 C18, 75 μ m x 200 mm, 1.7 μ m (Waters) column and peptides were directly eluted onto a homemade emitter.
Mass spectra were acquired using data independent acquisition mode (MS E ) described by Geromanos et al. with minor modifications. Spectra were processed with ProteinLynx Global Server 2.4 (Waters). Protein identification was obtained with the embedded database search algorithm of the program and a human UniProtKB/Swiss-Prot database (version 2012_02) was used. For protein identification the following parameters were adopted: carbamidomethylation of C as fixed modification; N-terminal acetylation, N and Q deamidation and M oxidation, as variable modifications; 1 missed cleavage and automatic precursor and fragment error tolerance. A maximum protein false discovery rate of 5% was allowed.
Comparison of results reported by PLGS and PAnalyzer
PLGS groups the identified proteins in hits. A hit is a group of proteins (or a single protein) that share peptides and one of the members has at least one unique and high confidence peptide. Furthermore, the top scoring protein of the group names the hit. This simple grouping cannot address the different scenarios that occur in large-scale identification experiments. For example when a protein shares peptides with more than one protein that have unique and high scoring peptides or when several proteins share all the peptides and have the same score. Moreover, manual inspection of each hit is the only way to know exactly the proteins that belong to a specific group.
As previously described, PAnalyzer groups the identified proteins following the recommendations proposed by Nesvizhskii & Aebersold. The results obtained for the 5 biological replicates of HEK 293T samples by PLGS and PAnalyzer are shown in Table1. Approximately 55% of the identified proteins are conclusive, meaning that they have at least 1 unique peptide and are definitively present in the sample. Nearly 20% of the proteins are non-conclusive and their presence in the sample is not supported by peptide evidence. The rest of the identified proteins belong to groups of proteins, more than 90% to indistinguishable groups with an average of 2.5 proteins per group.
PAnalyzer can also process simultaneously results from multiple replicate analysis. Protein classification can be performed based on peptide score criteria or/and replicate criteria (Table1 and tables in Additional files4 and5). In the results shown in Table1 no peptide filtering has been applied and only replicate criteria (presence in a minimum number of replicates) has been considered. Increase in the replicate criteria or replicate threshold entails a clear decrease in the number of conclusive proteins but it does not produce such a big effect in the number of indistinguishable, ambiguous and non-conclusive proteins.
A software tool for protein inference has been developed according to the protein classification of Nesvizhskii & Aebersold. This tool defines three kinds of peptides (unique, discriminating and non-discriminating) to classify the proteins into four evidence groups: conclusive, non-conclusive, indistinguishable and ambiguous group. It is important to make this protein classification since many studies use proteins that cannot be guaranteed to be present on a sample or, on the contrary, do not consider proteins that could be present in the sample because the protein inference tool based on the Occam’s razor simplification discards them. Conclusive proteins can be considered to be present in the sample, indistinguishable proteins and ambiguous groups usually relate to protein isoforms and it cannot be assured which protein combination is present. Finally non-conclusive proteins are filtered by different software tools according to the Occam’s razor principle, since the observed peptides can already be explained by the presence of the rest of the proteins. But non-conclusive proteins can be present in the sample and instead of being discarded should be tagged as non-conclusive for further analysis.
Multiple replicate run support has also been implemented using a threshold to define the minimum number of runs in which a peptide must be present in order to be considered in the protein inference process.
PAnalyzer is a simple and easy to use software tool to group and report proteins identified in a shotgun proteomics experiment, specially useful when data are obtained by MS E acquisition mode.
Availability and requirements
Project name: PAnalyzer
Project home page: http://code.google.com/p/ehu-bio/wiki/PAnalyzer
Operating systems: Platform independent
Programming language: C#
Other requirements: .NET 4 or higher, GTK# 2.12 or higher
License: GNU GPL
Any restrictions to use by non-academics: None
Proteomics Core Facility-SGIKER is member of ProteoRed-ISCIII and is supported by UPV/EHU, MCINN, GV/EJ, ERDF and ESF. JMA is supported by the Department of Industry of the Basque Government (ETORTEK grant IE09-256).RM is supported by Fundação para a Ciência e a Tecnologia (FCT) Ciência 2007 and by European Social Fund. IPATIMUP is an Associate Laboratory of the Portuguese Ministry of Science, Technology and Higher Education and is partially supported by FCT. RM is further supported by FCT grants (PTDC/QUI-BIQ/099457/2008 and PTDC/EIA-EIA/099458/2008).
- Washburn M, Wolters D, Yates III J: Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol 2001, 19(3):242–247. 10.1038/85686View ArticlePubMedGoogle Scholar
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422(6928):198–207. 10.1038/nature01511View ArticlePubMedGoogle Scholar
- Nesvizhskii A, Aebersold R: Interpretation of shotgun proteomic data. Mol Cell Proteomics 2005, 4(10):1419–1440. 10.1074/mcp.R500012-MCP200View ArticlePubMedGoogle Scholar
- Huang T, Wang J, Yu W, He Z: Protein inference: a review. Briefings Bioinf 2012, 13(5):586–614. 10.1093/bib/bbs004View ArticleGoogle Scholar
- Nesvizhskii A, Keller A, Kolker E, Aebersold R: A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem 2003, 75(17):4646–4658. 10.1021/ac0341261View ArticlePubMedGoogle Scholar
- Meyer-Arendt K, Old W, Houel S, Renganathan K, Eichelberger B, Resing K, Ahn N: IsoformResolver: A peptide-centric algorithm for protein inference. J Proteome Res 2011, 10(7):3060–3075. 10.1021/pr200039pPubMed CentralView ArticlePubMedGoogle Scholar
- Grobei M, Qeli E, Brunner E, Rehrauer H, Zhang R, Roschitzki B, Basler K, Ahrens C, Grossniklaus U: Deterministic protein inference for shotgun proteomics data provides new insights into Arabidopsis pollen development and function. Genome Res 2009, 19(10):1786–1800. 10.1101/gr.089060.108PubMed CentralView ArticlePubMedGoogle Scholar
- Qeli E, Ahrens C: PeptideClassifier for protein inference and targeted quantitative proteomics. Nat Biotechnol 2010, 28(7):647–650. 10.1038/nbt0710-647View ArticlePubMedGoogle Scholar
- Silva J, Gorenstein M, Li G, Vissers J, Geromanos S: Absolute quantification of proteins by LCMSE: a virtue of parallel MS acquisition. Mol Cell Proteomics 2006, 5(1):144–156.View ArticlePubMedGoogle Scholar
- Jones A, Eisenacher M, Mayer G, Kohlbacher O, Siepen J, Hubbard S, Selley J, Searle B, Shofstahl J, Seymour S, et al.: The mzIdentML data standard for mass spectrometry-based proteomics results. Mol Cell Proteomics 2012, 11(7):M111.014381. 10.1074/mcp.M111.014381PubMed CentralView ArticlePubMedGoogle Scholar
- Geromanos S, Vissers J, Silva J, Dorschel C, Li G, Gorenstein M, Bateman R, Langridge J: The detection, correlation, and comparison of peptide precursor and product ions from data independent LC-MS with data dependant LC-MS/MS. Proteomics 2009, 9(6):1683–1695. 10.1002/pmic.200800562View ArticlePubMedGoogle Scholar
- Li G, Vissers J, Silva J, Golick D, Gorenstein M, Geromanos S: Database searching and accounting of multiplexed precursor and product ion spectra from the data independent analysis of simple and complex peptide mixtures. Proteomics 2009, 9(6):1696–1719. 10.1002/pmic.200800564View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.