An automated proteomic data analysis workflow for mass spectrometry
© Pendarvis et al; licensee BioMed Central Ltd. 2009
Published: 8 October 2009
Mass spectrometry-based protein identification methods are fundamental to proteomics. Biological experiments are usually performed in replicates and proteomic analyses generate huge datasets which need to be integrated and quantitatively analyzed. The Sequest™ search algorithm is a commonly used algorithm for identifying peptides and proteins from two dimensional liquid chromatography electrospray ionization tandem mass spectrometry (2-D LC ESI MS2) data. A number of proteomic pipelines that facilitate high throughput 'post data acquisition analysis' are described in the literature. However, these pipelines need to be updated to accommodate the rapidly evolving data analysis methods. Here, we describe a proteomic data analysis pipeline that specifically addresses two main issues pertinent to protein identification and differential expression analysis: 1) estimation of the probability of peptide and protein identifications and 2) non-parametric statistics for protein differential expression analysis. Our proteomic analysis workflow analyzes replicate datasets from a single experimental paradigm to generate a list of identified proteins with their probabilities and significant changes in protein expression using parametric and non-parametric statistics.
The input for our workflow is Bioworks™ 3.2 Sequest (or a later version, including cluster) output in XML format. We use a decoy database approach to assign probability to peptide identifications. The user has the option to select "quality thresholds" on peptide identifications based on the P value. We also estimate probability for protein identification. Proteins identified with peptides at a user-specified threshold value from biological experiments are grouped as either control or treatment for further analysis in ProtQuant. ProtQuant utilizes a parametric (ANOVA) method, for calculating differences in protein expression based on the quantitative measure ΣXcorr. Alternatively ProtQuant output can be further processed using non-parametric Monte-Carlo resampling statistics to calculate P values for differential expression. Correction for multiple testing of ANOVA and resampling P values is done using Benjamini and Hochberg's method. The results of these statistical analyses are then combined into a single output file containing a comprehensive protein list with probabilities and differential expression analysis, associated P values, and resampling statistics.
For biologists carrying out proteomics by mass spectrometry, our workflow facilitates automated, easy to use analyses of Bioworks (3.2 or later versions) data. All the methods used in the workflow are peer-reviewed and as such the results of our workflow are compliant with proteomic data submission guidelines to public proteomic data repositories including PRIDE. Our workflow is a necessary intermediate step that is required to link proteomics data to biological knowledge for generating testable hypotheses.
Recent advances in genome sequencing projects have facilitated the global analysis of proteins ("proteomics") in order to study their role in health and disease. Proteomic datasets may be generated by coupling nanoflow technology with high-speed, high resolution mass spectrometers and these have generated immensely complex and very large mass spectral datasets. Analyzing these huge datasets by hand is a daunting, inefficient, and error-prone task, hence the need for an automated data analysis pipelines.
Multidimensional Protein Identification Technology (MudPIT)  followed by database searching is commonly used to identify proteins from a biological sample. Biological problems addressed by proteomics often include comparing two different conditions, e.g. normal versus treatment. For comparative proteomics, there is a need to determine which subset of proteins is differentially expressed (DE) at a defined statistical threshold. Sample preparation for proteomics includes total protein isolation from a target biological sample and digestion of these proteins using proteases like trypsin to generate a complex mixture of peptides that then need to be deconvoluted and analyzed by mass spectrometry. One method to reduce the complexity of peptides is separation based on their charge and hydrophobicity using two-dimensional liquid chromatography (2D-LC) before the peptides enter the mass spectrometer for MS/MS analysis.
The flow rates required to separate peptides are in the nanoliter to microliter per minute range and mass spectrometers must collect data for an extended amount of time, often for many hours. The resulting data sets can contain 10s to hundreds of thousands of mass spectra, which must then be searched against a protein database to identify the peptides and thus the proteins. The protein database is in silico digested with a protease (used for sample preparation) to generate database of peptides and their theoretical spectra that can be matched with the experimental spectra collected by mass spectrometry. Several search algorithms are described in literature for database searching, including Sequest , MASCOT , and X!Tandem  which match experimental mass spectra to theoretical spectra derived from a protein database. Sequest is a widely used searched algorithm and our proteomics workflow is designed to analyze Sequest search results. Sequest computes a cross correlation (Xcorr) function to assess the quality of peptide spectra matches. The better the match between an experimental peptide mass spectrum and its database counterpart, the higher the Xcorr will be. Sequest also computes ΔCn, a normalized score calculated from XCorr difference between the best peptide match and the second best match. ΔCn is dependent on database size, search parameters, and sequence homologies. While both XCorr and ΔCn have been used widely in the past for filtering search results [5–8] they provide little information for distinguishing correct peptide assignments from false positives. To get the most meaningful biological data from proteomics or any high throughput experiment it is necessary to reduce the false discovery rate. Decoy database search methods for estimating probabilities for peptide identifications are described in literature [9, 10]. However, open source computational tools that automate this estimation are not readily available.
Beyond the identification of peptides and proteins at acceptable statistical thresholds, for expression proteomics the end user requires computational tools for differential protein expression. Label free protein quantification methods determine relative protein abundances directly from high throughput proteomic analyses with out labeling techniques using sampling statistics like spectral counting , number of peptides , and ΣXcorr . We developed ProtQuant, a java based tool for label free quantification that uses a spectral counting method with increased specificity based on ΣXCorr. However, ProtQuant computes the statistical significance of differential protein expression using parametric statistics (ANOVA) assuming that the distribution of the control and treatment datasets closely approximates a normal distribution. However, this assumption may not be valid for shotgun proteomics due to either the biology under investigation or due to small sample sizes common to proteomic studies resulting in type I errors (i.e. increased false positive significance rate). Computer intensive distribution-free statistics offer a solution to this problem and we have applied random resampling with replacement to determine statistically significant differences in protein expression from ESI MS2 data .
A recurring theme in high-throughput biology is that collecting orthogonal evidence for biology under investigation using complementary data analysis platforms could reduce the noise and identify true biological effects. For example, microarray differential expression analysis is often complemented by quantitative RT-PCR. Matching mass spectra using two different algorithms like Sequest and Mascot often generates a list of proteins that overlap but also proteins uniquely identified by each method. Likewise given enough computational resources and automated data analysis tools, biologists could evaluate differential protein expression using different statistical tests to identify a core set of differences that could represent true biological changes in expression. Furthermore, proteomic analysis workflows also require corrections for multiple testing to reduce false positive identifications of significant DE based on a single P value cutoff.
To illustrate the functionality of our proteomics workflow we used the Edwardsiella ictaluri response to iron restriction using 2,2-dipyridyl (DP) iron chelator. E. ictaluri cultures were grown in triplicate and outer membrane proteins were isolated. Mass spectrometry and Sequest searches with a protein and reversed-protein database were done as previously described . SEQUEST results were processed using the tools and scripts described in our workflow.
Our proteomics workflow starts with SEQUEST search results in XML format from Bioworks 3.2 browser for both protein and reverse database searches. We chose the XML format as a standard format for Bioworks output as it overcomes the 65536 row file size limit for some versions of Microsoft Excel spreadsheets. When exporting Bioworks 3.2 search results, we recommend that the user does not apply any filters. However, due to the virtual memory constraints imposed by computer desktops, if exporting without filters is not practical, we suggest applying minimal filters for peptide charge state. However, the end users need to be aware that if the peptide filters are set too high, many positive matches may be lost. We created a java script named XML2TXT to quickly convert the XML output files to tab delimited text files, which are used by other scripts and can be opened in Excel/notepad for viewing.
Once the real and reverse unfiltered data files are formatted properly using XML2TXT, they can be processed by ProbCal. ProbCal is a set of PERLscripts that automate the estimation of peptide probabilities using search results from a protein and a decoy database. A t-score is obtained for each Xcorr and ΔCn pair from the reverse search results and based on this score a P value is calculated for peptides identified from protein database. The results can then be filtered using a probability cutoff, typically p ≤ 0.05.
ProtQuant analysis of the E. ictaluri iron replete and DP comparison identified expression of 217 proteins to be significantly increased or decreased (at Benjamini-Hochberg adjusted p ≤ 0.05).
The proteomic data analysis workflow described here for Bioworks Sequest results includes a modular design of the work flow wherein different components can be combined together to perform different analyses. The work flow can be as simple as identifying proteins at a certain probability threshold or as extensive as comparing two datasets for differential protein expression using multiple statistical methods. All the tools and scripts described here can be implemented and further modified to accommodate additional analyses design but do require basic programming skills. All the tools and scripts used are compatible with both Linux and Windows platforms.
XML2TXT is a java program that converts an xml file into a tab delimited text file, further used by other scripts. It is implemented using Xalan-Java, which is an XSLT (XSL Transformations) processor for transforming XML documents into text document types. javax.xml.transform interface is used as java API for XML Processing (JAXP) 1.3.
Perl scripts ProbCal, ProbCal-filter and integrator require the installation of the Active Perl runtime environment available at http://www.activestate.com/activeperl/. ProbCal is the implementation of the peptide probability calculation. Individual peptide probabilities are further utilized to calculate the probability that a protein identification is incorrect. Another subsidiary script ProbCal-filter uses the peptide probabilities to filter low quality peptides from being included in further analysis.
ProtQuant is implemented in Java 5 for platform independence. A self-installing executable for Windows has been generated using Macrovision InstallShield. An instruction for installing and using the tool in a Linux environment is also available. ANOVA analysis is done using a library from the R statistical package http://www.rproject.org/. Because of the size of the datasets that ProtQuant must handle, MySQL is used for data storage and efficient data manipulation. ProtQuant uses the file extension of input files to determine the format. ProtQuant includes a custom built parser for XML files. rsProt, for resampling, requires the installation of MatLab, available for purchase from MathWorks at http://www.mathworks.com/products/matlab/.
E. ictaluri Proteomics
E. ictaluri cultures were grown in triplicate in BHI (iron replete) and BHI with 100 M dipyridyl (iron restriction). Outer membrane proteins were isolated by sodium N-lauroylsarcosinate (SLS) extraction . Protein concentrations were determined using the Plus one 2D quant kit following the manufacturer's protocol (Amersham Biosciences, Piscataway, NJ). Trypsin digestion proteins and analysis of tryptic peptides by 2-D LC ESI MS/MS were conducted as described previously . For protein identification all searches were done using TurboSEQUEST™ (Bioworks Browser 3.2, ThermoElectron). Mass spectra and tandem mass spectra were searched against an in silico trypsin-digested E. ictaluri protein database (3786 proteins). Cysteine carboxyamidomethylation and methionine single and double oxidation were included in the search criteria. For decoy searches a reversed version of the protein database was generated using the reverse database function in Bioworks 3.2. The reversed database was also in silico trypsin digested and used for searches with tandem mass spectra exactly as described for the protein database. Bioworks results were exported in XML format for proteomic analysis workflow described here.
This project was partially supported by a grant from the National Research Initiative of the USDA Cooperative State Research, Education and Extension Service grant number #2006-35600-17688 and National Science Foundation (EPS-0556308-06040293). We acknowledge Dr. Mark L Lawrence for providing E. ictaluri proteomic datasets. We acknowledge Tibor Pechan of the Life Sciences and Biosciences Technology Institute, Mississippi State University for running the mass spectrometer. The Life Sciences and Biotechnology Institute provided salary support for Ken Pendarvis.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 11, 2009: Proceedings of the Sixth Annual MCBIOS Conference. Transformational Bioinformatics: Delivering Value from Genomes. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S11.
- Wolters DA, Washburn MP, Yates JR 3rd: An automated multidimensional protein identification technology for shotgun proteomics. Anal Chem 2001, 73(23):5683–90. 10.1021/ac010617eView ArticlePubMedGoogle Scholar
- Yates JR 3rd, et al.: Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal Chem 1995, 67(8):1426–36. 10.1021/ac00104a020View ArticlePubMedGoogle Scholar
- Perkins DN, et al.: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20(18):3551–67. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2View ArticlePubMedGoogle Scholar
- Craig R, Beavis RC: TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20(9):1466–7. 10.1093/bioinformatics/bth092View ArticlePubMedGoogle Scholar
- Lee SR, et al.: Bovine viral diarrhea virus infection affects the expression of proteins related to professional antigen presentation in bovine monocytes. Biochim Biophys Acta 2009, 1794(1):14–22.View ArticlePubMedGoogle Scholar
- Lee SR, et al.: Differential detergent fractionation for non-electrophoretic bovine peripheral blood monocyte proteomics reveals proteins involved in professional antigen presentation. Dev Comp Immunol 2006, 30(11):1070–83. 10.1016/j.dci.2006.02.002View ArticlePubMedGoogle Scholar
- Nanduri B, et al.: Effects of subminimum inhibitory concentrations of antibiotics on the Pasteurella multocida proteome. J Proteome Res 2006, 5(3):572–80. 10.1021/pr050360rView ArticlePubMedGoogle Scholar
- Nanduri B, et al.: Proteomic analysis using an unfinished bacterial genome: the effects of subminimum inhibitory concentrations of antibiotics on Mannheimia haemolytica virulence factor expression. Proteomics 2005, 5(18):4852–63. 10.1002/pmic.200500112View ArticlePubMedGoogle Scholar
- Choi H, Nesvizhskii AI: Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. J Proteome Res 2008, 7(1):254–65. 10.1021/pr070542gView ArticlePubMedGoogle Scholar
- Choi H, Nesvizhskii AI: False discovery rates and related statistical concepts in mass spectrometry-based proteomics. J Proteome Res 2008, 7(1):47–50. 10.1021/pr700747qView ArticlePubMedGoogle Scholar
- Liu H, Sadygov RG, Yates JR 3rd: A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal Chem 2004, 76(14):4193–201. 10.1021/ac0498563View ArticlePubMedGoogle Scholar
- Gao J, et al.: Changes in the protein expression of yeast as a function of carbon source. J Proteome Res 2003, 2(6):643–9. 10.1021/pr034038xView ArticlePubMedGoogle Scholar
- Bridges SM, et al.: ProtQuant: a tool for the label-free quantification of MudPIT proteomics data. BMC Bioinformatics 2007, 8(Suppl 7):S24. 10.1186/1471-2105-8-S7-S24PubMed CentralView ArticlePubMedGoogle Scholar
- Nanduri B, et al.: Quantitative analysis of Streptococcus pneumoniae TIGR4 response to in vitro iron restriction by 2-D LC ESI MS/MS. Proteomics 2008, 8(10):2104–14. 10.1002/pmic.200701048View ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 1995, 57(1):289–300.Google Scholar
- Lopez-Ferrer D, et al.: Statistical model for large-scale peptide identification in databases from tandem mass spectra using SEQUEST. Anal Chem 2004, 76(23):6853–60. 10.1021/ac049305cView ArticlePubMedGoogle Scholar
- Williams ML, Azadi P, Lawrence ML: Comparison of Cellular and Extracellular Products Expressed by Virulent and Attenuated Strains of Edwardsiella ictaluri. Journal of Aquatic Animal Health 2003, 15: 264–273. 10.1577/H03-051.1View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.