multiplierz: an extensible API based desktop environment for proteomics data analysis
© Parikh et al; licensee BioMed Central Ltd. 2009
Received: 09 April 2009
Accepted: 29 October 2009
Published: 29 October 2009
Efficient analysis of results from mass spectrometry-based proteomics experiments requires access to disparate data types, including native mass spectrometry files, output from algorithms that assign peptide sequence to MS/MS spectra, and annotation for proteins and pathways from various database sources. Moreover, proteomics technologies and experimental methods are not yet standardized; hence a high degree of flexibility is necessary for efficient support of high- and low-throughput data analytic tasks. Development of a desktop environment that is sufficiently robust for deployment in data analytic pipelines, and simultaneously supports customization for programmers and non-programmers alike, has proven to be a significant challenge.
We describe multiplierz, a flexible and open-source desktop environment for comprehensive proteomics data analysis. We use this framework to expose a prototype version of our recently proposed common API (mzAPI) designed for direct access to proprietary mass spectrometry files. In addition to routine data analytic tasks, multiplierz supports generation of information rich, portable spreadsheet-based reports. Moreover, multiplierz is designed around a "zero infrastructure" philosophy, meaning that it can be deployed by end users with little or no system administration support. Finally, access to multiplierz functionality is provided via high-level Python scripts, resulting in a fully extensible data analytic environment for rapid development of custom algorithms and deployment of high-throughput data pipelines.
Collectively, mzAPI and multiplierz facilitate a wide range of data analysis tasks, spanning technology development to biological annotation, for mass spectrometry-based proteomics research.
Mass spectrometry-based proteomics, particularly liquid chromatography coupled to electrospray ionization, has become the predominant technique for identification and quantification of proteins in biological systems . Growing demand for improved annotation of primary proteomics data with biological information from various public databases has catalyzed interest in the development of software tools to support integration of these data types. Unfortunately, a number of factors, including lack of experimental standardization, rapid introduction of novel mass spectrometry technology, and the evolution of proprietary file formats associated with proteomics platforms represent a significant hurdle to the development of efficient and comprehensive software frameworks.
To accommodate the emergent nature of proteomics-related technologies and the burgeoning number of databases that contain various biological annotations, data analytic systems must emphasize (i) intuitive and interactive interfaces, (ii) user-accessible coding frameworks to facilitate rapid prototyping of algorithms, and (iii) customizable sets of tools that can be readily integrated to provide pipelines that support a variety of proteomic workflows. Task specific Windows desktop applications such as MSQuant  and InsilicosViewer  can access a subset of native mass spectrometry data files directly and provide flexibility through adjustable parameters, but are not readily extended across the full spectrum of data analytic activities required in modern proteomics research. To address the full spectrum of analyses, open source projects such as The OpenMS Proteomics Pipeline (TOPP)  and ProteoWizard  offer a set of modular tools for generation of pipelines. The C++ coding environment of these tools is designed for performance and throughput, although researchers who lack programming experience often struggle to implement novel algorithms or other ad hoc tasks. Therefore, software libraries such as InSilicoSpectro  and mspire  have been developed based on high-level languages such as Perl and Ruby respectively. These libraries allow scripting of common data analysis tasks but cannot access raw binary data directly, and must rely instead on surrogate text files.
Historically the proprietary nature of binary files associated with proteomics technologies represented a significant obstacle to efforts aimed at development of integrated, desktop environments. One solution proposed specifically for mass spectrometry is extraction of native data to a common file format, typically a dialect of XML [8, 9]. We  and others  have challenged the technical merits of this approach. Given that mass spectrometry manufacturers implicitly carry the burden of maintaining up-to-date libraries for access to their native data, we recently proposed that a common API  is a more rational solution for shared access to proprietary mass spectrometry files.
Here we define and implement a minimal API (mzAPI) that provides direct, programmatic interaction with binary raw files and we demonstrate that performance for practical tasks is significantly faster as compared to equivalent operations for access to mzXML files. We implement mzAPI in Python to maximize accessibility; similarly, mzAPI is exposed to users through multiplierz, a Python-based desktop environment that combines an intuitive interface with a powerful and flexible high-level scripting platform. Together, mzAPI and multiplierz support a wide range of data analytic tasks and facilitate rapid prototyping of novel algorithms. In addition, the multiplierz environment is designed with a "zero-infrastructure" philosophy, meaning that it can be deployed by end users who lack system administration experience or support. We demonstrate the capabilities of multiplierz through a variety of proteomics case studies such as (i) label-free quantitative comparison and interactive validation of datasets from multi-acquisition experiments, (ii) automatic quality control of mass spectrometer performance, (iii) improved peptide sequence assignment via deisotoping of MS/MS spectra, and (iv) assessment of phosphopeptide enrichment efficiency through programmatic fragment ion extraction.
mzAPI: A Common API for Direct Access to Proprietary File Formats
scan(time) → [(mz, intensity)]
scan_list(start_time, stop_time) → [(time, precursor)]
time_range() → (start_time, stop_time)
scan_time_from_scan_name (scan_name) → time
ric(start_time, stop_time, start_mz, stop_mz) → [(time, intensity)]
The first two procedures in mzAPI return: 1) individual scans in the form of a list of (mz, intensity) pairs, and 2) a catalog of all scan descriptions in the form of a list of (time, precursor) pairs in the experiment. In addition, the API provides: 3) 'time_range' that returns the earliest and latest acquisition times in the experiment, and 4) 'scan_time_from_scan_name' for translation of manufacturer-specific scan nomenclature to the mzAPI naming convention. We opted to rely on acquisition time as a common naming convention. In the case of LC-MS this is equivalent to chromatographic retention time. Finally, a fifth procedure generates a reconstructed ion chromatogram (RIC) for a given time and mass-to-charge range, returned as a set of (time, intensity) pairs. While in principle RICs can be generated using the first two calls, we believe that ubiquitous use of the RIC operation in proteomics data analysis justifies exposure of RIC extraction as a primitive in the API. Given that RIC extraction is provided by all manufacturer libraries, this procedure represents an excellent example of efficient re-use of native data system indexing and software.
We propose that a proprietary file format is considered mzAPI compliant when the manufacturer provides a freely available, and preferably redistributable, implementation of the aforementioned 5 core procedures, or an extended version that may evolve from a community-driven standardization effort. For example, ThermoFisher Scientific provides a data access library for .RAW files through the MSFileReader program, freely available for download at: http://sjsupport.thermofinnigan.com/public/detail.asp?id=586 or http://blais.dfci.harvard.edu/research/mass-informatics/mzAPI/vendor-libraries/. Naturally, additional procedure calls, such as charge state or signal-to-noise values for each isotope cluster in MS or MS/MS scans, can be incorporated into the mzAPI framework by essentially subclassing the core mzFile class.
Access Efficiency for Open and Proprietary Mass Spectrometry Data Files.
Given the multidimensional nature of mass-spectrometry data, extraction based on specific slices through the data space, rather than random file access, is a more relevant performance metric for mass spectrometry files. Generation of RICs is perhaps the best example of a data slice procedure supported by all manufacturer data systems. Consequently we next sought to test the performance of mzAPI for creation of RICs directly from a .RAW file. As a point of comparison we generated the corresponding mzXML file (using TPP version 4.0)  and extracted RICs using both a graphical user interface (GUI) based browser tool (InsilicosViewer version 1.5.1)  as well as the Perl-based InSilicoSpectro environment (version 1.3.19)  and the R-based XCMS (version 1.12.1),  scriptable interface platforms. Although the latter two are designed to access a number of third-party file formats, none of the GUI- or command-line based tools supports access to specified subsets (in chromatographic time) of the underlying data. As a result we generated RICs by extraction of a specific mass-to-charge range over the full data file, or in the case of InSilicoSpectro, which had no support for RIC generation, we simply timed the opening of mzXML files. While the mzXML schema includes a scan index that provides for random access to scans at speeds competitive with, or exceeding, proprietary data system (in this case ThermoFisher Xcalibur) , Table 1 shows that generation of specific data slices, or in this case RICs, is 5- to 10-fold faster when leveraging the underlying manufacturer's API compared to GUI or command line based access to mzXML (scripts used for all timings included in Additional File 1). This result supports the notion that pragmatic data access patterns are well supported by existing, albeit proprietary, manufacturer libraries, and more importantly, that these libraries can be efficiently utilized through a common and redistributable API.
multiplierz: An Open-Source and Interactive Environment for Proteomics Data Analysis
Zero infrastructure integration of peptide identification and associated native mass spectrometry data
Regardless of final experimental goals, peptide identification is often the first or default operation performed subsequent to LC-MS data acquisition. We designed multiplierz to serve as a user-friendly, desktop tool for interaction with proteomics database search engines; consistent with our zero-infrastructure philosophy, X!Tandem  is fully integrated into the multiplierz installation package. Similarly we include support for automated retrieval of Mascot search results. In this case, the URL for a particular search is easily and unambiguously accessed using the Mascot job ID (after completion of the search, the Mascot ID is both on the search submission page and in the Mascot Daemon). The multiplierz module for downloading Mascot search results also allows input for Mascot-specific export options such as "Require Bold Red" and "Maximum Number of Protein hits." Multiple search results are specified using either a comma- or dash-separated list of Mascot Job IDs (e.g., 6556, 5878, 5120-5125). Users can optionally include Mascot MS/MS fragment annotation images (that are displayed in multiple Mascot report web pages) and embed them within a singular multiplierz report; thus multiplierz provides users with comprehensive Mascot information, including images, in a convenient and portable report (described below). Importantly, none of the above tasks require server level administrative privileges. For example, query of MS/MS peak annotations typically requires logon credentials within the web browser. multiplierz interacts with the browser to "screen scrape" MS/MS images and store them within the default report format. Users with full access to the Mascot server may parse results directly from the .DAT file using .mz scripts (multiplierz reports and .mz scripts and described below). Similar support is also provided for Protein Pilot  and OMSSA . For maximum flexibility in conversion of parsed data from other search engines we include modules for generation of multiplierz-compatible spreadsheets.
Calculation of a false discovery rate (FDR) for peptide sequence identifications is one mechanism to assess the overall quality of search results [17, 18]. multiplierz supports calculation of a FDR upon retrieval of peptide identification data from both forward and reverse database searches. The FDR for a given score threshold is calculated as the ratio of reverse database search identifications to that from the forward plus reverse searches, each with a score greater than or equal to the chosen threshold. The FDR thus represents the percentage of identified peptides in the forward search that would also be detected in the reverse database search. multiplierz identifies score thresholds for commonly used FDR (1%, 2%, and 5%) as well as calculates the FDR for each forward peptide score via an .mz Script (see below; scripts for generating a reverse database and calculating the FDR are included in Additional File 2).
Correlation of identified peptide sequences with specific features in the source mass spectrometry data, such as chromatographic peak width or maximum precursor intensity, is often complicated by the requirement for users to move between disparate programs and interfaces. The multiplierz desktop environment provides users with a centralized point of interaction with both search results and the underlying mass spectrometry files. For example, high-confidence peptide identifications may be used for direct generation of RICs across user-defined time and mass-to-charge ranges. Various metrics such as full peak width at half maximum (FWHM), peak area, and apex precursor intensity for peptide elution profiles are included in the output report. As described below these data are combined, annotated, and made available in portable, user-friendly reports.
Generation of portable multi-file reports
Interactive and dynamic analysis of native mass spectrometry data files
Other desktop tools
Researchers are increasingly focused on integration of disparate data types in order to better understand biological phenomena at the so-called network or systems level. As a first step in support of these and similar activities, multiplierz automatically downloads GenBank data over the internet based on an identified protein list, parses information such as gene ontology and domain classification, along with the corresponding Entrez Gene, HPRD, HGNC, and OMIM entries, and then creates hyperlinks directly in the spreadsheet reports. This and other tools including an in silico protein digestion tool and a peptide fragment calculator are described in Additional File 3.
Scripting capability for user-defined customization
While multiplierz includes many built-in features and tools, we also recognize the difficulty of building a "one size fits all" application given the diversity of ideas and efforts pursued within individual research laboratories. Hence multiplierz includes a command line console as well as scripting capability (through ".mz" scripts) which together support ad hoc data analysis tasks. The scripting capability is particularly useful for niche experiments or proteomics workflows not otherwise supported by other open-source or proprietary data systems. All multiplierz tools are available through both the desktop GUI as well as scriptable procedures. In addition, a pre-launch initialization ("rc.mz") script enables full customization of the application and its interfaces without recompiling the underlying code.
Finally, we note that programmatic access to mzAPI allows incorporation of multiplierz into automated data-analytic pipelines. For example, users can submit jobs through a laboratory information management system (LIMS). Upon completion of LC-MS acquisition(s) and database search(es), multiplierz executes .mz scripts to access both the search results and underlying .RAW or .WIFF file(s), in order to create a spreadsheet-based report. Users can be notified by email and access their results via the multiplierz desktop environment. Importantly, multiplierz spreadsheet reports, whether generated in low- or high-throughput mode, are portable and readily formatted in accordance with journal-specific requirements for proteomics data.
Collectively the features described above facilitate a wide range of data analysis tasks for mass spectrometry-based proteomics activities from technology development and evaluation to prioritization of protein identifications for subsequent biochemical validation. Importantly, multiplierz provides these capabilities to individual users at the desktop level.
In the following sections, we demonstrate the functionality of multiplierz through relevant examples based on data and results from work in our laboratory. Significantly we note that these examples encompass data generated on mass spectrometers manufactured by ThermoFisher Scientific and AB-SCIEX.
Optimization of LC Assemblies and Methods
Figure 2 (see above) shows an example of a multiplierz standard format report. To simplify the display we generated a comparison report (using multiplierz) for the two extremes in the 11 LC-MS acquisitions described above. The insets show examples of optional embedded images. We note that, unlike many web-based reports that often require frequent page updates, multiplierz images display immediately upon mouse-over, and hence facilitate rapid data validation and interrogation exercises.
Optimization of Phosphopeptide Enrichment Methods
Improved Peptide Sequence Assignment via De-isotoped MS/MS Spectra
Improved Peptide Sequence Assignment via De-isotoped MS/MS Spectra.
# Peptides Before Deisotoping
# Peptides After Deisotoping
# New Peptides
% New Peptides
Label-Free Quantitative Proteomics
Automated Quality Control of Mass Spectrometer Instrument Performance
We recognize that some aspects of our proposal diverge from current efforts to establish community standards in proteomics. For example, the use of mzAPI within multiplierz to provide direct access to binary mass spectrometry files does not rely on XML-based surrogate files. We note however, that the two strategies are not mutually exclusive; that is, support for mzXML , or the recently described mzML  can be readily incorporated into mzAPI. Similarly, output from multiplierz can be readily formatted in pepXML . In addition, recent discussions focused on data sharing in proteomics suggest that standards may evolve beyond XML-based formats [30, 31]. Equally important, the emergence of translation layers such as cygwin  and Wine , continue to blur inter-platform boundaries, such that software solutions amenable to the widest audience may eclipse those based largely on platform independence. In fact, our use of Microsoft Excel as the default report output for multiplierz is one such example. Similar image-enhanced spreadsheets may be generated in open formats such as OpenOffice.org XML  (see Additional File 4), but our experience to date indicates that the majority of biomedical researchers still opt for commercial spreadsheet solutions, either out of familiarity or because of existing institutional support.
The multiplierz framework is accessible to a wide range of researchers, and simultaneously provides support for novel algorithm development as well as deployment of automated data pipelines. As a central point of integration for information from publically available databases and native data from proprietary instrument platforms, multiplierz offers compelling addition to the ongoing discourse aimed at identifying an effective means to enable broad access and data exchange in the proteomics community. In particular, incorporation of mzAPI into the multiplierz desktop architecture may offer a better impedance match between the rate of proprietary mass spectrometry innovation and researchers' demands for increased autonomy in their data analysis tasks.
Availability and Requirements
This work was supported by the Dana-Farber Cancer Institute and the National Human Genome Research Institute (P50HG004233). JP was supported by National Science Foundation Integrative Graduate Education and Research Traineeship (IGERT) grant DGE-0654108. Eric D. Smith provided valuable assistance in preparation of figures and critical reading of this manuscript.
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422(6928):198–207. 10.1038/nature01511View ArticlePubMedGoogle Scholar
- Kohlbacher O, Reinert K, Gropl C, Lange E, Pfeifer N, Schulz-Trieglaff O, Sturm M: TOPP--the OpenMS proteomics pipeline. Bioinformatics 2007, 23(2):e191–197. 10.1093/bioinformatics/btl299View ArticlePubMedGoogle Scholar
- Kessner D, Chambers M, Burke R, Agus D, Mallick P: ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 2008, 24(21):2534–2536. 10.1093/bioinformatics/btn323PubMed CentralView ArticlePubMedGoogle Scholar
- Colinge J, Masselot A, Carbonell P, Appel RD: InSilicoSpectro: An Open-Source Proteomics Library. Journal of Proteome Research 2006, 5(3):619–624. 10.1021/pr0504236View ArticlePubMedGoogle Scholar
- Prince JT, Marcotte EM: mspire: mass spectrometry proteomics in Ruby. Bioinformatics 2008, 24(23):2796–2797. 10.1093/bioinformatics/btn513PubMed CentralView ArticlePubMedGoogle Scholar
- Orchard S, Taylor C, Hermjakob H, Zhu W, Julian R, Apweiler R: Current status of proteomic standards development. Expert Rev Proteomics 2004, 1(2):179–183. 10.1586/147894184.108.40.206View ArticlePubMedGoogle Scholar
- Pedrioli PG, Eng JK, Hubley R, Vogelzang M, Deutsch EW, Raught B, Pratt B, Nilsson E, Angeletti RH, Apweiler R, et al.: A common open representation of mass spectrometry data and its application to proteomics research. Nat Biotechnol 2004, 22(11):1459–1466. 10.1038/nbt1031View ArticlePubMedGoogle Scholar
- Askenazi M, Parikh JR, Marto JA: mzAPI: a new strategy for efficiently sharing mass spectrometry data. Nat Methods 2009, 6(4):240–241. 10.1038/nmeth0409-240PubMed CentralView ArticlePubMedGoogle Scholar
- Lin SM, Zhu L, Winter AQ, Sasinowski M, Kibbe WA: What is mzXML good for? Expert Rev Proteomics 2005, 2(6):839–845. 10.1586/147894220.127.116.119View ArticlePubMedGoogle Scholar
- Keller A, Eng J, Zhang N, Li XJ, Aebersold R: A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol Syst Biol 2005, 1: 2005 0017. 10.1038/msb4100024PubMed CentralView ArticlePubMedGoogle Scholar
- Smith CA, Want EJ, O'Maille G, Abagyan R, Siuzdak G: XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem 2006, 78(3):779–787. 10.1021/ac051437yView ArticlePubMedGoogle Scholar
- Craig R, Beavis RC: TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20(9):1466–1467. 10.1093/bioinformatics/bth092View ArticlePubMedGoogle Scholar
- Shilov IV, Seymour SL, Patel AA, Loboda A, Tang WH, Keating SP, Hunter CL, Nuwaysir LM, Schaeffer DA: The Paragon Algorithm, a Next Generation Search Engine That Uses Sequence Temperature Values and Feature Probabilities to Identify Peptides from Tandem Mass Spectra. Mol Cell Proteomics 2007, 6(9):1638–1655. 10.1074/mcp.T600050-MCP200View ArticlePubMedGoogle Scholar
- Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH: Open mass spectrometry search algorithm. J Proteome Res 2004, 3(5):958–964. 10.1021/pr0499491View ArticlePubMedGoogle Scholar
- Moore RE, Young MK, Lee TD: Qscore: an algorithm for evaluating SEQUEST database search results. J Am Soc Mass Spectrom 2002, 13(4):378–386. 10.1016/S1044-0305(02)00352-5View ArticlePubMedGoogle Scholar
- Kall L, Storey JD, MacCoss MJ, Noble WS: Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J Proteome Res 2008, 7(1):29–34. 10.1021/pr700600nView ArticlePubMedGoogle Scholar
- Bradshaw RA, Burlingame AL, Carr S, Aebersold R: Reporting protein identification data: the next generation of guidelines. Mol Cell Proteomics 2006, 5(5):787–788. 10.1074/mcp.E600005-MCP200View ArticlePubMedGoogle Scholar
- Ficarro SB, Zhang Y, Lu Y, Moghimi AR, Askenazi M, Hyatt E, Smith ED, Boyer L, Schlaeger TM, Luckey CJ, et al.: Improved electrospray ionization efficiency compensates for diminished chromatographic resolution and enables proteomics analysis of tyrosine signaling in embryonic stem cells. Anal Chem 2009, 81(9):3440–3447. 10.1021/ac802720eView ArticlePubMedGoogle Scholar
- Steen H, Kuster B, Fernandez M, Pandey A, Mann M: Detection of tyrosine phosphorylated peptides by precursor ion scanning quadrupole TOF mass spectrometry in positive ion mode. Anal Chem 2001, 73(7):1440–1448. 10.1021/ac001318cView ArticlePubMedGoogle Scholar
- Olsen JV, Macek B, Lange O, Makarov A, Horning S, Mann M: Higher-energy C-trap dissociation for peptide modification analysis. Nat Methods 2007, 4(9):709–712. 10.1038/nmeth1060View ArticlePubMedGoogle Scholar
- Ficarro SB, Parikh JR, Blank NC, Marto JA: Niobium(V) oxide (Nb2O5): application to phosphoproteomics. Anal Chem 2008, 80(12):4606–4613. 10.1021/ac800564hView ArticlePubMedGoogle Scholar
- Zhang Y, Ficarro SB, Li S, Marto JA: Optimized Orbitrap HCD for quantitative analysis of phosphopeptides. J Am Soc Mass Spectrom 2009, 20(8):1425–1434. 10.1016/j.jasms.2009.03.019View ArticlePubMedGoogle Scholar
- Wehofsky M, Hoffmann R: Automated deconvolution and deisotoping of electrospray mass spectra. J Mass Spectrom 2002, 37(2):223–229. 10.1002/jms.278View ArticlePubMedGoogle Scholar
- America AH, Cordewener JH: Comparative LC-MS: a landscape of peaks and valleys. Proteomics 2008, 8(4):731–749. 10.1002/pmic.200700694View ArticlePubMedGoogle Scholar
- Bondarenko PV, Chelius D, Shaler TA: Identification and relative quantitation of protein mixtures by enzymatic digestion followed by capillary reversed-phase liquid chromatography-tandem mass spectrometry. Anal Chem 2002, 74(18):4741–4749. 10.1021/ac0256991View ArticlePubMedGoogle Scholar
- Kaiser NK, Anderson GA, Bruce JE: Improved mass accuracy for tandem mass spectrometry. J Am Soc Mass Spectrom 2005, 16(4):463–470. 10.1016/j.jasms.2004.12.005View ArticlePubMedGoogle Scholar
- Deutsch E: mzML: a single, unifying data format for mass spectrometer output. Proteomics 2008, 8(14):2776–2777. 10.1002/pmic.200890049View ArticlePubMedGoogle Scholar
- Rodriguez H: International summit on proteomics data release and sharing policy. J Proteome Res 2008, 7(11):4609. 10.1021/pr800779qView ArticlePubMedGoogle Scholar
- Cottingham K: Proteomics researchers now agree on some aspects of data sharing. J Proteome Res 2008, 7(11):4612. 10.1021/pr800781dView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.