GNU polyxmass: a software framework for mass spectrometric simulations of linear (bio-)polymeric analytes
© Rusconi; licensee BioMed Central Ltd. 2006
Received: 01 February 2006
Accepted: 27 April 2006
Published: 27 April 2006
Nowadays, a variety of (bio-)polymers can be analyzed by mass spectrometry. The detailed interpretation of the spectra requires a huge number of "hypothesis cycles", comprising the following three actions 1) put forth a structural hypothesis, 2) test it, 3) (in)validate it. This time-consuming and painstaking data scrutiny is alleviated by using specialized software tools. However, all the software tools available to date are polymer chemistry-specific. This imposes a heavy overhead to researchers who do mass spectrometry on a variety of (bio-)polymers, as each polymer type will require a different software tool to perform data simulations and analyses. We developed a software to address the lack of an integrated software framework able to deal with different polymer chemistries.
The GNU polyxmass software framework performs common (bio-)chemical simulations–along with simultaneous mass spectrometric calculations–for any kind of linear bio-polymeric analyte (DNA, RNA, saccharides or proteins). The framework is organized into three modules, all accessible from one single binary program. The modules let the user to 1) define brand new polymer chemistries, 2) perform quick mass calculations using a desktop calculator paradigm, 3) graphically edit polymer sequences and perform (bio-)chemical/mass spectrometric simulations. Any aspect of the mass calculations, polymer chemistry reactions or graphical polymer sequence editing is configurable.
The scientist who uses mass spectrometry to characterize (bio-)polymeric analytes of different chemistries is provided with a single software framework for his data prediction/analysis needs, whatever the polymer chemistry being involved.
Mass spectrometry has proven essential in structural studies in which biopolymer molecules of a variety of polymer chemistries are involved. Indeed, while proteins were once the main biopolymeric analytes studied by mass spectrometry, oligo(deoxy)ribonucleotides and saccharides also are routinely analyzed today and mass spectrometry is used, for example, for the characterization of DNA-protein complexes or for the gas phase sequencing of saccharides (for reviews, see [1–4]). The current and ever-increasing variety of mass spectrometer designs affords a rather large array of experiments that can be performed on different biopolymers. Thus, the variety of polymer chemistries analyzable by mass spectrometry is compounded by the variety of mass spectrometric experiments, producing an extremely diverse set of mass data to be either predicted or analysed with the help of appropriate software tools. It is noteworthy that, while some experiments are almost completely automatable (like in the case of high-throughput proteomics), a majority of the experiments being performed in mass spectrometry facilities are neither automatable nor high-throughput. For example, one field of biochemistry that has massively benefited from the mass spectrometry improvements over the last ten years is the one involving fine structural characterizations of chemically modified biopolymers, like the post-translational modification studies in protein biochemistry (see  for a review). Studies like the ones described in  and  were not automatable at the time these were performed, and fact is that these are still not automatable today: flexible tools are still needed to help human-driven interpretation of mass spectrometric data obtained on structurally complex biological analytes.
As soon as biopolymers were successfully analyzed by mass spectrometry, a number of tools were made available to perform calculations on proteins or nucleic acids. These tools might be used across the network, such as the tools accessible at the Expasy portal  or the ones published in . Locally-installed software is also available, with the massXpert software , the GPMAW program (reviewed in ) or the SOS program . One problem with the current situation, though, is that each of these programs was designed to work with a given polymer chemistry, thus each time a mass spectrometric analysis is performed on a different biopolymer, a different tool has to be used. The GNU polyxmass software framework was designed to bring a solution to the problem above by providing a number of modules where the user might 1) define any number of polymer chemistries, 2) use polymer chemistry definitions to perform quick mass calculations and 3) graphically edit polymer sequences and perform complex simulations of (bio-)chemical reactions and mass spectrometric experiments.
Creating mass spectrometry software usable with any kind of polymer chemistry did require to elaborate and implement an abstraction layer between on the one hand, the graphical sequence editor and on the other hand, the simulation and mass computation engines. The implementation of this abstraction layer takes the final form of a number of packages, as described below.
Software development and source tree architecture
The software is developed on a Debian GNU/Linux platform in the "testing/etch" version. This project is structured in a number of packages, two of which are binary packages (libpolyxmass and polyxmass-bin) and two of which are data packages (polyxmass-common and polyxmass-data).
The binary packages
The data packages
The polyxmass-common and polyxmass-data packages contain polymer chemistry data files and example polymer sequence files (XML format files parsed using the libxml2 library); the former package is considered essential and contains the reference atom definition file (atoms.xml) along with all the files that make together the "protein" polymer chemistry definition. The latter package is optional and contains the polymer chemistry definition data for polymer types "dna", "rna" and "saccharide".
Configuration data are stored in simple text format files which are read by home-coded functions (in libpolyxmass).
Code is subjected to versioning control (using the tla implementation of the GNU arch specification) and is available from http://arch.debian.org
Before describing the details of the software project, it is necessary to put forth terminological decisions we had to take. Indeed, each bio-polymer chemistry has its own set of specific terms to refer to chemical entities in its realm. For example, sugars have "reducing ends" and "monoses", proteins have "N-terminal" and "C-terminal" ends and "residues", and nucleic acids have "3'-OH" and "5'-P" ends and "nucleotides". During the development of GNU polyxmass, we promoted the use of a unified glossary in describing bio-polymer entities, so as to be as generic as possible. "Monomer" refers to the chemical entity that has become part of a polymer sequence chain, that is after its polymerization (equivalent to "residue", in protein chemistry). The denomination of the extremities of the polymer sequence are "left end" and "right end" (equivalent to the protein chemistry "N-terminus" and "C-terminus", respectively). The chemical groups that cap the polymer sequence are called "caps". There is one chemical cap per end. Finally, the equivalent of a protein sequence is called "polymer sequence" and the equivalent of an oligonucleotide is called an "oligomer".
Results and discussion
The software complexity has been concealed behind an intuitive graphical user interface. GNU polyxmass is mainly operated through a single binary program (polyxmass binary file) that integrates three distinct functionalities organized in modules that can be conveniently called through menu items:
polyxdef: this module lets the user perform the definition of atoms and of brand new polymer chemistries (their file formats are described below);
polyxcalc: this module provides a desktop polymer chemistry-aware mass calculator in which mass computations can be performed without the need to edit polymer sequences;
polyxedit: this module provides the main functionalities of the software framework. It provides a polymer sequence editor in an environment where the user will trigger all the biochemical and mass spectrometric simulations through simple menu interactions.
In the following section, the operation of each module will be reviewed, along with a description of its tasks.
The polyxdef module
Atom definition file format
No mass is ever hard-coded in the software: in GNU polyxmass, any entity that is ponderable (ie that has a molecular weight) must have a formula associated with it. When masses are computed, formulas are challenged against an atom definition file that lists, for each chemical element, its corresponding isotopic mass/abundance pairs. The following excerpt from the atoms.xml file in the distribution shows the XML structure that was used to allow the dynamic generation of fully characterized atoms.
Each chemical element has a monoisotopic mass that corresponds to the lightest isotope's mass; the average mass is computed by considering all the isotopes of the chemical element. The atom average masses are computed dynamically once, when the atom definition file is loaded from disk. There can be any number of distinct atom definition files on the system.
Polymer chemistry definition file format
The polymer chemistry definition constitutes the most evident mechanism by which to achieve a full uncoupling between the sequence editor and the mass calculation engine. Each polymer chemistry type ("protein", for example) has such a definition file, where all aspects of its chemistry are documented. The following code excerpts were taken from the protein.xml file in the distribution. They illustrate the way in which the polymer chemistry definition is parted into XML nodes documenting the different chemical entities that make such chemical definition: monomers, modifications, chemical/enzymatic cleavages, gas-phase fragmentations, chemical end caps, ionization rule. All of these entities are reviewed below.
Because polymers are the result of the concatenation of monomers, listing the monomers that might enter in the composition of the polymer sequence is obviously the first step in defining a polymer chemistry. Monomers are defined as having a name, a code and a formula. The <codelen> element specifies the number of characters allowed to define a monomer code. This is a value that has a polymer chemistry definition scope. One common arbitrary limitation with mass spectrometry programs is that the editing of the sequence can be performed using only one letter-long monomer codes (thus limiting the number of codes to 26). GNU polyxmass removes this limitation with the design and implementation of sophisticated algorithms that make it possible to use any number of alphabetical characters to form a monomer code. The syntactic rule governing the formation of a code is that the first letter must be uppercase and all the remaining ones lowercase. This is particularly useful when working with modified monomers, in which case using more letters may help differentiating monomers with different modifications. For example, with a <codelen> element of value 2, 'Y' could be used for un-modified tyrosinyl residues, "Yp" after its phosphorylation and "Ys" after its sulphation.
Modifications also are polymer chemistry-specific. Note the peculiar formula which characterizes them, which is called an "actform", short for "action-formula". The net mass change, upon modification of the polymer sequence, is applied after computation of the net formula by subtracting the atoms prefixed with the '-' sign from the atoms prefixed with the '+' (if at all) sign. This notation has the advantage of being more meaningful from a chemical reaction standpoint, as it closely represents it, with the interplay of the leaving and the entering groups at the target biopolymer molecule level.
Enzymatic/chemical cleavage agents (<cls> elements) belong to the polymer chemistry definition. The example of cyanogen bromide, as a cleaving chemical agent, illustrates the flexibility of the definition language: when a protein sequence is cleaved with cyanogen bromide, the methionyl residue that got cleaved at its C-terminal side is chemically transformed to a homoseryl residue. That reaction is described using the actform "- CH2S + O" and is only applied to the generated peptides if the methionyl residue is located at its C-terminus (that is, the right end of the oligomer; see the <re-mnm-code> element above).
Gas-phase fragmentation patterns ("specifications"; <fgs> elements) are defined using a powerful syntax. A fragmentation specification lists the following data:
name: this is the name of the fragmentation pattern, and will be used to craft each fragment's name;
end: this datum specifies which side of the polymer chain will make the fragment after the fragmentation occurred. In protein chemistry, the 'a' fragmentation pattern describes the ions corresponding to the sequence portion left of the fragmentation location . In GNU polyxmass, three values are acceptable: "LE" for left end, "RE" for right end and "NE" for specific fragmentation cases like immonium ions in protein chemistry;
fgr: any number of fragmentation rules ("fragrule"; <fgr> elements) can be defined that will be applied to the fragment ions if sequence topological conditions are verified. The example a-fgr-1 fragrule states that if fragmentation occurs at a monomer of code 'D' (<this-mnm-code> element) AND that the previous monomer in sequence has a code 'E' AND that the next monomer in sequence has a code 'F', then the actform "-H2O" should be applied to the fragment ion;
This fragmentation specification grammar is flexible enough to allow the description of highly complex fragmentation patterns such as those observed in saccharidic gas-phase fragmentations . Indeed, fragmentation patterns that depend on the identity of the monomer at which the fragmentation occurs (and, sometimes, also the identity of the bordering monomers) can be modelled using the provided grammar. For example, the "a-B" fragmentation pattern observed in oligonucleotide fragmentation is easily supported in GNU polyxmass. Internal fragments generation is not supported yet.
End cap chemistry
The mere concatenation of monomers only yields a residual chain of monomers, and not a polymer in its finished polymerization state. Thus, it is necessary to provide chemical definitions of how the residual monomeric chain gets capped to finish the polymerization. In protein chemistry, the left (<leftcap> element) and right (<rightcap> element) caps are the proton and the hydroxyl group, respectively.
A default ionization rule is defined in the <ionizerule> element according to the following scheme:
actform: this is the chemical representation of the ionization reaction. For proteins, protonation is by far the most often used ionization mechanism, while for a number of polymers (including synthetic ones) the best ionization mechanism is not protonation, but cationization with metal ions;
charge: this is the charge that is brought to the polymer sequence after the ionization reaction took place. Protonation-based ionizations bring a charge of one;
level: this is the number of times that the ionization reaction should be performed on the polymer sequence. Usually mono-protonation is the preferred ionization level for peptides.
Once a polymer chemistry has been defined, its definition file might be used to compute masses with the calculator (polyxcalc module) or to edit sequences and to trigger biochemical and mass spectrometric simulations on these (polyxedit module).
The polyxcalc module
The polyxedit module
The polyxedit module is the most featureful module of the program. This section first describes the critical mechanisms that ensure that all the simulations performed with polymer sequences are using the proper polymer chemistry definition. Next, the different functionalities presented by polyxedit will be reviewed briefly.
Chemical consistency between a sequence and the biochemical and mass spectrometric simulations
The non-shaded part of the figure shows the mechanisms by which the program ensures, upon opening of a polymer sequence file, that the proper polymer chemistry definition is available, either already in memory or accessible on the disk. Indeed, a polymer chemistry definition is absolutely required in order to interpret correctly the sequence data contained in the sequence file: a given sequence–like "ATGC", for example–might mean different things depending on the polymer chemistry for which it was edited (that is, this sequence could be a nucleic acids sequence or a protein sequence).
The action "Open sequence file" triggers code that will inspect the polymer sequence file in search for a <type> XML element that documents the polymer chemistry definition of the sequence. That discovery process yields, in our example, "protein" as the polymer chemistry definition type. Thus, the sequence being opened is a protein sequence. The algorithm checks if the polymer chemistry definition is already available in the global array of polymer chemistry definitions. If so, there is no need to load that chemistry definition file into memory and the program goes on to load the polymer sequence. If not, the program has to first load the polymer chemistry definition from a file. The name of that file is discovered by reading a dictionary file that makes the link between a polymer chemistry definition ("protein") and the location of its corresponding file on disk. When the polymer chemistry definition file is successfully loaded, it is made available to the whole program by storing its pointer in the global array of polymer chemistry definitions mentioned above. The loading of the polymer sequence can continue, and if successful, the sequence will point to the polymer chemistry definition just loaded. This mechanism enforces the rule that any given sequence loaded from disk has to be attached to the proper polymer chemistry definition, thus ensuring that the biochemical and mass spectrometric simulations that are performed on the sequence actually make use of the proper chemical entities as defined in the right polymer chemistry definition file.
Graphical editing of polymer sequences
Chemical modifications of the monomers can be rendered graphically by compositing a transparent pixmap onto the monomer icon to be modified. For example, in the left sequence of Figure 6, the 'S' monomer icon was modified by compositing onto it a transparent pixmap with a red 'P' graphical element (the seryl residue is phosphorylated). This is unlimitedly configurable, because the transparent graphics files representing chemical modifications are–like described above– freely modifiable or can be created anew.
When a polymer sequence is opened in the sequence editor, as shown in Figure 6, all the simulations and computations are automatically made available in the editor menus. The simulations and computations that are available through the sequence editor menus are:
Chemical modifications of individual monomers or of the polymer sequence as a whole on its left/right ends (see Figure 7 for an overview);
Chemical/enzymatic cleavage of a polymer sequence, with automatic fully configurable mass calculations (see Figure 8);
Gas-phase fragmentation of an oligomer, with automatic fully configurable mass calculations. The process here is in any aspect comparable to what was described above for the cleavage of polymer sequences;
Mass-to-charge (m/z) ratio calculations with the ability to change the chemical ionization agent on the fly (see Figure 9);
Calculation of the composition of a polymer sequence (both monomeric and elemental);
Isoelectric point calculations of polymer sequences, optionally taking into account chemical modifications (see Figure 10 for an overview);
Matching operations between data extracted from a real mass spectrum and data generated in silico by GNU polyxmass. The procedure is to deisotope the mass spectrum and to compute the centroid for the remaining peaks. All the centroid peaks are listed in a peak list. Such list is then fed into GNU polyxmass and all the masses it contains can be matched with a GNU polyxmass-generated theoretical mass spectrum.
Annotation of the whole polymer sequence or of single monomers;
Fully customizable find/replace operations;
Full reporting of all the results in a spreadsheet-friendly ASCII format.
The GNU polyxmass software is an integrated mass spectrometric software framework that allows biochemists and massists to perform biochemical and mass spectrometric simulations on polymer sequences of any polymer chemistry. The software design that drove the implementation of GNU polyxmass constitutes an improvement over the current situation, in which each time a mass spectrometric experiment is performed on a different polymer chemistry (protein, nucleic acid or saccharide, for example), the data it yields have to be analyzed using a different software tool. We find GNU polyxmass useful in our day-to-day mass spectrometric work on proteins and peptides, serving as a mass data prediction and mass data analysis tool. We also have used GNU polyxmass successfully as an education tool, as it behaves exactly like a mass spectrometer, allowing a great number of mass spectrometry concepts to be experimented virtually. This software framework was implemented with configurability and flexibility in mind and with the idea that no arbitrary limit should be imposed on the user. It is published under a Free Software license in the hope to form a community around it, to continue and further its adoption and development as a research and education tool.
The GNU polyxmass software framework will be made extensible by use of the Python scripting language. Further, it is envisaged to integrate a mass spectrum visualization tool that will be able to render mass spectrometric data files of recently published XML-based open formats mzXML  and mzData [16, 17]. Easy graphical differential display between theoretically and experimentally obtained mass spectra might thus be made possible in GNU polyxmass. Features like cross-links and ramification of polymers are being elaborated upon and might find their way in future releases.
Availability and requirements
This work is dedicated to Dr Richard M. Stallman of the Free Software Foundation. The author wishes to extend special thanks to Dr Jeff Bizzarro (Bioinformatics.org, Inc., Boston, MA) for providing infrastructure with the world-wide development/distribution of GNU polyxmass. A number of contributors to the project (packaging, suggestions, bug reporting) are acknowledged in the software distribution. Dr María Cecilia Angulo (CNRS, Paris, France) is warmly thanked for innumerable enlightening discussions. Dr Lionel Elie Mamane (Institute for Computing and Information Science, Radboud University Nijmegen, The Netherlands) and Dr Mark Tracy (Senior Chemist, Dionex Corporation, USA; GNU polyxmass has no link with Dionex Corporation) are thanked for sharing software engineering knowledge and for interesting discussions about the architecture of the project. Dr Philippe Garrigues (CNRS, Talence, France) and Prof J.S. Sun (MNHN, Paris, France) are acknowledged for support. Mrs Jeanne Jordanov (Délégation Régionale du CNRS, Talence, France) is acknowldeged for formally allowing publication of this work under the GNU General Public License.
- Rudd P, Dwek R: Rapid, sensitive sequencing of oligosaccharides from glycoproteins. Curr Opin Biotechnol 1997, 8(4):488–97. 10.1016/S0958-1669(97)80073-0View ArticlePubMedGoogle Scholar
- Crain P, McCloskey J: Applications of mass spectrometry to the characterization of oligonucleotides and nucleic acids. Curr Opin Biotechnol 1998, 9: 25–34. 10.1016/S0958-1669(98)80080-3View ArticlePubMedGoogle Scholar
- Huber C, Oberacher H: Analysis of nucleic acids by on-line liquid chromatography-mass spectrometry. Mass Spectrom Rev 2001, 20(5):310–43. 10.1002/mas.10011View ArticlePubMedGoogle Scholar
- Rusconi F, Guillonneau F, Praseuth D: Contributions of mass spectrometry in the study of nucleic acid-binding proteins and of nucleic acid-protein interactions. Mass Spectrom Rev 2002, 21(5):305–48. 10.1002/mas.10036View ArticlePubMedGoogle Scholar
- Mann M, Jensen O: Proteomic analysis of post-translational modifications. Nat Biotechnol 2003, 21(3):255–61. 10.1038/nbt0303-255View ArticlePubMedGoogle Scholar
- Redeker V, Rusconi F, Mary J, Promé D, Rossier J: Structure of the C-terminal tail of alpha-tubulin: increase of heterogeneity from newborn to adult. J Neurochem 1996, 67(5):2104–14.View ArticlePubMedGoogle Scholar
- Rusconi F, Potier M, Le Caer J, Schmitter JM, Rossier J: Characterization of the chicken telokin heterogeneity by time-of-flight mass spectrometry. Biochemistry 1997, 36(36):11021–6. 10.1021/bi970752eView ArticlePubMedGoogle Scholar
- ExPASy Proteomics Server[http://www.expasy.ch]
- Rozenski J: Nucleic acids masspec toolbox (Mongo Oligo Mass Calculator v2.06).[http://medlib.med.utah.edu/masspec]
- Rusconi F, Belghazi M: Desktop prediction/analysis of mass spectrometric data in proteomic projects by using massXpert. Bioinformatics 2002, 18(4):644–5. 10.1093/bioinformatics/18.4.644View ArticlePubMedGoogle Scholar
- Peri S, Steen H, Pandey A: GPMAW-a software tool for analyzing proteins and peptides. Trends Biochem Sci 2001, 26(11):687–9. 10.1016/S0968-0004(01)01954-5View ArticlePubMedGoogle Scholar
- Rozenski J, McCloskey JA: SOS: a simple interactive program for ab initio oligonucleotide sequencing by mass spectrometry. J Am Soc Mass Spec 2001.Google Scholar
- Roepstorff P, Fohlman J: Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed Mass Spectrom 1984, 11(11):601. 10.1002/bms.1200111109View ArticlePubMedGoogle Scholar
- Domon B, Costello C: A systematic nomenclature for carbohydrate fragmentations in FAB-MS/MS spectra of glycoconjugates. Glycoconjugate Journal 1988, 5: 397–409. 10.1007/BF01049915View ArticleGoogle Scholar
- Pedrioli P, Eng J, Hubley R, Vogelzang M, Deutsch E, Raught B, Pratt B, Nilsson E, Angeletti R, Apweiler R, Cheung K, Costello C, Hermjakob H, Huang S, Julian R, Kapp E, McComb M, Oliver S, Omenn G, Paton N, Simpson R, Smith R, Taylor C, Zhu W, Aebersold R: A common open representation of mass spectrometry data and its application to proteomics research. Nat Biotechnol 2004, 22(11):1459–66. 10.1038/nbt1031View ArticlePubMedGoogle Scholar
- Orchard S, Hermjakob H, Julian R Jr, Runte K, Sherman D, Wojcik J, Zhu W, Apweiler R: Common interchange standards for proteomics data: Public availability of tools and schema. Proteomics 2004, 4(2):490–1. 10.1002/pmic.200300694View ArticlePubMedGoogle Scholar
- Orchard S, Hermjakob H, Binz P, Hoogland C, Taylor C, Zhu W, Julian R Jr, Apweiler R: Further steps towards data standardisation: the Proteomic Standards Initiative HUPO 3(rd) annual congress, Beijing 25–27(th) October, 2004. Proteomics 2005, 5(2):337–9. 10.1002/pmic.200401158View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.