PathwayBooster: a tool to support the curation of metabolic pathways

Background Despite several recent advances in the automated generation of draft metabolic reconstructions, the manual curation of these networks to produce high quality genome-scale metabolic models remains a labour-intensive and challenging task. Results We present PathwayBooster, an open-source software tool to support the manual comparison and curation of metabolic models. It combines gene annotations from GenBank files and other sources with information retrieved from the metabolic databases BRENDA and KEGG to produce a set of pathway diagrams and reports summarising the evidence for the presence of a reaction in a given organism’s metabolic network. By comparing multiple sources of evidence within a common framework, PathwayBooster assists the curator in the identification of likely false positive (misannotated enzyme) and false negative (pathway hole) reactions. Reaction evidence may be taken from alternative annotations of the same genome and/or a set of closely related organisms. Conclusions By integrating and visualising evidence from multiple sources, PathwayBooster reduces the manual effort required in the curation of a metabolic model. The software is available online at http://www.theosysbio.bio.ic.ac.uk/resources/pathwaybooster/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0447-2) contains supplementary material, which is available to authorized users.


Introduction
PathwayBooster is an open-source software tool to support the comparison and curation of metabolic models. It combines gene annotations from GenBank files and other sources with information retrieved from the metabolic databases BRENDA and KEGG to produce a set of pathway diagrams and reports summarising the evidence for the presence of a reaction in a given organism's metabolic network. By comparing multiple sources of evidence within a common framework, PathwayBooster assists the curator in the identification of likely false positive (misannotated enzyme) and false negative (pathway hole) reactions. Reaction evidence may be taken from alternative annotations of the same genome and/or a set of closely related organisms.
This document provides information on how to install and run PathwayBooster. The software has been built and tested with Python 2.6 and newer, on Windows, Mac OS X, and Linux platforms. It is also available a PathwayBooster version with a graphical interface which is only available for windows.
For support and other queries, send e-mail to j.pinney@imperial.ac.uk.
• PIL 3 python modules. If you encounter problems installing PIL or when running PathwayBooster, with an error like 'ImportError: The imagingft C module is not installed', you should try installing a PIL version with precompiled libraries 4 .
Using Enthought python, PIL should already be installed.
• BRENDA flatfile 5 . Once extracted from the zipfile, move the file brenda download.txt into the PathwayBooster/files directory. PathwayBooster does not need to run from the PathwayBooster directory. To run from a different directory, just type:

Installation terminal version
python path/to/PathwayBooster.py [setupFilename.xml] and the results will be saved in the current working directory.

Installation GUI version
This section describes how to install PathwayBoosterGUI and get it working. The GUI version is only available for Windows.
Download the BRENDA flatfile (freely obtainable from http://www.brenda-enzymes. info/) and unpack. Save this in the PathwayBooster/files directory as brenda download.txt To start PathwayBoosterGUI double click on PathwayBoosterGUI.exe file.
Instructions on how to use PathwayBoosterGUI can be found bellow and in the help support given in PathwayBoosterGUI.

Setup File
The setup file is constructed in XML format and is divided into three parts that reflect groups of information to be provided by the user: <pathwayList>, <genomeList> and, optionally, <blockList>. An example setup file is shown in Fig. 2.1.

<pathwayList>
In this section, the user specifies the set of KEGG pathways to be processed as a series of <pathway> elements. There are two ways to specify pathways: • using KEGG metabolic function groups 6 , e.g. carbohydrate metabolism has id 1.1.
The declaration <pathway id=1.1> means that all pathways in this group are processed.
• using the global KEGG id of an individual pathway, e.g. <pathway id = 00010> corresponds to Glycolysis/Gluconeogenesis.

<genomeList>
This section requires the user to specify genome information for the species of interest and other reference organisms. For each organism to be included, the user must define a <genome> element. The attribute name refers to a species identifier, which will be used by the software for display. For each <genome>, the user may provide multiple <annotation> sources. These can be of three different kinds, defined by the attribute type: kegg, genbank or embl. For genbank and embl, the user must provide a filename for a genome annotation in the respective file format. For the kegg annotations, the user provides the keggId for the given genome. For example, in the case of Bacillus subtilis, set keggId=bsu. The user can provide more than one annotation of each type, however all the annotations must have a unique id attribute.
For each <genome> there are multiple options available, specified by the following attributes:

• filename
The user may supply a FASTA-format file containing amino acid sequences for the predicted proteome.
• query When set to true, this signifies that this genome is the one of main interest. If none of the genomes is set with query=true, the first genome with a genome annotation sequence file provided will be considered as the query genome.

• brenda
The full taxonomic name of the organism, which will be used by PathwayBooster to search the BRENDA database in order to retrieve publication information.
• color The color that should be used to identify the genome in the PathwayBooster display. The accepted format is the RGB color model. This format is constituted by 3 numbers between 0 and 255 separated by commas. An example would be color="30,40,200". If the user does not specify a color, PathwayBooster will attribute one automatically.
• pathway This controls whether the genome is included in the pathway visualisation. (Default is true). There can be a maximum of 7 genomes displayed. A reaction is considered as present if it has either an annotated gene (from any of the annotations provided) or literature evidence (if brenda is provided).
• hamming This controls whether the genome is included in the Hamming distance matrix. (Default is true).
<blockList> This optional section can be used to specify more complex display preferences, for example if the user wants to compare the annotations that were obtained from two different sources for the same organism, these can be separated into different <block> elements. All options available for a <genome> are also available for a <block>, with the exception of filename.
When the <blockList> section is present, the <genome> attributes will be overridden for the pathway map, Hamming distance and literature evidence displays.

Graphical interface manual
This section provides a detailed description of PatwayBooster GUI. PathwayBooster GUI is formed by 4 main sections organised in tabs.

• Run
The objective of PatwayBooster GUI is to help the user to build the xml setup file (for more details, see section 2.3) and define some preferences.

• Preferences
In this tab the user can define some preferences such as where the report will be saved and its name. PathwayBooster home: Specify where the annotation files are. This will make the task of finding the annotation files easier and faster; Output directory: Specify the directory where you want to save the PathwayBooster Report. By default the report will be saved in the same directory as PathwayBooster; Report Name: Indicate the name of the PathwayBooster Report. By default the name is PathwayBoosterReport; PYTHONHOME: Indicate the directory that contains the python "Lib" folder.; PYTHONPATH: Indicate the directories that contain the python modules needed to run PathwayBooster.

• Organisms
This tab relates to the genomeList section of the xml setup file. Here, the user will have to specify genome information for the species of interest and other reference organisms. Figure 2.3: Organims Tab. 01 -add: Adds an organims object to the display; 02remove: Removes that specific organims; 03 -ID: Refers to the organism identifier, which will be used by the software for display. Each ID must be unique; 04 -Name: The full taxonomic name of the organism, which will be used by PathwayBooster to search the BRENDA database in order to retrieve publication information; 05 -fasta: The user may supply a FASTA-format file containing amino acid sequences for the predicted proteome. By clicking in the fasta button a popup window with a file-handling menu will appear; 06 -P: This controls whether the genome is included in the pathway visualisation. (Default is trueblue background). There can be a maximum of 7 genomes displayed. A reaction is considered as present if it has either an annotated gene (from any of the annotations provided) or literature evidence (if brenda is provided); 07 -H: This controls whether the genome is included in the Hamming distance matrix. (Default is true -blue background); 08 -Color: The color that should be used to identify the genome in the PathwayBooster display. It accepts 3 different formats:RGB color model -3 numbers between 0 and 255 separated by commas (Ex:255,102,0); RGB color model -3 numbers between 0 and 1 separated by commas (Ex:1,0.4,0); hexadecimal color model -starts with a # foloowed by 3 hexadecimal number between 00 and FF (Ex:#FF6700). If the user does not specify a color, PathwayBooster will attribute one automatically; 09 -Q: When set to true (blue background), this signifies that this genome is the one of main interest. If none of the genomes is set with query=true, the first genome with a genome annotation sequence file provided will be considered as the query genome; A1 -add: Adds an annotation object to the organism; A2 -remove: Removes that specific annotation; A3 -ID: Refers to the organism identifier. Each ID must be unique; A4 -Type: These can be of three different kinds: kegg, genbank or embl. For the kegg annotations, the user provides the kegg Id for the given genome. For genbank and embl, the user must provide a filename for a genome annotation in the respective file format. Avoid typing the pathway, use the file handler provided when pressing the filename button.

• Pathways
In this section, the user specifies the set of KEGG pathways to be processed as a series of <pathway> elements in the xml setup file (see manual for more details). There are two ways to specify pathways: using KEGG metabolic function groups, e.g. carbohydrate metabolism has id 1.1. This means that all pathways in this group are processed.
using the global KEGG id of an individual pathway, e.g. 00010 corresponds to Glycolysis/Gluconeogenesis. To select a pathway the just needs to click on it. All the selected pathways appear in the Selected sliding window and have a blue background; Selected: All the selected Pathways will appear here. To remove any of the pathways selected, the user just needs to click on it.
• Run In this section the user can run the PathwayBooster tool with the settings showing in the Pathways and Organisms tabs. In this section the user also has the opportunity to import any previously built setup file or save into an xml file the settings showing in the Pathway and Organisms Tabs.

PathwayBooster Report Index
After running PathwayBooster, a report will be produced. The index page will contain links to every pathway requested. Figure 2.6 shows an example of a possible index page. In this case, the user requested all the Energy metabolism pathways and two other pathways grouped in the Other Pathways section. The links provided for each pathway will contain the information described in the paper and in the Example Application bellow.

Example Application
Geobacillus thermoglucosidasius NCIMB 11955 is a thermophilic bacterium with the potential to convert lignocellulose to ethanol in a highly productive manner. Thermophilic bacteria are especially useful in biofuel production since they can withstand the high temperatures that are unavoidable at certain stages of fermentation. Given these interesting properties, we would like to understand the metabolism of this organism in more detail.
As an example of the use of PathwayBooster, we present results for cysteine and methionine metabolism (KEGG id = 00270). Initial genome annotations were generated by ERGO TM integrated genomics (Overbeek et al. (2003)) and the RAST annotation server (Aziz et al. (2008)). The agreements and differences between these annotations were used in the first stages of metabolic model curation for this organism. At a later stage, reference organisms were selected, including Escherichia coli, Bacillus subtilis, Geobacillus thermoglucosidasius C56-YS93, Geobacillus thermodenitrificans and Geobacillus kaustophilus.

Filling pathway holes
Looking at the visual representation of this pathway (Fig 3.1) and the Hamming distance heatmap (Figure 3.2) generated by PathwayBooster, it can be observed that enzymes with the EC numbers 4.2.1.109, 3.1.3.77, 1.13.11.53 and 5.3.1.23 are not annotated for the query strain (red), but are present in most of the reference organisms. It is possible that these enzymes were missed by the ERGO/RAST annotation servers.
The article was easily accessible via its Pubmed accession number. Information on enzyme mass and activity will be used in the design of laboratory experiments, but the retrieved gene sequence was immediately useful for finding similar genes within the genome of G. thermoglucosidasius NCIMB 11955.
Sequences for all relevant genes from related organisms can be easily accessed from the 'Annotations' report in PathwayBooster. Here, hyperlinks redirect the user to the KEGG website (Kanehisa et al. (2012(Kanehisa et al. ( , 2000) where information on a gene, including its nucleotide and amino acid sequence can be accessed: Using the 'BLAST bidirectional hits' report, a candidate gene within G. thermoglucosidasius NCIMB 11955 genome was identified: It is also possible for the user to view the top three BLAST hits for the model organism, accompanied by the sequence similarity information, e-value and overall BLAST score: The above procedure was applied to all remaining omitted genes and all of them were successfully found in our query strain of G. thermoglucosidasius.

Identifying misannotated enzymes
In contrast, the enzyme 5'-methylthioadenosine nucleosidase (EC 3.2.2.16) was found in the annotation of the query strain but not in the closely related reference organisms. There are two possible explanations for this: either G. thermoglucosidasius NCIMB 11955 has acquired an enzyme that its close relatives lack, or else this enzyme has been misannotated by one of the annotation servers used. Examining the PathwayBooster 'Annotations' and 'BLAST hits' sections, where no hits were found, we decided to look more closely at the gene encoding this enzyme. RTMO02286 has been assigned two potential functions: 5'methylthioadenosine nucleosidase (EC 3.2.2.16) and S-adenosylhomocysteine nucleosidase (EC 3.2.2.9). Given that there were no hits found for EC 3.2.2.16 by PathwayBooster, our focus shifted to EC 3.2.2.9. This enzyme is assigned to all reference organisms and after examining the annotations and BLAST hits (Fig. 3.3), we concluded that EC 3.2.2.9 is the more probable annotation for RTMO02286.