- Research article
- Open Access
GEM System: automatic prototyping of cell-wide metabolic pathway models from genomes
BMC Bioinformatics volume 7, Article number: 168 (2006)
Successful realization of a "systems biology" approach to analyzing cells is a grand challenge for our understanding of life. However, current modeling approaches to cell simulation are labor-intensive, manual affairs, and therefore constitute a major bottleneck in the evolution of computational cell biology.
We developed the Genome-based Modeling (GEM) System for the purpose of automatically prototyping simulation models of cell-wide metabolic pathways from genome sequences and other public biological information. Models generated by the GEM System include an entire Escherichia coli metabolism model comprising 968 reactions of 1195 metabolites, achieving 100% coverage when compared with the KEGG database, 92.38% with the EcoCyc database, and 95.06% with iJR904 genome-scale model.
The GEM System prototypes qualitative models to reduce the labor-intensive tasks required for systems biology research. Models of over 90 bacterial genomes are available at our web site.
Given the burgeoning wealth of knowledge in molecular biology, including the ever more rapidly accumulating quantitative high-throughput data, and with more than a hundred complete genomes now at hand, the grand challenge of what we might call "the post-genome era" is to obtain a system-level understanding of the dynamic behavior of the mechanisms of life. However, the dynamic behavior of biological systems, a result of the diverse nonlinear interactions of multiple molecular components possessing various properties, is complex and unintuitive. An integrative systems biology approach is therefore required to complement traditional reductionism, and computer simulation has proven to be an invaluable tool for system-level analysis . Simulation-based research facilitates the understanding of the complex underlying structure of a system, and detailed models can be used to help generate testable predictions and hypotheses for experiments.
Several simulation studies of large-scale biological systems have been reported, but most are achieved by manual modeling of the cellular networks and simulation of network models by the use of approaches such as biochemical systems theory  and flux balance analysis . Investigations of dynamic behavior thus far have been limited in scale, focusing on minimized models  or specific pathways . This is mostly because dynamic modeling for biosimulation requires a multitude of parameters, and collection and organization of the information required is extremely time-consuming, labor-intensive, manually precise work. The modeling procedure usually involves three major steps: (1) qualitative modeling, where the network structure or the pathway map including all the necessary inhibitors, activators, reversibility, and feedback regulation is constructed from established biological knowledge and hypotheses; (2) quantitative modeling, where quantitative data such as the metabolite and enzyme concentrations, accurate rate equations, and kinetic parameters are incorporated so as to formulate a mathematical system model; and (3) cell programming, where the above information is translated into a machine-readable modeling language such as the Systems Biology Markup Language (SBML)  ready for simulation software, with specifications for simulation such as the type of integrators, integration step size, and simulation procedures . The manual modeling process is the most serious bottleneck in systems biology, and an intelligent environment with both sophisticated data and knowledge bases is necessary for the next step in the evolution of computational cell biology. For example, the biochemical simulator GEPASI/COPASI  provides an intuitive graphical user interface to aid the cell programming process in modeling, yet the users are required to obtain the qualitative and quantitative information manually, and current biochemical simulation software suites do not provide the automatic qualitative and quantitative modeling components.
Although quantitative modeling currently requires a thorough bottom-up approach with expert knowledge and a large amount of public kinetic information, a substantial part of qualitative modeling can be automated by integrating information from numerous databases. Although not intended for generating simulation models, several software tools for the reconstruction of the pathway database from the genome exist, including metaSHARK , IdentiCS , and the PathoLogic program in the BioCyc Pathway Tools software suite . However, PathoLogic, for example, heavily relies on text-based annotation of the genome, and sometimes requires another pipeline such as the GeneQuiz system for annotation beforehand . Since all three software tools contain no stoichiometric information and lack the cell programming step, they all require considerable time and effort in order to use the results for simulation. In contrast, a system dedicated to the prototyping of pathway simulation models can be highly optimized for speed and ease of use.
Here we describe a novel database-driven intelligent software system named the Genome-based Modeling (GEM) System, which automatically generates a cell-wide metabolic pathway simulation model suitable as a draft model to build upon for computer-based studies. The model is based on complete genome sequence data, and the software provides an environment that allows analysis of the system-level behavior of the organism of interest.
A traditional modeling approach scales up from a basic model by adding new information by hand, but because our aim is to provide a draft model to reduce the labor-intensive qualitative modeling steps, we take an opposite, top-down approach of automated modeling. A rough image of the entire metabolic network is extracted from genome data and modeled on the basis of genetic information, and then more specific information is later added from expert knowledge with minimal manual work. Another merit to this approach is that the process starts from the genome sequence. In whole-cell modeling, integration of different biological databases is a challenging task, because the target field is broad and the scheme of each database differs. Moreover, the names of genes and proteins are often ambiguous and thus difficult to match. However, most databases contain a link to the genome sequence regardless of the subject, so by modeling from the genome as the starting point, it becomes possible to link a large amount of biological information by an automated method.
The GEM System takes the complete genome database, both annotated and unannotated data, as input, and automatically goes through several steps to produce a simulation model of the organism in a flat-file format that can be readily converted to standard SBML suitable for simulation with various simulation software systems (Figure 1) [7, 13]. When an unannotated genome is given, the system predicts genes with Glimmer , which produces a high false-positive rate and a low false-negative rate. Alternatively, users may use existing sophisticated pipelines such as GenDB , GeneQuiz , and EnsEMBL  to identify the coding regions and for the functional annotation to be used in the following steps.
The second step matches the genes to the product protein through a combined method of annotation reference, homology search, and orthology search. The system first checks for a direct external database link (db_xref) to SWISS-PROT and TrEMBL  in the annotation frequently provided in the EMBL complete genome database  to find the EC number of the protein product of a gene. If annotation is not available, the system performs a BLASTP search  against the SWISS-PROT database with a default cutoff value of e-25, which is also configurable, and the SWISS-PROT entry with the best e-value is used as its homolog. Homology searching is a powerful technique for conserved proteins, but sometimes is insufficient in functional genomics . If there is no hit above the cutoff e-value, an orthology search of the amino acid sequence of the gene by using the COGnitor program provided with the COGs database  is then performed with a cut-off value at 3 clades. This step can alternatively be carried out by direct reference to the annotated COG entry given in the PTT database distributed in the GenBank genome flat-file . The obtained COG entry is matched to the corresponding EC number by reference to the KEGG Orthology database .
When the SWISS-PROT and TrEMBL entries match with the annotation and the homology search does not contain an EC number, there is an additional search in the ortholog entries in the same WIT Cluster  category. The WIT Cluster provides a list of orthologous genes identified under strict criteria; therefore, the genes in the same WIT Cluster are expected to have identical functions. Because the GEM System keeps track of the enzymes by the EC number, when the SWISS-PROT and TrEMBL entries are not annotated with an EC number, the system looks through all orthologous genes in the same WIT Cluster to find entries annotated with an EC number, and uses those for annotation. When multiple entries with EC numbers are found, the entry with the lowest e-value is used. Currently, the GEM System cannot account for any enzymes that are not EC-encoded, and unspecific or incomplete codes are treated the same as complete codes. This is a limitation of the system, but since the KEGG database that the system uses to check the pathway is also mostly based on EC numbers and treats unspecific or incomplete codes similarly, our system follows this approach.
The GEM System holds information on enzymes extracted from the major enzyme and pathway databases [18, 24–27] and curates it for consistency of nomenclature in the form of an internalized database, and the EC numbers obtained are matched to the corresponding stoichiometric enzyme reaction equation. Here each gene is assumed to have a one-to-one enzyme relationship, so the system cannot distinguish between isozymes and heteropolymeric enzymes. To resolve this problem and to recover false negative matches, the stoichiometric reaction list undergoes a pathway check that compares the extracted list with the general reference pathway of the KEGG and MetaCyc databases . For the problem of many-to-many relationships between reactions, firstly the multiple rows with the same stoichiometry (the reactions are equivalent in each instance) are collapsed to a single row. This procedure is equivalent to resolving the problem of heteropolymeric enzymes and ignoring the presence of isozymes. With this new stoichiometric matrix, where all rows represent unique reactions, each reaction is searched in the reference pathways, and the row is duplicated when the reaction exists in multiple pathways, to recover the necessary isozymes. Then for the pathway connectivity check, when fewer than "Y" steps of a gap exist between more than "X" connected steps upstream and downstream of the pathway, the gap is filled from the information in the pathway databases. X and Y are configurable, but are set to 3 and 1, respectively, by default. For example, when there is a continuous pathway containing 7 enzymes such as A-B-C-D-E-F-G exist, and when the consecutive sets of three enzymes A-B-C and E-F-G are identified by functional annotation, the gap enzyme D is filled in. All 7 enzymes in this example must be consecutive reactions, but they are not required to be in linear order or to belong in the same pathway. Gap filling in pathway reconstruction is a challenging task, and the filled gap would be better to be reconfirmed by sequence alignment. Although only the most straightforward method is currently implemented, the filled gap is clearly marked as such in the model, therefore enabling the users to easily take out uncertain reactions. This gap-filling process is useful for flux-based analysis where pathway connectivity is essential, but since the data is uncertain, these reactions are not included for the following validations. A graphical user interface allows the configuration of options and optimal execution of the system.
To summarize the procedures described so far, let us follow the workflow taking fumarate hydratase (18.104.22.168) and glutamate synthase (22.214.171.124) in E. coli as examples. Firstly, the coding regions are identified by Glimmer unless an annotated complete genome is available, and the nucleotide sequences are translated into amino acid sequences. If database reference to SWISS-PROT is available in annotation, this reference is used to identify the gene and the EC number of enzyme coded by the gene, and if the reference is not available, BLAST similarity search matches the amino acid sequence to the corresponding SWISS-PROT entry. In this case, gene located on the complementary strand of position 1684755 to 1686401 is identified to be FUMA_ECOLI that is 126.96.36.199 in EC number with complete identify (e-value of zero), and likewise, FUMB_ECOLI (e-value of zero) is also identified to be 188.8.131.52. Similarly, two genes GLTB_ECOLI (e-value of zero) and GLTD_ECOLI (e-value of 6.5e-282) are identified to be 184.108.40.206. Even when the homology search fails, orthology search identifies fumA and fumB genes to be both COG1838 (Tartrate dehydratase beta subunit/Fumarate hydratase class I, C-terminal domain), and the majority of genes belonging to the orthologous cluster is identified to be 220.127.116.11 in WIT Cluster. Likewise, gltB is identified to be COG0069 (Glutamate synthase domain 2) and gltD to be COG0493 (NADPH-dependent glutamate synthase beta chain and related oxidoreductases) and subsequently to be 18.104.22.168 with WIT Cluster. FumA is an aerobic isozyme and FumB is an anaerobic isozyme, whereas GltB and GltD are subunits of glutamate synthase. This is correctly identified by pathway check, because 22.214.171.124 occurs both in aerobic citrate cycle and in anaerobic reductive carboxylate cycle pathways, but 126.96.36.199 only occurs in Glutamate metabolism pathway (188.8.131.52 actually also exists in nitrogen metabolism pathway in the GEM model, but this is from another isozyme, gltF). Reaction for 184.108.40.206 is a reversible reaction of (S)-malate = fumerate + H2O, and that of 220.127.116.11 is an irreversible reaction of 2 L-glutamate + NADP = L-glutamate + 2-oxoglutarate + NADPH + H. Taking account of the existence of isozymes (here we igonore gltF for convenience), the stoichiometric matrix is derived as follows (Table 1):
where the first four rows represent two sets of reversible reaction of 18.104.22.168 and the fifth row represents the irreversible reaction of 22.214.171.124, and the columns represent (S)-malate, fumerate, H2O, L-glutamate, NADP, 2-oxoglutarate, NADPH, and H, respectively. For gap-filling, the reaction connecting 2-Hydroxy-ethyl-ThPP and Acetaldehyde, namely 126.96.36.199 is suggested from the connectivity of upstream reactions (188.8.131.52, 184.108.40.206, and 220.127.116.11) and downstream reactions (18.104.22.168, 22.214.171.124, and 126.96.36.199) in the glycolysis pathway.
Validation of the qualitative modeling step
Stoichiometric simulation models of all available complete annotated bacterial genomes have been generated by using the GEM System with the default parameters. Complete genome flatfiles were obtained from the EMBL database, and corresponding PTT files were used for COG annotation. Using these inputs, the BLAST searches were limited to the genes that did not contain direct external database link (db_xref) to SWISS-PROT or TrEMBL, and COG searches were replaced with data integration of PTT file with EMBL data. In this way, functional annotation process was optimized and therefore the calculation speed was remarkably fast, finishing the entire process in a few hours on a dual-processor PC server (Pentium 4 Xeon 2.8 GHz, 4 GB RAM). Statistics describing the scale of computer-based cell models with over 500 enzymes are shown in Table 2 (the complete list of 90 models is available at our web site ). Here the KEGG coverage is calculated as the percentage of correctly extracted enzymes in the corresponding organism-specific pathways in the KEGG database. The model organism E. coli K12 MG1655 yielded the best numbers, with 968 reactions of 835 enzymes and 1195 metabolites in the computer-based model with an accuracy of 100% KEGG coverage, in other words, without any false negatives. Organisms that are not well understood are limited by this database driven approach, but even the model with lowest coverage, in this case Mycobacterium leprae, achieved over 90% coverage.
Accuracy comparison with E. coli specific entries of KEGG PATHWAY and SWISS-PROT data for pathways with unidentified genes or genes that have no EC number assigned are shown in Table 3. Since the overall KEGG coverage is 100% in E. coli, the EC coverage is obviously 100% for all pathways. However, there are several reactions that cannot be EC coded in some metabolic pathways, and a large fraction of reactions for pathways other than metabolism. Although the reactions that are not EC coded is not included in the stoichiometric model since they are beyond the purpose of this work to create metabolic pathway models, GEM System correctly identified all genes except for 3 cases in comparison with SWISS-PROT entries, so that the information can easily be incorporated for future applications. Three misidentified genes were actually correctly identified but the homology search identified them in different organisms or strains, namely, GPMA_ECOLI was identified to be GPMA_SHIFL (same gene in Shigella flexneri), ISPH_ECOLI and UPK_ECOLI were identified to be ISPH_ECO57 and UPPP_ECO57 (same gene in O157 strain of E. coli).
The E. coli model was further compared with the genome-scale metabolic flux model of Reed et al. (iJR904)  and the EcoCyc database . EC numbers are directly compared, and all of the 54 enzymes that are not included in the E. coli specific entries of KEGG or SWISS-PROT are manually checked through EcoCyc and iJR904 as shown in Table 4. There were 6 possible mis-annotations by the GEM System, but the majority of the enzymes were mis-identified due to the inconsistencies of the EC notation among databases. After correction of obsolete or deleted EC numbers, iJR904 contained 388 out of 425 EC numbers in common (91.29% accuracy), and EcoCyc had 651 out of 701 EC numbers in common (92.38% accuracy). 16 enzymes that were assigned different EC numbers between the SWISS-PROT database and the iJR904 model, although the genes were correctly identified, so our model has overall 95.06% accuracy compared with the iJR904 model. Five enzymes out of the 21 false negatives in comparison with iJR904 and 38 out of the 49 false negatives in comparison with EcoCyc have no corresponding genes as of now. This fact emphasizes the importance of manual refinement, but since this process is required for less than 5% of the model in E. coli, our automatic modeling system should keep the manual effort to a minimum. Obviously E. coli is the most well studied organism, and the manual procedure required for other organisms would be greater than 5%. However, most of the other models also yielded over 500 reactions at 90% or more KEGG coverage, and since the models are provided with pathway-wise accuracy table similar to Table 3 at GEM System web-site  the user can easily identify which pathway is incomplete and thus requires manual checking.
The number of EC numbers extracted from the iJR904 model, 425, may seem small compared with the total number of reactions, 931. However, iJR904 contains 184 transporters that cannot be EC-coded, and it contains many enzymes without EC numbers that are EC-coded in KEGG database. Moreover, since many enzymes have multiple reactions, the total number of EC-coded reactions in iJR904 is 519. It is worth noting that iJR904 selects the pathways to include in the model, whereas the GEM System takes a greedy approach where every possible enzyme that is predicted to exist in a genome is included, regardless of the types of pathway the enzyme belongs to, leading to greater number of enzymes than in the iJR904 model. To summarize, the generated model has very high coverage (91~100%) compared to KEGG, EcoCyc, and iJR904, and the overall accuracy is also high, with false-positives of 6 entries (0.72%) and possible false-negatives of less than 43 entries (5.14%).
Database of generated models
Our web site  makes publicly available all genome-scale models with enzyme or metabolite lists with reactions, gene lists with matched product and BLAST e-values, stoichiometric matrices for static simulation and metabolic flux analysis, interactive pathway maps generated with a Java applet for visualizing protein-protein interactions , and a tool to view the extracted enzymes mapped on the KEGG pathway database by using KEGG API.
We have developed the GEM System, automated software for the rapid construction of draft simulation models of cell-wide metabolic pathways from genome sequence information by integration of public biological databases. Automatic generation of the models is currently limited to metabolism in bacteria, and depends on the availability of EC numbers in public databases, but we have shown that qualitative models of the metabolic pathways of bacteria can be generated with low false positives and negatives, as validated by the comparison with KEGG, EcoCyc, and Reed et al.'s model. Although the generated models are draft models and thus still require expert curation to ensure the accuracy of simulations, manual involvement is minimized.
There are, however, several limitations of this approach. Firstly, although EC numbers are generally effective for enzyme data representation for well known pathways, certain number of reactions have no EC number assigned, and therefore majority of the transporters are identified as genes but not included as reactions in GEM System, making a large fraction of the model different from iJR904. Secondly, some EC numbers are incomplete and therefore ambiguous, and some become quickly obsolete, being assigned to new EC numbers. This resulted in more than 40 inconsistent enzyme assignments in GEM System. Thirdly, since the GEM System identifies enzymes and the corresponding reactions based on the genome information, it cannot identify reactions that are experimentally observed but with no corresponding gene found. To overcome these problems, more general nomenclature for enzymes should be used in addition to the EC numbers and integrate necessary information that have no link to the gene sequences.
The system generates a stoichiometric simulation model in SBML format, which is readily applicable to flux-based analyses on a number of simulation platforms. The stoichiometric models can be used for metabolic flux analyses by supplying experimental data for exchange fluxes as reported elsewhere [33, 34]. One potential application of GEM System using this stoichiometric matrix is for dynamic large-scale simulation of metabolic pathways with hybrid dynamic/static simulation method . Using this method, quasi-dynamic simulation is achieved by subdividing the model into multiple "static modules" connected by "dynamic modules", and by calculating the flux distribution of static modules using the stoichiometry and boundary flux of the dynamic module that is modeled with traditional enzyme kinetics methods. In this way, necessary kinetic equations and parameters are significantly reduced while maintaining simulation accuracy. Most reactions with high elasticities can be included in the static module, for which the stoichiometric matrix generated by the GEM System is directly applicable.
The GEM System can generate models automatically from public databases, but can also utilize private databases if such experimental data becomes available. Mining of high-throughput data by bioinformatics may facilitate the quantitative modeling step; for example, it should be possible to take advantage of recent progress in "metabolomics". Once genome-wide metabolome data becomes available via high-throughput techniques such as the capillary electrophoresis – electrospray ionization – mass spectrometry (CE-ESI-MS) method, metabolome data can be used to add unknown pathways, to supply the initial values of the metabolites, and to optimize kinetic parameters. Parameter fitting of time-series metabolite concentration data to general dynamic equations such as Generalized Mass Action is a possible substitution for accurate kinetic modeling, at least in the given time frame of the data set used for parameter optimization.
Our next step is to model the gene expression layer, including transcription, translation, and degradation processes. The GEM System is a powerful platform for this purpose, in no small part because the genome-based approach enables a link to databases of different fields based on the nucleotide sequences already described. Because the GEM System has been based on a generic bioinformatics workbench, that is, the G-language Genome Analysis Environment , the system can directly access genome sequences and perform computational genome data-mining. Required parameters or information such as the structure of a promoter can be directly obtained from the genome sequence as the simulation takes place. In this respect, GEM System can be extended to be applicable for the modeling of eukaryotes, by identifying protein subcellular localizations from database reference and with predictable methods [37, 38]. Although the parameters in the functional annotation process should be revised to cope with the information availability and the existence of a multitude of duplicate gene paralogs, by selecting tissue specific gene expression pattern with expressed sequence tags (EST) or microarray data, the general approach of the GEM System should also be applicable for tissue specific cellular models of higher eukaryotes. In sum, the rapid accumulation of biological information now allows the realization of integrative systems biology, but at the same time makes manual modeling unrealistic; therefore, a genome-based automatic modeling procedure is a crucial step forward for the grand challenge to construct life in a computer.
The GEM System facilitates systems biology research by prototyping a metabolic pathway simulation model from a genome. Given a complete genome, all modeling procedures are automated with configurable options, generating stoichiometric models in SBML format that are readily usable by cell simulators. In comparison with the KEGG organism-specific databases, the qualitative modeling step has high accuracy, with few false positives and negatives. More than 90 models generated from complete bacterial genomes are available for download online, with visualized pathway maps and gene lists.
Kitano H: Systems biology: a brief overview. Science 2002, 295: 1662–1664. 10.1126/science.1069492
Ni TC, Savageau MA: Application of biochemical systems theory to metabolism in human red blood cells. Signal propagation and accuracy of representation. J Biol Chem 1996, 271: 7927–7941. 10.1074/jbc.271.14.7927
Covert MW, Knight EM, Reed JL, Herrgard MJ, Palsson BO: Integrating high-throughput and computational data elucidates bacterial networks. Nature 2004, 429: 92–96. 10.1038/nature02456
Normile D: Building working cells 'in silico'. Science 1999, 284: 80–81. 10.1126/science.284.5411.80b
Kitano H: Computational systems biology. Nature 2002, 420: 206–210. 10.1038/nature01254
Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H, Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A, Cuellar AA, Dronov S, Gilles ED, Ginkel M, Gor V, Goryanin II, Hedley WJ, Hodgman TC, Hofmeyr JH, Hunter PJ, Juty NS, Kasberger JL, Kremling A, Kummer U, Le Novere N, Loew LM, Lucio D, Mendes P, Minch E, Mjolsness ED, Nakayama Y, Nelson MR, Nielsen PF, Sakurada T, Schaff JC, Shapiro BE, Shimizu TS, Spence HD, Stelling J, Takahashi K, Tomita M, Wagner J, Wang J: The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 2003, 19: 524–531. 10.1093/bioinformatics/btg015
Takahashi K, Yugi K, Hashimoto K, Yamada Y, Pickett CJF, Tomita M: Computational Challenges in Cell Simulation. IEEE Intelligent Systems 2002, 17: 64–71. 10.1109/MIS.2002.1039834
Mendes P: GEPASI: a software package for modelling the dynamics, steady states and control of biochemical and other systems. Comput Appl Biosci 1993, 9: 563–571.
Pinney JW, Shirley MW, McConkey GA, Westhead DR: metaSHARK: software for automated metabolic network prediction from DNA sequence and its application to the genomes of Plasmodium falciparum and Eimeria tenella. Nucleic Acids Res 2005, 33: 1399–1409. 10.1093/nar/gki285
Sun J, Zeng AP: IdentiCS--identification of coding sequence and in silico reconstruction of the metabolic network directly from unannotated low-coverage bacterial genome sequence. BMC Bioinformatics 2004, 5: 112. 10.1186/1471-2105-5-112
Karp PD, Paley S, Romero P: The Pathway Tools software. Bioinformatics 2002, 18 Suppl 1: S225–32.
Yeh I, Hanekamp T, Tsoka S, Karp PD, Altman RB: Computational analysis of Plasmodium falciparum metabolism: organizing genomic information to facilitate drug discovery. Genome Res 2004, 14: 917–924. 10.1101/gr.2050304
Slepchenko BM, Schaff JC, Macara I, Loew LM: Quantitative cell biology with the Virtual Cell. Trends Cell Biol 2003, 13: 570–576. 10.1016/j.tcb.2003.09.002
Delcher AL, Harmon D, Kasif S, White O, Salzberg SL: Improved microbial gene identification with GLIMMER. Nucleic Acids Res 1999, 27: 4636–4641. 10.1093/nar/27.23.4636
Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T, Clausen J, Kalinowski J, Linke B, Rupp O, Giegerich R, Puhler A: GenDB--an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 2003, 31: 2187–2195. 10.1093/nar/gkg312
Andrade MA, Brown NP, Leroy C, Hoersch S, de Daruvar A, Reich C, Franchini A, Tamames J, Valencia A, Ouzounis C, Sander C: Automated genome sequence analysis and annotation. Bioinformatics 1999, 15: 391–412. 10.1093/bioinformatics/15.5.391
Hubbard T, Andrews D, Caccamo M, Cameron G, Chen Y, Clamp M, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, Down T, Durbin R, Fernandez-Suarez XM, Gilbert J, Hammond M, Herrero J, Hotz H, Howe K, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Keenan S, Kokocinsci F, London D, Longden I, McVicker G, Melsopp C, Meidl P, Potter S, Proctor G, Rae M, Rios D, Schuster M, Searle S, Severin J, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Trevanion S, Ureta-Vidal A, Vogel J, White S, Woodwark C, Birney E: Ensembl 2005. Nucleic Acids Res 2005, 33 Database Issue: D447–53.
Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31: 365–370. 10.1093/nar/gkg095
Brooksbank C, Camon E, Harris MA, Magrane M, Martin MJ, Mulder N, O'Donovan C, Parkinson H, Tuli MA, Apweiler R, Birney E, Brazma A, Henrick K, Lopez R, Stoesser G, Stoehr P, Cameron G: The European Bioinformatics Institute's data resources. Nucleic Acids Res 2003, 31: 43–50. 10.1093/nar/gkg066
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410. 10.1006/jmbi.1990.9999
Lester PJ, Hubbard SJ: Comparative bioinformatic analysis of complete proteomes and protein parameters for cross-species identification in proteomics. Proteomics 2002, 2: 1392–1405. 10.1002/1615-9861(200210)2:10<1392::AID-PROT1392>3.0.CO;2-L
Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 2001, 29: 22–28. 10.1093/nar/29.1.22
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank: update. Nucleic Acids Res 2004, 32 Database issue: D23–6. 10.1093/nar/gkh045
Kanehisa M, Goto S, Kawashima S, Nakaya A: The KEGG databases at GenomeNet. Nucleic Acids Res 2002, 30: 42–46. 10.1093/nar/30.1.42
Overbeek R, Larsen N, Pusch GD, D'Souza M, Selkov EJ, Kyrpides N, Fonstein M, Maltsev N, Selkov E: WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res 2000, 28: 123–125. 10.1093/nar/28.1.123
Arita M: Additional paper: computational resources for metabolomics. Brief Funct Genomic Proteomic 2004, 3: 84–93. 10.1093/bfgp/3.1.84
Schomburg I, Chang A, Hofmann O, Ebeling C, Ehrentreich F, Schomburg D: BRENDA: a resource for enzyme data and metabolic information. Trends Biochem Sci 2002, 27: 54–56. 10.1016/S0968-0004(01)02027-8
Krieger CJ, Zhang P, Mueller LA, Wang A, Paley S, Arnaud M, Pick J, Rhee SY, Karp PD: MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res 2004, 32 Database issue: D438–42. 10.1093/nar/gkh100
Arakawa K: GEM System Database.[http://www.g-language.org/gem/]
Reed JL, Vo TD, Schilling CH, Palsson BO: An expanded genome-scale model of Escherichia coli K-12 (iJR904 GSM/GPR). Genome Biol 2003, 4: R54. 10.1186/gb-2003-4-9-r54
Keseler IM, Collado-Vides J, Gama-Castro S, Ingraham J, Paley S, Paulsen IT, Peralta-Gil M, Karp PD: EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res 2005, 33 Database Issue: D334–7.
Mrowka R: A Java applet for visualizing protein-protein interaction. Bioinformatics 2001, 17: 669–671. 10.1093/bioinformatics/17.7.669
Christensen B, Nielsen J: Metabolic network analysis. A powerful tool in metabolic engineering. Adv Biochem Eng Biotechnol 2000, 66: 209–231.
Price ND, Reed JL, Palsson BO: Genome-scale models of microbial cells: evaluating the consequences of constraints. Nat Rev Microbiol 2004, 2: 886–897. 10.1038/nrmicro1023
Yugi K, Nakayama Y, Kinoshita A, Tomita M: Hybrid dynamic/static method for large-scale simulation of metabolism. Theor Biol Med Model 2005, 2: 42. 10.1186/1742-4682-2-42
Arakawa K, Mori K, Ikeda K, Matsuzaki T, Kobayashi Y, Tomita M: G-language Genome Analysis Environment: a workbench for nucleotide sequence data mining. Bioinformatics 2003, 19: 305–306. 10.1093/bioinformatics/19.2.305
Donnes P, Hoglund A: Predicting protein subcellular localization: past, present, and future. Genomics Proteomics Bioinformatics 2004, 2: 209–215.
Nakai K: Protein sorting signals and prediction of subcellular localization. Adv Protein Chem 2000, 54: 277–344.
The authors would like to thank the members of the GEM project: Haruo Suzuki, Koya Mori, Tatekimi Matsuzaki, Seira Nakamura, Kenji Higashi, Hiromi Komai, Hayataro Kouchi, Atsuko Kishi, and Yukino Ogawa for their time and effort. Members of E2coli proposed many insightful ideas. During implementation of the GEM System, some ideas were inspired by discussions with Ryo Hattori and Daisuke Kyuma of the G-language Project. We are grateful for the generous time and effort of the staff of the Institute for Advanced Biosciences--Hirotada Mori, Takaaki Nishioka, Akio Kanai, Tomoyoshi Soga, Takeshi Ara, Tomoya Baba, and Aya Itoh--for discussions. We would like to thank Dr. Dietmar Schomburg, Cologne University, for kindly granting us permission to make our dynamic models using kinetic information from the Brenda database publicly available. This research was supported by the Japan Society for the Promotion of Science (JSPS), and supported in part by the Ministry of Education, Culture, Sports, Science and Technology, with a Grant-in-Aid for the 21st Century Center of Excellence (COE) Program entitled "Understanding and Control of Life's Function via Systems Biology (Keio University)".
KA conceived the system, developed the software, carried out the validations, and drafted the manuscript. YY and KS participated in the design of the software. YN and MT supervised the work. All authors read and approved the final manuscript.
About this article
Cite this article
Arakawa, K., Yamada, Y., Shinoda, K. et al. GEM System: automatic prototyping of cell-wide metabolic pathway models from genomes. BMC Bioinformatics 7, 168 (2006) doi:10.1186/1471-2105-7-168
- Metabolic Flux Analysis
- System Biology Markup Language
- Fumarate Hydratase
- Stoichiometric Matrix
- Draft Model