Correcting ligands, metabolites, and pathways
© Ott and Vriend; licensee BioMed Central Ltd. 2006
Received: 11 July 2006
Accepted: 28 November 2006
Published: 28 November 2006
A wide range of research areas in bioinformatics, molecular biology and medicinal chemistry require precise chemical structure information about molecules and reactions, e.g. drug design, ligand docking, metabolic network reconstruction, and systems biology. Most available databases, however, treat chemical structures more as illustrations than as a datafield in its own right. Lack of chemical accuracy impedes progress in the areas mentioned above. We present a database of metabolites called BioMeta that augments the existing pathway databases by explicitly assessing the validity, correctness, and completeness of chemical structure and reaction information.
The main bulk of the data in BioMeta were obtained from the KEGG Ligand database. We developed a tool for chemical structure validation which assesses the chemical validity and stereochemical completeness of a molecule description. The validation tool was used to examine the compounds in BioMeta, showing that a relatively small number of compounds had an incorrect constitution (connectivity only, not considering stereochemistry) and that a considerable number (about one third) had incomplete or even incorrect stereochemistry. We made a large effort to correct the errors and to complete the structural descriptions. A total of 1468 structures were corrected and/or completed. We also established the reaction balance of the reactions in BioMeta and corrected 55% of the unbalanced (stoichiometrically incorrect) reactions in an automatic procedure. The BioMeta database was implemented in PostgreSQL and provided with a web-based interface.
We demonstrate that the validation of metabolite structures and reactions is a feasible and worthwhile undertaking, and that the validation results can be used to trigger corrections and improvements to BioMeta, our metabolite database. BioMeta provides some tools for rational drug design, reaction searches, and visualization. It is freely available at http://www.cmbi.ru.nl/biometa/ provided that the copyright notice of all original data is cited. The database will be useful for querying and browsing biochemical pathways, and to obtain reference information for identifying compounds. However, these applications require that the underlying data be correct, and that is the focus of BioMeta.
The importance of knowledge about metabolites for understanding life is well demonstrated by their prominent role in the Kyoto Encyclopedia of Genes and Genomes [1–5], MetaCyc, the Boehringer-Mannheim charts[7, 8], Brenda[9, 10], ExPASy, ChEBI, or PubChem. These databases vary considerably in their focus. Some have a strong emphasis on enzymatic information, while others are metabolic databases containing, for example, information about metabolites, reactions, enzymes, and genes. Most of these systems also contain a limited number of small xenobiotic compounds.
Three frequently used pathway databases are KEGG, MetaCyc, and Brenda. KEGG is a suite of databases and associated software, interlinking data on small compounds, reactions, enzymes, and genes. The graphical pathway maps to which the databases are linked are an important feature of KEGG. MetaCyc is a curated database of experimentally elucidated metabolic pathways from many organisms. It contains data about pathways and their associated small compounds, enzymes, and genes. KEGG and MetaCyc both contain data on metabolites; unfortunately, MetaCyc does not hold atomic information on small compounds. The metabolite data in KEGG (the Compound section of the Ligand database) have been organized such that they are easily downloadable as chemical structure files in the MDL molfile format.
The Boehringer-Mannheim wall charts offer a glimpse on the enormous complexity of the interlinked metabolic network. The small-molecule part of these charts has been extracted into a C@rol database called BioPath. Brenda is a curated enzyme database that provides pictures of reaction diagrams and chemical structures of small compounds. ChEBI is a dictionary of molecular entities focusing on small compounds. PubChem is a database of chemical structures of small compounds and information on their biological activities. Many of these databases, especially ChEBI and PubChem, contain cross-references to other databases, notably KEGG. PubChem merely lists these references, but in ChEBI the entries are curated and classified using a chemical ontology.
Even though the systems mentioned above provide a wealth of data, they cover only a very small portion of all possible metabolites. Estimates on the total number of metabolites range from 200,000 to about 1,000,000, but even this higher estimate may be conservative. If plant and bacterial secondary metabolites (metabolites that are not necessary to keep the organism alive) are included then the numbers are enormously larger. The probable number of metabolites is also considerably larger than the number of corresponding genes, so it seems that the currently available databases cover at best 2% of the total number of metabolites. Of course, this discussion includes only metabolites from biochemical pathways, not the catabolism of xenobiotics – the number of small compounds involved in those processes may go up indefinitely as many thousands of xenobiotics are being developed every year.
The limited availability of metabolite data stands in marked contrast to the high demand for them. A wide range of research areas in bioinformatics, molecular biology, and medicinal chemistry require chemical structure information about molecules and reactions. This need is best seen for fields like total synthesis of natural products, drug design, ligand docking, metabolomics, metabolic network reconstruction, or systems biology. Metabolites have been used in several ways in drug design. First, endogenous human metabolites can be used as leads in drug design. Second, many metabolites from plants or other sources are medicines or good leads for drug design. All such applications require the molecular information to be correct, complete, and accurate. We have therefore set out to design and implement BioMeta, a database that aims at providing correct metabolite structures and correct reactions. The philosophy behind the correction principles is that enzymes cannot invent new chemistry; they can only speed up existing chemistry. So, if a metabolic conversion does not make sense from an organic chemistry point of view, it also does not make sense from a metabolic point of view.
Structure descriptions of compounds can be checked automatically for incorrect valences and undefined stereocenters, and reactions can be checked automatically for incorrect stoichiometry. Once a structure description is administratively correct and completely defined, further error checking (incorrect composition, connectivity, or stereochemistry) will require manual inspection and comparison to other sources, e.g., original references and other compounds related to it through known reactions. However, even for the automatic validations, no general tools are currently available, so we developed them specially for BioMeta.
BioMeta is a relational database containing information about known metabolites and the validation of their structures. It also holds metabolic reactions. It is based entirely on freely available metabolite data (mainly from KEGG) and is freely available as a web service (provided that the copyright notices of the original data providers are respected).
Construction and Content
BioMeta database design
The reactions table contains information pertaining to reactions as a whole, such as reversibility, balance, or the KEGG accession number. The relations between molecules and reactions are stored in the Rxn-Mol link table, each row in this table describing the role (reactant, product) and stoichiometry of a particular molecule in a particular reaction. The relations between reactions and enzymes are stored in the Rxn-Enz link table; each row in this table indicates that a particular enzyme catalyzes a particular reaction. The database does not contain other information about pathways or pathway maps, nor does it contain gene, species, or cellular localization information.
An additional data table (not shown in Figure 1) is used to store molecular formula information. This table contains the appropriate coefficient for each compound/element combination (e.g., the 2 in H2O). The field ElemCount in the Compounds data table contains the number of different elements in the formula of a compound. In combination, they allow formula searches such as "all compounds with twenty carbon atoms and at least 38 hydrogen atoms and at most three different elements".
Compounds and reactions in the KEGG Ligand database
The KEGG metabolic pathways are graphical maps displaying compounds and reactions from the Ligand database [1–4]. This Ligand database is tightly coupled to the KEGG pathway maps. It consists of three sections: Compound, Reaction, and Enzyme. The Compound section contains about 13,000 small compounds, most of which are involved in enzymatic reactions as substrates, products, cofactors, or inhibitors. A number of drugs and xenobiotics have also been included but these are currently being transferred to a separate Drug section in the KEGG Ligand database. Each compound entry contains an ID code, CAS registry number, common name, synonyms, systematic name, chemical formula, structure as an MDL molfile with a GIF image, reaction links, and enzyme links. The Reaction section contains about 6,500 reactions. Each reaction entry contains an ID code, name of the enzyme, a textual description of the reaction, chemical structures of the substrates and products as an MDL rxnfile and as a GIF image, an equation expressed in compound ID codes, links to Enzyme entries, and a link to the corresponding KEGG pathway map. The rxnfiles are constructed from the molfiles of the participating compounds. The Enzyme section (about 4,500 entries) contains the enzymes, indexed by their EC number. The majority of entries (compounds, reactions, and enzymes) in BioMeta were obtained from KEGG.
Lack of stereochemical completeness may also prevent database normalization. When a compound is entered in a relational database, duplicate checking must prevent redundant entries. If the new structure is actually the same as one already present in the database but it is not completely described, the duplicate check is likely to fail and a new compound entry is wrongly introduced. In the case of metabolic modeling, incomplete or erroneous networks may be built because the chemical identity of two compounds from different reactions goes undetected.
We obtained the reactions from the KEGG Ligand database in the form of an ASCII file. This file does contains neither information about reversibility nor, if irreversible, about the direction of the reactions. Reversibility/direction information is obtained from a separate ASCII file which KEGG maintains in connection to their graphical maps. Another important issue is the reaction balance that indicates whether an equal number of atoms of the various elements and an equal number of charges is present on both sides of the reaction arrow. The KEGG Reaction section of the Ligand database contained 6089 reactions, of which 5323 were provided with fully described and non-polymeric structures. The other 766 reactions either had missing structures (e.g., "acceptor" or "phosphorylated protein") or involved polymeric compounds (e.g., "oligopeptide" or "starch"), preventing assessment of their balance. We found that 3711 reactions were balanced and that 1612 were unbalanced. Unbalanced reactions can obviously not be used for the automatic construction of reaction networks as is done in metabolic modeling and systems biology. It is an easy matter to identify the unbalanced reactions, but a major problem to correct them. The cases where just a simple component such as H+, H2O, CO2, or H3PO4 is missing could be amenable to automatic correction. Most cases, however, will require tedious manual correction. Using an automatic procedure, we have corrected the reactions where the "imbalance" was H2O, H+, or 2H+, accounting for 893 reactions (55% out of 1612) reactions. Limited resources have prevented us from making a more thorough attempt.
Chemical structure validation software
Determining and checking valency;
Ring and aromaticity detection;
Calculation of molecular formula, weight, and exact mass;
Calculation of canonical string identifiers.
MDL molfiles describe 2D chemical structures in a valence-bond representation. Valences can therefore be checked using the Lewis structure concept (i.e., the number of electrons in the valence shell of first-row elements is usually eight and can only be less, never more). As a rule, the structures are hydrogen-suppressed (hydrogen atoms occur only when needed to indicate stereochemical configurations), so the valence detection will give the numbers of (implicit) hydrogen atoms on each atom which, of course, needed for the calculation of the molecular formula and weight.
Similarly, C = C, C = N, and N = N double bonds were examined for possible cis/trans isomerism, excluding aromatic double bonds and those in cumulenes such as allenes. A bond is a stereo double bond if its "inversion" (cis-trans isomerization) would change the structure into a different stereoisomer. The 2D coordinates suffice for establishing the configuration. Only if one of the atoms on the bond is singly substituted and the bond angle at that atom is 180 degrees can the stereochemistry of a double bond remain unknown, i.e., undefined (Figure 5). Finally, the program determines whether the molecule is chiral. A molecule is chiral only if it is not superimposable onto its mirror image. The mirror image is easily obtained by inverting all stereocenters. If the mirror image is not identical to the original molecule (determined by the canonicalization routine described below), then the molecule must be chiral. If the structure in a molfile is chiral, the intended structure may be the enantiomer as it has been drawn (absolute stereochemistry) or it may be the racemic mixture of that structure (relative stereochemistry) or, perhaps, a single but unknown enantiomer. In the molfile this is indicated through the so-called "chiral flag" which is set to 1 in the case of absolute stereochemistry. If a structure is chiral, but the flag has not been set to 1 in the molfile, the validation program issues a warning – since for the purpose of a biochemical database, the intended structure is expected to be a single, known enantiomer.
Validation of compounds and reactions from the KEGG Ligand database
Detected and corrected problems in the BioMeta database
Type of Problem
# in KEGG
# in BioMeta
Undefined stereo double bond(s)
Invalid sp3 stereocenter(s)
Ambiguous sp3 stereocenter(s)
Undefined sp3 stereocenter(s)
Undefined sp3 stereochemistry
Statistics of sp3 stereochemical content in the KEGG Compound and BioMeta databases
# in KEGG
# in BioMeta
Undefined (i.e., omitted)
Incompletely defined – meso
Incompletely defined – absolute
Incompletely defined – relative
Completely defined – meso
Completely defined – absolute
Completely defined – relative
Total not OK
We also assessed the balance (stoichiometry) of the reactions. BioMeta contains 5323 reactions with fully described and non-polymeric structures, of which 3711 were balanced and 1612 were unbalanced. We also determined the "imbalance" of these reactions and those for which the imbalance was H2O, H+, or 2H+ were corrected, accounting for 893 reactions (55% out of 1612) reactions. Limited resources prevent us from making a more thorough attempt.
KEGG version 3.6 contained the reaction "Fe + O2 + 4 H+ <=> Fe + 2 H2O" which prompted us to manually review all metal cations in the database. A number of those were present as "generic" cations, without an actual charge specification. To remedy this situation, six metal cations having definite oxidation states (Mn3+, Mn2+, Fe3+, Fe2+, Co3+, and Cu+) were added. Co2+ and Cu2+ were already present in KEGG. In the meantime, KEGG has also carried out this correction for the iron cations (in version 3.8) but not for manganese.
A variety of methods was used to determine the correct or intended structure. The name often provided sufficient information, but in many cases the reactions in which a compound was involved had to be consulted; either in the KEGG database or in other databases such as Brenda[9, 10], MetaCyc, or ExPASy. In the cases where database information was insufficient and the original literature had to be consulted. Brenda proved most useful for obtaining those references. We will discuss three examples of database corrections to illustrate the kinds of problems encountered, but also to illustrate the importance of these corrections for, e.g., systems biology.
Examples of validations and corrections
Database implementation details
The BioMeta database was implemented in PostgreSQL, an open-source relational database management system. Its contents are also stored in text (ASCII) files, and Python scripts have been written to import these files into the database and to export the database contents into the text files. When the database is being filled, the output from the chemical validation software is included in the database import. The validation software has been written in Fortran. Python scripts have also been used for the web interface.
Utility and Discussion
In addition to the various data fields calculated from the structure, The web interface displays the various data fields calculated from the structures and the reaction, including the validation results. For compounds, the stereochemical information (field "Stereochemistry") is displayed with respect to completeness: "None" if the compound cannot exhibit stereoisomerism, "None (i.e., undefined)" if stereoisomerism is possible but stereochemistry is completely absent, "Meso" if the compound is achiral, "Relative" if the compound is chiral but a racemic mixture is indicated (this may or may not be intentional, drugs are often racemates), and finally "Absolute" if the compound is chiral and the enantiomer shown is the intended one. "Meso", "Relative", and "Absolute" may be followed by the remark "partially defined" if one or more stereocenters are undefined. For reactions, the field "Balanced" indicates whether the reaction is balanced or not. In case of an unbalanced reaction the word "No" is followed by a chemical formula representing the difference between the reactants and products). If one or more compounds have a polymeric structure or do not have a structure at all, the balance is displayed as "Unknown".
We expect that BioMeta will prove useful for querying and browsing biochemical pathways, to search connecting reaction paths between metabolites, and to view (calculated) three-dimensional models of the structures, to obtain reliable molecular data on metabolites, etc. Three-dimensional structures (calculated by Corina) are already available for compounds with stereochemically completely defined structures. In the future, BioMeta may also provide the basis of several inference engines. For example, graph-theoretical approaches can be applied to determine pathways from series of individual enzymatic reactions.
We demonstrate that the validation of metabolite structures and reactions is a feasible and worthwhile undertaking, and that the validation results can be used to trigger corrections and improvements to BioMeta, our metabolite database. BioMeta provides some tools for rational drug design, reaction searches, and visualization. The database will be useful for querying and browsing biochemical pathways, and to obtain reference information for identifying compounds, and for all other applications that require the underlying molecular data to be correct.
We have made our corrections available to KEGG and will keep doing so for the foreseeable future.
Availability and requirements
The BioMeta database is freely available as a web service provided the copyright notice of all original data is cited. The restrictions for use of the database are the same as those for the use of the KEGG Ligand database. Academic users may freely use the web site. Non-academic users may also use the web site as end users, but any form of distribution is not allowed.
The interface makes use of the JME (Java Molecular Editor) to display structures and to draw structure queries, so the browser needs to be Java-enabled.
Project name: The BioMeta Database
Project home page: http://www.cmbi.ru.nl/biometa/
Browser requirements: Microsoft Internet Explorer works best, but other browsers (e.g., Firefox) will function satisfactorily.
Programming language: Java (no version restrictions) for the JME applet and for Jmol (to display 3D structures).
The authors are indebted to KEGG (Kyoto Encyclopedia of Genes and Genomes) for making their molecular data publicly available. Use of the JME Molecular Editor, courtesy of Peter Ertl (Novartis AG) is gratefully acknowledged. The authors acknowledge appreciate many stimulating discussions with the members of the CDD group at the CMBI and Organon NV. GV acknowledges financial support from the BioRange programme of NBIC, which is supported by a BSIK grant through NGI, and the BioSapiens EU FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health" contract number LSHG-CT-2003-503265.
- KEGG (Kyoto Encyclopedia of Genes and Genomes) Ligand database[http://www.genome.ad.jp/kegg/]
- Kanehisa M: A database for post-genome analysis. Trends Genet 1997, 13: 375–376. 10.1016/S0168-9525(97)01223-7View ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 2000, 28: 27–30. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: LIGAND: chemical database of enzyme reactions. Nucleic Acids Res 2000, 28: 380–382. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 2006, 34: D354–357. 10.1093/nar/gkj102PubMed CentralView ArticlePubMedGoogle Scholar
- Krieger CJ, Zhang P, Mueller LA, Wang A, Paley S, Arnaud M, Pick J, Rhee SY, Karp PD: MetaCyc: A Multiorganism Database of Metabolic Pathways and Enzymes. Nucleic Acids Res 2004, 32: D438–442. 10.1093/nar/gkh100PubMed CentralView ArticlePubMedGoogle Scholar
- The Roche Applied Science "Biochemical Pathways" wall chart Boehringer Mannheim GmbH – Biochemica 1993.
- Michal G: Biochemical Pathways: An Atlas of Biochemistry and Molecular Biology. New York: Wiley & Sons; 1999.Google Scholar
- Schomburg I, Chang A, Schomburg D: BRENDA, enzyme data and metabolic information. Nucleic Acids Res 2002, 30: 47–49. 10.1093/nar/30.1.47PubMed CentralView ArticlePubMedGoogle Scholar
- Schomburg I, Chang A, Ebeling C, Gremse M, Heldt C, Huhn G, Schomburg D: BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res 2004, 32: D431–433. 10.1093/nar/gkh081PubMed CentralView ArticlePubMedGoogle Scholar
- Gasteiger E, Gattiker A, Hoogland C, Ivanyi I, Appel RD, Bairoch A: ExPASy: the proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res 2003, 31: 3784–3788. 10.1093/nar/gkg563PubMed CentralView ArticlePubMedGoogle Scholar
- De Matos P, Ennis M, Darsow M, Guedj M, Degtyarenko K, Apweiler R: ChEBI – Chemical Entities of Biological Interest. Nucleic Acids Res 2006. Database Summary Paper 646. Database Summary Paper 646.Google Scholar
- PubChem, a database of 'small' molecules and their biological activities[http://pubchem.ncbi.nlm.nih.gov/]
- Dalby A, Nourse JG, Hounshell WD, Gushurst AKI, Grier DL, Leland BA, Laufer J: Description of Several Chemical Structure File Formats Used by Computer Programs Developed at Molecular Design Limited. J Chem Inf Comput Sci 1992, 32: 244–255. 10.1021/ci00007a012View ArticleGoogle Scholar
- C@rol, a chemical warehouse system by Molecular Networks GmbH[http://www.mol-net.de/]
- Biochemical Pathways Database (BioPath) by Molecular Networks GmbH[http://www.mol-net.de/]
- Ceres, Inc[http://www.ceres-inc.com/techno/platforms/metab.html]
- Wink M: Plant breeding: importance of plant secondary metabolites for protection against pathogens and herbivores. Theor Appl Genet 1988, 75: 225–233. 10.1007/BF00303957View ArticleGoogle Scholar
- Schwab W: Metabolome diversity: too few genes, too many metabolites? Phytochemistry 2003, 62: 837–849. 10.1016/S0031-9422(02)00723-9View ArticlePubMedGoogle Scholar
- Lee K-H: Anticancer Drug Design Based on Plant-Derived Natural Products. J Biomed Sci 1999, 6: 236–250.PubMedGoogle Scholar
- BioMeta database[http://www.cmbi.ru.nl/biometa/]
- Morgan HL: The generation of a unique machine description for chemical structures – A technique developed at chemical abstracts service. J Chem Doc 1965, 5: 107–113. 10.1021/c160017a018View ArticleGoogle Scholar
- Wip ke WT, Dyott TM: Stereochemically Unique Naming Algorithm. J Am Chem Soc 1974, 96: 4834–4842. 10.1021/ja00822a021View ArticleGoogle Scholar
- Weininger D, Weininger A, Weininger JL: SMILES. 2. Algorithm for Generation of Unique SMILES Notation. J Chem Inf Comput Sci 1989, 29: 97–101. 10.1021/ci00062a008View ArticleGoogle Scholar
- Van Aalten DMF, Bywater R, Findlay JBC, Hendlich M, Hooft RWW, Vriend G: PRODRG: a program for generating molecular topologies and unique molecular descriptors from coordinates of small molecules. J Comput-Aided Mol Des 1996, 10: 255–262. 10.1007/BF00355047View ArticlePubMedGoogle Scholar
- The IUPAC International Chemical Identifier (InChI)[http://www.iupac.org/inchi/]
- CrossFire Beilstein, a large organic chemistry database[http://mdl.com/products/knowledge/crossfire_beilstein/]
- SciFinder, a tool to query the Chemical Abstracts Services database[http://www.cas.org/SCIFINDER/]
- Volk R, Bacher A: Biosynthesis of Riboflavin. Studies on the mechanism of L-3,4-dihydroxy-2-butanone 4-phosphate synthase. J Biol Chem 1991, 266: 20610–20618.PubMedGoogle Scholar
- Williams DR, Trudgill PW, Taylor DG: Metabolism of 1,8-cineole by a Rhodococcus species: Ring cleavage reactions. J Gen Microbiol 1989, 135: 1957–1967.Google Scholar
- PostgreSQL, an open-source relational database management system[http://www.postgresql.org/]
- Python, a dynamic object-oriented programming language[http://www.python.org/]
- Ertl P, Jacob O: WWW-based chemical information system. Theochem 1997, 419: 113–120. 10.1016/S0166-1280(97)00179-6View ArticleGoogle Scholar
- Corina, a generator of 3D structures from connection tables by Molecular Networks GmbH[http://www.mol-net.de/]
- Arita M: The metabolic world of Escherichia coli is not small. Proc Nat Acad Sci USA 2004, 101: 1543–1547. 10.1073/pnas.0306458101PubMed CentralView ArticlePubMedGoogle Scholar
- Jmol, an interactive web browser applet for viewing molecules[http://jmol.sourceforge.net/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.