A new network representation of the metabolism to detect chemical transformation modules
© Sorokina et al. 2015
Received: 1 July 2015
Accepted: 29 October 2015
Published: 14 November 2015
Metabolism is generally modeled by directed networks where nodes represent reactions and/or metabolites. In order to explore metabolic pathway conservation and divergence among organisms, previous studies were based on graph alignment to find similar pathways. Few years ago, the concept of chemical transformation modules, also called reaction modules, was introduced and correspond to sequences of chemical transformations which are conserved in metabolism. We propose here a novel graph representation of the metabolic network where reactions sharing a same chemical transformation type are grouped in Reaction Molecular Signatures (RMS).
RMS were automatically computed for all reactions and encode changes in atoms and bonds. A reaction network containing all available metabolic knowledge was then reduced by an aggregation of reaction nodes and edges to obtain a RMS network. Paths in this network were explored and a substantial number of conserved chemical transformation modules was detected. Furthermore, this graph-based formalism allows us to define several path scores reflecting different biological conservation meanings. These scores are significantly higher for paths corresponding to known metabolic pathways and were used conjointly to build association rules that should predict metabolic pathway types like biosynthesis or degradation.
This representation of metabolism in a RMS network offers new insights to capture relevant metabolic contexts. Furthermore, along with genomic context methods, it should improve the detection of gene clusters corresponding to new metabolic pathways.
In bioinformatics, metabolism is generally modeled by directed networks where nodes represent reactions and/or metabolites and edges the product/substrate exchanges between reactions . Metabolic network reconstruction of a given organism generally starts with its genome annotation that predicts enzymatic activities from coding sequences and, therefore, the corresponding reactions and metabolites of the network. However, two main bottlenecks limit today this reconstruction by homology: the difficulty in associating correct functions to genes and the lack of experimental characterization of enzyme activities for which proteins are sometimes unknown, i.e. orphan enzymes .
Subgraphs of these networks are often used to represent metabolic pathways that group sets of connected reactions involved in a same biological process. Several hypotheses on the origin and evolution of metabolic pathways have been proposed, including patchwork evolution by enzyme recruitment in new metabolic pathways [3, 4], retrograde synthesis which postulates that metabolic pathways are constructed starting from the final metabolite , and the theory on metabolic pathway duplication . Despite their differences, these hypotheses agree about the importance of enzyme promiscuity in the evolution of metabolic pathways, i.e. the capacity of enzymes to catalyze one or several types of reactions on more or less different substrates. A recent study in Escherichia coli successfully brings out this enzyme capacity to adapt themselves to new substrates .
In order to explore metabolic pathway conservation and divergence among organisms, previous studies were based on pathway alignment to find similar pathways within or between organisms using the Enzyme Commission (EC) numbers to define reaction similarities [8–11]. Due to limitations of the EC classification, the notion of reaction similarity for pathway alignment was improved using metabolite similarity  or substructure changes . Another approach, that does not require predefined pathways, was based on the detection of motifs in a reaction network . Few years ago, the concept of chemical transformation modules, also called reaction modules, was introduced by Muto et al. . They correspond to sequences of chemical transformations which are conserved in metabolism. These modules capture the chemical logic of pathways that may correspond or not to conserved sets of enzymes. Muto et al. made a systematic analysis of the conservation of reaction modules by aligning metabolic pathways from KEGG  and used RClass (Reaction Class)  to group reactions having same patterns of chemical transformations. The same year, Barba et al.  published a study on the modularity of the purine and pyrimidine metabolism, which presents chemical reaction similarities, and also enriched the reaction module definition with the notion of enzyme homology.
In the present work, we propose a different formalism for the detection of reaction modules, although we use the same definition of modules as Muto et al. . Instead of using pathway alignment, we adopt an innovative graph representation of the metabolism where the reaction network is reduced in a Reaction Molecular Signature (RMS) network. For that, RMS are automatically computed for all reactions and encode changes in atoms and bonds as described in . Thereby, reactions sharing a same signature are grouped together. Paths in the RMS network are then explored to detect conserved modules. Furthermore, this graph-based formalism allows us to define several path scores reflecting different biological conservation meanings. These scores are finally analyzed for all possible paths in the network and for known metabolic ones and used to build association rules that should predict metabolic pathway types like metabolite biosynthesis or degradation.
Metabolic data was extracted from MetaCyc public database version 19.0 . MetaCyc contains a large collection of curated metabolic pathways from all domains of life. In addition, metabolites, reactions, enzymes and genes are also listed. Metabolic pathways described in MetaCyc are generally short (4.3 reactions on average) and have been experimentally elucidated in at least one organism. A metabolic network was reconstructed using MetaCyc reactions as nodes. We linked two reactions by a directed edge when the product of one reaction is the substrate of the other one. However, to avoid the high connectivity problems that are common when building such metabolic networks, we limited shared compounds to “main compounds”, i.e. metabolites deemed biologically relevant to both reactions in at least one metabolic pathway. Only reactions that belong to a metabolic pathway were taken into account, as only these ones have distinction between main metabolites and co-substrates supporting the reaction such as water, ATP or NAD. Transport reactions, for which translocated substrate remains unchanged, were excluded from the network construction and from further analysis, e.g. ABC transporter ATPase reactions corresponding to 3.6.3.- EC class.
Reaction molecular signatures
Reaction Molecular Signatures (RMS) were computed for all MetaCyc reactions, belonging or not to a metabolic pathway, as described in . These signatures encode changes in atoms and bonds where the reaction is taking place. First, structures of all molecules involved in a reaction were downloaded from MetaCyc website in MDL Molfile format. Using ChemAxon MolConvert software , all molecules were standardized by adding implicit hydrogen atoms and applying aromatization when needed. Stereo signature molecular descriptors  were then computed for heights 1 and 2 with the MolSig software (http://molsig.sourceforge.net). These molecular signatures are encoded using SMILES-like strings  and the height parameter corresponds to a distance for the inclusion of neighbour atoms and bonds up from a given atom. Second, corresponding RMS were generated for each molecular signature height by calculating the difference between the signatures of the products and of the substrates. To obtain correct RMS, reaction equations have to be balanced with explicit compounds for which Molfile structures are available. It should be noticed that (i) for a given height, a reaction has only one RMS signature (ii) reactions sharing a same RMS have similar chemical transformations (iii) the higher the height value is more the signature is precise. RMS of height 1 (RMS-H1) capture the reaction center with atom and bond changes. To compute RMS of height 2 (RMS-H2), RMS-H1 were partitioned in sub-groups having similar signatures at height 2. Distances between signatures were computed using an approximate string matching algorithm . Then, a hierarchical clustering was build on these distances using the Ward algorithm  and the tree was cut at a cophenetic distance threshold of 90. To deal with reaction directionality, RMS having strictly opposite signatures were merged in a single entry. Higher values of the height parameter were not used because they lead to too precise signatures with many describing only one reaction. The RMS classification of reactions is available in Additional file 1 and the source code for the RMS computation was deposited in GitHub (https://github.com/mSorok/createRMS.git). The RMS method has been chosen in this work as it guarantees that all reactions described by the same signature perform the same chemical transformation, making manual post-process unnecessary.
RMS node weighting
Several weights, reflecting different biological conservation meanings, have been computed on nodes of the RMS networks. The first weight, wRea, corresponds to the number of MetaCyc reactions associated to a given RMS, whether they are present or not in the initial reaction network. It gives a quantitative measure of the diversity of reactions represented by a RMS.
A second weight, wPageRank, is computed using PageRank algorithm  implemented in the Jung 2.0 Java library . This topological weight is based on a network architecture exploration in order to locate influential nodes in the RMS network with the assumption that most important chemical transformations are likely to have more incoming links from other transformations.
This weight gives a quantitative measure of the diversity of enzymes associated to a RMS. High value of wProt may indicate that the chemical transformation is widely represented among organisms and/or that many enzymes catalyze this transformation because of many gene duplications or many enzyme families.
RMS path enumeration and scoring
An enumeration of all paths of length 1 (one edge and two RMS nodes) to 4 (four edges and five nodes) was made in both RMS networks using the Grph Java library . In this path enumeration, loops were not allowed (i.e. a node cannot be found more than once in a path). To make them comparable, metabolic pathways from MetaCyc were translated in overlapping RMS paths of the same length. In addition, a Pathway Conservation Index (PCI) was computed for each RMS path and represents the number of distinct corresponding reaction paths that are present in at least one MetaCyc pathway.
ScorePageRank and scoreProt are computed in the same way using wPageRank and wProt, respectively.
Results and discussion
From reaction to RMS networks
Reaction molecular signature statistics
Number of RMS
Number of reactions by RMS
Statistics on reaction network and RMS networks
Number of nodes
Number of edges
Average node degree
Average node out degree
Average node in degree
Node reduction rate
Conserved RMS paths in metabolic pathways
Number of conserved modules (P C I≥2)
Number of pathways containing at least one conserved module (length 2, P C I≥2) classified by their type
263 (42 %)
154 (24 %)
172 (47 %)
95 (25 %)
3 (27 %)
3 (23 %)
61 (78 %)
51 (65 %)
19 (33 %)
10 (17 %)
518 (46 %)
313 (27 %)
RMS path scoring and learning
Statistics on conservation scores for paths of length 2 in the RMS-H2 network
paths (n = 72173)
Paths in known
pathways (n = 3001)
Next, these scores were analyzed in the light of MetaCyc pathway classification using five main types of biological processes: biosynthesis, degradation/utilization/assimilation, detoxification, generation of precursor metabolites and energy, and a last type, called “others”, that gathers other MetaCyc main pathway classes. By performing pairwise comparisons of pathway types (i.e. Kruskal-Wallis rank sum tests completed by post-hoc Tukey’s HSD tests, see Additional file 5), we found significant differences (p-values <0.05) among all pathway types for at least one of the three conservation scores. These results presume that pathway types could be predicted by machine learning using a combination of the three scores. Thus, pathway assignment rules were generated with the NNge algorithm [35, 36] implemented in Weka . As the number of RMS paths per pathway type is very unbalanced (e.g. the “biosynthesis” class contains almost twice the number of paths than other types), classes were virtually balanced using resampling function of Weka. We successfully obtained rules that correctly classify RMS paths in pathway types with an accuracy greater than 89 % (see Additional file 6).
We present here a novel metabolic network representation where nodes are chemical transformations depicted by reaction molecular signatures. This data model is particularly useful for finding conserved chemical transformation modules in metabolic pathways as they correspond to paths in the RMS network. An important number of modules was detected and could be integrated in metabolic databases, like KEGG  or MetaCyc , to help biologists looking for similar pathways. Furthermore, new metrics (i.e. scoreRea, scoreProt and scorePageRank) were introduced to evaluate module conservation according to different biological meanings. We show that known metabolic paths present higher score values than random ones and that the scores, used conjointly, may predict module pathway types. In terms of improvement of the graph reduction method, it may be of interest to dynamically adapt the precision of the reaction signatures when merging reaction nodes to take into account the local graph topology. This could be achieved taking inspiration from the method proposed by Xu et al.  in which the maximum entropy principle and the Markov chain model-reduction problem were applied. Finally, it should be highlighted that our method can be easily adapted to other types of reaction classifications based on chemical transformations.
Although its construction is based on an initial reaction network, the RMS network offers new insights into metabolism as it could capture relevant metabolic contexts even without precise definition of initial reaction sets or metabolite structures. Indeed, more than two thousand reactions lacking a metabolic pathway were integrated in the RMS network and now share common contexts with reactions from known pathways. Furthermore, considering that many orphan enzymes have network neighbours that are orphans themselves , computational tools [39, 40] have difficulties to find candidate genes for these missing enzymes by defining correct genomic contexts (e.g. chromosomal clusters, co-occurrence profiles) that include candidate proteins and known enzymes. As a perspective, one of the possible improvements of these methods could be the use of a RMS network instead of a reaction network as it may be easier to find proper genomic contexts using relaxed notions of metabolic context. This enhancement may also be applied in the discovery of gene clusters corresponding to new metabolic pathways.
We would like to thank Anne Zaparucha and Carine Vergne-Vaxelaire for their valuable advice in chemistry, and, also, Karine Bastard and Mark Stam for their helpful suggestions on the manuscript.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Lacroix V, Cottret L, Thébault P, Sagot MF. An introduction to metabolic networks and their structural analysis. IEEE/ACM Trans Computational Biology and Bioinformatics. 2008; 5(4):594–617.View ArticleGoogle Scholar
- Sorokina M, Stam M, Médigue C, Lespinet O, Vallenet D. Profiling the orphan enzymes. Biol Direct. 2014; 9:10.View ArticlePubMedPubMed CentralGoogle Scholar
- Jensen RA. Enzyme recruitment in evolution of new function. Ann Rev Microbiol. 1976; 30:409–25.View ArticleGoogle Scholar
- Ycas M. On earlier states of the biochemical system. J Theor Biol. 1974; 44(1):145–60.View ArticlePubMedGoogle Scholar
- Horowitz NH. On the Evolution of Biochemical Syntheses. Proc Nat Acad Sci USA. 1945; 31(6):153–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Schmidt S, Sunyaev S, Bork P, Dandekar T. Metabolites: a helping hand for pathway evolution?Trends Biochem Sci. 2003; 28(6):336–41.View ArticlePubMedGoogle Scholar
- Notebaart RA, Szappanos B, Kintses B, Pal F, Gyorkei A, Bogos B, et al.Network-level architecture and the evolutionary potential of underground metabolism. Proc Nat Acad Sci USA. 2014; 111(32):11762–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Dandekar T, Schuster S, Snel B, Huynen M, Bork P. Pathway alignment: application to the comparative analysis of glycolytic enzymes. Biochemic J. 1999; 343:115–24.View ArticleGoogle Scholar
- Tohsato Y, Matsuda H, Hashimoto A. A multiple alignment algorithm for metabolic pathway analysis using enzyme hierarchy. In: Proceedings International Conference on Intelligent Systems for Molecular Biology: 2000. p. 376–83. http://europepmc.org/abstract/med/10977098.
- Pinter RY, Rokhlenko O, Yeger-Lotem E, Ziv-Ukelson M. Alignment of metabolic pathways. Bioinformatics. 2005; 21(16):3401–8.View ArticlePubMedGoogle Scholar
- Wernicke S, Rasche F. Simple and fast alignment of metabolic pathways by exploiting local diversity. Bioinformatics. 2007; 23(15):1978–85.View ArticlePubMedGoogle Scholar
- Ay F, Kellis M, Kahveci T. Submap: aligning metabolic pathways with subnetwork mappings. J Comput Biol. 2011; 18(3):219–35.View ArticlePubMedPubMed CentralGoogle Scholar
- Tohsato Y, Nishimura Y. Reaction similarities focusing substructure changes of chemical compounds and metabolic pathway alignments. IPSJ Trans Bioinformatics. 2009; 2:15–24.View ArticleGoogle Scholar
- Lacroix V, Fernandes CG, Sagot MF. Motif search in graphs: Application to metabolic networks. In: IEEE/ACM Transactions on Computational Biology and Bioinformatics: 2006. p. 360–8. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4015378.
- Muto A, Kotera M, Tokimatsu T, Nakagawa Z, Goto S, Kanehisa M. Modular architecture of metabolic pathways revealed by conserved sequences of reactions. J Chem Inform Model. 2013; 53(3):613–22.View ArticleGoogle Scholar
- Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M. Data, information, knowledge and principle: back to metabolism in kegg. Nucleic Acids Res. 2014; 42(D1):199–205.View ArticleGoogle Scholar
- Kotera M, Okuno Y, Hattori M, Goto S, Kanehisa M. Computational assignment of the ec numbers for genomic-scale analysis of enzymatic reactions. J Am Chem Soc. 2004; 126(50):16487–98.View ArticlePubMedGoogle Scholar
- Barba M, Dutoit R, Legrain C, Labedan B. Identifying reaction modules in metabolic pathways: bioinformatic deduction and experimental validation of a new putative route in purine catabolism. BMC Syst Biol. 2013; 7:99.View ArticlePubMedPubMed CentralGoogle Scholar
- Carbonell P, Planson AG, Fichera D, Faulon JL. A retrosynthetic biology approach to metabolic pathway design for therapeutic production. BMC Syst Biol. 2011; 5(1):122.View ArticlePubMedPubMed CentralGoogle Scholar
- Caspi R, Altman T, Billington R, Dreher K, Foerster H, Fulcher CA, et al.The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res. 2014; 42(D1). http://nar.oxfordjournals.org/content/42/D1/D459.short.
- ChemAxon. JChem Base was used for structure searching and chemical database access and management. 2012. http://www.chemaxon.com.
- Carbonell P, Carlsson L, Faulon JL. Stereo signature molecular descriptor. J Chem Inform Model. 2013; 53(4):887–97.View ArticleGoogle Scholar
- Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inform Model. 1988; 28(1):31–6.View ArticleGoogle Scholar
- Diaz-Gonzalez J. FuzzyWuzzy. https://github.com/seatgeek/fuzzywuzzy 2015.
- Jr Ward JH. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 1963; 58(301):236–44.View ArticleGoogle Scholar
- Page L, Brin S, Motwani R, Winograd T. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab. 1999. http://ilpubs.stanford.edu:8090/422/.
- Team TJFD. JUNG — the Java Universal Network/Graph Framework. http://jung.sourceforge.net/ 2013.
- UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 2015; 43:D204–D212. http://nar.oxfordjournals.org/content/43/D1/D204.View ArticleGoogle Scholar
- Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: the protein families database. Nucl Acids Res. 2014; 42:D222–30.View ArticlePubMedGoogle Scholar
- Hogie L. Grph:The high performance graph library for Java. 2013. http://www.i3s.unice.fr/~hogie/grph/.
- Bairoch A. The ENZYME data bank. Nucleic Acids Res. 1994; 22(17):3626–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Nath N, Mitchell JBO. Is EC class predictable from reaction mechanism?BMC Bioinformatics. 2012; 13(1):60.View ArticlePubMedPubMed CentralGoogle Scholar
- Rahman SA, Cuesta SM, Furnham N, Holliday GL, Thornton JM. EC-BLAST: a tool to automatically search and compare enzyme reactions. Nat Methods. 2014; 11(2):171–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Latino DARS, Zhang QY, Aires-de-Sousa JA. Genome-scale classification of metabolic reactions and assignment of EC numbers with self-organizing maps. Bioinformatics. 2008; 24(19):2236–44.View ArticlePubMedGoogle Scholar
- Roy S. Nearest Neighbor With Generalization. Christchurch, New Zealand. 2002. http://weka.sourceforge.net/doc.packages/NNge/weka/classifiers/rules/NNge.html.
- Martin B. Instance-based learning: Nearest neighbor with generalization. 1995.Google Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA Data Mining Software: An Update; SIGKDD Explorations. 2009; 11(1). Accessed 2015-04-28.Google Scholar
- Xu Y, Salapaka SM, Beck CL. On reduction of graphs and markov chain models. In: Decision and Control and European Control Conference (CDC-ECC), 2011 50th IEEE Conference On: 2011. p. 2317–322. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6160882&tag=1.
- Yamada T, Waller AS, Raes J, Zelezniak A, Perchat N, Perret A, et al. Prediction and identification of sequences coding for orphan enzymes using genomic and metagenomic neighbours. Mol Syst Biol. 2012; 8:581.View ArticlePubMedPubMed CentralGoogle Scholar
- Smith AAT, Belda E, Viari A, Medigue C, Vallenet D. The CanOE strategy: Integrating genomic and metabolic contexts across multiple prokaryote genomes to find candidate genes for orphan enzymes. PLoS Comput Biol. 2012; 8(5). http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002540.