Network analysis of metabolic enzyme evolution in Escherichia coli
© Light and Kraulis 2004
Received: 18 July 2003
Accepted: 18 February 2004
Published: 18 February 2004
The two most common models for the evolution of metabolism are the patchwork evolution model, where enzymes are thought to diverge from broad to narrow substrate specificity, and the retrograde evolution model, according to which enzymes evolve in response to substrate depletion. Analysis of the distribution of homologous enzyme pairs in the metabolic network can shed light on the respective importance of the two models. We here investigate the evolution of the metabolism in E. coli viewed as a single network using EcoCyc.
Sequence comparison between all enzyme pairs was performed and the minimal path length (MPL) between all enzyme pairs was determined. We find a strong over-representation of homologous enzymes at MPL 1. We show that the functionally similar and functionally undetermined enzyme pairs are responsible for most of the over-representation of homologous enzyme pairs at MPL 1.
The retrograde evolution model predicts that homologous enzymes pairs are at short metabolic distances from each other. In general agreement with previous studies we find that homologous enzymes occur close to each other in the network more often than expected by chance, which lends some support to the retrograde evolution model. However, we show that the homologous enzyme pairs which may have evolved through retrograde evolution, namely the pairs that are functionally dissimilar, show a weaker over-representation at MPL 1 than the functionally similar enzyme pairs. Our study indicates that, while the retrograde evolution model may have played a small part, the patchwork evolution model is the predominant process of metabolic enzyme evolution.
In 1945 one of the first theories regarding the evolution of metabolic pathways, often referred to as the retrograde evolution model, was proposed by Horowitz . It states that during evolution pathways assembled backward compared to the direction of the pathway in response to depletion of substrates from the environment. As an example consider the following scenario: Enzyme E1 catalyzes reaction A → B, where B is essential to the organism. A is depleted from the environment, which means that an organism harboring an enzyme E2 that can catalyze a reaction producing A from some other substrate in the environment would be at an advantage. Since E1 can already bind A there is a greater chance that El rather than an enzyme without an affinity for A will be duplicated and mutated into E2.
In 1976 Jensen  proposed the recruitment evolution theory, more often referred to as the patchwork evolution model . The patchwork evolution model states that enzymes initially have broad substrate specificities and that specialization takes place by way of gene duplication. As an example consider the following: An enzyme E1 catalyzes a reaction where either one of the substrates S1 and S2 is accepted. Through gene duplication and random mutation two different versions of E1 evolve; E1', which only accepts S1 as a substrate, and E1" which only accepts S2 as a substrate.
The retrograde evolution model was at first supported by the discovery of operons, where the functionally related genes in operons were thought to have evolved through tandem duplications . The theory that operons are the remnants of tandem duplication has recently been criticized by Lawrence and Roth , who instead proposed that horizontal gene transfer may be the underlying mechanism for the occurrence of gene clusters. There are a few known homologous genes coding for enzymes which catalyze consecutive reactions and which therefore represent possible cases of retrograde evolution: trpC, trpA and trpB (in E. coli trpA and trpB are fused) that catalyze consecutive steps of the tryptophan biosynthesis , hisF and hisA in the histidine biosynthetic pathway  and metB and metC in the methionine biosynthesis .
A few recent studies give some support for the retrograde evolution model. Saqi & Sternberg  showed that a super-family has a general tendency to appear in one or two particular pathway(s). Rison et al  showed that homologous enzymes are found at close distances within the (extended) pathways of E. coli and Alves et al  showed that homologous enzymes are also found close to each other in the whole metabolic network using a modified version of the KEGG database[12, 13].
The patchwork evolution model holds that there should be many pairs of homologous enzymes that catalyze basically the same kind of reaction, where one or more substrates are non-identical but similar. Support for this theory is more abundant than for the retrograde evolution model [14–17]. The TIM-barrel containing enzymes have been found in many different pathways  and the homologous pairs of small molecule metabolism enzymes of E. coli have been shown to be evenly distributed within and across pathways [15, 16].
Metabolic networks are often partitioned into pathways, which are considered to be functionally separate units of the network. The partitioning of the metabolic network into pathways is not always straightforward . As a result there may be correlations that are not visible in a pathway oriented perspective which will emerge in a whole-network oriented view. There is also an element of arbitrariness involved in which compounds are considered promiscuous (compounds involved in many reactions, e. g. H2O, ATP and cofactors) and which are not. We have chosen to apply a simple network-based criterion. We count the number of reactions a compound participates in within the complete metabolic network of E. coli. The most common compounds are then considered promiscuous and are excluded from part of the analysis.
In the study presented here we investigate whether homologous E. coli enzymes can be found close to each other in the complete, unpartitioned metabolic network of E. coli as derived from the EcoCyc database . We subsequently investigate the homologous enzyme pairs found close to each other in the network and classify these enzyme pairs as cases of retrograde evolution and patchwork evolution respectively. Finally we investigate whether the correlation between metabolic network distance and homology differs for the enzyme pairs classified as retrograde evolution cases compared to the patchwork evolution cases.
Results and Discussion
Metabolic pathway information is available in several different databases such as EcoCyc , WIT , BRENDA  and KEGG . EcoCyc is an E. coli-specific database and contains the metabolic complement of E. coli. We chose EcoCyc over the other available databases for three reasons: 1) EcoCyc contains information about the directionality of the enzymatic reactions, 2) Some enzymes have not yet been classified according to the Enzyme Commission (EC) system  (EcoCyc contains both EC-classified enzymes and enzymes that fall outside of the EC classification; There are 1172 enzyme entries in the database and 781 of these have EC numbers.) and 3) EcoCyc is freely available for universities and non-profit research institutes.
Some reactions are physiologically irreversible and should be represented accordingly. To encompass the directionality information we use a directed graph to represent the metabolic network of E. coli. As a result there are not necessarily paths from every enzyme vertex in the network to every other enzyme vertex.
One of the most problematic aspects with metabolic network analysis is how to handle the promiscuous compounds. One may argue that because promiscuous compounds are usually not the limiting factors of reactions, the network would become more biochemically meaningful if these compounds are removed . However, to our knowledge there is no generally accepted criterion to determine whether a compound participating in a reaction is a current compound, cofactor or main metabolite. In this study, we have chosen to formulate and apply a simple network-based criterion. We count the number of times a compound occurs as part of an edge in the network. The most common compounds are then considered as promiscuous compounds . As in the study performed by Wagner and Fell  we here conduct our study with one network that includes all compounds, including the promiscuous compounds, and another network where the most promiscuous compounds have been removed.
Determination of network parameters
Network parameters. The table contains some important characteristics of the network (see text for more detailed description). The second column contains the network parameters for the whole network while the third column contains the network parameters for the network where the 20 most promiscuous compounds have been removed. The average path length (D) is determined by taking the average over the MPLs between all vertex pairs. The connectivity of a vertex is defined as the number of vertices which it is connected to by an outgoing edge.
0 COMPOUNDS REMOVED
20 COMPOUNDS REMOVED
Number of vertices (u)
Number of edges (k)
Mean connectivity ( )
Connectivity standard deviation
Average path length (D)
Average path length (random network) (D r )
Clustering coefficient (C n )
Clustering coefficient (random network) (C r )
The most promiscuous compounds in the network. The frequency of a compound is the number of times the compound occurs as part of an edge in the network.
The enzymes with the highest connectivities (k) of the network for the whole and the reduced networks. The connectivity of an enzyme is here defined as the number of enzymes which it is connected to by an outgoing edge.
Carbamoyl phosphate synthase
S-adenosylmethionine synthetase I
Asparagine synthase B
Methionine adenosyltransferase 2
Fructose 1,6-bisphosphatase II
Glutamate synthase (NADPH)
It has been shown in previous studies that the metabolic network of E. coli is a small world network [24–26]. The definition of a small world graph is that its average path length (D) is on the same order as the average path length for a random graph, with the same number of vertices (n) and mean connectivity ( ), but its clustering coefficient (C n ) exceeds that of the random graph by far . C n is a measure between 0 and 1 of the cliquishness of the graph . Each vertex' clustering coefficient (C) is calculated by taking the number of edges between the vertex' neighbors (m) divided by the maximum number of edges between the vertex' neighbors, i.e if u is the number of neighbors and the graph is directed C = m/(u(u - 1)). C n is the average clustering coefficient for the graph.
Our network graph has C n = 0.72 which is large compared to the clustering coefficient of the random graph C r = ( - 1)/n = 0.16 (Table 1). The average path length for the Erdös-Rényi random graph  is D r ~ ln(n)/ln( ) = 1.3, which is smaller than but on the same order as the average path length for the metabolic network (Table 1). Given the large C n and the small D we can conclude that the metabolic network of E. coli constructed from EcoCyc shows some characteristics of a small world network.
Correlation between MPL and homology
We used PSI-BLAST  with an E-value cut-off of 10-6 and 3 iterations for an all-against-all sequence comparison of the 1105 protein sequences coding for the metabolic enzymes in E. coli. We chose the relatively strict E-value cut-off in order to minimize the number of false positives. We also ran PSI-BLAST against the SCOP database  to increase the sensitivity and collect further pairs of homologous enzymes. The proteins in the SCOP database are ordered into a hierarchy consisting of class, fold, super-family and family, with an increasing level of structural similarity between the proteins. In our study, enzymes belonging to the same super-family are considered homologous. Using these methods 8,218 homologous enzyme pairs were found.
There are 209 E. coli genes which are each associated with at least two enzymatic functions. Most of these multifunctional enzymes are enzymes with broad substrate specificities. 18 of these genes consist of two separate regions which are clearly associated with two (or more) different enzymatic functions (see Additional file 1 and for instance [31, 32]). We used EcoCyc to identify these genes and Pfam  to localize the domains on the genes. The domains were separated from each other and included in the analysis as partial genes.
The numbers of homologous enzyme pairs and enzyme pairs found at MPLs 1–12. The fraction is simply the ratio between the number of homologous enzyme pairs and the enzyme pairs. The reduced network is the network where the 20 most promiscuous compounds have been removed.
It could be argued that the over-representation of homologous enzyme pairs at MPL 1 is due to the fact that not all cofactors have been removed by removing the 20 most promiscuous compounds. To investigate this possibility an alternative network was constructed where all the cofactors, as defined by EcoCyc, were removed from the network. In this alternative network the number of homologous enzyme pairs at MPL 2 is just within the boundaries of 3 standard deviations for the randomized networks. The correlation at MPL 1 remained unchanged (data not shown). Hence, the correlation between homology and MPL at MPL 1 is not due to cofactor-binding regions alone.
Rison et al  found an over-representation within the extended pathways of E. coli at pathway distances 1, 2 and 3. Alves et al found that there is clustering of homologous enzyme pairs at MPL 1 and 2 in the metabolic networks of several organisms. We detect a clearly significant over-representation only at MPL 1 in the metabolic network of E. coli. Two possible explanations for the differences between our results are that Alves et al performed a multi-organism analysis while we analyze only E. coli and that Rison et al looked at extended pathways rather than at the whole network.
Analysis of the homologous enzyme pairs
Patchwork model: Homologous enzymes with similar functions probably evolved through patchwork evolution events. Therefore homologous enzymes that evolved through the patchwork evolution model should share the same primary EC number.
Retrograde evolution: Homologous enzymes with dissimilar functions are less likely to have evolved through patchwork evolution events. Therefore homologous enzymes that have different primary EC numbers are candidates for retrograde evolution.
We find that 304 enzymes (26%) have not been EC classified. Reactions that do not have EC numbers have been classified by EcoCyc according to an EcoCyc-specific scheme. We consider enzymes that belong to the same EcoCyc reaction type as functionally similar. There are some reactions that remain unclassified in EcoCyc. Pairs including such enzymes are regarded as 'undetermined' in our analysis.
The numbers of functionally similar /dissimilar/undetermined homologous enzyme pairs at MPLs 1–11 in the network where the 20 most promiscuous compounds have been removed.
90 (21%) of the homologous enzyme pairs at MPL 1 are functionally dissimilar (Table 5). One instance is component I of anthranilate synthase (trpE) which is homologous to the two isozymes of isochorismate synthase (entC and menF) (Figure 8b). If substrate depletion was the primary selective pressure for the E. coli ancestor of these enzymes, chorismate was probably the substrate being depleted because it is the only compound that these two reactions have in common. It is primarily among the 297 homologous enzyme pairs with dissimilar functions at MPL 1 and 2 that the candidates for retrograde theory enzymes can be found. However, it should be noted that some of the candidates for retrograde evolution that have been identified before are not included among our retrograde evolution candidates: We classify the enzymes, coding for the last two consecutive steps in the tryptophan biosynthesis (trpA/trpB and trpC) as functionally conserved because these two enzymes have the same primary EC numbers (184.108.40.206 and 220.127.116.11). In the same manner we classify the enzymes, coding for two consecutive steps in methionine biosynthesis (metB and metC) (18.104.22.168 and 22.214.171.124) as functionally similar as well as the four homologous consecutive enzymes in the peptidoglycan biosynthesis (126.96.36.199, 188.8.131.52, 184.108.40.206, 220.127.116.11).
The correlation between homology and network distance at MPL 1 is mostly due to the homologous enzyme pairs with similar functions which have evolved through patchwork evolution and to homologous enzyme pairs with undetermined function. While there is some statistically significant over-representation of functionally dissimilar homologous enzyme pairs at MPL 1 which cannot easily be explained by the patchwork evolution model, it appears that the evolution of an enzyme that catalyzes a new type of reaction is rare and increased enzyme specificity driven evolution is more common which is in general agreement with recent studies [16, 15].
We constructed a representation of the whole metabolic network of E. coli from EcoCyc and analyzed the distribution of homologous enzyme pairs over the network. We conclude from our study that homologous pairs of enzymes are more common at minimal path length (MPL) 1 than expected by chance. This correlation persists after the systematic removal of the most promiscuous compounds from the network.
The retrograde evolution model predicts that homologous enzyme pairs will be found at close distances in the metabolic network. Like previous studies our study seems to lend some support to the retrograde evolution model. To investigate the support for the retrograde evolution model further we analyzed the homologous pairs of enzymes in order to distinguish between on the one hand cases of patchwork evolution (broad-to-narrow evolution of enzyme substrate specificity) and on the other hand cases of retrograde evolution (evolution of a different reaction mechanisms). We found that the correlation between homology and network distance at MPL 1 is mostly due to homologous enzyme pairs with similar functions which have evolved through patchwork evolution and to homologous enzyme pairs with undetermined function. However, there is a statistically significant over-representation of functionally dissimilar homologous enzyme pairs at MPL 1 which cannot easily be explained by the patchwork evolution model. In conclusion, our study indicates that while the retrograde evolution model may have played a small part, the patchwork evolution model is the predominant process of metabolic enzyme evolution.
The record of evolutionary history that is present in modern genome sequences does not give much support for the retrograde evolution model  while Jensen's patchwork evolution model  has substantially more support. Horowitz aimed at explaining the emergence of metabolic pathways at the origin of life. It is possible that the subsequent mutations and gene rearrangements have obliterated the traces of ancient retrograde evolution. The patchwork evolution cases that we identify could be the examples of more recent events in evolutionary history. Further phylogenetic analysis of other genomes may shed light on this issue.
EcoCyc is arranged into several different flat-files. We used the enzrxns.dat file to extract the enzyme identifiers and the reaction directions, the proteins.dat file to extract the proteins coding for the enzymes, the genes.dat file to find the genes coding for the proteins, the genes.col file to find the Blattner identification number and the compounds.dat file to extract the compounds that are included in the reactions.
There were some reactions that were EC-number classified but did not have a link to the corresponding enzyme in the enzrxns.dat file. We could find 46 such cases and used KEGG  and BRENDA  to correct this problem (see Additional file 2).
Of the 1,172 enzyme identifiers available in EcoCyc, 44 were not connected to the rest of the network. Some other enzymes have not yet been located in the genome, which makes sequence comparison impossible, leaving the final number of enzymes for our study at 1,085. The 1,105 protein sequences coding for these enzymes were extracted from the Wisconsin-Madison E. coli genome project's flat-file . There were 519 compounds that were involved in the reactions extracted.
This work was supported by the Foundation for Strategic Research (SSF).
- Horowitz NH: On the evolution of biochemical syntheses. Proc Natl Acad Sci USA 1945, 31: 153–157.PubMed CentralView ArticlePubMed
- Jensen RA: Enzyme recruitment in evolution of new function. Annu Rev Microbiol 1976, 30: 409–425. 10.1146/annurev.mi.30.100176.002205View ArticlePubMed
- Lazcano A, Miller SL: The origin and early evolution of life: prebiotic chemistry, the pre-RNA world, and time. Cell 1996, 85: 793–798. 10.1016/S0092-8674(00)81263-5View ArticlePubMed
- Horowitz NH: The evolution of biochemical syntheses – retrospect and prospect. In Evolving genes and proteins (Edited by: Bryson V, Vogel HJ). New York: Academic Press 1965, 15–23.
- Lawrence JG, Roth JR: Selfish operons: horizontal transfer may drive the evolution of gene clusters. Genetics 1996, 143: 1843–1860.PubMed CentralPubMed
- Wilmanns M, Hyde CC, Davies DR, Kirschner K, Jansonius JN: Structural conservation in parallel beta/alpha-barrel enzymes that catalyze three sequential reactions in the pathway of tryptophan biosynthesis. Biochemistry 1991, 30: 9161–9169.View ArticlePubMed
- Fani R, Lio P, Lazcano A: Molecular evolution of the histidine biosynthetic pathway. J Mol Evol 1995, 41: 760–774.View ArticlePubMed
- Belfaiza J, Parsot C, Martel A, de la Tour CB, Margarita D, Cohen GN, Saint-Girons I: Evolution in biosynthetic pathways: two enzymes catalyzing consecutive steps in methionine biosynthesis originate from a common ancestor and possess a similar regulatory region. Proc Natl Acad Sci USA 1986, 83: 867–871.PubMed CentralView ArticlePubMed
- Saqi MA, Steinberg MJ: A structural census of metabolic networks for E. coli . J Mol Biol 2001, 313: 1195–1206. 10.1006/jmbi.2001.5107View ArticlePubMed
- Rison SC, Teichmann SA, Thornton JM: Homology, pathway distance and chromoso mal localisation of the small molecule metabolism enzymes in Escherichia coli. J Mol Biol 2002, 318: 911–932. 10.1016/S0022-2836(02)00140-7View ArticlePubMed
- Alves R, Chaleil RA, Sternberg MJ: Evolution of enzymes in metabolism: a network perspective. J Mol Biol 2002, 320: 751–770. 10.1016/S0022-2836(02)00546-6View ArticlePubMed
- Kanehisa M, Goto S, Kawashima S, Nakaya A: The KEGG databases at GenomeNet. Nucleic Acids Res 2002, 30: 42–46. 10.1093/nar/30.1.42PubMed CentralView ArticlePubMed
- Goto S, Okuno Y, Hattori M, Nishioka T, Kanehisa M: LIGAND : database of chemical compounds and reactions in biological pathways. Nucleic Acids Res 2002, 30: 402–404. 10.1093/nar/30.1.402PubMed CentralView ArticlePubMed
- Copley RR, Bork P: Homology among (β/α) 8 barrels: implications for the evolution of metabolic pathways. J Mol Biol 2000, 303: 627–641. 10.1006/jmbi.2000.4152View ArticlePubMed
- Teichmann SA, Rison SC, Thornton JM, Riley M, Gough J, Chothia C: The evolution and structural anatomy of the small molecule metabolic pathways in Escherichia coli . J Mol Biol 2001, 311: 693–708. 10.1006/jmbi.2001.4912View ArticlePubMed
- Tsoka S, Ouzounis CA: Functional versatility and molecular diversity of the metabolic map of Escherichia coli . Genome Res 2001, 11: 1503–1510. 10.1101/gr.187501PubMed CentralView ArticlePubMed
- Petsko GA, Kenyon GL, Gerlt JA, Ringe D, Kozarich JW: On the origin of enzymatic species. Trends Biochem Sci 1993, 18: 372–376. 10.1016/0968-0004(93)90091-ZView ArticlePubMed
- Schuster S, Fell DA, Dandekar T: A general definition of metabolic pathways useful for systematic organization and analysis of complex metabolic networks. Nat Biotechnol 2000, 18: 326–332. 10.1038/73786View ArticlePubMed
- Karp PD, Riley M, Saier M, Paulsen IT, Collado-Vides J, Paley SM, Pellegrini-Toole A, Bonavides C, Gama-Castro S: The EcoCyc database. Nucleic Acids Res 2002, 30: 56–58. 10.1093/nar/30.1.56PubMed CentralView ArticlePubMed
- Overbeek R, Larsen N, Pusch GD, D'Souza M Jr, Selkov E Jr, Kyrpides N, Fonstein M, Maltsev N, Selkov E: WIT : integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res 2000, 28: 123–125. 10.1093/nar/28.1.123PubMed CentralView ArticlePubMed
- Schomburg I, Chang A, Schomburg D: BRENDA, enzyme data and metabolic information. Nucleic Acids Res 2002, 30: 47–49. 10.1093/nar/30.1.47PubMed CentralView ArticlePubMed
- Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB): Enzyme Nomenclature San Diego, California: Academic Press 1992.
- Gerrard JA, Sparrow AD, Wells JA: Metabolic databases – what next? Trends Biochem Sci 2001, 26: 137–140. 10.1016/S0968-0004(00)01759-XView ArticlePubMed
- Wagner A, Fell DA: The small world inside large metabolic networks. Proc R Soc Lond B Biol Sci 2001, 268: 1803–1810. 10.1098/rspb.2001.1711View Article
- Ma H, Zeng AP: Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics 2003, 19: 270–277. 10.1093/bioinformatics/19.2.270View ArticlePubMed
- Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL: The large-scale organization of metabolic networks. Nature 2000, 407: 651–654. 10.1038/35036627View ArticlePubMed
- Watts DJ, Strogatz SH: Collective dynamics of 'small-world' networks. Nature 1998, 393: 440–442. 10.1038/30918View ArticlePubMed
- Erdös P, Rényi A: On random graphs. I. Publicationes Mathematicae (Debrecen) 1959, 6: 290–297.
- Schaffer AA, Aravind L, Madden TL, Shavirin S, Sponge JL, Wolf YI, Koonin EV, Altschul SF: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 2001, 29: 2994–3005. 10.1093/nar/29.14.2994PubMed CentralView ArticlePubMed
- LoConte L, Ailey B, Hubbard TJ, Brenner SE, Murzin AG, Chothia C: SCOP: a structural classification of proteins database. Nucleic Acids Res 2000, 28: 257–259. 10.1093/nar/28.1.257View Article
- Calhoun DH, Bonner CA, Gu W, Xie G, Jensen RA: The emerging periplasm-localized subclass of AroQ chorismate mutases, exemplified by those from Salmonella typhimurium and Pseudomonas aeruginosa. Biol Genome 2001., 2(8 RESEARCH0030. Epub):
- Veron M, Falcoz-Kelly F, Cohen GN: The threonine-sensitive homoserine dehydrogenase and aspartokinase activities of Escherichia coli K12. The two catalytic activities are carried by two independent regions of the polypeptide chain. Eur J Biochem 1972, 28: 520–527.View ArticlePubMed
- Sonnhammer EL, Eddy SR, Durbin R: Pfam: a comprehensive database of protein families based on seed alignments. Proteins 1997, 28: 405–420. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-LView ArticlePubMed
- Blattner FR, Plunkett G 3rd, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, Gregor J, Davis NW, Kirkpatrick HA, Goeden MA, Rose DJ, Mau B, Shao Y: The complete genome sequence of Escherichia coli K-12. Science 1997, 277: 1453–1474. 10.1126/science.277.5331.1453View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.