Network analysis of metabolic enzyme evolution in Escherichia coli

Background The two most common models for the evolution of metabolism are the patchwork evolution model, where enzymes are thought to diverge from broad to narrow substrate specificity, and the retrograde evolution model, according to which enzymes evolve in response to substrate depletion. Analysis of the distribution of homologous enzyme pairs in the metabolic network can shed light on the respective importance of the two models. We here investigate the evolution of the metabolism in E. coli viewed as a single network using EcoCyc. Results Sequence comparison between all enzyme pairs was performed and the minimal path length (MPL) between all enzyme pairs was determined. We find a strong over-representation of homologous enzymes at MPL 1. We show that the functionally similar and functionally undetermined enzyme pairs are responsible for most of the over-representation of homologous enzyme pairs at MPL 1. Conclusions The retrograde evolution model predicts that homologous enzymes pairs are at short metabolic distances from each other. In general agreement with previous studies we find that homologous enzymes occur close to each other in the network more often than expected by chance, which lends some support to the retrograde evolution model. However, we show that the homologous enzyme pairs which may have evolved through retrograde evolution, namely the pairs that are functionally dissimilar, show a weaker over-representation at MPL 1 than the functionally similar enzyme pairs. Our study indicates that, while the retrograde evolution model may have played a small part, the patchwork evolution model is the predominant process of metabolic enzyme evolution.


Background
In 1945 one of the first theories regarding the evolution of metabolic pathways, often referred to as the retrograde evolution model, was proposed by Horowitz [1]. It states that during evolution pathways assembled backward compared to the direction of the pathway in response to depletion of substrates from the environment. As an example consider the following scenario: Enzyme E1 catalyzes reaction A → B, where B is essential to the organ-ism. A is depleted from the environment, which means that an organism harboring an enzyme E2 that can catalyze a reaction producing A from some other substrate in the environment would be at an advantage. Since E1 can already bind A there is a greater chance that El rather than an enzyme without an affinity for A will be duplicated and mutated into E2.
In 1976 Jensen [2] proposed the recruitment evolution theory, more often referred to as the patchwork evolution model [3]. The patchwork evolution model states that enzymes initially have broad substrate specificities and that specialization takes place by way of gene duplication. As an example consider the following: An enzyme E1 catalyzes a reaction where either one of the substrates S1 and S2 is accepted. Through gene duplication and random mutation two different versions of E1 evolve; E1', which only accepts S1 as a substrate, and E1" which only accepts S2 as a substrate. The retrograde evolution model was at first supported by the discovery of operons, where the functionally related genes in operons were thought to have evolved through tandem duplications [4]. The theory that operons are the remnants of tandem duplication has recently been criticized by Lawrence and Roth [5], who instead proposed that horizontal gene transfer may be the underlying mechanism for the occurrence of gene clusters. There are a few known homologous genes coding for enzymes which catalyze consecutive reactions and which therefore represent possible cases of retrograde evolution: trpC, trpA and trpB (in E. coli trpA and trpB are fused) that catalyze consecutive steps of the tryptophan biosynthesis [6], hisF and hisA in the histidine biosynthetic pathway [7] and metB and metC in the methionine biosynthesis [8].
A few recent studies give some support for the retrograde evolution model. Saqi & Sternberg [9] showed that a super-family has a general tendency to appear in one or two particular pathway(s). Rison et al [10] showed that homologous enzymes are found at close distances within the (extended) pathways of E. coli and Alves et al [11] showed that homologous enzymes are also found close to each other in the whole metabolic network using a modified version of the KEGG database [12,13]. The patchwork evolution model holds that there should be many pairs of homologous enzymes that catalyze basically the same kind of reaction, where one or more substrates are non-identical but similar. Support for this theory is more abundant than for the retrograde evolution model [14][15][16][17]. The TIM-barrel containing enzymes have been found in many different pathways [14] and the homologous pairs of small molecule metabolism enzymes of E. coli have been shown to be evenly distributed within and across pathways [15,16].
Metabolic networks are often partitioned into pathways, which are considered to be functionally separate units of the network. The partitioning of the metabolic network into pathways is not always straightforward [18]. As a result there may be correlations that are not visible in a pathway oriented perspective which will emerge in a whole-network oriented view. There is also an element of arbitrariness involved in which compounds are considered promiscuous (compounds involved in many reactions, e. g. H 2 O, ATP and cofactors) and which are not. We have chosen to apply a simple network-based criterion. We count the number of reactions a compound participates in within the complete metabolic network of E. coli. The most common compounds are then considered promiscuous and are excluded from part of the analysis.
In the study presented here we investigate whether homologous E. coli enzymes can be found close to each other in the complete, unpartitioned metabolic network of E. coli as derived from the EcoCyc database [19]. We subsequently investigate the homologous enzyme pairs found close to each other in the network and classify these enzyme pairs as cases of retrograde evolution and patchwork evolution respectively. Finally we investigate whether the correlation between metabolic network distance and homology differs for the enzyme pairs classified as retrograde evolution cases compared to the patchwork evolution cases.

Databases
Metabolic pathway information is available in several different databases such as EcoCyc [19], WIT [20], BRENDA [21] and KEGG [12]. EcoCyc is an E. coli-specific database and contains the metabolic complement of E. coli. We chose EcoCyc over the other available databases for three reasons: 1) EcoCyc contains information about the directionality of the enzymatic reactions, 2) Some enzymes have not yet been classified according to the Enzyme Commission (EC) system [22] (EcoCyc contains both ECclassified enzymes and enzymes that fall outside of the EC classification; There are 1172 enzyme entries in the database and 781 of these have EC numbers.) and 3) EcoCyc is freely available for universities and non-profit research institutes.

Representation framework
The vertices of the graph we use represent the enzymes catalyzing the reactions. One edge represents one or more compounds. There will be an edge leading from an enzyme E1 to an enzyme E2 if E1 catalyzes a reaction where compound A is produced and E2 takes A as a substrate. Reversible reactions, such as A B, are separated into two reactions, A → B and B → A. There can be at most one edge in each direction between a pair of enzymes. Note that the representation used herein is different from the common representation of metabolic pathways where the substrates and products are the vertices and the enzymes catalyzing the reactions are the edges (Figure 1). The type of network representation used in our study has been used before for metabolic network analysis where it has been referred to as 'protein-centric' graphs [23] and 'reaction graphs' [24] (Figure 2). Some reactions are physiologically irreversible and should be represented accordingly. To encompass the directional-ity information we use a directed graph to represent the metabolic network of E. coli. As a result there are not necessarily paths from every enzyme vertex in the network to every other enzyme vertex. Figure 1 Common representation of the biotin metabolism. In the most common representation of biochemical reactions the reactants represent the vertices of the network and the enzymes represent the edges. The drawing of the biotin metabolism was redrawn from EcoCyc [19]. Coenzyme A CO2 L−alanine One of the most problematic aspects with metabolic network analysis is how to handle the promiscuous compounds. One may argue that because promiscuous compounds are usually not the limiting factors of reactions, the network would become more biochemically meaningful if these compounds are removed [25]. However, to our knowledge there is no generally accepted criterion to determine whether a compound participating in a reaction is a current compound, cofactor or main metabolite. In this study, we have chosen to formulate and apply a simple network-based criterion. We count the number of times a compound occurs as part of an edge in the network. The most common compounds are then considered as promiscuous compounds [11]. As in the study performed by Wagner and Fell [24] we here conduct our study with one network that includes all compounds, including the promiscuous compounds, and another network where the most promiscuous compounds have been removed.

Common representation of the biotin metabolism
Protein-centric representation of the biotin metabolism Figure 2 Protein-centric representation of the biotin metabolism. The protein-centric or reaction graph representation which was used in our study has enzymes as vertices and reactants as edges. EC

Determination of network parameters
The directed graph derived from EcoCyc contains 1,172 vertices and 224,972 edges ( Table 1). The network consists of two components; one contains 2 vertices and a single edge while the other component consists of the remaining vertices. Except for the two enzymes (two chloride ion transporters) the whole metabolic network of E. coli is therefore connected as one would expect from a unicellular, autonomous and non-parasitic organism.
We examined which compounds are the most promiscuous in the metabolic network graph by determining the number of times the compounds occur as part of edges connecting two vertices in the network. The most promiscuous compound is H 2 O, which appears as part of an edge between 79,485 enzyme pairs ( Table 2). The compound frequency plot shows some scale-free network characteristics in that there are a few compounds that occur very often and most compounds occur as parts of edges 1, 2 or 3 times ( Figure 3). We also determine which enzymes are the most highly connected. The most highly connected enzyme has 735 enzyme neighbors (Carbamoyl phosphate synthase) ( Table 3).
It has been shown in previous studies that the metabolic network of E. coli is a small world network [24][25][26]. The definition of a small world graph is that its average path length (D) is on the same order as the average path length for a random graph, with the same number of vertices (n) and mean connectivity ( ), but its clustering coefficient (C n ) exceeds that of the random graph by far [27]. C n is a measure between 0 and 1 of the cliquishness of the graph [27]. Each vertex' clustering coefficient (C) is calculated by taking the number of edges between the vertex' neighbors (m) divided by the maximum number of edges between the vertex' neighbors, i.e if u is the number of neighbors and the graph is directed C = m/(u(u -1)). C n is the average clustering coefficient for the graph.
Our network graph has C n = 0.72 which is large compared to the clustering coefficient of the random graph C r = ( -1)/n = 0.16 ( Table 1). The average path length for the Erdös-Rényi random graph [28] is D r ~ ln(n)/ln( ) = 1.3, which is smaller than but on the same order as the average path length for the metabolic network (Table 1). Given the large C n and the small D we can conclude that the metabolic network of E. coli constructed from EcoCyc shows some characteristics of a small world network.

Correlation between MPL and homology
We used PSI-BLAST [29] with an E-value cut-off of 10 -6 and 3 iterations for an all-against-all sequence comparison of the 1105 protein sequences coding for the metabolic enzymes in E. coli. We chose the relatively strict Evalue cut-off in order to minimize the number of false positives. We also ran PSI-BLAST against the SCOP database [30] to increase the sensitivity and collect further pairs of homologous enzymes. The proteins in the SCOP database are ordered into a hierarchy consisting of class, fold, super-family and family, with an increasing level of structural similarity between the proteins. In our study, enzymes belonging to the same super-family are considered homologous. Using these methods 8,218 homologous enzyme pairs were found.
There are 72 cases where two homologous genes code for the same enzyme. The reason may for instance be that the genes code for different subunits of the enzyme, that there are multiple copies of one gene or that the genes code for two isozymes. In addition there are 169 pairs of enzymes which are clearly isozymes because they are homologous, identified as separate enzymes in EcoCyc and catalyze the same reaction. It is not clear what relationship these 241 pairs should have in our graph representation of the metabolic network. They may be included at minimal path  Figure 4) 1 since they catalyze identical reactions, or they may be included as a special category at MPL 0. For our analysis however, since we are interested in the relationship between clearly different enzymes, we chose not to include these probably fairly recent gene duplications.
There are 209 E. coli genes which are each associated with at least two enzymatic functions. Most of these multifunc-tional enzymes are enzymes with broad substrate specificities. 18 of these genes consist of two separate regions which are clearly associated with two (or more) different enzymatic functions (see Additional file 1 and for instance [31,32]). We used EcoCyc to identify these genes and Pfam [33] to localize the domains on the genes. The domains were separated from each other and included in the analysis as partial genes.
Compound frequency distribution histogram for the metabolic network of E. coli Figure 3 Compound frequency distribution histogram for the metabolic network of E. coli. The 20 most promiscuous compounds have been removed in the network analyzed here. The histogram shows how many compounds (y-axis, logged scale) that occur with a certain frequency (x-axis) in the network. Compound frequency is defined as the number of times a compound occurs as part of an edge.
There are 644,656 pairs of enzymes in the whole network (Table 4). Of these, only 7,989 (1.2%) are homologous enzyme pairs. There are 121,295 enzyme pairs at MPL 1. 4,662 (3.8%) of these are homologous enzyme pairs. There appears to be a 3-fold over-representation of homologous enzyme pairs at MPL 1 in the whole network. In order to determine whether the observed overrepresentation of homologous enzyme pairs at MPL 1 is statistically significant randomized networks were constructed. There are many alternative ways to construct randomized networks. We have chosen an approach that preserves the topological properties of the network since it has been shown that the metabolic network of many organisms have some of the properties of scale-free networks [26]. We constructed many randomized networks by starting from the original real network and shuffling the enzyme identities between randomly chosen pairs of vertices in the graph which preserves the network topology ( Figure 5).
The number of homologous enzyme pairs plotted against MPL for the whole metabolic network of E. coli and the standard deviations for the 1,000 randomized networks are shown in Figure 6. In the real network there are 4,662 pairs of homologous genes at MPL 1, which is 58% of all homologous pairs found (Table 4). In contrast typically 1,454 homologous enzyme pairs (18%) are found at MPL 1 in the randomized networks. Compared to the randomized networks there is again a 3-fold over-representation of homologous enzyme pairs at MPL 1. If the homologies were randomly distributed in the network we would expect to see the line representing the real network within the boundaries of the standard deviations for the randomized networks in Figure 6. Instead, we find that there is a significantly greater likelihood of enzymes at distance 1 from each other to be homologous in the metabolic network of E. coli than in the randomized network.
The network includes promiscuous compounds such as H 2 O, ATP and NAD. There are therefore homologous enzyme pairs containing for instance NAD-binding  domains that are at MPL 1 because their gene products catalyze reactions involving that cofactor. It could be argued that such coenzyme binding domains give rise to skewed results in our analysis. To remedy this complication we removed 20 compounds, starting from the most promiscuous compound (H 2 O) down to the 20th most promiscuous compound (O 2 ). We find that the correlation between MPL and homology is preserved (Figure 7), indicating that the abundance of homologous enzyme pairs at MPL 1 is not the result of common cofactor-binding domains alone. We could also detect a marginally significant correlation between MPL and homology at MPL 2 when the 20 most promiscuous compounds had been excluded from the network (Figure 7). No correlation could be detected at MPLs greater than 2. These results were robust for variations in the number of compounds that were removed from the network, i.e. removing between 17 and 23 of the most promiscuous compounds generated the same result (data not shown).
It could be argued that the over-representation of homologous enzyme pairs at MPL 1 is due to the fact that not all cofactors have been removed by removing the 20 most promiscuous compounds. To investigate this possibility an alternative network was constructed where all the cofactors, as defined by EcoCyc, were removed from the network. In this alternative network the number of homologous enzyme pairs at MPL 2 is just within the boundaries of 3 standard deviations for the randomized networks. The correlation at MPL 1 remained unchanged (data not shown). Hence, the correlation between homology and MPL at MPL 1 is not due to cofactor-binding regions alone.
Rison et al [10] found an over-representation within the extended pathways of E. coli at pathway distances 1, 2 and 3. Alves et al found that there is clustering of homologous enzyme pairs at MPL 1 and 2 in the metabolic networks of several organisms. We detect a clearly significant over-representation only at MPL 1 in the metabolic network of E. coli. Two possible explanations for the differences between our results are that Alves et al performed a multiorganism analysis while we analyze only E. coli and that Rison et al looked at extended pathways rather than at the whole network.

Analysis of the homologous enzyme pairs
We wished to investigate to which extent the retrograde evolution model and the patchwork evolution model respectively are responsible for the over-representation of homologous enzyme pairs seen at MPL 1. In order to be able to analyze the homologous enzyme pairs at MPL 1 further we use a rough criterion for discriminating between cases that fit the retrograde evolution model or the patchwork evolution model: 1) Patchwork model: Homologous enzymes with similar functions probably evolved through patchwork evolution events. Therefore homologous enzymes that evolved through the patchwork evolution model should share the same primary EC number.
2) Retrograde evolution: Homologous enzymes with dissimilar functions are less likely to have evolved through patchwork evolution events. Therefore homologous enzymes that have different primary EC numbers are candidates for retrograde evolution. Preservation of network topology   (Table 5). One instance is 1-phosphofructokinase and 6-phosphofructokinase, which catalyze similar reactions with the difference that 1-phosphofructokinase catalyzes a reaction involving fructose-1phosphate while 6-phosphofructokinase has fructose-6phosphate as a substrate (Figure 8a). These are clearly analogous reactions whose homology can readily be explained by the patchwork evolution model. 90 (21%) of the homologous enzyme pairs at MPL 1 are functionally dissimilar (Table 5). One instance is component I of anthranilate synthase (trpE) which is homologous to the two isozymes of isochorismate synthase (entC and menF) (Figure 8b). If substrate depletion was the pri-mary selective pressure for the E. coli ancestor of these enzymes, chorismate was probably the substrate being depleted because it is the only compound that these two reactions have in common. It is primarily among the 297 homologous enzyme pairs with dissimilar functions at MPL 1 and 2 that the candidates for retrograde theory enzymes can be found. However, it should be noted that some of the candidates for retrograde evolution that have been identified before are not included among our retrograde evolution candidates: We classify the enzymes, coding for the last two consecutive steps in the tryptophan biosynthesis (trpA/trpB and trpC) as functionally conserved because these two enzymes have the same primary EC numbers (4. The functionally similar, dissimilar and undetermined homologous enzyme pairs were plotted against MPL and normalized against 1,000 randomized networks by the same method as described above. From this procedure we Homology vs minimal path length (MPL) for the whole meta-bolic network of E. coli could determine that most of the correlation between homology and MPL at MPL 1 can be attributed to enzymes with similar functions (Figure 9) and enzymes with undetermined function (data not shown). However, there is still an over-representation of functionally dissimilar homologous enzyme pairs at MPL 1 ( Figure 10) indicating that there is a statistically significant proportion of the homologous enzyme pairs whose homology cannot be attributed to chemical similarity. We also note an overrepresentation at MPL 10 for the functionally similar homologous pairs (Figure 9). The over-representation consists of 9 pairs of homologous inner membrane MFS transporters which show 15-20% sequence identity.   The correlation between homology and network distance at MPL 1 is mostly due to the homologous enzyme pairs with similar functions which have evolved through patchwork evolution and to homologous enzyme pairs with undetermined function. While there is some statistically significant over-representation of functionally dissimilar homologous enzyme pairs at MPL 1 which cannot easily be explained by the patchwork evolution model, it appears that the evolution of an enzyme that catalyzes a new type of reaction is rare and increased enzyme specificity driven evolution is more common which is in general agreement with recent studies [16,15].

Conclusions
We constructed a representation of the whole metabolic network of E. coli from EcoCyc and analyzed the distribution of homologous enzyme pairs over the network. We conclude from our study that homologous pairs of enzymes are more common at minimal path length (MPL) 1 than expected by chance. This correlation persists after the systematic removal of the most promiscuous compounds from the network.
The retrograde evolution model predicts that homologous enzyme pairs will be found at close distances in the metabolic network. Like previous studies our study seems to lend some support to the retrograde evolution model. To investigate the support for the retrograde evolution model further we analyzed the homologous pairs of enzymes in order to distinguish between on the one hand cases of patchwork evolution (broad-to-narrow evolution of enzyme substrate specificity) and on the other hand cases of retrograde evolution (evolution of a different reaction mechanisms). We found that the correlation between homology and network distance at MPL 1 is mostly due to homologous enzyme pairs with similar functions which have evolved through patchwork evolution and to homologous enzyme pairs with undetermined function. However, there is a statistically significant overrepresentation of functionally dissimilar homologous enzyme pairs at MPL 1 which cannot easily be explained by the patchwork evolution model. In conclusion, our study indicates that while the retrograde evolution model may have played a small part, the patchwork evolution model is the predominant process of metabolic enzyme evolution.
The record of evolutionary history that is present in modern genome sequences does not give much support for the retrograde evolution model [1] while Jensen's patchwork evolution model [2] has substantially more support. Horowitz aimed at explaining the emergence of metabolic pathways at the origin of life. It is possible that the subsequent mutations and gene rearrangements have obliterated the traces of ancient retrograde evolution. The patchwork evolution cases that we identify could be the examples of more recent events in evolutionary history. Further phylogenetic analysis of other genomes may shed light on this issue.

Databases
EcoCyc is arranged into several different flat-files. We used the enzrxns.dat file to extract the enzyme identifiers and the reaction directions, the proteins.dat file to extract the proteins coding for the enzymes, the genes.dat file to find the genes coding for the proteins, the genes.col file to find the Blattner identification number and the compounds.dat file to extract the compounds that are included in the reactions.
There were some reactions that were EC-number classified but did not have a link to the corresponding enzyme in the enzrxns.dat file. We could find 46 such cases and used KEGG [12] and BRENDA [21] to correct this problem (see Additional file 2).
Of the 1,172 enzyme identifiers available in EcoCyc, 44 were not connected to the rest of the network. Some other enzymes have not yet been located in the genome, which makes sequence comparison impossible, leaving the final number of enzymes for our study at 1,085. The 1,105 protein sequences coding for these enzymes were extracted from the Wisconsin-Madison E. coli genome project's flatfile [34]. There were 519 compounds that were involved in the reactions extracted.

Programs
A set of programs were constructed for determination of the minimal path lengths (MPLs) for all enzyme pairs in the network graph. For each E. coli enzyme the neighboring enzymes were determined (Figure 11a). The main program is a breadth-first search implementation which determines all possible MPLs between all pairs of enzymes (Figure 11b). The general idea for the algorithm is the same as described in Jeong et al [26]. It should be noted that because the graph representing the metabolic network of E. coli is directed there may not be paths between all enzyme pairs in one graph component. All the scripts used in this study were written in Python.

Authors' contributions
SL performed the analysis as a graduate student under the supervision of PK. Figure 11 Main algorithm. a) Determination of neighbors for each enzyme. b) Determination of minimal path length (MPL). The algorithm is a bread-first search for the shortest distance between all pairs of enzymes. MAX_MPL is the maximum MPL under investigation.