Metabolite coupling in genome-scale metabolic networks

Background Biochemically detailed stoichiometric matrices have now been reconstructed for various bacteria, yeast, and for the human cardiac mitochondrion based on genomic and proteomic data. These networks have been manually curated based on legacy data and elementally and charge balanced. Comparative analysis of these well curated networks is now possible. Pairs of metabolites often appear together in several network reactions, linking them topologically. This co-occurrence of pairs of metabolites in metabolic reactions is termed herein "metabolite coupling." These metabolite pairs can be directly computed from the stoichiometric matrix, S. Metabolite coupling is derived from the matrix ŜŜT, whose off-diagonal elements indicate the number of reactions in which any two metabolites participate together, where Ŝ is the binary form of S. Results Metabolite coupling in the studied networks was found to be dominated by a relatively small group of highly interacting pairs of metabolites. As would be expected, metabolites with high individual metabolite connectivity also tended to be those with the highest metabolite coupling, as the most connected metabolites couple more often. For metabolite pairs that are not highly coupled, we show that the number of reactions a pair of metabolites shares across a metabolic network closely approximates a line on a log-log scale. We also show that the preferential coupling of two metabolites with each other is spread across the spectrum of metabolites and is not unique to the most connected metabolites. We provide a measure for determining which metabolite pairs couple more often than would be expected based on their individual connectivity in the network and show that these metabolites often derive their principal biological functions from existing in pairs. Thus, analysis of metabolite coupling provides information beyond that which is found from studying the individual connectivity of individual metabolites. Conclusion The coupling of metabolites is an important topological property of metabolic networks. By computing coupling quantitatively for the first time in genome-scale metabolic networks, we provide insight into the basic structure of these networks.

of one or more metabolites. Subject to the law of mass conservation, the net sum of elements and electrical charge is conserved in each reaction and thus in the network as a whole. Each transformation by definition must involve more than one metabolite. Herein, we define "metabolite coupling" as the appearance of a pair of metabolites in the same biochemical transformation.
Any examination of carefully balanced biochemical transformations immediately displays the ubiquity of the proton (H + ) and water in metabolic reactions. The ubiquity of the proton in metabolic networks is interesting in light of the fact that at standard pH a typical bacterial cell contains approximately 60 free protons. Assuming pH = 7 [1] and the volume of the cell is equal to 1 μm 3 , the number of free protons in solution is equal to (10 -7 mol H + /L)(6.022 × 10 23 molecules/mol)(cell volume L); unit conversions yield the final answer. Thus, through the medium of water, the protons must be rapidly shuttled around the cell as metabolic transfers are taking place. Although the inclusion of protons is not complete in most metabolic pathway databases, protons have been included where appropriate in the networks analyzed in the present work. While the inclusion of protons in genome scale metabolic reconstructions is not frequent, it does allow for the computation of important effects of H + balancing on network functions. For example, the change in pH of the growth media for cultures of Escherichia coli was predicted by a balanced model [2]. Careful elemental and charge balancing of the human mitochondrial metabolic network also allowed for the computational determination of ATP yield [3].
With the exception of the proton and water, the most commonly coupled metabolites are those generally referred to as cofactors (i.e. ATP/ADP, NAD + /NADH). Cofactors fulfil a range of roles in cells, from transferring energy and redox potential to carrying key metabolic intermediates. In some cases, different common cofactors can serve the same purpose -ATP and GTP can both deliver energy; NADH and NADPH can both supply reducing power -but one is often preferred for a given purpose in an organism. For example, NADH is more commonly used during the oxidative reactions that comprise catabolism whereas NADPH is preferred for the reductive reactions that are involved in anabolism [4]. The exact use of cofactors in metabolic reactions is typically not fully defined in genome annotations, and thus metabolic pathways derived from a sequence-based annotation may not correctly specify the cofactor used in a particular reaction in a network [5].
Topological analyses of the cellular metabolism of microorganisms have generated significant interest recently [6]. Metabolic networks have been shown to exhibit scale-free behavior, where the number of reactions in which a given metabolite participates follows a power-law distribution, meaning that the probability of a metabolite having k connections to other metabolites is approximately equal to k -γ , where γ is constant for a given network [7,8]. The reaction fluxes in the central metabolism of E. coli have been shown to behave in a similar manner; the probability that a reaction has a given flux v is proportional to (v+v 0 ) -α , with v 0 and α constant [9]. The overall hierarchical and modular organization of metabolic networks has been demonstrated [10]. Protein domain networks also have been suggested to have a scale-free nature [11]. The methodology and principal results of the topological analysis of biological networks has been recently reviewed [6]. More recently, it has been asserted that a power law is a natural consequence of a high variability process, much as the central limit theorem dictates a Gaussian distribution for lower variability processes [12].
Metabolic networks of differing levels of detail and accuracy can be encoded using a variety of formalisms, including (hyper)graphs [7], petri nets [13], and stoichiometric matrices [14,15]. The level of detail of a metabolic network is primarily determined by the method of reconstruction, independent of the specific encoding used. For example, networks generated in a semi-automated fashion principally from metabolic pathway databases enable high-throughput reconstructions for many organisms but sacrifice accuracy with regard to fine details such as cofactor preference (NADH vs. NADPH) and proton balancing. In contrast, the metabolic networks considered herein were manually reconstructed from diverse sources, including metabolic databases, legacy biochemical data, and physiological data [16]. They are curated to the point where they can be used for computation, and the subsequent comparison of these computations to experimental data provides further evidence as to the quality of the reconstructions [17,18]. These networks were initially published using the stoichiometric matrix formalism. Furthermore, all of the metabolic networks analyzed herein were scrutinized for cofactor preference on a reaction-byreaction basis as well as elementally and charge balanced before the networks were initially published, meaning that each reaction was examined to determine if the addition of protons was necessary to enforce the conservation of elements and charge at cellular pH. A study of metabolite coupling requires both this balancing and the most accurate assignment of cofactors to reactions possible for each organism.
In this paper, we present a topological analysis of metabolite coupling in genome-scale metabolic networks. We find that most pairs of metabolites never participate in a reaction together, smaller numbers of pairs occur together in a few reactions, and a select few pairs, such as cofactors, occur in a significant fraction of the reactions in the network, in many cases more often than would be expected based on random connectivity. This work is distinct from the previous analyses because we consider the consequences of pairs of metabolites occurring together on a genome scale, rather than individual metabolites. The motivation for studying metabolite pairs is that many biomolecules are of more interest as pairs than as single molecules. For example, ATP is most important because of its ability to couple with ADP, P i and H + in reactions that transfer the phosphate moiety as a form of energy currency for a cell. Studies of metabolite coupling can be used to highlight such chemically-based network properties as they appear throughout the entire network for multiple organisms. Because we focus on coupling, having essential detail in cofactor usage is essential. This necessitates that we use only the hand-curated metabolic networks that are available for a select few microorganisms and the human cardiac mitochondrion.

Basic network properties
We characterized six metabolic networks based on topological features, with representatives from each of the three primary domains of life. The metabolic networks considered represent three bacteria (E. coli [2], Helicobacter pylori [19], Staphylococcus aureus [20]), one member of the archea (Methanosarcina barkeri [21]), one unicellular eukaryote (S. cerevisiae [22]), and one human cellular organelle (cardiac mitochondrion [3]). These are, to the best of our knowledge, the only genome-scale metabolic networks that are manually curated as well as elementally and charge balanced available to date.
We computed the symmetric metabolite coupling matrix M for each network by multiplying the binary form of the stoichiometric matrix S by its transpose (see Methods for details). Each diagonal element of M (m ii ) represents the number of reactions in which a particular metabolite appears and each off-diagonal element (m ij , i ≠ j) represents the number of reactions in which two particular metabolites appear together. Figure 1 shows a graphical representation of the first 15 rows and columns of M for all six networks. Both the box size and the histogram height are logarithmically scaled and normalized by the number of unique metabolite pairs that occur in each network. This scaling and normalization allows for the visualization of pairs of metabolites that participate in anywhere from zero to several hundred reactions together in any given network. The color of each box and bar represents one particular network. Any matrix constructed by the multiplication of another matrix with its transpose is symmetric, and thus the points above the diagonal of M are identical to those below; thus, each histogram below the diagonal corresponds to one set of colored boxes above the diagonal. It is notable that, of the 15 most connected metabolites, only three pairs never couple in the six networks analyzed here: ADP/NADH, ADP/NADPH, and NH 4 /CoA. The first two metabolite pairs that never couple are perhaps surprising, because it is well known that cells use electron carriers to phosphorylate ADP. However, because this process occurs in a series of reactions instead of a single reaction, these metabolites are not considered to be coupled. The computation of seconddegree coupling (where two metabolites are coupled if they couple with at least one common third metabolite) would find this relationship, but would also introduce significant difficulties because many metabolite pairs would be second-degree coupled solely because they are individually coupled with the proton or water.
Basic properties of each network and its corresponding M are shown in table 1. The metabolic network of E. coli has 621 compounds and thus its M is a square symmetric matrix with 621 2 elements. Each compound in the network must participate in at least one reaction, so each diagonal element of M must be greater than zero. However, there is no general topological reason that an arbitrary pair of metabolites must participate in any reaction together, so zeroes are feasible for off-diagonal elements. In fact, the density of M for E. coli is 2.2%, meaning that approximately 98% of the (((621 2 )/2)-621) unique offdiagonal elements are zero. Figure 2 is a binary representation of M for E. coli, constructed by keeping all zeros in place (white spaces) and replacing all non-zeros with 1 (black points), that clearly demonstrates this sparsity. Smaller networks tend to have a denser M.

Metabolite coupling
For each of the six networks, we determined the number of reactions in which each possible pair of metabolites is coupled. We defined two metabolites as coupled in a given reaction if they both occur in that reaction, in contrast to the recently presented concept of metabolite concentration coupling analysis which studies metabolites whose concentrations are linked by network stoichiometry [23]. The distribution of metabolite coupling is shown for E. coli (figure 3) and S. cerevisiae (figure 4). Most pairs of metabolites are never coupled and cannot be shown on a log-log plot. Many pairs of metabolites occur together in only one reaction and are illustrated by the leftmost data point of each figure. The pairs of coupled metabolites that occur in many reactions together are the rightmost points in each plot and represent pairs of metabolites that are either small and ubiquitous (e.g. H + , PP i ) or traditionally referred to as cofactors (e.g. NAD + , NADH). In the six networks considered, the two most common metabolite pairs are always H + /H 2 O and ATP/ADP. The pair ATP/ ADP is ranked first in H. pylori and S. aureus but second in the remaining four networks. The identity and order of coupled metabolite pairs diverges after these two pairs, as illustrated by the plots for the two model microorganisms, E. coli and S. cerevisiae.
The number of reactions in which each possible combination of two metabolites participates approximates a line on a log-log scale in all networks studied, with average slopes between -2 and -3 ( figure 5). The slope of the bestfit line (linear on a log-log scale) was determined for each network from the left-most 10 points and is indicated in the figure legend, along with the corresponding r 2 value. While the dominance of certain cofactor pairs in metabolic networks was known, the strong fit to a line for the less connected pairs demonstrates significant regularity in the use of pairs of metabolites in metabolic networks. The slope of this line provides a quantitative measure to describe the organization of metabolite coupling in the network and can be used to compare coupling properties of different networks of various sizes. Because the coupling relationships fit a line so well, it is possible in principle to determine the relative dominance of certain pairs of metabolites across networks. A smaller (negative) slope generally implies that there are either fewer metabolite pairs that only appear once or twice or that there are more pairs of metabolites that participate in an intermediate number of reactions (for example, 5 to 10). A smaller slope suggests that relatively more metabolite pairs influence more than one reaction and link reactions together.
The coupling of highly connected metabolites   6). The first reconstructed genome-scale network for Haemophilus influenzae metabolism [24] suggested that this property had a power-law distribution and fit a line on a log-log scale. Similar calculations made for many networks [7] using a somewhat different network formalism described earlier yielded similar results. The fit to a line in figure 6 begins after the number of occurrences is two or greater, since a metabolite must participate in at least two reactions to be involved in reactions that can be active at steady state. This result is different from the graph-based approaches that plot in-degree and out-degree separately [7], essentially converting all metabolites that participate in two reactions into two occurrences of one reaction each.
Furthermore, the slope of each line for single metabolite connectivities (figure 6) is different than the slope of the corresponding line for metabolite coupling ( figure 5).
When the ratio of the slope for metabolite coupling to the slope for single metabolite participation is computed, the results range from 0.95 (mitochondrion) to 1.54 (S.aureus) with a mean of 1.11 and standard deviation of 0.10. Furthermore, when the slopes are rank ordered from greatest to least, it is observed that the networks are ordered differently when considering single metabolites than when considering metabolite coupling. Thus, the widely differing slope ratios and differential ordering indicates that the distribution of metabolite coupling (offdiagonal elements in M) is not simply a direct result of the distribution of metabolite connectivity (diagonal elements in M).

Relation between metabolite coupling and metabolite connectivity
Although the slopes of best-fit lines vary considerably for single metabolites and metabolite pairs, there is an important relationship between metabolite coupling and individual metabolite connectivity because a particular metabolite cannot be highly coupled with other metabolites if it does not appear in many reactions. Thus, each off-diagonal element of M (m ij ) is subject to both of the following criteria: The metabolite coupling and connectivity is shown graphically for E. coli in figure   counting the diagonal). The matrix is symmetric, so there are also 479 black points in column 1. Figure 7 further illustrates that the most connected metabolites couple with considerably more different metabolites than lessconnected metabolites. The sum of each row of figure 2, added to the sum of all previous rows and normalized gives a data point in figure 7. For all networks, the first point, representing hydrogen, is between 0.05 and 0.06 in figure 7, signifying that hydrogen accounts for between 5% and 6% of all coupling interactions. The second point, representing the percentage of coupling interactions involving either hydrogen or water, is close to 0.10 (10%) in all networks. The number of metabolites that couple with of the top 15 most connected metabolites is listed in table 2, the second column of which is the numerical result of summing across a row in figure 2. It is not surprising that metabolites like the proton and ATP, which each occur in many reactions, couple with many different metabolites. In figure 7, the nearly constant slope in the middle region of the curve (between approximately metabolite 100 and 500) for all networks except the mitochondrion is interesting, however. While it takes fewer than the most-connected 100 metabolites to account for 50% of the coupling, it takes between 300 and 500 metabolites to account for 90% of the coupling in any given network, excluding the mitochondrion. Furthermore, each of the 300 to 500 metabolites in that region between 50% and 90% participates roughly equally in the cumulative coupling interactions; that is, each of these metabolites couples with a roughly equal number of other metabolites. This is likely due to the simple fact that most of these metabolites each participate in two reactions. The slight decrease in slope for the final 50 to 100 metabolites suggests that the least connected metabolites each account for The distribution of metabolite coupling in E. coli Figure 3 The distribution of metabolite coupling in E. coli. The number of metabolite pairs that occur together in each possible number of reactions is plotted on a log-log scale. There are many metabolite pairs that never occur together, and there are many particular numbers of reactions in which no metabolite pairs occur, but these are not shown because zero does not appear on a log-log plot. The points and most coupled metabolite pairs are colored to improve clarity and readability, but they are all in the E. coli network. The top four metabolite pairs, located at the right of the figure, all occur in over 100 reactions together. slightly less coupling, which is unsurprising since most of these metabolites occur in only one reaction, and are dead-ends in their respective networks.
The mitochondrion is a special case that does not conform well to the previous generalizations. An examination of the mitochondrial metabolic network shows that, relative to its size, the mitochondrion contains a disproportionately high number of metabolites that are highly coupled in other networks. Whereas the 15 most connected metabolites represent only 2.4% of the 621 metabolites present in E. coli, they account for over 10% of the mitochondrion's 145 metabolites. We present the results for the mitochondrion in the interest of completeness, but caution that it may be difficult to extract meaning from a direct comparison to other, more comprehensive, metabolic networks. While normalization of the results to some measure of network size would bring the data in figure 7 into closer agreement, it would not change the simple fact that the mitochondrial network is fundamentally more limited in scope and over-represents highly connected metabolites relative to the genome-scale matrices. This high connectivity and coupling within the mitochondrion suggests that the individual metabolic reactions therein are more interrelated and may affect each other to a greater extent, on average, than those within a genomescale bacterial metabolic network.

Preferentially coupled metabolite pairs
There are some pairs of metabolites that occur in reactions together at a higher rate than would be expected given their individual connectivities. In order to locate these pairs, we used a Monte Carlo approach to uniformly sample the possible values of each off-diagonal element of the metabolite coupling matrix given that the diagonal (the frequency of a given metabolite's participation in the network) remained unchanged. Based on the two diagonal elements corresponding to each off-diagonal element (m ii and m jj correspond to m ij ), each off-diagonal element is assigned a random, feasible value. Complete details are in The distribution of metabolite coupling in S. cerevisiae Figure 4 The distribution of metabolite coupling in S. cerevisiae. Identical to figure 3, but using the S. cerevisiae network. The general trends are similar but the metabolite identities and precise distribution differ. the methods section. After doing this many times for each pair of metabolites in each network, it is possible to determine which metabolite pairs in each network are coupled more than expected for a random network with the same individual connectivity. The pairs that never couple as often in 10,000 randomizations as they do in the real network are listed for E. coli in table 3. The spread of preferentially coupled metabolite pairs (those that couple more often in the real network than in 99% of the randomizations) is graphically represented for E. coli in figure 8. The very existence of preferentially coupled metabolites demonstrates the non-random organization of metabolite coupling. This non-random coupling can be a function of fundamental chemical principles or biological needs, such as the general need for both ATP and ADP to transfer a phosphate moiety.
Furthermore, a number of less connected metabolites preferentially couple, demonstrating that this effect is not limited to the highly-connected metabolites. Although many of the preferentially coupled pairs are clustered toward the left side of figure 8 and correspond to the most connected metabolites, the remainder are spread through-out much of the figure, and as a result, the network as a whole. Thus, participating in many reactions is not necessary to detect non-random preferential coupling. This suggests that metabolites with low connectivity overall in the network are still tightly connected to certain other metabolites, often through shared chemical structural properties. Examination of the list presented in table 3 supports this assertion. The proton couples preferentially with several metabolites that gain or lose a proton during the course of balanced biochemical transformations. The adenine and phosphate containing metabolites preferentially couple in a number of cases. Sugars preferentially couple with other sugars, and fatty acids preferentially couple with other fatty acids.

Preferentially uncoupled metabolite pairs
The same Monte Carlo randomization used to locate overly coupled pairs of metabolites was also used to find metabolites that occur together less often than expected. This procedure located relatively high connectivity metabolites (those that individually participate in many reactions) that do not couple as often as they would in a network with random coupling; results are shown for E.    table 4. For example, this procedure indicates that ATP and NADH couple less than expected based on the number of reactions in which they participate together in every genome-scale network. This preferential uncoupling between ATP and NADH allows the network more flexibility in controlling energy and redox needs. Although these two metabolites never occur together in any reaction in the mitochondrion, having even 95% confidence that this is a non-random result would require that they participate in fewer than zero reactions together and thus this observation was not statistically significant. This demonstrates a limitation of the Monte Carlo approach to finding under-coupled metabolite pairs -it cannot locate with high confidence metabolites that are relatively scarcely connected individually.

Conclusion
This study presents the first calculations of metabolite coupling for curated, genome-scale metabolic networks. As expected, metabolite coupling is dominated by a few highly connected pairs (H + /H 2 O, ATP/ADP, etc.). However, it was also shown that the coupling of metabolites with lower individual connectivity demonstrated a striking amount of regularity, with the number of reactions shared by pairs of metabolites in a given metabolic network closely approximating a line on a log-log plot. We also probed the network to discover which metabolite pairings occurred much more or less than would be expected from their individual connectivities alone. We found that the highly connected metabolites contributed a disproportionate percentage of the enriched coupling interactions in that highly connected metabolites are more connected to each other than would be expected from their individual connectivities alone. These preferentially coupled pairs of metabolites highlight chemical relationships between molecules. In this study, it is interesting that all of the top 15 most coupled metabolites have at least one other metabolite with which they preferentially couple in E. coli, highlighting the importance of metabolite coupling to the emergence of highly-connected compounds in metabolic networks.

Basics
We study the distribution of metabolite coupling in the chemically-detailed and highly curated metabolic networks of E. coli [2], S. cerevisiae [22], Helicobacter pylori [19], Staphylococcus aureus [20], M. barkeri [21], and the The distribution of metabolite connectivity in all networks       human cardiac mitochondrion [3]. We derived the stoichiometric matrix S from the publicly-available reaction lists for each organism. Each row of S represents a metabolite, each column represents a reaction, and each element is a stoichiometric coefficient. If a compound existed in more than one cellular compartment (for example, the lysosome), the row(s) representing a given compound in any compartment(s) other than the cytosol were added to the cytosolic row and then eliminated. This pre-processing of S assured that the coupling of two metabolites was considered identically regardless of the sub-cellular localization of the metabolites.

Calculating the metabolite coupling matrix
The binary form of S, termed Ŝ, was formed for each network by replacing every non-zero element in S with unity. The symmetric "metabolite coupling matrix" M was then computed for each network by Eq. 1.
The density of each M is defined as the number of nonzero elements divided by the total number of elements in the matrix. In order to visualize the entire range of data with meaningful size and greyscale differences differences, particularly between zero and small numbers, each element of M was transformed for figure 1 such that new value = log 2 (old value + 1)/(number of unique metabolite pairs in network). Figure 2 uses the binary form of M, created by replacing all non-zero elements with 1, and was constructed in Matlab (The Mathworks, Inc.) using the imagesc function. All numerical calculations were done with the same software. (Figures 3, 4, 5, 6) For each network, the number of metabolite pairs (y-axis) that participate together in a given number of reactions (xaxis) was plotted against that number of reactions. The leftmost 10 points were used to calculate a best-fit line on a log-log scale for the less connected metabolites (those that participate in 1 to 10 reactions together). The same procedure was done for individual metabolites, except that points 2 to 11 were used to compute the best-fit line because relatively few metabolites participate in only one reaction. The best-fit lines were calculated in Excel (Microsoft).

Metabolite coupling and connectivity
The number of unique coupling interactions for each metabolite was computed by first sorting the rows and columns of each M separately, according to the ordering of the diagonal elements, such that the most connected metabolites occupy the first rows and columns. The number of non-zero elements in each row of the sorted M was plotted against the row number in a cumulative fashion and normalized by dividing by the total number of non-zero elements in the matrix. Preferentially coupled metabolite pairs in E. coli Figure 8 Preferentially coupled metabolite pairs in E. coli. The metabolite pairs in E. coli that are greater than 99% likely to be preferentially coupled in a non-random fashion are summed by row/column and binned. The ordering of the rows in this figure is identical to figure 3, proceeding from most to least connected metabolites from left to right. Each preferential coupling is counted twice, once for each metabolite involved. The most connected metabolites occupy the bins on the left, and clearly account for most of the preferential coupling detected.