Reconstructing phylogeny from metabolic substrate-product relationships
© Chang et al; licensee BioMed Central Ltd. 2011
Published: 15 February 2011
Many approaches utilize metabolic pathway information to reconstruct the phyletic tree of fully sequenced organisms, but how metabolic networks can add information to original genomic annotations has remained open.
We translated enzyme reactions assigned in 1075 organisms into substrate-product relationships to represent the metabolic information at a finer resolution than enzymes and compounds. Each organism was represented as a vector of substrate-product relationships and the phyletic tree was reconstructed by a simple hierarchical method. Obtained results were compared with several other approaches that use genome information and network properties.
Phyletic trees without consideration of network properties can already extract organisms in anomalous environments. This efficient method can add insights to traditional genome-based phylogenetic reconstruction.
Structural relationship among metabolites can highlight parasitic or symbiont species such as spirochaete and clamydia. The method assists understanding of species-environment interaction when used in combination with traditional phylogenetic methods.
Understanding the phyletic relationship among living organisms has long been a fundamental challenge since the concept of evolution had emerged. Traditionally, molecular biologists constructed phylogenetic trees based on the sequence similarity of small subunit ribosomal RNA  or other single genes. As whole-genome sequencing technologies advance, vast amount of sequence data become available for download and analysis. Without question, the comparative analysis of whole genomes can provide more information to reconstruct the phylogeny than individual genes do. Consequently, numerous methods have been proposed to reconstruct the phylogenetic trees from whole genome features such as oligonucleotide compositions , genome fragment occurrence , and absence/presence of metabolic features .
In parallel with genomic comparisons, many studies focused on the similarity of metabolic processes. Metabolic profiles of a living organism are strongly related to its environment, and metabolism is adapted to balance compounds taken up from its surroundings [5, 6]. Thus, metabolic consideration can add insights into species-environment interaction such as symbiosis or convergent adaptation to extreme environments. To analyze the phyletic relationship in metabolic capability, there are at least 3 approaches. The first is machine learning. Oh et al. used a distance computed by the exponential graph kernel, i.e., the weighted sum of similarities between adjacency matrices of 1-step neighbors, 2-step neighbors, and so on for 81 organisms . The second is network comparison. Zhang et al. defined existence/absence of metabolic pathways and computed the network similarity measure for 47 organisms . The last is EC-based classification. Clemente et al. used sets of EC numbers to define pathway similarity and compared metabolism of 8 bacteria .
Metabolic data are well standardized in previous approaches because all works depended on the bulk-downloadable KEGG database . Less concerned, however, was the strategy for transforming enzymatic reactions into graphs (or networks). Depending on the strategy, resulting networks are drastically different enough to change fundamental network centralities . For example, Borenstein et al. converted each enzymatic reaction to a fully connected bipartite graph between substrates and products to enhance connectivity and defined ‘seed’ compounds for each organism as the union of essential metabolites in all environments . This transformation is known to overestimate the ability to synthesize/degrade metabolites. On the other hand, using the EC numbers for pathway analysis tend to underestimate the metabolic network because the numbers are assigned to biochemical transformation, and not to enzyme themselves. We here propose a more suitable data representation, and elucidate the phylogenies across three domains of life. Its effectiveness is shown in comparison with previous approaches.
Enzyme annotation for organisms
Enzyme annotations and corresponding EC reactions for 1075 organisms (895 bacteria, 67 archaea, and 113 eukaryotes) were obtained from the KEGG database through its application program interface. The number of EC reactions was 3116, covering as many as 154 pathway maps. Metabolic annotations in each species were represented as a set of substrate-product relationships by transforming all assigned EC reactions into a set of metabolite pairs (see the next section). Most EC-numbered entries correspond to multiple enzymatic reactions. For example, alcohol dehydrogenase (EC 188.8.131.52) can catalyze a multitude of compounds with a hydroxyl group. For such generic EC-numbered functions we manually integrated possible reactions to ensure the coverage of biochemical transformation shown in the metabolic maps.
Strategy for graph transformation
An enzymatic reaction usually has multiple inputs (substrates) and outputs (products). Although standard metabolic pathway charts are depicted as hypergraphs, substrate-product relationships must be specified for each reaction to transform it into a graph. A standard way is to use a fully connected bipartite graph [7, 8, 12]. The network connectivity then portrays the ‘reaction membership’; frequently occurring metabolites become hub nodes in the resulting graph. The representation, however, does not capture biochemical transformation between compounds because any two metabolites can be falsely linked through metabolic hubs regardless of their structures .
To avoid this bypassing effect, we employ the substrate-product decomposition of reactions . In this scheme, each reaction is decomposed into a set of structurally related substrate-product pairs at the atomic scale. The data are also available from the RPAIR database , and the same method has been used in several recent works [15–17]. This representation avoids bias originating from currency metabolites. In other words, the method focuses on the variation of structural transformations, not the occurrence of each metabolite. The decomposition results of EC-numbered reactions are accessible at our wiki-based site: http://metabolomics.jp/wiki/Enzyme:[EC-number]. For example, the details of hexokinase can be accessed at http://metabolomics.jp/wiki/Enzyme:184.108.40.206 In the transformation, we replaced generic names such as alcohol or amino acids with concrete compound names. For hexokinase, as many as 15 reactions are included depending on hexose types. Through this decomposition, a set of enzymatic reactions becomes a set of substrate-product pairs. We did not consider the multiplicity of each pair in our analysis.
Phyletic trees were created by a hierarchical clustering method (pairwise complete linkage algorithm) using the Cluster 3.0 software program . Each organism was represented as a vector of substrate-product pairs, where the absence/presence of each relationship was denoted as 0 or 1. For visualization, Dendroscope software program  was used to analyze and compare phyletic trees. The employed simple algorithm may be controversial for phyletic reconstruction, and will be discussed later.
We compared results of our data representation with several recent, well known studies.
Phyletic trees for multi-domains of life based on substrate-product relationships
Phyletic trees with or without network connectivity
Comparison with EC number-based classification
We previously argued that metabolic hubs are better identified in the substrate-product graph than in other graph representations, because the approach does not count the frequency of metabolite names in reactions but the number of structural transformations . The number of transformations roughly reflects the structural variation of catalytic sites of respective enzymes, and therefore reflects the diversity of metabolic capabilities.
Most differently transforming metabolites in the three domains. The full list is available at http://sarst.life.nthu.edu.tw/metabolic/SD.csv.
List of hubs in the descending order of appearance
Archaea (67 spp)
ATP, L-Glutamate, CO2, Acetyl-CoA, L -Glutamine, Pyruvate, NH3, L -Aspartate, AMP, 5P-alpha-D-ribose 1P, S-Adenosyl- L -methionine
Bacteria (895 spp)
CO2, ATP, NH3, Pyruvate, L -Glutamate, Acetyl-CoA, 5P-alpha-D-ribose 1P, CoA, Malonyl-ACP, L -Aspartate, Glutathione, AMP, S-Adenosyl- L -methionine, Glycine
Eukaryotes (113 spp)
L -Glutamate, CO2, Acetyl-CoA, NH3, CoA, ATP, AMP, Glutathione, Pyruvate, Malonyl-ACP, S-Adenosyl- L -methionine, Glycine, D-Galactose, UDP-glucuronate, L -Serine, L -Glutamine
Metabolic differences between bacteria, archaea, and eukaryotes
Our reconstruction using substrate-product relationships efficiently extracted metabolically interesting species in comparison with the standard phylogenetic approach. Previous approaches which used metabolic information could also produce informative results [7–9, 12], but the achievements were similar to those found by genetic comparisons [2–4]. This is understandable because in their approach metabolic reactions correspond roughly one-to-one to enzymes or genes.
Why can substrate-product relationships add insights?
Our approach is more robust to pathway gaps (incomplete annotation) or currency metabolites by evaluating each biochemical transformation with an equal weight. It is also robust to biases by the number of genes or their multiplicity. Standard phylogenetic methods can elucidate evolutionary relationship, whereas our approach can locate species of anomalous or interesting metabolism in comparison. Therefore, the method is useful in combination (not exclusive) with existing phyletic/phylogenetic clustering.
Our method is also computationally lightweight and scalable, requiring O(N 2 V) time for computing pairwise similarity, where N is the number of organisms and V is the maximum number of reactions in one organism. On the contrary, for example, the exponential graph kernel requires O(NV3+N2V2) time to compute the similarity . Our computational complexity is equivalent to the recently presented pathway alignment method , but the method exploits the graph topology and the result is expected to be similar to the one by Zhang et al. Lastly, the ‘seed’ approach uses a heuristic to find metabolic seeds , but an accurate identification of metabolic seeds is NP-complete . There is a huge gap as to the scalability to the other metabolic approaches.
Algorithms to find phylogeny
Our method uses a simplistic complete linkage clustering algorithm to reconstruct the phylogeny. This may sound inappropriate but is grounded on our data representation. Since the substrate-product relationship disregards the occurrence of metabolites, a frequently appearing reaction type (e.g. ATP-kinase) and a rare reaction type (e.g. sterol synthase) are given the same weights. For this reason, standard parsimony or evolutionary distance does not properly reflect the distance between species in our scheme. Since we wanted to focus on metabolic differences, the complete linkage method was employed. However, other algorithms should be systematically tested and evaluated for their appropriateness, which is left as our future work.
Sharing metabolic knowledge through wiki
We publicize the substrate-product relationships on a wiki-based site so that readers can check every detail of our analysis. This is especially important in the era of high-throughput data management because more and more research results tend to become irreproducible due to the insufficiency of publicized data or incomplete description of methods. To overcome this difficulty, the traceability and transparency of data and their analysis is important in the evaluation of research.
Phylogeny was reconstructed by using structural relationship between annotated metabolites. This method is robust to pathway gaps or gene copy numbers, and can extract metabolically anomalous species by comparing the result with other phyletic or phylogenetic reconstructions. Through several comparisons, our method could highlight metabolic anomaly in clamydia and spirochaete, both of which are well known parasitic species. The metabolic comparison thus assists understanding of species-environment interaction in combination with other gene-oriented strategies.
CWC conducted research in Japan for 6 months with Elite Scholarship Program, Ministry of Education, Taiwan. Authors thank Prof. Kenta Nakai (University of Tokyo) and Dr. Kazuhiro Takemoto (University of Tokyo) for helpful comments to our draft.
This work is supported by Grant-in-Aid for Scientific Research on Priority Areas “Systems Genomics” from Ministry of Education, Culture, Sports, Science and Technology, Japan.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S1.
- Woese CR, Kandler O, Wheelis ML: Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U S A 1990, 87: 4576–4579. 10.1073/pnas.87.12.4576PubMed CentralView ArticlePubMedGoogle Scholar
- Ma HW, Zeng AP: Phylogenetic comparison of metabolic capacities of organisms at genome level. Mol Phylogenet Evol 2004, 31: 204–213. 10.1016/j.ympev.2003.08.011View ArticlePubMedGoogle Scholar
- Lin J, Gerstein M: Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. Genome Res 2000, 10: 808–818. 10.1101/gr.10.6.808PubMed CentralView ArticlePubMedGoogle Scholar
- Aguilar D, Aviles FX, Querol E, Sternberg MJ: Analysis of phenetic trees based on metabolic capabilites across the three domains of life. J Mol Biol 2004, 340: 491–512. 10.1016/j.jmb.2004.04.059View ArticlePubMedGoogle Scholar
- Almaas E, Kovacs B, Vicsek T, Oltvai ZN, Barabasi AL: Global organization of metabolic fluxes in the bacterium Escherichia coli . Nature 2004, 427: 839–843. 10.1038/nature02289View ArticlePubMedGoogle Scholar
- Varma A, Palsson BO: Stoichiometric flux balance models quantitatively predict growth and metabolic byproduct secretion in wild-type Escherichia coli W3110. Appl Environ Microbiol 1994, 60: 3724–3731.PubMed CentralPubMedGoogle Scholar
- Oh SJ, Joung JG, Chang JH, Zhang BT: Construction of phylogenetic trees by kernel-based comparative analysis of metabolic networks. BMC Bioinformatics 2006, 7: 284. 10.1186/1471-2105-7-284PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Y, Li S, Skogerbo G, Zhang Z, Zhu X, Sun S, Lu H, Shi B, Chen R: Phylophenetic properties of metabolic pathway topologies as revealed by global analysis. BMC Bioinformatics 2006, 7: 252. 10.1186/1471-2105-7-252PubMed CentralView ArticlePubMedGoogle Scholar
- Clemente JC, Satou K, Valiente G: Phylogenetic reconstruction from non-genomic data. Bioinformatics 2007, 23: e110–115. 10.1093/bioinformatics/btl307View ArticlePubMedGoogle Scholar
- Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 1999, 27: 29–34. 10.1093/nar/27.1.29PubMed CentralView ArticlePubMedGoogle Scholar
- Arita M: The metabolic world of Escherichia coli is not small. Proc Natl Acad Sci U S A 2004, 101: 1543–1547. 10.1073/pnas.0306458101PubMed CentralView ArticlePubMedGoogle Scholar
- Borenstein E, Kupiec M, Feldman MW, Ruppin E: Large-scale reconstruction and phylogenetic analysis of metabolic environments. Proc Natl Acad Sci U S A 2008, 105: 14482–14487. 10.1073/pnas.0806162105PubMed CentralView ArticlePubMedGoogle Scholar
- Arita M: In silico atomic tracing by substrate-product relationships in Escherichia coli intermediary metabolism. Genome Res 2003, 13: 2455–2466. 10.1101/gr.1212003PubMed CentralView ArticlePubMedGoogle Scholar
- Kotera M, Hattori M, Oh M-A, Yamamoto R, Komeno T, Yabuzaki J, Tonomura K, Goto S, Kanehisa M: RPAIR: a reactant-pair database representing chemical changes in enzymatic reactions. Genome Informatics 2004, 15: P062. (poster abstract) (poster abstract)Google Scholar
- Faust K, Croes D, van Helden J: Metabolic pathfinding using RPAIR annotation. J Mol Biol 2009, 388: 390–414. 10.1016/j.jmb.2009.03.006View ArticlePubMedGoogle Scholar
- Pitkänen E, Jouhten P, Rousu J: Inferring branching pathways in genome-scale metabolic networks. BMC Syst Biol 2009, 3: 103.PubMed CentralView ArticlePubMedGoogle Scholar
- Tohsato Y, Nishimura Y: Reaction Similarities Focusing Substructure Changes of Chemical Compounds and Metabolic Pathway Alignments. Inform Media Technol 2009, 4: 390–399.Google Scholar
- de Hoon MJ, Imoto S, Nolan J, Miyano S: Open source clustering software. Bioinformatics 2004, 20: 1453–1454. 10.1093/bioinformatics/bth078View ArticlePubMedGoogle Scholar
- Huson DH, Richter DC, Rausch C, Dezulian T, Franz M, Rupp R: Dendroscope: An interactive viewer for large phylogenetic trees. BMC Bioinformatics 2007, 8: 460. 10.1186/1471-2105-8-460PubMed CentralView ArticlePubMedGoogle Scholar
- Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: Toward automatic reconstruction of a highly resolved tree of life. Science 2006, 311: 1283–1287. 10.1126/science.1123061View ArticlePubMedGoogle Scholar
- Gamulin V, Muller IM, Muller WEG: Sponge proteins are more similar to those of Homo sapiens than to Caenorhabditis elegans . Biol J Linnean Soc 2000, 71: 821–828. 10.1111/j.1095-8312.2000.tb01293.xView ArticleGoogle Scholar
- Caetano-Anollés G, Kim HS, Mittenthal JE: The origin of modern metabolic networks inferred from phylogenomic analysis of protein architecture. Proc Natl Acad Sci U S A 2007, 104: 9358–9963.PubMed CentralView ArticlePubMedGoogle Scholar
- Mano A, Tuller T, Béjà O, Pinter RY: Comparative classification of species and the study of pathway evolution based on the alignment of metabolic pathways. BMC Bioinformatics 2010, 11(Suppl 1):S38. 10.1186/1471-2105-11-S1-S38PubMed CentralView ArticlePubMedGoogle Scholar
- Pitkänen E, Rantanen A, Rousu J, Ukkonen E: Finding Feasible Pathways in Metabolic Networks. Lecture Notes in Comput Sci 2005, 3746: 123–133.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.