Reconstruction of phyletic trees by global alignment of multiple metabolic networks
© Ma et al.; licensee BioMed Central Ltd. 2013
Published: 21 January 2013
Skip to main content
© Ma et al.; licensee BioMed Central Ltd. 2013
Published: 21 January 2013
In the last decade, a considerable amount of research has been devoted to investigating the phylogenetic properties of organisms from a systems-level perspective. Most studies have focused on the classification of organisms based on structural comparison and local alignment of metabolic pathways. In contrast, global alignment of multiple metabolic networks complements sequence-based phylogenetic analyses and provides more comprehensive information.
We explored the phylogenetic relationships between microorganisms through global alignment of multiple metabolic networks. The proposed approach integrates sequence homology data with topological information of metabolic networks. In general, compared to recent studies, the resulting trees reflect the living style of organisms as well as classical taxa. Moreover, for phylogenetically closely related organisms, the classification results are consistent with specific metabolic characteristics, such as the light-harvesting systems, fermentation types, and sources of electrons in photosynthesis.
We demonstrate the usefulness of global alignment of multiple metabolic networks to infer phylogenetic relationships between species. In addition, our exhaustive analysis of microbial metabolic pathways reveals differences in metabolic features between phylogenetically closely related organisms. With the ongoing increase in the number of genomic sequences and metabolic annotations, the proposed approach will help identify phenotypic variations that may not be apparent based solely on sequence-based classification.
One of the major challenges in biology is to reconstruct phyletic relationships between living organisms. Various phylogenetic inference methods have been proposed to unravel this critical problem by using genomic data ; different phylogenetic trees have been reconstructed based on the similarity of sequences of genes encoding 16S ribosomal RNAs  and other marker genes [3–5].
With the increasing availability of whole-genome sequences, proteomic data, and annotated metabolic reactions, more homologous characters between different organisms can be identified to infer phylogenetic trees. In addition to genomic comparisons, a number of recent studies have begun to explore phylogenetic distance between species based on metabolic properties, either alone or in combination with sequence features [6–17]. Conserved metabolic pathways have been used to explicitly derive phylogenetic trees through a variety of approaches. For example, Forst et al. measured distances between organisms by iteratively aligning enzymes based on sequence similarities . Heymans et al. conducted a pairwise comparison of a single common metabolic pathway between organisms to build phylogenetic trees; they created a distance matrix based on topological relationships among enzymes (reaction graph) . Clemente et al. hierarchically compared EC (Enzyme Commission) numbers of a common metabolic pathway among multiple organisms to measure pathway similarity . All these studies, however, only compared a single metabolic pathway independently when retrieving metabolic network information.
Subsequently, Clemente et al. extended the EC-based classification method to compare all the common metabolic pathways between multiple species . On the other hand, Oh et al. used a machine learning approach for computing a distance metric using an exponential graph kernel based on nine common pathways . Another way to compare a pair of metabolic pathways between organisms is to use topological properties to define the existence/absence of metabolic pathways among organisms ; it is thus a network comparison-based method. Mazurie et al. used descriptors of structure and complexity of metabolic reactions to calculate phylogenetic distances . Borenstein et al. devised a seed approach based on essential metabolites to carry out large-scale reconstruction of phylogenetic trees . Recently, Chang et al. proposed an approach from the perspective of enzyme substrates and corresponding products in which each organism is represented as a vector of substrate-product pairs, and the vectors are then compared to reconstruct a phylogenetic tree . Furthermore, Mano et al. considered the topology of pathways as chains and used the pathway alignment method developed by Pinter et al.  to classify species . Although comparison and alignment of metabolic networks have been applied to reconstruct phyletic relationships [9, 10, 12–16], previous studies only considered pairwise structural comparison of conserved metabolic pathways in a local fashion.
Network alignment has become central to systems biology; it can be divided into two types: local and global alignment. Local network alignment is defined as an alignment of small subnetworks from one network with one or more subnetworks in another network. Because such alignments allow one node to have different pairings in different subnetworks, local network alignment may generate ambiguous results. On the other hand, global network alignment can provide a one-to-one mapping for all nodes between networks. That is, the aim is to find multiple independent regions of localized network similarity. Global alignment of multiple networks provides clusters across species that best represent conserved biological functions. Therefore, to investigate phyletic relationships from metabolic networks, we selected IsoRankN , a global multiple-network alignment tool that simultaneously integrates sequence information with topological properties to cluster functionally similar proteins across species.
We used IsoRankN to generate a biologically relevant multipartite mapping between organisms. The clusters of enzymes across the networks in the mapping derived by IsoRankN represent conserved biological reactions and functions. We adapted an entropy measure  as the filtering criterion to remove non-consistent enzyme clusters (see Methods). To construct a phyletic tree comprising multiple species, we defined a pairwise distance measure between two organisms. Data for all the metabolic networks and the enzyme sequences used in this study were retrieved from the KEGG database . Additional file 1 lists information for the organisms we tested.
First, we classified 26 organisms at the phylum scale and compared our results with recent studies. Moreover, the approach was applied to phylogenetically closely related organisms to reconstruct phyletic relationships concerning specific metabolic characteristics, such as the light-harvesting systems between Prochlorococcus and Synechococcus groups, fermentation types between Lactobacillus, and sources of electrons used for photosynthesis between green sulfur and green nonsulfur bacteria.
The above result shows that our method can correctly classify organisms into main categories. For the cases shown below, we tested our method with consideration of specific metabolic features.
Based on global alignment of multiple metabolic networks, our approach can classify organisms into main categories that reflect living style and phenotypes. The above cases clearly show that the resulting phyletic trees reflect specific metabolic characteristics among species. Thus, our approach can provide phyletic reconstructions at high resolution and characterize differences in metabolic features between phylogenetically closely related organisms.
We employed IsoRankN to explore functional similarities and differences in multiple metabolic networks. The key idea of IsoRankN is briefly introduced (Additional file 3), and a detailed description has been published in . IsoRankN is a global multiple-network alignment tool based on spectral clustering methods. Given several metabolic networks, in which the enzymes and metabolites are represented as nodes and the reactions catalyzed by enzymes are represented as edges in each network, the algorithm first computes pairwise functionally similar scores between all the cross-species enzymes . The next step uses the concept of the star alignment approach and personalized spectral clustering. In addition, we also used the functional consistency measure  to further refine the clusters obtained by IsoRankN.
where p i is the fraction of S V with KEGG group ID i. A cluster with lower entropy implies greater within-cluster consistency with respect to KEGG annotations, and thus we select the clusters with lower entropy to extract a greater amount of information on the phylogenetic relationships between the test organisms.
A phyletic tree comprising multiple species is reconstructed based on a distance measure defined by the fraction of the identified clusters in which the constituent enzymes appear in the two organisms. The distance between two organisms A and B is defined as follows: where |S A∩B | denotes the number of clusters that contain enzymes in both organisms A and B, and |S A∪B | denotes the number of clusters in which the constituent enzymes are in either organism A or B. We remark that only the clusters with lower mean entropy are considered. The mean entropy of a cluster measures its functional consistency, and as noted above, lower entropy implies greater within-cluster consistency with respect to KEGG annotations. Thus, to obtain consistency with respect to sequence-based KEGG annotation and topological features, we select the clusters having entropy no larger than 0.5.
Based on the above process, a distance matrix can be obtained. We then used PHYLIP  to build a phyletic tree based on the distance matrix. The visualization tool, Dendroscope , was used to display the phyletic trees. All experiments were performed on a platform consisting of Intel(R) Xeon(R) CPU E31230 (3.20 GHz, 16 GB memory) machines running the Linux system.
Establishing network alignments is critical in evolutionary and systems biology . Several approaches to multiple network alignment have been developed to infer the global homologous characters between complete networks; these approaches include Græmlin [35, 36], NetworkBLAST-M , IsoRank , IsoRankN , GRAAL , and SubMAP . Græmlin is a machine learning approach implemented by initially using sequence features and then incorporating local network information. However, it is difficult to select training data for reconstructing phyletic relationships between close organisms . NetworkBLAST-M is a local network alignment tool, which cannot reveal complete topological information. Kuchaiev et al. developed the pairwise sequence-free global network alignment tool, GRAAL, with which they defined a distance metric between two species by using the edge correctness ratio of pairwise metabolic network alignment results and reconstructed phylogenetic trees . Because the tool only considers topological information of metabolic networks, the sequence features that are ignored may play important biological roles in phylogeny. The first global network alignment algorithm, IsoRank, uses a spectral graph algorithm to measure an alignment between two networks based on both sequence similarity between nodes and topological similarity of their neighborhoods. Ay et al. extended the idea of the IsoRank algorithm for pairwise network alignment to metabolic networks but did not consider multiple network alignment . Therefore, for our purpose we selected IsoRankN, a global multiple network alignment tool that simultaneously integrates sequence information with topological properties to cluster functionally similar proteins across species. Liao et al.  demonstrated that IsoRankN outperformed existing algorithms for global multiple network alignment of protein interaction networks with respect to coverage and consistency.
As for phylogenetically closely related organisms, we then applied the same analysis to Lactobacillus. For our reconstruction (see Figure 3), we consider three pairs of organisms with high 16S rRNA sequence similarity: Lactobacillus gasseri (lga) versus Lactobacillus johnsonii NCC 533 (ljo), Lactobacillus fermentum IFO 3956 (lfe) versus Lactobacillus reuteri SD2112 (lru), and finally lfe versus lga. The former two pairs come from the same groups, respectively, and the last pair was selected from different groups in our reconstruction. As shown in Additional file 5, the pair (lga, ljo) in the homofermentation group shares more enzymes than those for the pair (lfe, lga) from different groups according to the statistics of the KEGG pathways (Additional file 5a); similarly, (lfe, lru) has more common enzymes than those for (lfe, lga) (Additional file 5b). That is, Lactobacillus species in the same group in our classification show more functional similarity than those species from different groups. More precisely, concerning the glycolysis/gluconeogenesis pathway, ko00010, (lga, ljo) and (lfe, lru) share more constituent enzymes than those for (lfe, lga). These results show that our reconstruction can reveal specific metabolic features.
We also analyzed species from Prochlorococcus and Synechococcus, which have different light-harvesting systems. For our reconstruction (see Figure 4), we consider three pairs of organisms: Prochlorococcus marinus SS120 (pma) versus Prochlorococcus marinus MIT 9515 (pmc), Synechococcus sp. WH8102 (syw) versus Synechococcus sp. WH7803 (syx), and finally pma versus syx. The former two pairs come from the same groups, respectively, and the last one was selected from different groups in our reconstruction. However, there is no obvious difference when we compare (pma, pmc) and (syw, syx) with (pma, syx) (Additional file 6a and 6b). In such a case, the quantitative analysis cannot explicitly classify the species with high sequence similarity regarding their particular metabolic features.
In contrast, our classification by using global alignment of multiple metabolic networks can successfully determine phenotypic similarity (Figure 4). Because our approach incorporates topology features of metabolic networks with sequence similarity, it affords a more in-depth analysis of the phyletic reconstruction.
Most studies have focused on the classification of organisms based on structural comparison and local alignment of metabolic pathways. In contrast, global alignment of multiple metabolic networks, which compensates sequence-based phylogenetic analyses, may provide more comprehensive information. Therefore, we propose a new approach that uses the global network alignment tool, IsoRankN, to reconstruct phyletic relationships of multiple species. Our phyletic trees lie between conventional genotypic construction and phenotypic reconstruction. We demonstrated that our reconstruction has the capacity to explore more in-depth metabolic features and subtle phenotypic differences, such as light-harvesting systems, fermentation type, and sources of electrons for photosynthesis.
The growing mass of systems-level data allows our approach to find more applications to identify phenotypic variations hidden behind sequence-based classification [1, 40]. In addition to metabolic network information, Suthram et al.  showed that phylogenetic relationships may be inferred from protein interaction networks. They identified conserved species-specific complexes in protein interaction networks and built a phylogenetic tree based on the complexes because interactions between proteins may imply conservation of specific groups. Although false-positives exist in protein-protein interaction data, comparative analysis of protein-protein interaction networks of closely related organisms can reveal phenotypic properties . Therefore, global alignment of multiple protein-protein interaction networks may provide a high-resolution look at phyletic reconstruction. It is worthwhile to explore the phenotypic differences between global network alignment of multiple metabolic networks and protein interaction networks. In the future, better quantitative and qualitative analyses of metabolic pathways between organisms would also be of interest.
This work was funded in part by the National Science Council of Taiwan under the Grants NSC100-2221-E-007-108-MY3 (to C.-S.L.) and NSC100-2221-E-126-011-MY3 (to C.Y.T), NIH Grant GM081871 (to B.B.) and MOE Grant 101N2074E1 (to C.-S.L.). The publication costs for this article were funded by the National Science Council of Taiwan.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 2, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S2.
We thank Masanori Arita at University of Tokyo for helpful comments. We are also grateful to the National Center for High-performance Computing for computer time and facilities. C.-S.L. acknowledgements support from Sayling Wen Cultural and Educational Foundation.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.