Construction of phylogenetic trees by kernelbased comparative analysis of metabolic networks
 S June Oh^{1},
 JeGun Joung^{2, 3},
 JeongHo Chang^{4} and
 ByoungTak Zhang^{2, 3, 4}Email author
DOI: 10.1186/147121057284
© Oh et al; licensee BioMed Central Ltd. 2006
Received: 05 October 2005
Accepted: 06 June 2006
Published: 06 June 2006
Abstract
Background
To infer the tree of life requires knowledge of the common characteristics of each species descended from a common ancestor as the measuring criteria and a method to calculate the distance between the resulting values of each measure. Conventional phylogenetic analysis based on genomic sequences provides information about the genetic relationships between different organisms. In contrast, comparative analysis of metabolic pathways in different organisms can yield insights into their functional relationships under different physiological conditions. However, evaluating the similarities or differences between metabolic networks is a computationally challenging problem, and systematic methods of doing this are desirable. Here we introduce a graphkernel method for computing the similarity between metabolic networks in polynomial time, and use it to profile metabolic pathways and to construct phylogenetic trees.
Results
To compare the structures of metabolic networks in organisms, we adopted the exponential graph kernel, which is a kernelbased approach with a labeled graph that includes a label matrix and an adjacency matrix. To construct the phylogenetic trees, we used an unweighted pairgroup method with arithmetic mean, i.e., a hierarchical clustering algorithm. We applied the kernelbased network profiling method in a comparative analysis of nine carbohydrate metabolic networks from 81 biological species encompassing Archaea, Eukaryota, and Eubacteria. The resulting phylogenetic hierarchies generally support the tripartite scheme of three domains rather than the two domains of prokaryotes and eukaryotes.
Conclusion
By combining the kernel machines with metabolic information, the method infers the context of biosphere development that covers physiological events required for adaptation by genetic reconstruction. The results show that one may obtain a global view of the tree of life by comparing the metabolic pathway structures using metalevel information rather than sequence information. This method may yield further information about biological evolution, such as the history of horizontal transfer of each gene, by studying the detailed structure of the phylogenetic tree constructed by the kernelbased method.
Background
The availability of pathway databases such as Kyoto Encyclopedia of Genes and Genomes (KEGG), What is there? (WIT3), PathDB, and MetaCyc opens up various new possibilities for comparative analysis. In particular, information about metabolic pathways in different organisms yields important information about their evolution and offers a complementary approach to phylogenetic analysis. Here we present a comparative metabolomic approach to constructing phylogenetic trees that uses physiological functions of the organisms by computing the structural similarity of metabolic networks. The consideration of metabolic components complements the conventional approaches to phylogeny based on genome sequences. Recognizing the similarities and differences in metabolic functions between species may provide insights into other applications in biotechnology, ecology, and evolutionary studies. Several researchers have attempted to rebuild evolutionary history by comparing ribosomal RNA sequences [1], by phylogenomics [2], or by comparing whole genomes to overcome the limitations of the genesequence analyses [3–5].
Several recent studies have extended conventional phylogenetic analysis to incorporate metabolic pathway information. Forst and Schulten [6, 7] presented one of the earliest approaches to extend the conventional sequence comparison and phylogenetic analysis of individual enzymes to metabolic networks. They also presented a method to calculate distances between metabolic networks based on sequence information of the biomolecules involved and information about the corresponding reaction networks. Dandekar et al. (1999) combined strategies in a systematic comparison of the enzymes and corresponding sequence information of the glycolytic pathway [8]. Other approaches involving the reconstructed phylogenies from geneorder data have been based on simulating genome evolution [9], and studying the genome evolution resulting from the metabolic adaptation of the organism to the surrounding environment. Liao et al. (2002) presented a method to group organisms by comparing the profiles of metabolic pathways, where the profiling was based simply on binary attributes (e.g., by denoting the presence or absence of pathways in the organisms) [10].
Whereas the previous approaches incorporated information about the additional metabolic pathways, systematic methods to calculate the similarities between metabolic networks are lacking or contain gaps in some of the biological assumptions. In this paper, we introduce the concept of graph kernels to calculate the similarities between two different network structures. The graph kernelbased approach can compute more efficiently the similarity of two graph structures by the kernel function that can extract important features from the graph. Our approach contrasts with that of Forst and Schulten [6, 7] in that the graph kernel calculates the distance based on the network level instead of on its sequence information on the biomolecules involved.
In their comparative analysis of metabolic pathways, Heymans and Singh [11] showed that phylogenetic trees could be made from the graph similarities of metabolic networks. They applied a distance measure between metabolic graphs of the glycolytic pathway and the citric acid cycle from 16 organisms. However, some of their data on phylogenetic inference did not correspond entirely with the conventional taxonomy and did not provide a global view of the specialization of species according to the scale of analyzed species and metabolic pathways.
Several more recent attempts have reconstructed genome trees using different formalisms such as gene ordering [12, 13], measuring gene contents [3, 14], comparing sequence similarities [15, 16], comparing proteome strings [17], and phylogenomics [18, 19]. All are based on the principle of genome sequences, but none has applied the concepts of effectible physiology to the phylogenetic analyses. We report on our comparative results and discuss our findings.
Results and discussion
Statistics for the dataset according to the number of enzymes and their relationships
enzyme  relation  

# of total occurrences  35,134  17,567 
# of unique elements  218  1,275 
max # per organism  544  123 
min # per organism  46  26 
avg # per organism  68  217 
stdev across organisms  26  133 
The nine reference pathways used in the analysis
MAP No. (KEGG)  pathway name 

00010  glycolysis/gluconeogenesis 
00020  citrate cycle (TCA cycle) 
00030  pentose phosphate pathway 
00051  fructose and mannose metabolism 
00052  galactose metabolism 
00620  pyruvate metabolism 
00630  glyoxylate and dicarboxylate metabolism 
00640  propanoate metabolism 
00650  butanoate metabolism 
The 81 organisms included in the phylogenetic analysis. Full scientific names were abbreviated into three character notation (Abbr.) and their domain informations in phylogeny were also represented in single character that are Eubacteria (B), Archaea (A) and Eukaryota (E), respectively.
Abbr.  Domain  Organism  Abbr.  Domain  Organism 

Aae  B  Aquifex aeolicus  Mth  A  Methanobacterium thermoautotrophicum 
Ana  B  Anabaena sp.  Mtu  B  Mycobacterium tuberculosis H37Rv 
Atc  B  Agrobacterium tumefaciens C58 Cereon  Nma  B  Neisseria meningitidis serogroup A 
Ath  E  Arabidopsis thaliana  Nme  B  Neisseria meningitidis serogroup B 
Atu  B  Agrobacterium tumefaciens C58 UWash  Oih  B  Oceanobacillus iheyensis 
Bha  B  Bacillus halodurans  Pab  A  Pyrococcus abyssi 
Bme  B  Brucella melitensis  Pae  B  Pseudomonas aeruginosa 
Bsu  B  Bacillus subtilis  Pai  A  Pyrobaculum aerophilum 
Cac  B  Clostridium acetobutylicum  Pfu  A  Pyrococcus furiosus 
Ccr  B  Caulobacter crescentus  Pho  A  Pyrococcus horikoshii 
Cel  E  Caenorhabditis elegans  Pmu  B  Pasteurella multocida 
Cje  B  Campylobacter jejuni  Rno  E  Rattus norvegicus 
Cmu  B  Chlamydia muridarum  Rso  B  Ralstonia solanacearum 
Cpa  B  Chlamydophila pneumoniae AR39  Sam  B  Staphylococcus aureus MW2 
Cpe  B  Clostridium perfringens  Sau  B  Staphylococcus aureus N315 
Cpj  B  Chlamydophila pneumoniae J138  Sav  B  Staphylococcus aureus Mu50 
Cpn  B  Chlamydophila pneumoniae CWL029  Sce  E  Saccharomyces cerevisiae 
Cte  B  Chlorobium tepidum  Sco  B  Streptomyces coelicolor 
Ctr  B  Chlamydia trachomatis  Sme  B  Sinorhizobium meliloti 
Dme  E  Drosophila melanogaster  Spg  B  Streptococcus pyogenes M3 
Dra  B  Deinococcus radiodurans  Spm  B  Streptococcus pyogenes M18 
Ece  B  Escherichia coli O157 EDL933  Spo  E  Schizosaccharomyces pombe 
Ecj  B  Escherichia coli K12 W3110  Spy  B  Streptococcus pyogenes 
Eco  B  Escherichia coli K12 MG1655  Sso  A  Sulfolobus solfataricus 
Ecs  B  Escherichia coli O157 Sakai  Stm  B  Salmonella typhimurium 
Fnu  B  Fusobacterium nucleatum  Sto  A  Sulfolobus tokodaii 
Hal  A  Halobacterium sp.  Sty  B  Salmonella typhi 
Hin  B  Haemophilus influenzae  Syn  B  Synechocystis sp. 
Hpj  B  Helicobacter pylori J99  Tac  A  Thermoplasma acidophilum 
Hpy  B  Helicobacter pylori 26695  Tel  B  Thermosynechococcus elongatus 
Hsa  E  Homo sapiens  Tma  B  Thermotoga maritima 
Lin  B  Listeria innocua  Tpa  B  Treponema pallidum 
Lla  B  Lactococcus lactis  Tte  B  Thermoanaerobacter tengcongensis 
Lmo  B  Listeria monocytogenes  Tvo  A  Thermoplasma volcanium 
Mac  A  Methanosarcina acetivorans  Vch  B  Vibrio cholerae 
Mja  A  Methanococcus jannaschii  Xax  B  Xanthomonas axonopodis 
Mle  B  Mycobacterium leprae  Xca  B  Xanthomonas campestris 
Mlo  B  Mesorhizobium loti  Xfa  B  Xylella fastidiosa 
Mma  A  Methanosarcina mazei  Ype  B  Yersinia pestis 
Mmu  E  Mus musculus  Ypk  B  Yersinia pestis KIM 
Mtc  B  Mycobacterium tuberculosis CDC1551 
The phylogeny took two directions: conventional taxonomy that focused on the morphological and physiological features to classify species, and a numerical taxonomy that stressed the historical changes in biological sequences. Phylogenies based on the ribosomal RNA molecules led to the proposal of a new tripartite scheme of three domains: Bacteria, Archaea, and Eukarya [20]. Although each approach is feasible on its own, it cannot provide a holistic view of the organism. Current phylogenetic studies indicate that horizontal gene transfer may have played a vital role in the evolution of major lineages [21]. Lake and Moore [22] also noted the pitfalls of comparative genomics based on molecular sequences. Our kernelbased method provides an alternative to the inference of an evolutionary scenario and allows for a higherlevel comparison of the phylogenetic trees by measuring the distances between pathways using metabolic network data to infer an evolutionary scenario.
Consistency with conventional taxonomy
Figure 2(a) shows that archaeal metabolic networks are more closely related to the eukaryotic networks than with the eubacterial networks. This corresponds with a comparison of the informationtransfer pathways and pathwaylevel organization between two domains [23], whereas eukaryotic metabolic enzymes are primarily of bacterial origin [24].
Inferring hidden order by network clustering
The conventional sequencebased analysis passes over or does not embrace the discordant evolution of each species or the horizontal gene transfer [25]. Our method can cope with this limitation by taking into account the structural features of individual metabolic networks. The disagreement between the molecular sequence data of operational genes and the rRNA tree suggests that different genes have different evolutionary histories [26, 27]. To address this problem, Li studied the mitochondrial genomes in relation to the problem of wholegenome phylogeny, where evolutionary events, such as genetic rearrangements that include gene transfer from the exterior, make genome alignments difficult [28].
To compare the kernelbased comparative analysis of metabolic networks to the sequencebased phylogenetic analysis, we analyzed two enzyme sequences that participate in carbohydrate metabolism in all 81 species together using a multiple sequence alignment (Figure 2(b)). In the resulting phylogenetic tree, Archaea and Eukaryota are clustered at each terminal; however, shortdistance neighboring node members belong to fairly distant taxonomic groups. The overall structure of the tree eventually becomes remote from not only that of the kernelbased method (Figure 2(a)), but also from that of current taxonomy (Figure 3). Although the phylogenetic tree constructed from the multiple alignment of two enzyme sequences shows a few unusual characteristics, our approach provides a good solution. The cluster mainly comprised archaeal species including the bacterial members Chlamydia and Chlamydophila, and had long branches at the root of the tree. Three eubacterial members (Ana, Tel, and Syn) are more closely related to Eukaryota than Eubacteria. Moreover, eubacterial groups are separated over the topology, and the Eukaryota are inserted between them (Figure 2(b)).
Comparison of similarity scores with respect to NCBI taxonomy for 65 organisms with the glycolysis pathway (β = 0.8).
Method  Similarity score 

Our method  0.196 
[11]  0.154 
In this paper, we intended to present a metalevel analysis of biological systems to construct a unitary phylogenetic tree that could be used to interpret the context of biological evolution. Our results suggest that the phylogenetic analysis with submetabolic network information might also allow us to infer horizontal or lateral gene transfer. Our results also support the tripartite scheme of the three domains, Bacteria, Archaea and Eukaryota [20].
Comparing the pathogenic bacterial genomes by focusing on the pathways of bacterial and eukaryotic aminoacyltRNA synthesis showed that this pathway is uniquely prokaryotic/archaeal and that it is found widely among the pathogenic bacteria. This suggests that members of this pathway can be used as targets for novel antimicrobial drugs [32]. Metabolic analysis of pathogenic organisms may play a critical role in the selective treatment or prevention of diseases caused by these organisms by using this innovative concept to develop new drugs.
Conclusion
Biological classification, taxonomy, and systematics are the profound themes in biology. Using phylogeny in evolutionary classification implies functional and morphological innovation, adaptive range, parallelism, and convergence. We have used a method based on the graph kernel to compare information on each metabolic network including cardinality, distance, and topology relating to metabolic networks as a type of undirected graph. Our results showed that our approach has potential in the macroscopic analysis of phylogenetic relationships among organisms in relation to horizontal gene transfer. To obtain information about each causal mechanism in the context of a similar phenotype, one should first analyze the phenomena at the level of a protein network. The analysis of a metabolic network is an example of this type of analysis. Biological entities that interact with the environment and eventually influence adaptation are a function of the activity of proteins and other bioactive molecules rather than gene order or genetic history.
The overall structure of the phylogenetic tree constructed from our experiments supports the tripartite scheme of the three domains Archaea, Eubacteria, and Eukaryota as described in an early report of Woese et al. [20]. The structures of metabolic pathway deduced from Archaea are more similar to those from Eukaryota than to those from Eubacteria. This agrees with the rooted universal tree of life [33, 34] and the tree of life [35, 36]. The metabolic network structures of organisms reflect their functional relationship with the environment, and the similarity might provide a measure of the organism's physiological functions. The trajectory of an organism's adaptation can be explained using the structure of its metabolic contents. Our approach can be extended to more organisms and applied to other types of biomolecular interactions, such as physical protein interactions in regulatory networks, to provide a basis for understanding the functional relationships between biological networks in different organisms.
Methods

Step 1: Build the enzymeenzyme relation lists.

Step 2: Convert the lists to graph structures.

Step 3: Compute similarity by graph kernels.

Step 4: Build the phylogenetic trees.
Data preparation
Dataset
We chose the KEGG database [37] as the resource for previous phylogenetic analysis. KEGG provides both an online map of pathways and the ability to focus on metabolic reactions in specific organisms. Each reaction may be uni or bidirectional.
Representation of organisms
Let O = {O_{1},..., O_{ N }} be a set of N organisms and P = {P_{1},..., P_{ M }} be a set of M reference pathways. Here a reference pathway contains all known alternatives of reaction paths. The set of organismspecific pathways is defined as P' = {${{P}^{\prime}}_{1}$,...,${{P}^{\prime}}_{M}$}, which contains organismspecific reactions. If we define a set of enzymeenzyme relations as R = {r_{1},...,r_{ K }}, then a subset of R constitutes P_{ j }or ${{P}^{\prime}}_{j}$ (1 ≤ j ≤ M). Here, r_{ k }(1 ≤ k ≤ K) is a pair of enzymes {e_{ u }, e_{ v }}, which means that e_{ u }directly connects with e_{ v }. The specific organism O_{ i }(1 ≤ i ≤ N) contains a set of pathways P, defined as P'(O_{ i }) and including a subset of R for the specific organism, R'(O_{ i }).
Enzymeenzyme relation lists of organisms
The pathways provided in KEGG are visualized on manually drawn pathway maps or XMLbased graphics. To construct enzymeenzyme relation lists, we used information about chemical compounds and chemical reactions contained in the LIGAND database [38]. The LIGAND database provides detailed molecular information about one type of the generalized proteinprotein interaction, namely, the enzymeenzyme relation. LIGAND is a composite database of ENZYME and COMPOUND. The ENZYME section contains information about enzymatic reactions and enzyme molecules, and the COMPOUND section contains more than 6,000 chemical compounds. The enzymeenzyme relationship can be extracted from information about enzymes contained in the COMPOUND entries. We automatically extracted information about enzymes of a specific organism from the ENZYME section.
The enzymeenzyme relationships of a specific organism were extracted from the enzymeenzyme relation list. If two enzymes of r_{ k }in an enzymeenzyme relation list existed in the enzyme list of a specific organism, we inserted r_{ k }into R'(O_{ i }). P'(O_{ i }) can be constructed by R'(O_{ i }).
Data analysis
Metabolic networks as labeled graphs
Our approach to estimate the distance between two metabolic networks is based on the graph comparison. Using the relation list of enzymes, the metabolic network of each organism is represented by a labeled graph Γ = ($\mathcal{V}$, $\mathcal{E}$, f), where $\mathcal{V}$ is a vertex set and $\mathcal{E}$ is an edge set. f is a vertexlabeling function f: $\mathcal{V}$ → $\mathcal{L}$, where $\mathcal{L}$ = {ℓ_{ l }} is a set of possible labels for vertices.
For an organism O_{ i }, each vertex v ∈ $\mathcal{V}$_{ i }corresponds to an enzyme of O_{ i }, and the cardinality $\mathcal{V}$_{ i } is equal to the number of distinct enzymes in the enzymeenzyme relation list R(O_{ i }) of the organism. When an entry for two enzymes e_{ u }and e_{ v }is found in R(O_{ i }), the corresponding vertices u and v are directly connected by an edge (u, v) ∈ $\mathcal{E}$_{ i }(denoted by u ~ v). The set $\mathcal{L}$ contains the unique identifiers (i.e., EC numbers) of all enzymes found in ${\left\{R\left({O}_{i}\right)\right\}}_{i=1}^{N}$ of all selected organisms.
A matrix representation of a labeled graph Γ_{ i }can be given by an adjacency matrix H_{ i }and a label matrix L_{ i }and, where H_{ i }is a $\mathcal{V}$_{ i } × $\mathcal{V}$_{ i } square matrix and L_{ i }is a $\mathcal{L}$ × $\mathcal{V}$_{ i } matrix. Each element H_{ i }(a, b) is given by
where ${w}_{({v}_{a},{v}_{b})}$ is the weight of the edge (v_{ a }, v_{ b }). Whenever the vertices v_{ a }and v_{ b }are joined by an edge, we set the weight such that w(v_{ a }, v_{ b }) = ${C}_{{\Gamma}_{i}}\cdot \frac{1}{deg({v}_{a})}$, where deg(v_{ a }) is the degree of v_{ a }and ${C}_{{\Gamma}_{i}}$ is a constant for the graph Γ_{ i }. Then, w(v_{ a }, v_{ b }) can be thought to be proportional to a probability (1/deg(v_{ a })) to visit v_{ b }in one step in a random walk starting from v_{ a }. We set ${C}_{{\Gamma}_{i}}=\frac{{\displaystyle {\sum}_{v\in {\mathcal{V}}_{i}}deg(v)}}{\left{\mathcal{V}}_{i}\right}$, which makes H_{ i }such that its total sum of elemets is still same to the number of edges in a bidirectional representation of Γ_{ i }.
An element of the matrix L_{ i }defined as
with 1 ≤ l ≤ $\mathcal{L}$, 1 ≤ a ≤ $\mathcal{V}$_{ i }. This means that L_{ i }(l, a) is 1 only when the label of vertex v_{ a }is ℓ_{ l }. Since we represent a metabolic pathway in such a way that every vertex (enzyme) in it has a unique EC number, every column sum of L_{ i }is 1, that is, ∑_{ l }L_{ i }(l, a) = 1, (∀ a). And, in terms of rows of L, ∑_{ a }L_{ i }(l, a) = 1 if ℓ_{ l }= f(v) (∃ v ∈ $\mathcal{V}$_{ i }); ∑_{ a }L_{ i }(l, a) = 0 otherwise. To compare the structures between metabolic networks of two organisms represented in graphs as described above, we adopted a kernelbased approach called the exponential graph kernel [39].
Comparison of metabolic networks: graph kernel
Given two graphs Γ_{ i }= (L_{ i }, H_{ i }) and Γ_{ j }= (L_{ j }, H_{ j }), the first simple approach to the graph comparison is to count the common vertices with the same labels in both Γ_{ i }and Γ_{ j }. This similarity (or kernel) can be calculated by k(Γ_{ i }, Γ_{ j }) = <${L}_{i}{L}_{i}^{T}$, ${L}_{j}{L}_{j}^{T}$>, where the inner product <M_{ i }, M_{ j }> between two matrices of the same dimension is defined as
Based on the definition of the label matrix in Equation (2), the matrix M_{ i }= ${L}_{i}{L}_{i}^{T}$ is a $\mathcal{L}$ × $\mathcal{L}$ diagonal matrix where M_{ i }(l, l) = 1 only when f(v) = ℓ_{ l }(∃v ∈ $\mathcal{V}$_{ i }), and M_{ i }(l, l) = 0 otherwise. However, this approach considers only the presence or absence of vertices (enzymes) but does not consider the structure of the graph, such that the successive enzymes or reaction steps cannot be considered when comparing metabolic networks. To capture the structure of the graph, one must also consider vertices that can be reached from a vertex by a subsequent traverse.
In the exponential graphkernel method, the similarity between two graphs Γ_{ i }and Γ_{ j }is defined as
where β (≥ 0) is a realvalued parameter and its value is chosen by performing many tries. When β = 0, it recovers the simple common vertexcounting measure since exp(0H) = I, the $\mathcal{V}$ × $\mathcal{V}$ identity matrix. Each element H^{ n } (a, b) of the matrix H^{ n } in Equation (5) represents the number of walks of length n (admitting cycles) from v_{ a }to v_{ b }, and allows the representation of the global structure of a graph.
Substituting Equation (5) into Equation (4), we can decompose the kernel function k(Γ_{ i }, Γ_{ j }) into two meaningful parts, k(Γ_{ i }, Γ_{ j }) = k_{1}(Γ_{ i }, Γ_{ j }) + k_{2}(Γ_{ i }, Γ_{ j }), where
The kernel function k_{1} contributes by considering walks of the same length in both graphs, and k_{2} can take into account the insertion or deletion of vertices in the graph [39]. As the number of movements in a graph increases, the significance of walks of length n decreases by $\frac{{\beta}^{n}}{n!}$. Eventually, the exponential matrix e^{ βH } can be interpreted as the product of a continuous process H, from which the identity matrix expands gradually to the matrix of the global structure of Γ [40].
The exponential graph kernel requires the exponentiation of square matrices H s. This can be performed by matrix diagonalization, with time complexity of about O($\mathcal{V}$_{ i }^{3}) for H_{ i }thus Γ_{ i }. [39]. The time complexity of the elementwise product of two matrices in k(Γ_{ i }, Γ_{ j }) is O(max ($\mathcal{V}$_{ i }^{2}, $\mathcal{V}$_{ j }^{2})). With N graphs, finally, the total time complexity for constructing the kernel matrix K = {k_{ ij }} (1 ≤ i, j ≤ N) is O(NV^{3} + N^{2}V^{2}) where V = max_{ i }$\mathcal{V}$_{ i }.
From the kernel k(Γ_{ i }, Γ_{ j }), the dissimilarity metric is defined in the standard manner, that is,
If we use the normalized kernel,
then the distance metric is simplified as
To summarize, metabolic networks constructed from reference pathways of N organisms were first converted to labeled undirected graphs. Each graph Γ_{ i }(1 ≤ i ≤ N) was then represented by two matrices: the vertexlabel matrix L_{ i }and the adjacency matrix H_{ i }. Using these two matrices, we can take into account only the local structure (the direct connectivities between enzymes in pathways) of networks. To compare networks in terms of their global structure, we adopted a kernelbased method, which we named the exponential graph kernel. Finally, the distance matrix acquired from the kernel function was fed into a hierarchical clustering algorithm to construct the phylogenetic trees.
Constructing phylogenetic trees
The distance between two organisms was calculated by comparing their metabolic networks using the measures mentioned earlier. To construct a phylogenetic tree, we used an unweighted pairgroup method with arithmetic mean (UPGMA) [41, 42], a hierarchical agglomerative clustering algorithm. Given N organisms, the algorithm starts by initializing N clusters, each of which contains exactly one distinct organism, and proceeds by iteratively merging the two nearest clusters until only one cluster (called the root of the tree) remains. The dendrogram derived from UPGMA is a binary tree, which we consider may represent a binary phylogenetic tree.
Declarations
Acknowledgements
This research was supported by the National Research Laboratory (NRL) Program (M104120009504J000003610) of Korean Ministry of Science and Technology and the Inje University research grant.
Authors’ Affiliations
References
 Whiting MF, Carpenter JC, Wheeler QD, Wheeler WC: The Strepsiptera problem: phylogeny of the holometabolous insect orders inferred from 18S and 28S ribosomal DNA sequences and morphology. Syst Biol 1997, 46: 1–68.PubMedGoogle Scholar
 Delsuc F, Brinkmann H, Philippe H: Phylogenomics and the reconstruction of the tree of life. Nature Rev Genet 2005, 6: 361–375. 10.1038/nrg1603View ArticlePubMedGoogle Scholar
 FitzGibbon ST, House CH: Whole genomebased phylogenetic analysis of freeliving microorganisms. Nucleic Acids Res 1999, 27: 4218–4222. 10.1093/nar/27.21.4218PubMed CentralView ArticlePubMedGoogle Scholar
 Lin J, Gerstein M: Wholegenome trees based on the occurrence of folds and orthologs: Implications for comparing genomes on different levels. Genome Res 2000, 10: 808–818. 10.1101/gr.10.6.808PubMed CentralView ArticlePubMedGoogle Scholar
 Otu HH, Sayood K: A new sequence distance measure for phylogenetic tree construction. Bioinformatics 2003, 19: 2122–2130. 10.1093/bioinformatics/btg295View ArticlePubMedGoogle Scholar
 Forst CV, Schulten K: Evolution of metabolisms: A new method for the comparison of metabolic pathways using genomics information. J Comp Biol 1999, 6: 343–360. 10.1089/106652799318319View ArticleGoogle Scholar
 Forst CV, Schulten K: Phylgenetic analysis of metabolic pathways. J Mol Evol 2001, 52: 471–489.View ArticlePubMedGoogle Scholar
 Schuster DandekarTT, Snel B, Huynen M, Bork P: Pathway alignment: application to the comparative analysis of glycolytic enzymes. Biochem J 1999, 343: 115–124. 10.1042/02646021:3430115PubMed CentralView ArticlePubMedGoogle Scholar
 Moret BME, Wang LS, Warnow T, Wyman SK: New approaches for reconstructing phylogenies from gene order data. Bioinformatics 2001, 17: S165S173.View ArticlePubMedGoogle Scholar
 Liao L, Kim S, Tomb JF: Genome comparisons based on profiles of metabolic pathways. Proceedings of the Sixth International Conference on Knowledgebased Intelligent Information & Engineering Systems 2002, 469–476.Google Scholar
 Heymans M, Singh AK: Deriving phylogenetic trees from the similarity analysis of metabolic pathways. Bioinformatics 2003, 19: i138i146. 10.1093/bioinformatics/btg1018View ArticlePubMedGoogle Scholar
 Wolf YI, Rogozin IB, Grishin NV, Tatusov RL, Koonin EV: Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol Biol 2001, 1: 8. 10.1186/1471214818PubMed CentralView ArticlePubMedGoogle Scholar
 Korbel JO, Snel B, Huynen MA, Bork P: SHOT: a web server for the construction of genome phylogenies. Trends Genet 2002, 18: 158–162. 10.1016/S01689525(01)025975View ArticlePubMedGoogle Scholar
 Tekaia F, Lazcano A, Dujon B: The genomic tree as revealed from whole proteome comparison. Genome Res 1999, 9: 550–557.PubMed CentralPubMedGoogle Scholar
 Grishin NV, Wolf YI, Koonin EV: From complete genomes to measures of substitution rate variability within and between proteins. Genome Res 2000, 10: 991–1000. 10.1101/gr.10.7.991PubMed CentralView ArticlePubMedGoogle Scholar
 Henz SR, Huson DH, Auch AF, NieseltStruwe K, Schuster SC: Wholegenome prokaryotic phylogeny. Bioinformatics 2005, 21: 2329–2335. 10.1093/bioinformatics/bth324View ArticlePubMedGoogle Scholar
 Qi J, Wang B, Hao BI: Whole proteome prokaryote phylogeny without sequence alignment: a Kstring composition approach. J Mol Evol 2004, 58: 1–11. 10.1007/s0023900324937View ArticlePubMedGoogle Scholar
 Daubin V, Gouy M, Perriere G: A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res 2002, 12: 1080–1090. 10.1101/gr.187002PubMed CentralView ArticlePubMedGoogle Scholar
 Rokas A, Williams BL, King L, Carroll SB: Genomescale approaches to resolving incongruence in molecular phylogenies. Nature 2003, 425: 798–804. 10.1038/nature02053View ArticlePubMedGoogle Scholar
 Woese CR, Kandler O, L WM: Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci USA 1990, 87: 4576–4579. 10.1073/pnas.87.12.4576PubMed CentralView ArticlePubMedGoogle Scholar
 Doolittle WF: Phylogenetic classification and the universal tree. Science 1999, 284: 2124–2128. 10.1126/science.284.5423.2124View ArticlePubMedGoogle Scholar
 Lake JA, Moore JE: Phylogenetic analysis and comparative genomics. Trends Guide to Bioinformatics 1998.Google Scholar
 Podani J, Oltvai ZN, Jeong H, Tombor B, Barabasi AL: Comparable systemlevel organization of Archaea and Eukaryotes. Nat Genet 2001, 29: 54–56. 10.1038/ng708View ArticlePubMedGoogle Scholar
 Rivera MC, Jain R, Moore JE, Lake JA: Genomic evidence for two functionally distinct gene classes. Proc Natl Acad Sci USA 1998, 95: 6239–6244. 10.1073/pnas.95.11.6239PubMed CentralView ArticlePubMedGoogle Scholar
 Canback B, Andersson SGE, Kurland CG: The global phylogeny of glycolytic enzymes. Proc Natl Acad Sci USA 2002, 99: 6097–6102. 10.1073/pnas.082112499PubMed CentralView ArticlePubMedGoogle Scholar
 Jain R, Rivera MC, Lake JA: Horizontal gene transfer among genomes: the complexity hypothesis. Proc Natl Acad Sci USA 1999, 96: 3801–3806. 10.1073/pnas.96.7.3801PubMed CentralView ArticlePubMedGoogle Scholar
 Doolittle WF: Lateral genomics. Trends Cell Biol 1999, 9: M5M8. 10.1016/S09628924(99)016645View ArticlePubMedGoogle Scholar
 Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H: An informationbased sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 2001, 17: 149–154. 10.1093/bioinformatics/17.2.149View ArticlePubMedGoogle Scholar
 Keeling PJ, Palmer JD: Lateral transfer at the gene and subgenic levels in the evolution of eukaryotic enolase. Proc Natl Acad Sci USA 2001, 98: 10745–10750. 10.1073/pnas.191337098PubMed CentralView ArticlePubMedGoogle Scholar
 Nye TMW, Lio P, Gilks WR: A novel algorithm and webbased tool for comparing two alternative phylogenetic trees. Bioinformatics 2006, 22: 117–119. 10.1093/bioinformatics/bti720View ArticlePubMedGoogle Scholar
 Zhang K, Wang JTL, Shasha D: On the editing distance between undirected acyclic graphs. Int J Foundations Comput Sci 1996, 7: 43–57. 10.1142/S0129054196000051View ArticleGoogle Scholar
 Fritz B, Raczniak GA: Bacterial genomics: potential for antimicrobial drug discovery. Biodrugs 2002, 16: 331–337. 10.2165/0006303020021605000002View ArticlePubMedGoogle Scholar
 Doolittle WF, Brown JR: Tempo, mode, the progenote, and the universal root. Proc Natl Acad Sci USA 1994, 91: 6721–6728. 10.1073/pnas.91.15.6721PubMed CentralView ArticlePubMedGoogle Scholar
 Doolittle WF, Logsdon JM Jr: Archaeal genomics: Do archaea have a mixed heritage? Curr Biol 1998, 8: R209R211. 10.1016/S09609822(98)701277View ArticlePubMedGoogle Scholar
 Wolf YI, Rogozin IB, Grishin NV, Koonin EV: Genome trees and the tree of life. Trends Genet 2002, 18: 472–479. 10.1016/S01689525(02)027440View ArticlePubMedGoogle Scholar
 Tree of Life[http://tolweb.org]
 Ogata H, Goto SK, Fujibuchi H, Bono H, Kanehisa M: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 1999, 27: 29–34. 10.1093/nar/27.1.29PubMed CentralView ArticlePubMedGoogle Scholar
 Goto S, Okuno Y, Hattori M, Nishioka T, Kanehisa M: LIGAND: database of chemical compounds and reactions in biological pathways. Nucleic Acids Res 2002, 30: 402–404. 10.1093/nar/30.1.402PubMed CentralView ArticlePubMedGoogle Scholar
 Gärtner T: Exponential and Geometric Kernels for Graphs. NIPS 2002 Workshop on Unreal Data: Principles of Modeling Nonvectorial Data 2002.Google Scholar
 Kondor RI, Lafferty J: Diffusion kernels on graphs and other discrete input spaces. Proceedings of 19th International Conference on Machine Learning 2002, 315–322.Google Scholar
 Jain AK, Dubes RC: Algorithms for Clustering Data. 2nd edition. address in USA: Prentice Hall; 1988.Google Scholar
 Durbin R, Eddy SR, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press; 1998.View ArticleGoogle Scholar
 Page RDM: TREEVIEW: An application to display phylogenetic trees on personal computers. CABIOS 1996, 12: 357–358.PubMedGoogle Scholar
 NCBI taxonomy[http://www.ncbi.nlm.nih.gov/Taxonomy/]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.