 Research
 Open access
 Published:
A novel approach to phylogenetic tree construction using stochastic optimization and clustering
BMC Bioinformatics volume 7, Article number: S24 (2006)
Abstract
Background
The problem of inferring the evolutionary history and constructing the phylogenetic tree with high performance has become one of the major problems in computational biology.
Results
A new phylogenetic tree construction method from a given set of objects (proteins, species, etc.) is presented. As an extension of ant colony optimization, this method proposes an adaptive phylogenetic clustering algorithm based on a digraph to find a tree structure that defines the ancestral relationships among the given objects.
Conclusion
Our phylogenetic tree construction method is tested to compare its results with that of the genetic algorithm (GA). Experimental results show that our algorithm converges much faster and also achieves higher quality than GA.
Background
An evolutionary tree, or phylogenetic tree, is a model of the evolutionary history for a set of species. With more and more DNA and protein sequences have been obtained [1–3], the problem of inferring the evolutionary history and constructing the phylogenetic tree has become one of the major problems in computational biology. This is because the evolutionary relationship of species provides a great deal of information about their biochemical machinery. For example, RNA's secondary structure is most accurately determined by selecting correlated mutations of a class of related species.
A phylogenetic tree is a tree showing the evolutionary interrelationships among various species or other entities that are believed to have a common ancestor. In a phylogenetic tree, each node with descendants represents the most recent common ancestor of the descendants, and the edge lengths correspond to time estimates. Each node in a phylogenetic tree is called a taxonomic unit, and the leaves usually denote a set of objects (proteins, species, etc.). Internal nodes are generally referred to as Hypothetical Taxonomic Units (HTUs) as they cannot be directly observed [3–6].
To construct a tree from a set of species, one must have a metric to decide if a tree is better than another one. Many criteria have been proposed. But in general, they turn out to be NPhard to optimize. There is still no consensus in the biology community on how to make a good tree.
One way to counteract this problem is to execute many different phylogenetic clustering methods resulting in various starting tree topologies. A choice from the generated trees gives rise to the best one. Another way to handle this problem is to use a global optimization technique to derive the optimal topology of the tree. In this paper, ant colony algorithm is applied both as a clustering method and as a global optimization technique so that the optimal tree can be found even with a bad initial tree topology.
During the past years, a number of efforts have been contributed to phylogenetic analyses using genomic sequences, which could be either whole genomes (complete gene sequence sets) or complete protein sequence sets [7–9]. There are three main methods for constructing phylogenetic trees: distancebased methods such as neighbourjoining, parsimonybased methods such as maximum parsimony, and characterbased methods such as maximum likelihood or Bayesian inference [1, 2]. The distance based approaches avoid the high computational complexity of multiple sequence alignment (including genome reorganization) to compute an evolutionary distance and try to construct the phylogenetic trees efficiently. The phylogenetic clustering method in this article belongs to the distance based category.
Ant Colony Optimization (ACO) is a new evolution simulation algorithm proposed by Italian researchers Dorigo et al [10]. Inspired by studies on biological ant colony, they recognize the similarities between the ants' foodhunting activities and TSP, and successfully resolve the TSP problems using the same principle that the ants have used to find the shortest route to food source via communication and cooperation, and it has been applied to lots of combinational optimization problems [10, 11].
We note that in the ant colony algorithm, ants can volatilize a kind of chemical odour called pheromone when they encounter each other or in the process of seeking their fellows. Enlightened by this fact, we first apply weights of rejection and acceptance between the objects to form a complete digraph in which the vertexes represent the objects and the initial weight of each edge between vertexes is the weights of acceptance between the objects. The novel clustering process by artificial ants is illustrated in Fig. 1, 2, 3 and 4, during the process, the pheromone on each edge of the digraph will be updated with the artificial ants' adaptive movements, and some adaptive strategies are also presented to speed up the clustering progress. Finally the clusters got by the ants are used to progressively construct the phylogenetic trees.
Results
Constructing a specific digraph for the objects
Ants can volatilize a kind of chemical odour called pheromone when they encounter each other or in the process of seeking their fellows. Based on this kind of odour, ants will naturally attract those who have similar features and repel those that are different. In this paper, artificial ants were set to travel on the graph and deposits pheromone on the edges they passed. As showed in Fig. 1 and Fig. 2, in each step, the artificial ant selects the next vertex according to the acceptance weight in digraph and some heuristic information. The pheromone on each edge of the digraph will be updated with the artificial ants' adaptive movements, and some adaptive strategies are also presented to speed up the clustering progress.
Strong component analysis
The more similar the objects are, the higher the quantity of pheromone may be deposited on the edge between their vertexes. To make full use of the quantity of pheromone on each edge, we omit some connections whose pheromone value is less than a certain threshold to get a new digraph, and the strong connected components of the new digraph forms the finial clusters. This way, the initial objects are separated into a few clusters by the ant subcolony. Finally these clusters obtained by the ants are used to construct the phylogenetic trees progressively.
Optimizing the phelogenetic trees
Artificial ants in the same subcolony try to construct an independent phylogenetic tree as a solution of the problem by their cooperation; and different subcolonies construct different trees so as to maintain diversity of candidates. After optimizing these trees, the performance of these solutions is improved. Meanwhile, the pheromones on the edges of high fitness valued trees are increased to strengthen the ants' clustering process.
The phylogenetic tree construction method showed in this paper is tested to compare its results with that of GA, experimental results show that our algorithm is easier to implement and more efficient. Comparing to GA, it can converge much faster and obtain higher solution quality.
Discussion
Reconstruction of the phylogeny is one of the most important problems in evolutionary study, which is very difficult for large data sets in macromolecular databases. The number of possible phylogenetic trees is exponentially large and the space of topologies cannot be searched exhaustively. Even heuristic searches can be very slow in this case, especially when computationally intensive optimality criteria such as maximum likelihood (ML) are used.
An exhaustive search for the ML topology is usually computationally prohibitive for more than 20 taxa (species). At the same time, the clustering approach for a phylogeny inference is advantageous for a number of reasons, including the ability to model a variety of factors affecting nucleotide sequence evolution, robustness to violations of model assumptions, and resistance to long branch attractions. The use of stochastic algorithm provides an opportunity to develop new efficient and fast methods for phylogeny analysis.
Conclusion
The proposed adaptive ant colony algorithm for phylogenetic tree construction method (AAPTC) consists of three components, including initialization, constructing phylogenetic trees through clustering, and phylogenetic tree optimization.
In the stage of initialization, a weighted digraph is built where the vertices represent the data to be clustered and the weight is the acceptance rate between the two objects it connected.
In the course of constructing the phylogenetic trees, the ants travel in the digraph and update the pheromone on the paths it passed. At each step, ants choose the next vertex according to a certain probability depending on the pheromone and the heuristic information of the edge. The digraph is first modified by omitting some edges whose pheromone value is less than a threshold, and then the strong connected components of the updated digraph are computed to form the clusters which are used to construct the phylogenetic trees.
After getting a group of phylogenetic trees, the ant colony and its pheromone feedback system act as a global optimization technique to derive the optimal topology of the phylogenetic tree.
The algorithm showed in this paper is tested using randomly generated sequences. Using the same sequences, we also test the GA method. Our experiments were implemented on Dell Precession workstation 380 with IntelP4 Hyper Threading Processor of 3.2 GHz and 800 M Front Bus Speed.
As showed in [3], the simulated data sets used are generated in two different ways. The first set of simulated data consists of trees where the topology of the tree is fixed and randomly generated branch lengths are assigned to each nodetonode connection. The resulting distance matrix is used as input for the methods to be tested. In this way four sets were generated (S11, S12, S13 and S14) which consists of distance matrices defining ancestral relationships among 24, 96, 1000, and 4000 objects, respectively. The second sets S21, S22, S23 and S24 include stochastically generated distance matrices. For these data sets, the optimal tree is not known.
In fact, AAPTC not only provides a novel clustering method to obtain a group of good initial tree topologies, but also has global optimization on these trees. In this way, the AAPTC produces an ensemble of trees of almost similar quality. Whereas the GA method cannot guarantee the topology quality since its sole initial tree topology was generated by some other clustering methods such as NJ or FITCH. The ensemble of high qualified solutions of AAPTC allows experts to decide which topology is most likely since the quality criterion (the fitness value) used does not guarantee the optimal tree topology.
Tables 1, 2, 3 and 4 show the performance comparison of the two methods. In all the tables, the performance of a method is measured by the fitness value between the original and calculated distance matrices. The number of examined trees is depicted in parentheses. For AAPTC the mean, standard deviation, highest, and lowest fitness value derived from 50 independent trials are given. The basic parameters are set as m = n ρ = 0.05 C = 10 q0 = 0.95. We also use the vertebrate dataset [5, 12] to evaluate the performance of our algorithm. Vertebrate database contains in total 832 mitochondrial proteins from 64 vertebrates. The results of the neighbour join based phylogeny and taxonomy tree is shown in [8], and ant colony based phylogeny is shown as fig 5.
Methods
Ant Colony Algorithm
Here we briefly introduce AC and its applications using TSP as an example. In the TSP, a given set of n cities has to be visited exactly once and the tour ends in the initial city. We denote the edge between city i and j as (i, j) and its distance as d_{ij} (i, j ∈ [1, n]). Let τ_{ij}(t) be the intensity of pheromone on (i, j) at time t, and use τ_{ij}(t) to simulate the pheromone of real ants. Suppose m is the total number of ants, at time t the kth ant selects from its current city i to city j according to the following probability distribution:
Where allowedk is a set of the cities can be chosen by the kth ant at city i for the next step, η_{ij} is a heuristic function which is defined as the visibility of the link between cities i and j, for instance it can defined as 1/d_{ ij }.
The relative influence of the trail information τ_{ij}(t) and the visibility η_{ij} are determined by the parameters α, β. When α = 1 and β = 0, the algorithm becomes a complete heuristic algorithm with positive feedback and when α = 0 and β = 1, it is just a traditional greedy algorithm. For every ant, its path traversing all the cities forms a solution. The intensity of pheromone is updated by Eq. (2):
τ_{ ij }(t + 1) = ρ τ_{ ij }(t)+Δτ_{ ij } (2)
Where 0<ρ<1 represents the evaporation of τ_{ij}(t) between time t and t+1, Δτ_{ij} is the increment of the pheromone on (i, j) in step t, and Δτ_{ij}^{k} is the pheromone laid by the kth ant on it, it takes different formula depending on the model used.
For example, in the most popularly used model called "ant circle system", it is given as Eq.(4).
where Q is a constant and L_{k} is the total length travelled by the k^{th} ant.
Constructing phylogenetic trees by clustering and optimization
The overall algorithm for constructing the phylogenetic trees is given below.
Begin

1
Initialization
1.1 Initialize parameters: minC, m ε, γ, α, β;
1.2 Initialize the pheromone digraph;
1.3 For each ant in each subcolony do
Chooses an initial object to visit randomly;
End for

2
While (not termination) do // 500 iterations
2.1 For each subcolony do //m subcolonies
2.1.1 Set a root node rt as the ancestor of all the objects;and let CO denote the current object set, the initial value of CO equals the given object set of the problem
2.1.2 While (not termination) do
//500 interations
For each ant k in current subcolony do
//m ants
While (allowed_{k} not empty) do
Compute probability function p;
Select the next object j to visit;
allowedk =allowedk { j };
End while
Reset allowedk= CO;
End for
Have local pheromone updating on each edge in the digraph according to the evolutionary distance between objects;
Adaptively update the parameters of α, β;
End while
2.1.3 Transfer the pheromone digraph to another digraph by omitting the edges whose pheromone value is less thanε ; find out the strongest connected components of the updated digraph as clusters. Join the small clusters with the nearest cluster till there are two clusters left, we denote them as clu1 and clu2;
2.1.4 let clu1 and clu2 be the internal node to denote the children of the root node rt
2.1.5 If the number of objects in clu1 is lower than 1 let rt=clu1, CO = { clu1} break;
2.1.6 If the number of objects in clu2 is lower than 1 let rt=clu1, CO = { clu2} break;
2.2 obtain each phylogenetic tree constructed by each subcolony
End for
2.3 Calculate the fitness value of each subcolony
2.4 Have crossover and mutation operation to improve the quality of the trees
2.5 Have global pheromone updating operation according to the fitness value of the constructed phylogenetic trees
End while

3
Output the phylogenetic trees constructed by the colony
End
In the while loop between line 2 and line 3, based on τ_{ave}, the parameter of threshold ε could be defined as ε = γ*τ_{ave}, where γ is a constant. The population size of the ant colony m is normally equal to n/2, here n is the number of the given objects. The value of parameters α and β are subject to be adjusted adaptively in process of the algorithm. In line of 2.4, the crossover and mutation operation are executed by branch moving, swapping techniques introduced in [13].
Initialization of the Pheromone Diagraph
The initialization stage of the algorithm constructs a weighted digraph with the vertexes representing the given objects and the weighted edges between vertexes representing the acceptance weight between the two objects it connected. The acceptance weight between two objects can be calculated from the evolutionary distance between the objects.
Definition 1: The set of objects
A set of n objects is defined as S=(CO, RT) where CO = {object_{1}, object_{2},...object_{ n }} represents the object set, and rt is the ancestor of all the objects in CO.
A similarity or evolutionary distance is often obtained by pairwise comparisons of DNA or protein sequences. The measurements of the evolutionary distance can be classified into the following three categories: the first type usually let the number of homologous genes divided by the total number of genes, or its variants be the evolutionary distance [14]; in the second kind, regularities identified in genetic sequences by compression algorithms are used to represent biological significance for evolutionary history, but these data compression based methods often involve of aggregated errors [15]; in the third category, the evolutionary distance is measured by string composition based on the singular value decomposition (SVD) of a string frequency matrix [13], or on the composition vector on short strings of a fixed length or the information discrepancy on short strings of a fixed length [1, 2]. In this paper, we use the cosine distance introduced in [4, 5] as the evolutionary distance between the objects.
Definition 2: The evolutionary distance between objects
The evolutionary distance d(object_{ i }, object_{ j }) between object_{i} and object_{j} is defined as:
Here, C (object_{i}, object_{j}) is the cosine of the angle between vector i and vector j defined in [3, 4].
Definition 3: The mean distance and the shortest distance
We use d_{ mean }(object_{ i }) to denote the mean distance from object_{i} to all the other objects, namely
We also denote the shortest distance from object_{ i }to all the other objects as d_{min}(object_{ i }).
Definition 4: The acceptance weights
For two objects object_{ i }and object_{ j }the acceptance weights for object_{ i }to object_{ j }is defined as Eq.(8):
Similarly, the acceptance weights for object_{ j }to object_{ i }is as Eq.(9):
From the definition we can see that the more similar two objects are, the greater acceptance weight to each other will be. We also can see that acceptance weight between two objects is not symmetric, namely, normally
accept_{j} (object_{i}) ≠ accept_{i}(object_{j}) (10)
According to the definitions above, we could form a weighed digraph where each vertex represents an object. Denote the weight of the directed edge from object_{i} to object_{j} as τ_{ ij }(0). This value will be updated according to the pheromone deposited by the ants passing it. Its initial value τ_{ ij }(0) is set as the acceptance weight:
τ_{ ij }(0) = accept_{ i }(object_{ j }) (11)
In traditional ant colony algorithm, pheromone on all edges is usually initialized as zero. This is not helpful for ants to choose path at the early stages. However, in AAPTC, the proposed initial pheromone value set on the digraph is much important for ants' latter movements, that is to say it can make great influence on the initial topology of the phylogenetic trees. Based on this initial value, in the latter stages the ants will update this pheromone digraph for the construction and optimization of the phylogenetic trees.
Heuristic function
The heuristic function ηij in Eq.(1) is a problem dependent function that measures the "quality" of the edge (i, j) which connects the vertexes i and j representing the two objects. Here the "quality" means the preference of the edge to be selected by the ants. Obviously, the less distance between the two connected objects, the more preferred the edge should have. Therefore, η_{ij} should be associated with the distance between objects. So it is given by the following formula.
η_{ ij }= 1/d (object_{ i },object_{ j }) (12)
Different from pheromone τ_{ij}, η_{ij} is static and unidirectional heuristic information determined by the distance information.
Pheromone Updating
In the algorithm, based on the following formula, pheromone on edge (i,j) is updated on the paths the ants just passed after each iteration.
Here constant ρ ∈ (0,1) is the coefficient of evaporation. At an individual iteration the pheromone on each path will be evaporated by a rate of ρ.
In the local updating period,
is the increment of τ_{ij} by ant k, which is defined by Eq.(14)
Here Q is a constant. From the formulas above, it is easy to see that the more ants pass through an edge, the more pheromone deposited on it, and the more probability for the two vertexes connected by the edge to be included in the same strong connected component of the weighted digraph constructed in the third stage of the algorithm. In the global updating period,
is the increment of τ_{ ij }by subcolony k, which is defined as follows :
Here, Tree_{k} denotes the phylogenetic tree constructed by subcolony_{k}, fitness(subcolony_{k}) is the fitness value of Tree_{k}. According to [9], once the topology and the branch lengths of Treek are determined, a new distance matrix can be deduced. By comparing this distance matrix D^{k}with the original distance matrix D (calculated from the given objects), a quality measurement showed in Eq.(16) can be assigned to the tree as its fitness value:
The summation extends over all n(n1)/2 distances between the n objects. If the distances are concentrated within a narrow scope, a high fitness value will be assigned to the tree, and if the reconstructed distance equal the original distances, the fitness will reach the highest value C. By global pheromone updating, the pheromone deposited on the edges of high fitness valued trees will be much higher than others, thus the objects connected by these edges can hardly be separated by the ants during the clustering process.
Updating Parameters
The second stage of the algorithm consists the step of updating the value of α, β which are the parameters of the Eq.(1) which is the probability distribution for the ant's selecting the next vertex. In Eq.(1), parameters α, β determine the relative influence of the trail strength τ_{ ij }and the heuristic information η_{ ij }. At the initial stage of the algorithm, the pheromone value on each edge is relatively small. To speedup the convergence, the ants should select the path mainly according to the heuristic information η_{ ij }. Therefore, the value of α should be relatively large in this stage. After some iteration, the pheromone values on the edges are increased, their influence become more and more important. Therefore the value of β should be relatively large. Since the adjustment of the values of α and β should be based on the strength of pheromone on the edges In Eq.(17) we define τ_{ave} as the average amount of pheromone on the pheromone digraph and in Eq.(18) define δ as the pheromone distributing weight to measure the distribution of pheromone on the graph.
Using the pheromone distributing weight δ, the algorithm updates the value of α, β as follows:
α = log ^{1+δ} (19)
Once the pheromone digraph is updated, α and β will be adaptively modified by the pheromone distributing weight and make influence on the effect of pheromone and heuristic function. By adjusting the value of α, β adaptively, the algorithm can accelerate the convergence and also can avoid local convergence and precocity. Therefore, this adaptive procedure is much important for AAPTC. Furthermore, since the amount of pheromone is an important measure for tree construction, the pheromone distributing weight δ is also a critical factor to terminate the iterations of the algorithm.
References
Hodge T, Cope MJTV: A Myosin Family Tree. Journal of Cell Science 2000, 113: 3353–3354.
Saitou N, Nei M: The neighborjoining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4: 406–425.
Reijmers TH, Wehrens R, Daeyaert FD, Lewi PJ, et al.: Using genetic algorithms for the construction of phylogenetic trees: application to Gprotein coupled receptor sequences. Biosystems 1999, 49: 31–43.
Hao B, Qi J: Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance. Proceedings of the 2003 IEEE Bioinformatics Conference 2003, 375–385.
Xiaomeng W, Xiufeng W, Gang W, Dong X, Guohui L: Phylogenetic Analysis Using Complete Signature Information of Whole Genomes and Clustered NeighborJoining Method. International Journal on Bioinformatics Research and Applications 2006, 2(3):219–248.
Wu X, Wan XF, Xu D, Lin GH: Whole genome phylogeny based on clustered signature string composition. Posters in 2005 IEEE Computational Systems Bioinformatics Conference (CSB2005) 2005, 53–54.
Chen X, Wong SK, Li M: A compression algorithm for DNA sequences and its applications in genome comparison. In proceedings of the sixth annual international computing and combinatorics conference (RECOMB). ACM press; 2000:107–117.
Grumbach S, Tahi F: A new challenge for compression algorithms: genetic sequences. Journal of Information Processing Management 1994, 30: 866–875.
Hao B: Fractals from genomesexact solutions of a biology inspired problem. Physica 2000, A282: 225–246.
Dorigo M, Maniezzo V, Colomi A: Ant system: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and CyberneticsPart B 1996, 26: 29–41.
Ling C, Jie S, Ling Q: An Adaptive Ant Colony Algorithm Based on Equilibrium of Distribution. Journal of Software 2003, 14: 1148–1151.
Stuart G, Moffett K, Bozarth RF: A comprehensive vertebrate phylogeny using vector representation of protein sequences from whole genomes. Molecular Biology and Evolution 2002, 19: 554–562.
Kuntz P, Snyder D: New results on antbased heuristic for highlighting the organization of large graphs. In Proceedings of the 1999 Congress or Evolutionary Computation. IEEE Press, Piscataway, NJ; 1999:1451–1458.
Herniou E, et al.: Use of Whole genome sequence data to infer baculovirus phylogency. Journal of virology 2001, 75: 8117–8126.
Grumbach S, Tahi F: A new challenge for compression algorithms: genetic sequences. Journal of Information Processing Management 1994, 30: 875–866.
Acknowledgements
This paper is supported in part by a US Department of Energy ECPI grant, the Chinese National Natural Science Foundation under grant No. 60673060, Chinese National Foundation for Science and Technology Development under contract 2003BA614A14, and the Natural Science Foundation of Jiangsu Province under contract BK2005047.
This article has been published as part of BMC Bioinformatics Volume 7, Supplement 4, 2006: Symposium of Computations in Bioinformatics and Bioscience (SCBB06). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/7?issue=S4.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
LQ conceived, designed and performed the study under the supervision of LC and YP; LQ and YC (Yixin Chen) collected and analyzed the data; LQ wrote the computer code; LQ and YC designed algorithms; LQ and LC wrote the manuscript; All authors have read and approved the final manuscript.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Qin, L., Chen, Y., Pan, Y. et al. A novel approach to phylogenetic tree construction using stochastic optimization and clustering. BMC Bioinformatics 7 (Suppl 4), S24 (2006). https://doi.org/10.1186/147121057S4S24
Published:
DOI: https://doi.org/10.1186/147121057S4S24