 Research article
 Open Access
Shortest triplet clustering: reconstructing large phylogenies using representative sets
 Le Sy Vinh^{2} and
 Arndt von Haeseler^{1, 2}Email author
https://doi.org/10.1186/14712105692
© Sy Vinh and von Haeseler; licensee BioMed Central Ltd. 2005
 Received: 29 November 2004
 Accepted: 08 April 2005
 Published: 08 April 2005
Abstract
Background
Understanding the evolutionary relationships among species based on their genetic information is one of the primary objectives in phylogenetic analysis. Reconstructing phylogenies for large data sets is still a challenging task in Bioinformatics.
Results
We propose a new distancebased clustering method, the shortest triplet clustering algorithm (STC), to reconstruct phylogenies. The main idea is the introduction of a natural definition of socalled krepresentative sets. Based on krepresentative sets, shortest triplets are reconstructed and serve as building blocks for the STC algorithm to agglomerate sequences for tree reconstruction in O(n^{2}) time for n sequences.
Simulations show that STC gives better topological accuracy than other tested methods that also build a first starting tree. STC appears as a very good method to start the tree reconstruction. However, all tested methods give similar results if balanced nearest neighbor interchange (BNNI) is applied as a postprocessing step. BNNI leads to an improvement in all instances. The program is available at http://www.bi.uniduesseldorf.de/software/stc/.
Conclusion
The results demonstrate that the new approach efficiently reconstructs phylogenies for large data sets. We found that BNNI boosts the topological accuracy of all methods including STC, therefore, one should use BNNI as a postprocessing step to get better topological accuracy.
Keywords
 Path Length
 Cluster Algorithm
 Average Path Length
 Stochastic Error
 Large Simulation
Background
Reconstructing the evolutionary relationships among species based on their genetic information is one of the primary objectives in phylogenetic analysis. In recent years, numerous heuristics to reconstruct phylogenies for large data sets have been proposed [1–11]. In addition, parallel treereconstruction programs have been implemented [12–15].
To date, distancebased methods introduced by CavalliSforza and Edwards [16] and by Fitch and Margoliash [17] appear most appropriate to reconstruct phylogenies based on thousands of sequences. These methods are a compromise between computational speed and topological accuracy [1, 3, 5–7] and run typically in O(n^{3}) time for n sequences [1, 3, 5] or in O(n^{2}) for recently suggested approaches [6, 7]. Clustering algorithms form a major class of distancebased methods [18]. They do not have an explicit objective function that needs to be optimized. They rather group sequences (or taxa) iteratively to reconstruct a distancebased phylogenetic tree. UPGMA is a popular method to infer phylogenies with the constraint that a molecular clock is imposed on the evolutionary process. Other clustering approaches have been proposed to relax the molecular clock assumption [1, 3, 5, 19–21].
An attempt to boost the accuracy and to reduce the computational burden is the introduction of krepresentative set concepts [10, 11]. k representative sets consist of at most k elements but retain the most important information from whole sets. In this paper, we extend our original approach [10] by introducing a more natural k representative set concept. In a nutshell, representative sets are regarded as components to construct shortest triplets, each of which comprises three closely related sequences from three krepresentative sets. The collection of shortest triplets serves as building block for a new distancebased clustering method called shortest triplet clustering algorithm (STC).
Results
Simulations were run on a PC cluster with 16 nodes. Each node has two 1.8 GHz processors and 2 GB RAM. SeqGen [22] was used to evolve sequences along trees using the Kimura twoparameter model [23] with a transition/transversion ratio of 2.0. We generated 100 simulated data sets of 500 sequences each with sequence lengths 500, 1000 and 2000 nucleotides (nt), respectively. As one model tree, we used the rbcl gene tree with diameter 0.36 substitutions per site as inferred from an alignment of 500 rbclgenes [10]. We call this the rbclsimulation.
In a second experiment, the socalled large simulation, tree topologies were drawn from the YuleHarding distribution [24], and edge lengths were drawn from an exponential distribution and subsequently rescaled such that the mean diameter of the tree was either 0.1, 0.5, 1.0, or 1.5. For each value of the diameter we generated 100 trees with 1000 sequences and 100 trees with 5000 sequences. Thus, a total of 800 trees were used.
Finally, we tested the accuracy and runtime of the STC and compared it with six other commonly used distancebased methods. More specifically, we investigate the performance of the NeighborJoining method (NJ) [1] implemented in PAUP* 4.0 [25], BIONJ [3], Weighbor 1.2 [5], Harmony Greedy Triplet and Four Point Condition (HGT/FP) [7] as well as Greedy Minimum Evolution (GME) and Balanced Minimum Evolution (BME) [6]. Unfortunately, no distancebased program is available for the disccovering method [4]. All methods were combined with DNADIST version 3.5 [26] and pairwise distances were corrected for multiple hits according to the model used in the simulation. Moreover, we examined the performance of all methods when the balanced nearest neighbor interchange (BNNI) [6] is used as a postprocessing step.
Further, to illustrate the performance of STC we reanalyzed the 96taxon alignments of sequence length 500 nt, that were analyzed in [6] and available at http://www.lirmm.fr/~guindon/simul/. The 6000 trees were split into three groups called "slow" (0.2 substitutions per site), "moderate" (0.4 substitutions per site) and "fast" (1.0 substitutions per site). We call this the reanalyzed simulation.
The accuracy of a tree reconstruction method for a simulated data set is measured by the Robinson and Foulds (RF) distance [27] between the inferred tree and the model tree used to generate the data set. The RF distance between two trees is the number of bipartitions present in one of the two trees but not the other, divided by the number of possible bipartitions. Thus, the smaller the RF distance between two trees the closer are their topologies. In other words, the smaller the RF distance is between the inferred tree and the model tree the higher is the topological accuracy of the tree reconstruction method.
In the following we discuss the results of the rbclsimulation, and the large simulation and the reanalyzed simulation.
rbclsimulation
The average Robinson and Foulds distance of 100 simulated data sets of 500 sequences each with sequence lengths 500, 1000 and 2000 nt (rbcl simulation). Methods are used without BNNI.
sequence length  NJ  BIONJ  Weighbor  HGT/FP  GME  BME  STC^{k = 5} 

500  .190  .188  .194  .512  .240  .184  .177 
1000  .100  .098  .099  .409  .144  .096  .088 
2000  .049  .048  .050  .313  .082  .046  .040 
The average Robinson and Foulds distance of 100 simulated data sets of 500 sequences each with sequence lengths 500, 1000 and 2000 nt (rbcl simulation). Methods are used with BNNI.
sequence length  NJ  BIONJ  Weighbor  HGT/FP  GME  BME  STC^{k = 5} 

500  .162  .162  .162  .166  .163  .163  .162 
1000  .079  .079  .079  .079  .080  .079  .079 
2000  .035  .035  .035  .036  .036  .035  .035 
Large simulation
The average Robinson and Foulds distance of 100 simulated data sets of 1000 taxa for each tree diameter 0.1, 0.5, 1.0 and 1.5 and with sequence length 1000 nt (large simulation). Methods are used without BNNI.
number sequences  NJ  BIONJ  HGT/FP  GME  BME  STC^{k = 5} 

1000 (0.1)  .146  .146  .378  .168  .143  .139 
1000 (0.5)  .093  .089  .193  .126  .075  .066 
1000 (1.0)  .094  .090  .188  .132  .074  .062 
1000 (1.5)  .097  .091  .182  .138  .073  .061 
The average Robinson and Foulds distance of 100 data sets of 5000 taxa for each tree diameter 0.1, 0.5, 1.0 and 1.5 and with sequence length 1000 nt (large simulation). Methods are used without BNNI.
number sequences  NJ  BIONJ  HGT/FP  GME  BME  STC^{k = 5} 

5000 (0.1)  .178  .179  .442  .207  .173  .170 
5000 (0.5)  .109  .105  .210  .156  .084  .072 
5000 (1.0)  .107  .102  .192  .155  .073  .064 
5000 (1.5)  .112  .106  .188  .164  .072  .063 
The average Robinson and Foulds distance of 100 simulated data sets of 1000 taxa for each tree diameter 0.1, 0.5, 1.0 and 1.5 and with sequence length 1000 nt (large simulation). Methods are used with BNNI.
number sequences  NJ  BIONJ  HGT/FP  GME  BME  STC^{k = 5} 

1000 (0.1)  .137  .137  .137  .137  .137  .138 
1000 (0.5)  .061  .061  .061  .061  .061  .061 
1000 (1.0)  .057  .057  .057  .057  .057  .056 
1000 (1.5)  .055  .055  .055  .055  .055  .055 
The average Robinson and Foulds distance of 100 data sets of 5000 taxa for each tree diameter 0.1, 0.5, 1.0 and 1.5 and with sequence length 1000 nt (large simulation). Methods are used with BNNI.
number sequences  NJ  BIONJ  HGT/FP  GME  BME  STC^{k = 5} 

5000 (0.1)  .168  .168  .168  .168  .168  .168 
5000 (0.5)  .066  .066  .066  .066  .066  .066 
5000 (1.0)  .057  .057  .057  .057  .057  .057 
5000 (1.5)  .055  .055  .055  .055  .055  .055 
Reanalyzed simulation
The average RF distance of the 96taxon alignments of sequence length 500 nt, that were analyzed in [6]. The 6000 trees were split into three groups called "slow" (0.2 substitutions per site), "moderate" (0.4 substitutions per site) and "fast" (1.0 substitutions per site). Except for STC, the accuracies for the other methods were taken from [6]. Methods are used without BNNI.
number sequences  NJ  BIONJ  Weighbor  HGT/FP  GME  BME  STC^{k = 5} 

96 (slow)  .183  .180  .178  .512  .199  .186  .179 
96 (moderate)  .136  .134  .129  .480  .158  .137  .125 
96 (fast)  .115  .112  .103  .465  .144  .117  .102 
The average RF distance of the 96taxon alignments of sequence length 500 nt, that were analyzed in [6]. The 6000 trees were split into three groups called "slow" (0.2 substitutions per site), "moderate" (0.4 substitutions per site) and "fast" (1.0 substitutions per site). Except for STC, the accuracies for the other methods were taken from [6]. Methods are used with BNNI.
number sequences  NJ  BIONJ  Weighbor  HGT/FP  GME  BME  STC^{k = 5} 

96 (slow)  .173  .173  .173  .175  .173  .173  .173 
96 (moderate)  .119  .118  .118  .123  .118  .118  .116 
96 (fast)  .090  .090  .091  .098  .091  .090  .090 
Another look at the performance
The percentage of cases where STC is at least as good as other tested methods in terms of RF distance. The number in parentheses is the percentage of cases where STC is equally good as other tested methods. Methods are used without BNNI.
number sequences  NJ  BIONJ  Weighbor  HGT/FP  GME  BME 

96 (500 nt)  68 (16)  65 (15)  57 (16)  100 (0)  73 (10)  70 (14) 
500 (500 nt)  80 (4)  76 (4)  88 (3)  100 (0)  100 (0)  66 (1) 
500 (1000 nt)  88 (3)  79 (4)  84 (4)  100 (0)  100 (0)  83 (6) 
500 (2000 nt)  92 (6)  90 (4)  92 (3)  100 (0)  100 (0)  86 (9) 
1000 (1000 nt)  95 (2)  95 (1)  n.d.  100 (0)  100 (0)  89 (15) 
5000 (1000 nt)  100 (0)  99 (0)  n.d.  100 (0)  100 (0)  95 (1) 
The percentage of cases where STC is better than other tested methods in terms of RF distance. The number in parentheses is the percentage of cases where STC is worse than other tested methods. Methods are used with BNNI.
number sequences  NJ  BIONJ  Weighbor  HGT/FP  GME  BME 

96 (500 nt)  9 (8)  8 (8)  10 (10)  12 (10)  10 (8)  10 (9) 
500 (500 nt)  34 (37)  35 (39)  35 (36)  59 (29)  46 (33)  41 (39) 
500 (1000 nt)  22 (19)  17 (23)  18 (22)  23 (28)  30 (20)  24 (20) 
500 (2000 nt)  10 (13)  8 (7)  10 (8)  9 (8)  12 (10)  7 (10) 
1000 (1000 nt)  30 (28)  27 (29)  n.d.  28 (22)  30 (24)  28 (27) 
5000 (1000 nt)  48 (40)  42 (44)  n.d.  45 (45)  52 (37)  43 (43) 
Conclusion
We are presenting krepresentative sets which allow us to design a fast and accurate method to reconstruct phylogenies from large data sets with 1000 or more taxa. Simulations show that STC gives better results than other tested methods in terms of topological accuracy. However, if BNNI is introduced as a subsequent optimization step, the differences in the performance disappear. All methods show more or less the same accuracy. Thus, one should apply BNNI to improve the topological accuracy.
The time to reconstruct a tree of up to 1000 sequences is not really an issue for all tested distancebased methods, with the exception of Weighbor. Weighbor needed about 19 minutes to reconstruct a tree with 500 sequences, thus it is only applicable to data sets with up to some hundred sequences. For data sets with up to 1000 sequences, the remaining methods needed less than one minute to output a tree, thus the difference between methods in terms of runtime is not significant. For data sets with 5000 sequences, STC (GME, HGT/FP or BME) with BNNI took about 2.0 (2.5, 3.0 or 3.5) minutes to reconstruct a tree. NJ (BIONJ) with BNNI were slower and consumed approximately six minutes to output a tree. In short, the combination of STC and BNNI efficiently reconstruct trees for large data sets in both terms of topological accuracy and runtime.
Methods
In this section we introduce a new clustering algorithm to reconstruct phylogenies based on distance matrices.
Additive distances
Let S = {s_{1}, s_{2},..., s_{ n }} be a set of n objects (typically contemporary sequences/taxa), let D = D(uv) be a distance matrix where D(uv) is the distance between two objects u and v.
Definition 1
The distance matrix D is additive if and only if it satisfies the fourpoint condition [28]: for any quartet {u, v, w, x},
D(uv) + D(wx) ≤ max{D(uw) + D(vx), D(ux) + D(vw)}.
In this case, the objects s ∈ S are related by a tree T = (V, E) where V is the set of vertices such that S ⊂ V and E = {{v_{1}, v_{2}}v_{1}, v_{2} ∈ V} is the set of edges. A vertex with one adjacent edge is called a leaf, all other vertices are called internal nodes. We let L ⊂ V be the leaf set of the tree T. Note that we typically require L ⊆ S in the phylogenetic setting.
If D is additive, then there exists a map and a length function such that
for all u, v ∈ S where p(φ (u), φ(v)) is the unique path connecting φ(u) and φ(v) in T and denotes the distances between vertices in T (cf. [29]). ℓ(e) is called edge length of the edge e. To avoid unnecessary complication, we consider only onetoone maps from S on the leaf set L of T. If D is additive, the reconstruction of tree T and ℓ is trivial. If D is not additive, methods are available that try to fit a tree T to D with respect to an objective function (cf. [30]). Thus, in the following we consider arbitrary distance matrices and we want to reconstruct a tree together with a length function .
Estimating edge lengths using triplets
We consider a subset X of S, then induces a map on a subtree of T such that the relationships of objects in X are displayed by the subtree with leaf set φ(X). The complement S_{0}(X) = S  X we will call the unclassified object set, because the relationships of objects in S_{0}(X) to X is not known from the subtree. Note that we will use S_{0} instead of S_{0}(X) if X is clear from the context.
Let denote T_{ r }= (V_{ r }, E_{ r }) a rooted tree with root r and leaf set L_{ r }, and let S_{ r }be a subset of S such that φ(S_{ r }) = L_{ r }. For convenience, we use S_{ r }and L_{ r }interchangeably.
Now consider a rooted T_{ r }with the inferred treelike metric . The rooted tree T_{ r }consists of two rooted subtrees and (see Figure 4b). For convenience, we will use T_{ i }instead of if r_{ i }is clear from the context. The leaf set S_{ r }= {S_{1} ∪ S_{2}} where S_{ r }⊂ S and S_{0} = S  S_{ r }is not represented in T_{ r }. Then we can compute
for each triplet (s_{0}, s_{1}, s_{2}) ∈ (S_{0} × S_{1} × S_{2}).
Note that, if D is additive and T_{1}, T_{2} are isometric subtrees of T, the lengths ℓ(r_{1}r) and ℓ(r_{2}r) do not depend on the choice of the triplet {s_{0}, s_{1}, s_{2}}.
Regardless of additivity considerations, we may define the average length for a fixed s_{0} ∈ S_{0} as
We can estimate edge lengths ℓ(r_{1}r) and ℓ(r_{2}r) by using all possible triplets as
Recovering a tree from a distance matrix
The largest path length criterion
We want to reconstruct a tree T = (V, E) with respect to a distance matrix D such that D_{ T }represents D. To this end, we use triplets and the notation of a rooted tree T_{ r }together with Equations 4 and 5.
If D is additive, ℓ(m, mrcamS_{ i }S_{ j }) is exactly the path length from the mrca of (T_{ i }, T_{ j }) to m. In other words, the path length from the mrca of ( , ) to m is large stand ( , ) is a true neighboring pair. However, in real applications D is rarely additive, therefore the root m is selected so as to avoid noise from stochastic errors involved with large distance estimates [17]. To this end, m is selected such that the distance from the farthest object to root m is minimal,
med = argmin_{m'∈S}{max{D(m'x)x = 1,..., n}} (7)
med is called a median object.
T_{i'}= argmax{ℓ(med, mrcamed, S_{ i }, S_{ j })T_{ j }∈ }. (8)
In the following, we present a natural clustering algorithm to reconstruct trees based on distance matrices and the largest path length criterion
Clustering Algorithm

Initial step: Find the median object med using Equation 7. Set = {T_{1},..., T_{ n }}  {T_{ med }}. Find for each T_{ i }∈ its potential neighbor T_{i'}∈ using Equation 8.

Selection step (largest path length criterion): Find the neighboring pair ( , ) using Equation 9.

Agglomeration step: Combine and into a new rooted tree with root i_{0}j_{0}, and estimate new edge lengths of using Equation 5. Delete and and add to . Find the potential neighbor for the new rooted tree . using Equation 8, and replace T_{i'}for each T_{ i }∈ by if is its potential neighbor.

Stopping step: If   > 1 goto the Selection step, otherwise output the tree.
This algorithm is similar to approaches described elsewhere [19–21], however, an essential difference is that we estimate path lengths and edge lengths by using triplets.
Local rearrangement
The heart of the clustering algorithm is the largest path length criterion, at which the path length from the mrca of (T_{ i }, T_{ j }) to med is estimated by ℓ(med, mrcamed, S_{ i }, S_{ j }) using Equation 4. Thus, as path length we take the average of the lengths obtained from at most O(n^{2} triplets {med, s_{ i }, s_{ j }} ∈ med × S_{ i }× S_{ j }. This average may not be the representative estimate of the true path length. Moreover the root med may be too far way from the mrca and this leads to an inaccurate estimate of the path length.
For convenience, we will use ℓ(S_{0}(S_{ ij })S_{ x }S_{ y }) instead of ℓ(S_{0}(S_{ ij })S_{ x }S_{ y }S_{ x }S_{ y }). We now use the average path lengths from Equation 10 to decide which pair of subtrees among (T_{ x }, T_{ y }), (T_{ x }, T_{ j }) and (T_{ y }, T_{ j }) is preferred. More specifically, if
ℓ(S_{0}(S_{ ij })S_{ x }S_{ y }) ≥ max{ℓ(S_{0}(S_{ ij })S_{ x }S_{ j }), ℓ(S_{0}(S_{ ij })S_{ y }S_{ j })}
we stick to the suggested grouping of T_{ x }and T_{ y }(see Figure 6a). Otherwise, if ℓ(S_{0}(S_{ ij })S_{ x }S_{ j }) or ℓ(S_{0}(S_{ ij })S_{ y }S_{ j }) is larger than the remaining average path lengths, we swap T_{ y }and T_{ j }or T_{ x }and T_{ j }as displayed in Figure 6b or 6c, respectively. Note that, this decision can be considered as a correction of the largest path length criterion by taking all possible triplets into account. We call the correction the largest average path length criterion.
We now explain the preorder traversal procedure [31] to reconstruct the rooted tree T_{ i }using the nearest neighbor interchange operation based on the largest average path length criterion (T_{ i }is a subtree of T_{{ij}}= (T_{ i }, T_{ j })):
Preorder traversal procedure (T_{ i })

Step 1: If T_{ i }is a single leaf, return.

Step 2: Otherwise, T_{ i }consists of two subtrees T_{ x }and T_{ y }. Do the nearest neighbor interchange operation around the root of T_{ i }based on the largest average path length criterion (Equation 10). If T_{ x }and T_{ j }(or T_{ y }and T_{ j }) were exchanged, estimate new edge lengths using Equation 5.

Step 3: Apply the preorder traversal procedure to two rooted subtrees of T_{ i }.
Representative sets and shortest triplets
For a set S of sequences (or taxa), the (genetic) distance matrix D is typically not additive due to stochastic errors [17]. Larger distances between two sequences are less accurately estimated. This leads to a low performance of both the clustering algorithm and the preorder traversal procedure for divergent data sets.
The second approach takes the average distance
Both approaches suffer from noise. Estimating the path length using Equation 11 may be inaccurate because it randomly picks a pair (s_{ i }, s_{ j }) which may not be really representative. Equation 12 may be problematic, especially since it might be susceptible to noise, due to the possibility of including long distances with large stochastic errors.
To overcome these problems, we select only min(k, S_{ i }) and min(k, S_{ j }) closest leaves to the root of T_{ i }and T_{ j }with respect to the path length, respectively. To illustrate, for k = 3 we pick {1, 2} from T_{ i }and {4, 5, 6} from T_{ j }in Figure 7.
which is only based on the krepresentative leaf sets. Now we can perform the clustering algorithm with reduced complexity. However, we also want to improve the preorder traversal procedure. The average path length from the unclassified object set S_{0}(S_{ ij }) to the mrca of (T_{ i }, T_{ j }) is estimated by Equation 10 which also suffers from noise. To overcome this problem, we select only min(k, S_{0}(S_{ ij })) unclassified objects closest to the root of tree T_{{ij}}with respect to distances where s_{0} ∈ S_{0}(S_{ ij }). We call the subset, denoted (S_{ ij }), krepresentative unclassified object set.
In short, the preorder traversal procedure uses only shortest triplets to estimate path lengths as well as edge lengths.
Shortest triplet clustering algorithm (STC)
We introduce now the shortest triplet clustering algorithm by combining the clustering algorithm, the local rearrangement, the krepresentative sets, and the shortest triplets approach.
Shortest triplet clustering algorithm (STC)

Initial step:

(i): Find the median object med using Equation 7.

(ii): Set = {T_{1},..., T_{ n }}  {T_{ med }} and for each T_{ i }∈ its representative leaf set = {i}.

(iii): Find for each T_{ i }∈ its potential neighbor T_{i'}∈ using Equation 8.

Selection step (largest path length criterion): Find the neighboring pair ( , ) using Equation 9.

Agglomeration step:

(i): Combine and into a new rooted tree with root i_{0}j_{0}, and estimate new edge lengths of using Equation 5 based on shortest triplets.

(ii): Compute the krepresentative leaf set of . based on k representative leaf sets and of and , respectively.

(iii): Compute the krepresentative unclassified object set of .

(iv): Delete and and add to .

(v): Find the potential neighbor for the new rooted tree using Equation 8 based on representative sets, and replace T_{i'}for each T_{ i }∈ by if is its potential neighbor.

Local rearrangement step: Apply the preorder traversal procedure to the rooted subtrees and of the new rooted tree based on only shortest triplets.

Stopping step: If   > 1, goto Selection step, otherwise output the tree.
The complexity of STC
Now we briefly describe the complexity of the STC. At the initial step, (i), (ii), and (iii) are done in O(n^{2}), O(n) and O(n^{2}) time, respectively. Thus, the complexity of the initial step is O(n^{2}). The selection step is done in O(n). At the agglomeration step, (i), (ii), (iii), (iv), and (v) are done in O(k^{3}), O(k), O(nk^{2}), O(1), and O(nk^{2}) time, respectively. Thus, the complexity of the agglomeration step is O(nk^{2} + k^{3}). Finally, we are estimating the complexity of the preorder traversal procedure based on only shortest triplets. Step 1 is done in constant time. Step 2, the nearest neighbor interchange operation around the root of T_{ i }costs O(k^{3}). Estimating new edge lengths is done in O(k^{3}) time. Recomputing the krepresentative leaf set of T_{ i }based on krepresentative leaf sets of its rooted subtrees T_{ x }and T_{ y }costs O(k) time. Finally, recomputing the krepresentative unclassified object set (S_{ i }) of T_{ i }based on the krepresentative leaf set of T_{ j }and the krepresentative unclassified object set (S_{ ij }) of T_{{ij}}is done in O(k) time. Thus, the complexity of step 2 is O(k^{3}). Step 3 is done in constant time. Step 1, step 2, and step 3 are repeated O(n) times so the complexity of the preorder traversal procedure is O(nk^{3}).
In the STC algorithm, the selection step, the agglomeration step and the local rearrangement step are repeated (n  2) times so the overall complexity of the STC algorithm is O(n^{2}k^{3}). Practically, we chose k = 5 as a good compromise between the accuracy and computational complexity for all data sets. That is, the practical complexity of the STC algorithm is only O(n^{2}).
Declarations
Acknowledgements
We would like to express special thanks to Heiko Schmidt for his technical support. We thank Gunter Weiss, Ingo Ebersberger, Tanja Gesell and Jutta Buschbom for carefully reading the manuscript. We acknowledge the use of supercomputing resources of the ZAM/NIC at the Research Center Jiilich. We thank three anonymous referees for helpful comments.
Authors’ Affiliations
References
 Saitou N, Nei M: The Neighbor – joining Method: A New Method for Reconstructing Phylogenetic Trees. Mol Biol Evol 1987, 4: 406–425.PubMedGoogle Scholar
 Strimmer K, von Haeseler A: Quartet Puzzling: A Quartet Maximum – Likelihood Method for Reconstructing Tree Topologies. Mol Biol Evol 1996, 13: 964–969.View ArticleGoogle Scholar
 Gascuel O: BIONJ: An Improved Version of the NJ Algorithm Based on a Simple Model of Sequence Data. Mol Biol Evol 1997, 14: 685–695.View ArticlePubMedGoogle Scholar
 Huson DH, Nettles SM, Warnow TJ: DiskCovering, a FastConverging Method for Phylogenetic Reconstruction. J Comput Biol 1999, 6: 369–386. 10.1089/106652799318337View ArticlePubMedGoogle Scholar
 Bruno WJ, Socci ND, Halpern AL: Weighted Neighbor Joining: A Likelihood BasedApproach to DistanceBased Phylogeny Reconstruction. Mol Biol Evol 2000, 17: 189–197.View ArticlePubMedGoogle Scholar
 Desper R, Gascuel O: Fast and Accurate Phylogeny Reconstruction Algorithms Based on the MinimumEvolution Principle. J Comput Biol 2002, 9: 687–706. 10.1089/106652702761034136View ArticlePubMedGoogle Scholar
 Csürös M: Fast Recovery of Evolutionary Trees with Thousands of Nodes. J Comput Biol 2002, 9: 277–297. 10.1089/10665270252935467View ArticlePubMedGoogle Scholar
 Guindon S, Gascuel O: A Simple, Fast and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood. Syst Biol 2003, 52: 696–704. 10.1080/10635150390235520View ArticlePubMedGoogle Scholar
 Stamatakis A, Ludwig T, Meier H: RAxMLIII: A fast program for maximum likelihoodbased inference of large phylogenetic trees. Bioinformatics 2005, 21: 456–463. 10.1093/bioinformatics/bti191View ArticlePubMedGoogle Scholar
 Vinh LS, von Haeseler A: IQPNNI: Moving fast through tree space and stopping in time. Mol Biol Evol 2004, 21: 1565–1571. 10.1093/molbev/msh176View ArticleGoogle Scholar
 Vinh LS, Schmidt HA, von Haeseler A: PhyNav: A Novel Approach to Reconstruct Large Phylogenies. In Proceedings of the 28th Annual German Classification Society Conference (GfKl 2004). Dortmund, Germany; 2004:in press.Google Scholar
 Charleston MA: HitchHiking: A Parallel Heuristic Search Strategy, Applied to the Phylogeny Problem. J Comput Biol 2001, 8: 79–91. 10.1089/106652701300099137View ArticlePubMedGoogle Scholar
 Brauer MJ, Holder MT, Dries LA, Zwickl DJ, Lewis PO, Hillis DM: Genetic Algorithms and Parallel Processing in MaximumLikelihood Phylogeny Inference. Mol Biol Evol 2002, 19: 1717–1726.View ArticlePubMedGoogle Scholar
 Schmidt HA, Strimmer K, Vingron M, von Haeseler A: TREEPUZZLE: Maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 2002, 18: 502–504. 10.1093/bioinformatics/18.3.502View ArticlePubMedGoogle Scholar
 Schmidt HA, von Haeseler A: Maximum Likelihood Analysis Using TREEPUZZLE. In Current Protocols in Bioinformatics. Edited by: Baxevanis AD, Davison DB, Page RDM, Stormo G, Stein L. New York, USA: Wiley and Sons; 2003:6.6.1–6.6.25.Google Scholar
 CavalliSforza L, Edwards AWF: Phylogenetic analysis: Models and estimation procedures. Am J Hum Genet 1967, 19: 233–257.PubMed CentralPubMedGoogle Scholar
 Fitch W, Margoliash E: Construction of Phylogenetic trees. Science 1967, 155: 279–284.View ArticlePubMedGoogle Scholar
 Hartigan AJ: Clustering Algorithms. John Wiley and Sons, Inc; 1975.Google Scholar
 Farris J: On the phenetic approach to vertebrate classification. 1977, 17: 823–850.Google Scholar
 Klotz NKRB LC, Mitchell RM: Calculation of evolutionary trees from sequence data. Proc Natl Acad Sci USA 1979, 76: 4516–4520.View ArticleGoogle Scholar
 Li WH: Simple method for constructing phylogenetic trees from distance matrices. Proc Natl Acad Sci USA 1981, 78: 1085–1089.PubMed CentralView ArticlePubMedGoogle Scholar
 Rambaut A, Crassly NC: SeqGen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci 1997, 13: 235–238.PubMedGoogle Scholar
 Kimura M: A Simple Method for Estimating Evolutionary Rates of Base Substitutions through Comparative Studies of Nucleotide Sequences. J Mol Evol 1980, 16: 111–120.View ArticlePubMedGoogle Scholar
 Harding EF: The probabilities of rooted treeshapes generated by random bifurcation. Adv Appl Prob 1971, 3: 44–77.View ArticleGoogle Scholar
 Swofford DL, Olsen GJ, Waddell PJ, Hillis DM: Phylogeny Reconstruction. In Molecular Systematics. 2nd edition. Edited by: Hillis DM, Moritz C, Mable BK. Sunderland, Massachusetts: Sinauer Associates; 1996:407–514.Google Scholar
 Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.5c. Department of Genetics, University of Washington, Seattle 1993. [http://evolution.genetics.washington.edu/phylip.html]Google Scholar
 Robinson DR, Foulds LR: Comparison of phylogenetic trees. Mathematical Biosciences 1981, 53: 131–147. 10.1016/00255564(81)900432View ArticleGoogle Scholar
 Buneman P: The recovery of trees from measures of dissimilarity. In Mathematics in the archaeological and historical sciences. Edited by: Hodson, Lendall, Tautu. Edinburgh: Edinburgh university press; 1971.Google Scholar
 Semple C, Steel M: Phylogenetics. OXFORD univerity press; 2003.Google Scholar
 Felsenstein J: Inferring Phylogenies. Sunderland, Massachusetts: Sinauer Associates; 2004.Google Scholar
 Aho AV, Hopcroft JE, Ullman JD: The Design and Analysis of Computer Algorithms. AddisonWesley Publishing Company; 1974.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.