A topological transformation in evolutionary tree search methods based on maximum likelihood combining p-ECR and neighbor joining

Background Inference of evolutionary trees using the maximum likelihood principle is NP-hard. Therefore, all practical methods rely on heuristics. The topological transformations often used in heuristics are Nearest Neighbor Interchange (NNI), Subtree Prune and Regraft (SPR) and Tree Bisection and Reconnection (TBR). However, these topological transformations often fall easily into local optima, since there are not many trees accessible in one step from any given tree. Another more exhaustive topological transformation is p-Edge Contraction and Refinement (p-ECR). However, due to its high computation complexity, p-ECR has rarely been used in practice. Results To make the p-ECR move more efficient, this paper proposes a new method named p-ECRNJ. The main idea of p-ECRNJ is to use neighbor joining (NJ) to refine the unresolved nodes produced in p-ECR. Conclusion Experiments with real datasets show that p-ECRNJ can find better trees than the best known maximum likelihood methods so far and can efficiently improve local topological transforms in reasonable time.


Background
The inference of evolutionary trees with computational methods has many important applications in medical and biological research, such as drug discovery and conservation biology. A rich variety of tree reconstruction methods based on sequences have been developed, which fall into three categories, (a) maximum parsimony methods, (b) distance based methods and (c) approaches applying the maximum likelihood principle. The latter two are the most popular. Distance based methods calculate pair-wise distances between the sequences with each other, and support the tree that best fits these observed distances. The prominent distance based method is Neighbor joining (NJ) [1], in which partial trees are iteratively combined to form a larger tree in a bottom-up manner. Due to low computational time complexity and demonstrated topological accuracy for small data sets, NJ and its variants have been widely used.
Maximum likelihood methods aim to find the tree that gains the maximum likelihood value to have produced the underlying data. A number of studies [2,3] have shown that maximum likelihood programs can recover the correct tree from simulated datasets more frequently than other methods, which supports numerous observations from real data and explains their popularity.
However, the main disadvantage of maximum likelihood methods is that they require much computational effort. Maximum likelihood reconstruction consists of two tasks. The first task involves edge length estimation: Given the topology of a tree, find edge lengths to maximize the likelihood function. This task is accomplished by iterative methods such as expectation maximization or using Newton-Raphson optimization. Each iteration of these methods requires computations that take on the order of the number of sequences times the number of sequence positions. The second, more challenging, task is to find a tree topology that maximizes the likelihood. The number of potential topologies grows exponentially with the number of sequences n, e.g. for n = 50 sequences there already exist 2.84*10 76 alternative topologies; a number almost as large as the number of atoms in the universe (≈10 80 ). In fact, it has already been demonstrated that finding the optimal tree under the maximum likelihood criterion is NP-hard [4]. Consequently, the introduction of heuristics to reduce the search space in terms of potential topologies evaluated becomes inevitable, such as, the hill climbing based reconstruction algorithms [5][6][7]; the genetic algorithm based ones [8,9], etc.
Although using different search strategies, the heuristics are all to try to improve a starting tree/starting trees by a series of elementary topological rearrangements, until local optima is found. It is obvious that the performance of the heuristics depends on the degree of exhaustiveness of the topological rearrangements on some extent. The three often used topological rearrangements include Nearest Neighbor Interchange (NNI), Subtree Prune and Regraft (SPR) and Tree Bisection and Reconnection (TBR).
The NNI move swaps one rooted subtree or leave on one side of an internal edge e with another on the other side. For every internal edge, a NNI move can produce two different topologies as shown in Figure 1. For a tree containing n sequences, the size of the neighborhood induced by NNI is O(n). The neighborhood of an evolutionary tree T under a topological rearrangement move is defined as be the set of all trees that can be obtained from T by one move.
A SPR move on a tree T is defined as cutting any edge and thereby pruning a subtree t, and then regrafting the subtree by the same cut edge to a new vertex obtained by subdividing a pre-existing edge in T-t. For a tree containing n sequences, the size of the neighborhood induced by SPR is O(n 2 ).
In a TBR move an edge is removed from T, creating subtrees t and T-t, and then a new edge is added between the midpoints of any two edges in t and T-t, creating a new tree. For a tree containing n sequences, the size of the neighborhood induced by TBR is O(n 3 ). See Figure 2 for schematic representation of SPR and TBR.
As shown above, the neighborhood size produced by NNI, SPR and TBR acting on an evolutionary tree T comprising of n sequences is O(n), O(n 2 )and O(n 3 ) respectively. Thus, TBR are the most exhaustive. Even TBR searches, however, can often get trapped in local optima, since there are not many trees accessible in one step from any given tree, which motivates the introduction of p-  Edge Contraction and Refinement (p-ECR) [10]. A p-ECR move means to contract p edges all at once, creating unresolved nodes in the process, then refine these unresolved nodes to give back a binary tree. A contraction collapses an edge in the tree and identifies its two end points, while a refinement expands an unresolved node into two nodes connected by an edge. For example, the trees T 1 and T 5 in Figure 3 are separated by one 2-ECR move. From the definition of p-ECR, NNI is the special case of p-ECR when p equals to 1.

SPR and TBR
Let S u be the number of unresolved nodes produced in p-ECR and d i the degree of the unresolved node i, since the number of trees produced by n sequences is (2n-5)!!, refining the unresolved nodes can produce different trees. When p(>1) edges are deleted from a tree, the location relationship between the deleted edges determines the number of unresolved nodes produced and the degrees of the unresolved nodes. Now two extreme special cases are analyzed.
The first extreme special case is when all the p edges are adjacent. In this case, only one unresolved node with degree-(2p) is produced. Then the number of trees produced by one p-ECR is (4p-5)!!. Another extreme special case is when all the p edges disjoin. In this case, p unresolved nodes with degree-4 are produced. Then the number of trees produced by one p-ECR is ((2 × 4-5)!!) p , that is 3 p .
In other cases, the number of possible trees produced is intermediate of the two special cases. Observe that there are ways of selecting p edges to contract, there are Ω(n n 3 p ) trees produced by p-ECR. Thus, although an every sequence of p NNI moves on a tree is a p-ECR move on that tree, there are p-ECR moves that can not be performed by a sequence of p NNI move(the neighborhood size produced by p NNI moves is O(n p )). With such a wide search space, getting trapped in bad local optima can often be avoided, resulting in an exhaustive local search. Moreover, the exhaustiveness degree of a p-ECR move is dependent on the value of p, that is, a larger p means a larger search space for the correct tree, which could be potentially useful in selecting a suitable range of p.
However, how to quickly select the best one from so many possible evolutionary trees is a hard problem facing the p-ECR move, since there are so many potential topologies to evaluate and it is very time-consuming to compute the likelihood of a given topology as mentioned above. The straight answer is to simply evaluate every potential tree and select the best. Even for medium size of p, the answer is apparently impossible. Until now, there is no an efficient and general implementation of p-ECR. Consequently, people often yield up the exhaustive p-ECR and The illustration of a 2-ECR process Figure 3 The illustration of a 2-ECR process. Refine turn to some simpler one, such as NNI. In order to make p-ECR efficient, a method called p-ECRNJ motivated by NJ is presented in this paper. The main idea of p-ECRNJ is to use NJ to refine the unresolved nodes produced in p-ECR. In this paper, we use NJ to refine the unresolved nodes to improve p-ECR.

NJ
NJ is a greedy algorithm, which attempts to minimize the sum of all branch-lengths on the constructed tree. Conceptually, it starts out with a star-formed tree where each leaf corresponds to a sequence, and iteratively picks two nodes adjacent to the root and joins them by inserting a new node between the root and the two selected nodes. When joining nodes, the method selects the pair of nodes i, j that minimizes where d ij is the distance between node i and j(assumed symmetric, i.e., d ij = d ji ), R k is the sum over row k of the distance matrix: R k = ∑ x d kx (where x ranges over all nodes adjacent to the root node), and r is the remaining number of nodes adjacent to the root. Once the pair i, j to agglomerate is selected, a new node C which represents the root of the new cluster is created. Then the length of branches (C, i) and (C, j) is estimated by the following Eq. (2) Finally the distance matrix is reduced by replacing the distances relative to sequence i and sequence j by those between the new node C and any other node k using This formulation of NJ gives rise to a canonical algorithm that performs a search for min i, j Q ij , using time O(r 2 ), and joins i and j, using time O(r) to update d. The search and joining is continued until only there three nodes are adjacent to the root. The total time complexity becomes O(n 3 ), and the space complexity becomes O(n 2 ) (for representing the distance matrix d). An example of NJ is illustrated in Figure 4.
With a running time of O(n 3 ) on n sequences, NJ is fast and widely used. Moreover, empirical work shows it to be quite accurate, at least for small data sets. St. John et al. [11] even suggest it as a standard against which new phylogeny reconstruction methods should be evaluated. In this paper, we use NJ to refine the unresolved nodes to improve p-ECR.

Results
In The illustration of NJ Figure 4 The illustration of NJ. All the programs are run with default options. In addition, the parameter p in ECRML and ECRML+PHYML is set as 4 and iteration times as 20. Computing time is measured on a PC Pentium IV 2.99 GHz running with Windows XP.
Since the BioNJ cannot compute the likelihood values of final trees and there is a difference between all the maximum likelihood algorithms in the way of likelihood computation, all final trees found by BioNJ, PHYML, RAxML and fastDNAml are re-evaluated using ECRML to enable a direct comparison. The main results are shown in Table 2, Table 3 and Table 4. In addition, Stars in Table 2 and  Table 4 indicate entries where the algorithm was deemed to be too slow to bother with that test. Table 2 shows the maximum likelihood values of the evolutionary trees reconstructed by BioNJ, PHYML, RAxML, fastDNAml and ECRML on different datasets. Every of the former four algorithms include two columns: the first column lists the maximum likelihood values of the evolutionary trees reconstructed by the algorithm on different datasets; the second column lists the difference of the likelihood value between the algorithm and ECRML on corresponding dataset. A difference that is smaller than 0 means that ECRML can find an evolutionary tree with higher likelihood value than the algorithm on corresponding dataset and vice versa. ECRML only include one column, where lists the likelihood values of ECRML on different datasets. From Table 2, we can see that on every dataset, the values in the second column of BioNJ, PHYML, RAxML and fastDNAml are all smaller than 0. This means that ECRML can find better trees than these four algorithms on all datasets and in further proves that p-ECRNJ has a wider search space. Table 3 shows the likelihood values of the evolutionary trees reconstructed by PHYML, ECRML and ECRML+ PHYML on different datasets. Similar to Table 2, every of PHYML and ECRML includes two columns: the first column lists the likelihood values of the evolutionary trees reconstructed by the algorithm on different datasets; the second column lists the difference of the likelihood value between the algorithm and ECRML+ PHYML on corresponding dataset. A difference that is smaller than 0 means ECRML + PHYML can find an evolutionary tree with higher likelihood value than the algorithm on corresponding dataset and vice versa. ECRML + PHYML only include one column, where lists the likelihood values of ECRML + PHYML on different datasets. From Table 3, we can see that on every dataset, the values in the second column of PHYML are all smaller than 0. This support that p-ECRNJ can find better trees than other local rearrangements such as NNI and can further efficiently improve them. At the same time, we can also see from Table 3 that there are 10 values smaller than 0, one equal to 0 and one larger than 0 in the second column of ECRML. This means that ECRML + PHYML can often get better trees than ECRML, although they all include a p-ECRNJ search. This  is mainly due to that in p-ECRNJ, p edges are randomly deleted; then there is randomicity in p-ECRNJ, which can be eliminated by much iteration. However, when there is no enough iteration, the resulting trees may show some of the defects of starting trees. ECRML is start from a tree reconstructed by NJ and ECRML+PHYML is start from a tree reconstructed by PHYML in each iteration. PHYML can often get better trees than NJ as shown in Table 2. This explains that ECRML+PHYML can often get better trees than ECRML in Table 3.

Methods
In order to make p-ECR efficient, a method p-ECRNJ combining the exhaustiveness of the p-ECR move and the efficiency of NJ is presented in this paper and detailed here. Before p-ECRNJ, several concepts are introduced at first. An evolutionary tree is an unrooted or rooted tree whose leaves have degree one, and all of whose internal nodes have degree at least three. An internal node with degree more than three is called unresolved. A supernode α in a tree T is a degree-1 non leaf vertex, denoting some collapsed subtree.
The main idea of the p-ECRNJ is to randomly contract p edges from an evolutionary tree T, and the consequent refinement of the unresolved nodes is accomplished by NJ. As show in Figure 4, only one unresolved node Y is successively resolved in NJ. Generally, contracted tree T* in p-ECRNJ contains c (1 ≤ c ≤ p) unresolved nodes. Then, a collapsing procedure is needed before the refinement using NJ. That is, to select an unresolved node to refine and to root the tree at the node, then to collapse every subtree rooted at the node adjacent to the root node into a supernode respectively. This collapsing procedure produces a tree containing only one unresolved node; consequently, the unresolved node is refined according to NJ. The refinement process is continued until there is no unresolved node in T*.
A p-ECRNJ move on an evolutionary tree T is described in detail as follows.
(1) Contraction stage: to randomly select p edges to contract all at once and the unresolved tree T* is resulted; (2) Refinement stage: The refinement stage includes the following two steps: Step 1: Collapsing step: Select an unresolved node to refine and root T* at the unresolved node, collapse every subtree rooted at the internal node adjacent to the root node into a supernode respectively; Step 2: NJ step: 1 Build the distance matrix M of the nodes or supernodes in the collapsed tree T*. The distance between two nodes is the distance between the two sequences corresponding the two nodes, respectively. The distance between two supernodes is more sophisticated and computed as follows. Let a and β denote two supernodes respectively. It is assumed that a and β respectively represents a subtree containing x leaves α i (i = 0,..., x-1) and a subtree containing y leaves β i (i = 0,..., y-1). Then the distance between α and β is estimated using Eq. (4). The distance between a single node and a supernode is a special case where x = 1 or y = 1. The illustration of a 2-ECRNJ, where the black nodes denote supernodes, white nodes denote leaves or internal nodes, solid and dashed lines represent branches, dashed lines denotes branch to be deleted and the nodes in dashed cycles denote the ones to be refined Figure 5 The illustration of a 2-ECRNJ, where the black nodes denote supernodes, white nodes denote leaves or internal nodes, solid and dashed lines represent branches, dashed lines denotes branch to be deleted and the nodes in dashed cycles denote the ones to be refined.
The heuristic ECRML which is based on p-ECRNJ and hill climbing Figure 6 The heuristic ECRML which is based on p-ECRNJ and hill climbing. 5 Reduce the distance matrix by replacing the distances relative to i and j by those between the new node C and any other node k using Eq. (3).
The process of 2, 3, 4 and 5 is repeated until r = 2. The process of collapsing -NJ is repeated until there is no unresolved node in the tree, that is, the number of unresolved nodes S u is 0. The whole p-ECRNJ process is illustrated in Figure 5.
Due to that p edges are randomly selected, p-ECRNJ is performed repeatedly a pre-defined times k to perform in actual application instead of considering all ways of selecting p edges to contract. The value of k depends on the time allowed, if there is enough time, k can be set to . A simple heuristic named ECRML based on p-ECRNJ and hill climbing is shown in Figure 6.
As shown in Figure 6, for every unresolved node, the NJ method is run one time. The time complexity of the NJ method on n sequences is O(n 3 ). So, the total time complexity of a p-ECRNJ move is , where S u is the number of unresolved nodes and d i is the degree of the unresolved node. As mentioned above, when p edges are deleted from a tree, the location relationship between the deleted edges determines the number of unresolved nodes produced and the degrees of the unresolved nodes. When all the p edges are adjacent, only one unresolved node with degree-(2p) is produced. Then the time complexity of p-ECRNJ is O(8p 3 ), that is O(p 3 ); when all the p edges disjoin, p unresolved nodes with degree-4 are produced.
Then the time complexity of p-ECRNJ is O(4 3 p), that is O(p). In other cases, the time complexity is intermediate of the two special cases. Consequently, it takes at most O(p 3 ) to refine unresolved nodes in every run of p-ECRNJ(step a). After every p-ECRNJ, it need to optimize the branches and re-compute the likelihood of the current tree(Step b). The time complexity in this step is O(lmn), where l is the iteration times in the optimization of branches, m and n is the number of sites and number of sequences respectively. So, the total time complexity of ECRML is O(k*(p 3 + lmn)).
In an actual tree search, besides used as a topological transformation operation as shown in Figure 6, the p-ECRNJ move can be combined with a local topological transforms, such as NNI, where rounds of NNI and p-ECRNJ are alternated. For example, ECRML+PHYML in Results is based on the combination of p-ECRNJ and NNI.