Skip to main content
  • Methodology article
  • Open access
  • Published:

Recrafting the neighbor-joining method

Abstract

Background

The neighbor-joining method by Saitou and Nei is a widely used method for constructing phylogenetic trees. The formulation of the method gives rise to a canonical Θ(n3) algorithm upon which all existing implementations are based.

Results

In this paper we present techniques for speeding up the canonical neighbor-joining method. Our algorithms construct the same phylogenetic trees as the canonical neighbor-joining method. The best-case running time of our algorithms are O(n2) but the worst-case remains O(n3). We empirically evaluate the performance of our algoritms on distance matrices obtained from the Pfam collection of alignments. The experiments indicate that the running time of our algorithms evolve as Θ(n2) on the examined instance collection. We also compare the running time with that of the QuickTree tool, a widely used efficient implementation of the canonical neighbor-joining method.

Conclusion

The experiments show that our algorithms also yield a significant speed-up, already for medium sized instances.

Background

The neighbor-joining method is a distance based method for constructing evolutionary trees. It was introduced by Saitou and Nei [1], and the running time was later improved by Studier and Keppler [2]. It has become a mainstay of phylogeny reconstruction, and is probably the most widely used distance based algorithm in practice. With a running time of O(n3) on n taxa [2], it is fast for small input, and empirical work shows it to be reasonable accurate, at least for cases where the rate of evolution is not extremely high or low. St. John et al. [3] even suggest it as a standard against which new phylogenetic methods should be evaluated. The aim of this paper is to improve on the running time of neighbor-joining tree reconstruction to make it applicable for larger datasets, e.g. [4]. Whether the accuracy supplied by the neighbor-joining method is useful for a particular data set in a particular situation is an independent issue outside of the scope of this paper.

The neighbor-joining method is a greedy algorithm which attempts to minimize the sum of all branch-lengths on the constructed phylogenetic tree. Conceptually, it starts out with a star-formed tree where each leaf corresponds to a species, and iteratively picks two nodes adjacent to the root and joins them by inserting a new node between the root and the two selected nodes. When joining nodes, the method selects the pair of nodes i, j that minimizes the branch-length sum of the resulting new tree. One way of achieving this [2] is always to select the pair of nodes i, j that minimizes

Q ij = (r - 2) d ij - (R i + R j ),     (1)

where d ij is the distance between nodes i and j (assumed symmetric, i.e., d ij = d ji ), R k is the row sum over row k of the distance matrix: R k = ∑ i d ik (where i ranges over all nodes adjacent to the root node), and r is the remaining number of nodes adjacent to the root. When nodes i and j are joined, they are replaced with a new node, A, with distance to a remaining node k given by

d Ak = (d ik + d jk - d ij )/2.     (2)

This formulation of the neighbor-joining method gives rise to a canonical algorithm that performs a search for mini,jQ i j , using time O(r2), and joins i and j, using time O(r) to update d. This search and join is continued until only three nodes are adjacent to the root (i.e. for n - 3 joins where n is the total number of species). The total time complexity becomes O(n3), and the space complexity becomes O(n2) (for representing the distance matrix d). For further discussions of the neighbor-joining method, see e.g. [57].

In this paper, we present techniques for speeding up the canonical neighbor-joining algorithm. Our algorithms construct the same phylogenetic trees as the canonical algorithm, but attempt to reduce the search time for mini,jQ i j a quad-tree [8] built on top of the Q matrix, or on a matrix that approximates the Q matrix.

We evaluate the performance of our algorithms empirically on distance matrices obtained from the Pfam collection of alignments [9, 10], and compare the running time with that of the QuickTree tool [11], a widely used efficient implementation of the canonical neighbor-joining algorithm, which previously was shown to run faster than the implementations in the CLUSTAL W, and PHYLIP packages, and faster than the BIONJ implementation of a variant of the neighbor-joining method. The results show that the presented algorithms can give a significant speed-up over the standard neighbor-joining method, already for moderately sized instances. Indeed, evidence is given that the running time of the best of our algorithms evolves as Θ(n2) on the examined instance collection, as opposed to Θ(n3) for QuickTree.

Results and discussion

To evaluate the presented methods, we have implemented them in a tool, QuickJoin, available at [12]. For evaluating the performance of QuickJoin we have compared the QuickJoin tree creation with the canonical neighbor-joining tree creation method, as implemented in the tool QuickTree [11]. The QuickJoin program takes a distance matrix of the taxa for input, and produces a tree as output. The QuickTree tool, likewise, can take a distance matrix as input and produce a tree as output. Additionally, it can take a multiple alignment as input, and produce either a distance matrix or a tree as output. When comparing the running time of the two tools, we call both tools with a distance matrix as input.

The platform where the experiments were conducted was a Linux RedHat 8.0 kernel 2.4.18–19.7, Pentium 4 2.66 GHz, 512 KB cache, 1 GB ram, both the QuickJoin program and the QuickTree program was compiled using gcc/g++ 3.1.1 with optimization -O3. To measure the running time of the programs we used the GNU time tool, the time report is the user time obtained by the time -f %e option (wall-time in seconds). For QuickJoin we examine both the method based on a depth-first search with cutoffs and the method based on a priority queue search — see the Methods section for details. For QuickTree there is only one way of building trees.

Results on Pfam data

The data used for the first experiment were protein sequence alignments taken from the Pfam database [9, 10], and translated into distance matrices using QuickTree.

We first evaluated the performance of your method without the linear functions approximation of Q (see Methods). Figure 1 shows a plot of the walltime performance of the new methods with this approximation, compared to QuickTree on the alignments from Pfam with 200–1000 sequences. We can observe that the performance of the depth-first search method without sampling has a quite unstable performance, whereas the other methods achieve a performance comparable with that of the QuickTree implementation.

Figure 1
figure 1

Performance of our methods using the simple approximation to Q. The plots show the running time of QuickTree, and QuickJoin with the depth-first search (DFS) method and with the priority queue (p.queue) method with and without sampling (see Methods), with the first approximation of Q described in the Methods section. The input for the runs is distance matrices for the Pfam alignments with 200 to 1000 sequences. The depth-first search without sampling performs very poorly and is removed on the plot on the right to better show the performance of the remaining methods.

We then evaluated the performance of the new methods when also using the linear functions approximation to Q. The input for the runs is distance matrices for the Pfam alignments with 200 to 8000 sequences, and the results are shown in Figure 2. We can observe that the running time of all the presented methods are at the same level, and that all the methods outperform the QuickTree implementation.

Figure 2
figure 2

Performance of our methods using the linear functions approximation to Q. The plots show the running time of QuickTree, and QuickJoin with the depth-first search method and with the priority queue method with and without sampling, with the linear functions approximation of Q described in the Methods section. The input for the runs is distance matrices for the Pfam alignments with 200 to 8000 sequences. The new methods perform significantly better than the basic neighbor-joining method, as implemented in QuickTree. To better compare the new methods the QuickTree plot is removed on the right.

The way QuickJoin is implemented, the memory usage for representing the quad-tree is increased (by a factor of four) each time the number of taxa is increased to the next power of two. That is, the memory usage is close to constant between powers of twos, and grows by a factor of four when the input size crosses a power of two. As the memory usage grows, the number of page faults when running the program grows. This slows down the program, and is the explanation behind the increase in running time at 212 = 4096 in Figure 2. A similar increase in running time is observed at 211 = 2048 when running QuickJoin on a machine with less RAM. The canonical neighbor-joining method does not rely on a quad-tree and as such can run on less memory; it still needs to represent a distance matrix and a tree, however, and as such can only save about a factor of four compared to QuickJoin.

Results on data provided by Georg Fuellen

We have also used QuickJoin on two datasets supplied by Georg Fuellen, Integrated Functional Genomics, University Hospital Muenster, who used neighbor-joining to produce large phylogenies as described in [4]. Dataset A is a multiple sequence alignment of 1138 species, and dataset B is a multiple sequence alignment of 1863 species. Both multiple sequence alignments were converted into corresponding distance matrices. Building trees using QuickTree took 8.29 sec for dataset A and 34.67 sec for dataset B. Building trees using QuickJoin took 3.09 (3.38) sec for dataset A and 6.50 (7.56) sec for dataset B when using the depth-first search (priority queue search) method.

Conclusion

We have suggested methods for speeding up the search for mini,jQ ij in neighbor-joining based on a quad-tree storing information about known lower bounds on parts of the Q matrix. All our methods have a space bound of O(n2) and a time bound of the form O(nS + U), where S is the time used (on average) in each search and U is the time used for updating and rebuilding the quad-tree and other auxiliary data structures. For the suggested methods, the update time has a worst case bound of O(n2) if we rebuild the quad-tree whenever we have halved the number of remaining nodes. A worst case bound for S is O(n2), resulting in a combined O(n3) time bound for the methods, i.e., the same asymptotic bound as the original method.

We have conducted experiments, evaluating the performance of the methods implemented in QuickJoin on data from the Pfam database and have shown that the methods perform favorably compared to the canonical algorithm as implemented in QuickTree and achieves a significant speed up. QuickTree is stated to be an optimized implementation of the Neighbor-Joining tree building algorithm [11]. We expect that if we apply a similar level of code optimization techniques to the implementation of QuickJoin used for the experiments we will be able achieve an improved performance increasing the gap between the performance of QuickTree and QuickJoin.

Methods

Our algorithms construct the same phylogenetic trees as the canonical algorithm, but attempt to reduce the search time for mini,jQ ij , see Eq. (1), by using a quad-tree [8] built on top of the Q matrix, or on a matrix that approximates the Q matrix but does not need to be recomputed after each join. The nodes of the quad-tree store information guiding the search for the minimum, and the crux of our methods is to define this information in a way which will guide the search well for many iterations before it needs updating. The time complexity of our methods are given by O(nS + U), where S is the average search time for finding nodes i and j minimizing Q ij , and U is the time used, throughout the algorithm, for updating the quad-tree and other bookkeeping information, e.g., the distance matrix. The worst case time complexity remains O(n3), but the anticipation is that our methods on real data is significantly faster. The space complexity after adding the quad-tree is still O(n2) since a quad-tree with n2 leaves can be represented in O(n2) space.

Using a quad-tree

A quad-tree [8] is a four-ary tree modeling of a two-dimensional area recursively divided into quadrants. In the following description we assume for the sake of simplicity that r is a power of two. Figure 3 shows the tree resulting from a three-level recursive process.

Figure 3
figure 3

A quad-tree with three levels of nodes. A quad-tree with three levels of nodes, and the corresponding subdivision of a square. The root covers the entire square, its children each of the four quadrants, and the leaves a further division of these.

By building a quad-tree of height log r — where r is the number of remaining neighbors to the root node — on top of the r × r matrix Q and storing in nodes of the quad-tree the minimal values in the subtree rooted at that node, we can search for the pair of nodes minimizing Q ij in time O(log r). However, by Eq. 1, in each iteration of the algorithm, all entries in Q need to be updated: the value r is decreased by one, and each row and column in d has a new distance to the joined node A added and two distances removed, thereby changing R k for all k. Updating Q after each iteration therefore takes time O(r2), leading to a running time of O(n3). There is no asymptotic gain, and in practice the quad-tree solution will be significantly slower than the basic, non-quad-tree, algorithm, as a consequence of the added overhead. Simply building a quad-tree on top of Q will not improve the running time.

The problem with building the quad-tree on top of Q is that all entries in Q change with each join. To decrease the update time, we need to build the quad-tree on some information that does not completely change with each join. If, for instance, we only need to update a single row and column per join, we can do that in O(r) time.

Using approximations of Q

If we assume that the relative differences between the Q ij values do not change dramatically between joins — that is, we assume that the ordering of Q ij values is not randomly permuted after a join — we would expect that we could use the old Q ij values to guide the search for the current minimal Q ij . Let Q' denote the Q matrix at some earlier point, and let r' denote the number of remaining nodes adjacent to the root at that point. Similarly, let R k ' MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGudaqhaaWcbaGaem4AaSgabaGaei4jaCcaaaaa@303B@ denote the row sum of row k in Q', and let δ k denote the difference between R k and R k ' MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGudaqhaaWcbaGaem4AaSgabaGaei4jaCcaaaaa@303B@ : R k = R k ' MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGudaqhaaWcbaGaem4AaSgabaGaei4jaCcaaaaa@303B@ + δ k . Based on these definitions we can rewrite Eq. 1 to the following:

Q i j = r 2 r ' 2 Q i j ' + r r ' r ' 2 ( R i ' + R j ' ) ( δ i + δ j ) . ( 3 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaWgaaWcbaGaemyAaKMaemOAaOgabeaakiabg2da9maalaaabaGaemOCaiNaeyOeI0IaeGOmaidabaGaemOCaiNaei4jaCIaeyOeI0IaeGOmaidaaiabdgfarnaaDaaaleaacqWGPbqAcqWGQbGAaeaacqGGNaWjaaGccqGHRaWkdaWcaaqaaiabdkhaYjabgkHiTiabdkhaYjabcEcaNaqaaiabdkhaYjabcEcaNiabgkHiTiabikdaYaaacqGGOaakcqWGsbGudaqhaaWcbaGaemyAaKgabaGaei4jaCcaaOGaey4kaSIaemOuai1aa0baaSqaaiabdQgaQbqaaiabcEcaNaaakiabcMcaPiabgkHiTiabcIcaOGGaciab=r7aKnaaBaaaleaacqWGPbqAaeqaaOGaey4kaSIae8hTdq2aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkcqGGUaGlcaWLjaGaaCzcamaabmaabaGaeG4mamdacaGLOaGaayzkaaaaaa@6040@

This equation expresses the current Q ij values in terms of the old values and some correction terms, given by the R' and δ vectors. Because of these terms, the minimal Q i j ' MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaqhaaWcbaGaemyAaKMaemOAaOgabaGaei4jaCcaaaaa@3192@ does not necessarily identify the nodes i, j that minimize Q ij , so we cannot use a quad-tree of Q' alone to find the nodes to join. We can, however, use a quad-tree over Q' to get lower bounds for the minimal Q ij value in parts of the Q matrix, as described in the following.

Let Q MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFqeFuaaa@3841@ denote a quad-tree built on top of Q' such that Q MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFqeFuaaa@3841@ [i, j, l] denotes the minimum value at level l, where leaves are at level zero. More precisely, let Q MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFqeFuaaa@3841@ [i, j, 0] = Q i j ' MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaqhaaWcbaGaemyAaKMaemOAaOgabaGaei4jaCcaaaaa@3192@ , and

Q [ i , j , l ] = min { Q [ 2 i , 2 j , l 1 ] Q [ 2 i + 1 , 2 j , l 1 ] Q [ 2 i , 2 j + 1 , l 1 ] Q [ 2 i + 1 , 2 j + 1 , l 1 ] . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFqeFucqGGBbWwcqWGPbqAcqGGSaalcqWGQbGAcqGGSaalcqWGSbaBcqGGDbqxcqGH9aqpcyGGTbqBcqGGPbqAcqGGUbGBdaGabaqaauaabaqaeeaaaaqaaiab=br8rjabcUfaBjabikdaYiabdMgaPjabcYcaSiabikdaYiabdQgaQjabcYcaSiabdYgaSjabgkHiTiabigdaXiabc2faDbqaaiab=br8rjabcUfaBjabikdaYiabdMgaPjabgUcaRiabigdaXiabcYcaSiabikdaYiabdQgaQjabcYcaSiabdYgaSjabgkHiTiabigdaXiabc2faDbqaaiab=br8rjabcUfaBjabikdaYiabdMgaPjabcYcaSiabikdaYiabdQgaQjabgUcaRiabigdaXiabcYcaSiabdYgaSjabgkHiTiabigdaXiabc2faDbqaaiab=br8rjabcUfaBjabikdaYiabdMgaPjabgUcaRiabigdaXiabcYcaSiabikdaYiabdQgaQjabgUcaRiabigdaXiabcYcaSiabdYgaSjabgkHiTiabigdaXiabc2faDjabc6caUaaaaiaawUhaaaaa@86DC@

With this definition, we have

Q [ i , j , l ] = min 2 l i i ' < 2 l ( i + 1 ) , 2 l j j ' < 2 l ( j + 1 ) Q i ' j ' ' . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFqeFucqGGBbWwcqWGPbqAcqGGSaalcqWGQbGAcqGGSaalcqWGSbaBcqGGDbqxcqGH9aqpdaWfqaqaaiGbc2gaTjabcMgaPjabc6gaUbWcbaGaeGOmaiZaaWbaaWqabeaacqWGSbaBaaWccqWGPbqAcqGHKjYOcqWGPbqAcqGGNaWjcqGH8aapcqaIYaGmdaahaaadbeqaaiabdYgaSbaaliabcIcaOiabdMgaPjabgUcaRiabigdaXiabcMcaPiabcYcaSiabbccaGiabikdaYmaaCaaameqabaGaemiBaWgaaSGaemOAaOMaeyizImQaemOAaO2aaWbaaWqabeaacqGGNaWjaaWccqGH8aapcqaIYaGmdaahaaadbeqaaiabdYgaSbaaliabcIcaOiabdQgaQjabgUcaRiabigdaXiabcMcaPaqabaGccqWGrbqudaqhaaWcbaGaemyAaKMaei4jaCIaemOAaOMaei4jaCcabaGaei4jaCcaaOGaeiOla4caaa@6FE3@

Let MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFSeIqaaa@377E@ denote a binary tree for the correction terms built as follows, where MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFSeIqaaa@377E@ [k, l] denotes the k th node at level l:

[ k , 0 ] = r r r 2 R k δ k [ k , l ] = min { [ 2 k , l 1 ] , [ 2 k + 1 , l 1 ] } . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakqaaeeqaaGWaaiab=XsicjabcUfaBjabdUgaRjabcYcaSiabicdaWiabc2faDjabg2da9maalaaabaGaemOCaiNaeyOeI0IafmOCaiNbauaaaeaacuWGYbGCgaqbaiabgkHiTiabikdaYaaacuWGsbGugaqbamaaBaaaleaacqWGRbWAaeqaaOGaeyOeI0ccciGae4hTdq2aaSbaaSqaaiabdUgaRbqabaaakeaacqWFSeIqcqGGBbWwcqWGRbWAcqGGSaalcqWGSbaBcqGGDbqxcqGH9aqpcyGGTbqBcqGGPbqAcqGGUbGBcqGG7bWEcqWFSeIqieaacqqFBbWwcqaIYaGmcqWGRbWAcqGGSaalcqWGSbaBcqGHsislcqaIXaqmcqGGDbqxcqGGSaalcaaMc8Uae8hlHiKaei4waSLaeGOmaiJaem4AaSMaey4kaSIaeGymaeJaeiilaWIaemiBaWMaeyOeI0IaeGymaeJaeiyxa0LaeiyFa0NaeiOla4caaaa@74F8@

We have

[ k , l ] = min 2 l k k ' < 2 l ( k + 1 ) [ k ' , 0 ] . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFSeIqcqGGBbWwcqWGRbWAcqGGSaalcqWGSbaBcqGGDbqxcqGH9aqpdaWfqaqaaiGbc2gaTjabcMgaPjabc6gaUbWcbaGaeGOmaiZaaWbaaWqabeaacqWGSbaBaaWccqWGRbWAcqGHKjYOcqWGRbWAcqGGNaWjcqGH8aapcqaIYaGmdaahaaadbeqaaiabdYgaSbaaliabcIcaOiabdUgaRjabgUcaRiabigdaXiabcMcaPaqabaGccqWFSeIqcqGGBbWwcqWGRbWAcqGGNaWjcqGGSaalcqaIWaamcqGGDbqxcqGGUaGlaaa@5BCC@

>From the rewriting of Q by Eq. 3 and the trees above, we define

[ i , j , l ] = r 2 r ' 2 Q [ i , j , l ] + [ i , l ] + [ j , l ] . ( 4 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectcqGGBbWwcqWGPbqAcqGGSaalcqWGQbGAcqGGSaalcqWGSbaBcqGGDbqxcqGH9aqpdaWcaaqaaiabdkhaYjabgkHiTiabikdaYaqaaiabdkhaYjabcEcaNiabgkHiTiabikdaYaaacqWFqeFucqGGBbWwcqWGPbqAcqGGSaalcqWGQbGAcqGGSaalcqWGSbaBcqGGDbqxcqGHRaWkcqWFSeIqcqGGBbWwcqWGPbqAcqGGSaalcqWGSbaBcqGGDbqxcqGHRaWkcqWFSeIqcqGGBbWwcqWGQbGAcqGGSaalcqWGSbaBcqGGDbqxcqGGUaGlcaWLjaGaaCzcamaabmaabaGaeGinaqdacaGLOaGaayzkaaaaaa@6748@

We observe that MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ [i, j, 0] = Qi,jand that MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ [i, j, l] is a lower bound on the Q matrix entries in rows 2li to 2l (i + 1) - 1 and columns 2lj to 2l(j + 1) - 1:

[ i , j , l ] min 2 l i i ' < 2 l ( i + 1 ) , 2 l j j ' < 2 l ( j + 1 ) Q i ' j ' MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectcqGGBbWwcqWGPbqAcqGGSaalcqWGQbGAcqGGSaalcqWGSbaBcqGGDbqxcqGHKjYOdaWfqaqaaiGbc2gaTjabcMgaPjabc6gaUbWcbaGaeGOmaiZaaWbaaWqabeaacqWGSbaBaaWccqWGPbqAcqGHKjYOcqWGPbqAcqGGNaWjcqGH8aapcqaIYaGmdaahaaadbeqaaiabdYgaSbaaliabcIcaOiabdMgaPjabgUcaRiabigdaXiabcMcaPiabcYcaSiaaykW7cqaIYaGmdaahaaadbeqaaiabdYgaSbaaliabdQgaQjabgsMiJkabdQgaQjabcEcaNiabgYda8iabikdaYmaaCaaameqabaGaemiBaWgaaSGaeiikaGIaemOAaOMaey4kaSIaeGymaeJaeiykaKcabeaakiabdgfarnaaBaaaleaacqWGPbqAcqGGNaWjcqWGQbGAcqGGNaWjaeqaaaaa@6E85@

The MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ values can be seen as a quad-tree, although it is implicitly defined by Q MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFqeFuaaa@3841@ and MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFSeIqaaa@377E@ .

Searching the quad-tree

We cannot simply search for the minimum valued leaf in MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ in the usual quad-tree search fashion, since we are no longer storing the minimum value in a range, but rather a lower bound on the minimum value. Instead, we will use the MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ [i, j, l] values to guide our search for the minimal Q ij values.

Two approaches present themselves: A depth-first traversal of MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ with cut-offs when the lower bound is greater than a known Q ij value, and priority queue based search that always expands the MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ [i, j, l] value with the lowest lower bound.

Depth-first search

In the depth-first search approach, we simply explore MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ in a depth-first manner, looking for the minimal MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ [i, j, 0] value. By definition, this is also the minimal Q ij value. In itself, this will not speed up the search for the minimal value — although still in O(r2), traversing MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ is significantly slower than traversing the Q matrix to begin with — however, we can avoid exploring parts of the tree by cutting off searches of sub-trees. When we see a node MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ [i, j, l], whose lower bound is greater than an already seen Q value at the bottom level, we need not explore the sub-tree rooted in MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ [i, j, l] since none of the leaves in this tree will contain the minimal value.

The efficiency of this search greatly depends on how much of the tree can be discarded by cut-offs. In the worst case, no cut-offs are possible and we explore the entire MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ , with a search time in O(r2), giving us a combined search time of O(n3). If, on average, we only need to explore O(r) nodes, the combined search time is down to O(n2).

Priority queue search

In the priority queue approach, we use a priority queue to expand the MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ [i, j, l] nodes in a lowest-lower-bound-first order. This is based on the assumption that the lowest lower bound is more likely to contain the real lowest value. In each step, deletion of the minimum element in the priority queue gives us the unexplored node with the current lowest lower bound, and each of the children of the node are then inserted into the priority queue. Once a deletion produces an element on level 0, we have found the minimal Q ij value and need search no further.

As with the depth-first search, the efficiency of this search depends on how the lower bounds corresponds to the actual leaf-values in the tree. In the worst case, we need to explore the entire tree at a cost of O(r2 log r), with a total search time of O(n3 log n), while if, on average, we only search O(r) nodes we have a cost of O(r log r), with a total search time of O(n2 log n).

Random sampling

Both the depth-first search and the priority queue search approaches can be extended with an initial random sampling of, e.g., O(r) entries of the Q ij matrix. The minimum of these values can then be used as the initial cut-off value. For the depth-first search approach this allows the algorithm to make more qualified cut-offs already from the beginning of the search, whereas for the priority queue search approach the gain is minimal since the cut-off only reduces the number of insertions into the priority queue — the number of deletions remains unchanged.

Updating the quad-tree

In each join of two nodes, we need to delete two rows and two columns from Q — the two nodes we join — and add one new row and column — for the new node. This update must also be represented in MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ , which means that we need to update Q MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFqeFuaaa@3841@ and MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFSeIqaaa@377E@ . If we store the new row/column in one of the deleted rows/columns, say i, we need to update two rows/columns in Q' and all values in δ as follows (where x ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWG4baEgaqeaaaa@2E3D@ denotes the updated value of x):

d i k ¯ = ( d i k + d j k d i j ) / 2 δ k ¯ = δ k + d i k ¯ d i k d j k Q i k ' ¯ = ( r ' 2 ) d i k ¯ ( R i ' + R k ' ) Q j k ' ¯ = ( effectively deleting  j ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakqGaeeqaaaMbbaWaa0aaaeaacqWGKbazdaWgaaWcbaGaemyAaKMaem4AaSgabeaaaaGccqGH9aqpcqGGOaakcqWGKbazdaWgaaWcbaGaemyAaKMaem4AaSgabeaakiabgUcaRiabdsgaKnaaBaaaleaacqWGQbGAcqWGRbWAaeqaaOGaeyOeI0Iaemizaq2aaSbaaSqaaiabdMgaPjabdQgaQbqabaGccqGGPaqkcqGGVaWlcqaIYaGmaeaadaqdaaqaaGGaciab=r7aKnaaBaaaleaacqWGRbWAaeqaaaaakiabg2da9iab=r7aKnaaBaaaleaacqWGRbWAaeqaaOGaey4kaSYaa0aaaeaacqWGKbazdaWgaaWcbaGaemyAaKMaem4AaSgabeaaaaGccqGHsislcqWGKbazdaWgaaWcbaGaemyAaKMaem4AaSgabeaakiabgkHiTiabdsgaKnaaBaaaleaacqWGQbGAcqWGRbWAaeqaaaGcbaWaa0aaaeaacqWGrbqudaqhaaWcbaGaemyAaKMaem4AaSgabaGaei4jaCcaaaaakiabg2da9iabcIcaOiabdkhaYjabcEcaNiabgkHiTiabikdaYiabcMcaPmaanaaabaGaemizaq2aaSbaaSqaaiabdMgaPjabdUgaRbqabaaaaOGaeyOeI0IaeiikaGIaemOuai1aa0baaSqaaiabdMgaPbqaaiabcEcaNaaakiabgUcaRiabdkfasnaaDaaaleaacqWGRbWAaeaacqGGNaWjaaGccqGGPaqkaeaadaqdaaqaaiabdgfarnaaDaaaleaacqWGQbGAcqWGRbWAaeaacqGGNaWjaaaaaOGaeyypa0JaeyOhIuQaaCzcaiabcIcaOiabbwgaLjabbAgaMjabbAgaMjabbwgaLjabbogaJjabbsha0jabbMgaPjabbAha2jabbwgaLjabbYgaSjabbMha5jabbccaGiabbsgaKjabbwgaLjabbYgaSjabbwgaLjabbsha0jabbMgaPjabb6gaUjabbEgaNjabbccaGiabdQgaQjabcMcaPaaaaa@9D16@

for all ki, j. Updating δ and Q' this way takes time O(r). Rebuilding MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFSeIqaaa@377E@ from the new δ and updating Q MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFqeFuaaa@3841@ based on the change of two rows/columns in Q' can also be done in time O(r). Over the n iterations of the algorithm, this updating contributes O(n2) to the running time.

As the distance between r and r' grows, the information stored in Q' and R', and thus in Q MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFqeFuaaa@3841@ , diverges from the real values from Q and R. Consequently, the lower bounds in MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ becomes less accurate, and we expect to search more of MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ before we find a minimal leaf. It is therefore necessary to regularly update MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ , by setting Q' to the current Q, updating R' and δ correspondingly, and rebuild Q MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFqeFuaaa@3841@ .

A rebuild takes time O(r2), so if we rebuild too frequently, there will be no gain in running time — rebuilding in each iteration, for instance, will result in an O(n3) algorithm. On the other hand, if we rebuild too infrequently, the search time will suffer due to the worse lower bounds.

We chose to rebuild each time we have processed a fraction of the remaining nodes, i.e. after r ' m MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdkhaYjabcEcaNaqaaiabd2gaTbaaaaa@3062@ iterations, for some fixed m. Since the size of the matrices constructed decreases exponentially, this implies that we spend O(n2) time on rebuilding all in all. Together with the updating performed in each iteration, this gives a total update time of O(n2).

Limitations of the approach

For the methods to be useful, i.e. to yield a speed-up compared to a standard neighbor-joining implementation, it is essential that the derived lower bounds MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ [i, j, l] for a node in the quad-tree are close to the minimum value among the leafs in the subtree spanned by the quad-tree node. Comparing Eq. 3 and Eq. 4 this might be infeasible if the correction terms

r r ' r ' 2 ( R i ' + R j ' ) ( δ i + δ j ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdkhaYjabgkHiTiabdkhaYjabcEcaNaqaaiabdkhaYjabcEcaNiabgkHiTiabikdaYaaacqGGOaakcqWGsbGudaqhaaWcbaGaemyAaKgabaGaei4jaCcaaOGaey4kaSIaemOuai1aa0baaSqaaiabdQgaQbqaaiabcEcaNaaakiabcMcaPiabgkHiTiabcIcaOGGaciab=r7aKnaaBaaaleaacqWGPbqAaeqaaOGaey4kaSIae8hTdq2aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkaaa@492C@

span quite different values, since we use the minimum over all the correction values in the subtree.

Unfortunately, this is what we expect for the R i values. In our experiments presented in the Results section, we observe in Figure 1 that the performance of the above developed techniques, except for the depth first search without sampling, is essentially the same as those obtained by the QuickTree algorithm.

Approximation of Q using Linear functions

In Eq. 3 we based our search on an old Q matrix. In this section we base the approach on the following rewriting of Eq. 1 that only depends on old row sums R k ' MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGudaqhaaWcbaGaem4AaSgabaGaei4jaCcaaaaa@303B@ :

Q i j = ( d i j R i ' + R j ' r ' ) r 2 d i j + r R i ' r ' R i + r R j ' r ' R j = f i j ( r ) + c i ( r ) + c j ( r ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqaaeWadaaabaGaemyuae1aaSbaaSqaaiabdMgaPjabdQgaQbqabaaakeaacqGH9aqpaeaadaqadaqaaiabdsgaKnaaBaaaleaacqWGPbqAcqWGQbGAaeqaaOGaeyOeI0YaaSaaaeaacqWGsbGudaqhaaWcbaGaemyAaKgabaGaei4jaCcaaOGaey4kaSIaemOuai1aa0baaSqaaiabdQgaQbqaaiabcEcaNaaaaOqaaiabdkhaYjabcEcaNaaaaiaawIcacaGLPaaacqWGYbGCaeaaaeaaaeaacqGHsislcqaIYaGmcqWGKbazdaWgaaWcbaGaemyAaKMaemOAaOgabeaakiabgUcaRmaalaaabaGaemOCaiNaemOuai1aa0baaSqaaiabdMgaPbqaaiabcEcaNaaaaOqaaiabdkhaYjabcEcaNaaacqGHsislcqWGsbGudaWgaaWcbaGaemyAaKgabeaakiabgUcaRmaalaaabaGaemOCaiNaemOuai1aa0baaSqaaiabdQgaQbqaaiabcEcaNaaaaOqaaiabdkhaYjabcEcaNaaacqGHsislcqWGsbGudaWgaaWcbaGaemOAaOgabeaaaOqaaaqaaiabg2da9aqaaiabdAgaMnaaBaaaleaacqWGPbqAcqWGQbGAaeqaaOGaeiikaGIaemOCaiNaeiykaKIaey4kaSIaem4yam2aaSbaaSqaaiabdMgaPbqabaGccqGGOaakcqWGYbGCcqGGPaqkcqGHRaWkcqWGJbWydaWgaaWcbaGaemOAaOgabeaakiabcIcaOiabdkhaYjabcMcaPiabcYcaSaaaaaa@7956@

where

f i j ( r ) = ( d i j R i + R j r ) r 2 d i j ( 5 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGMbGzdaWgaaWcbaGaemyAaKMaemOAaOgabeaakiabcIcaOiabdkhaYjabcMcaPiabg2da9maabmaabaGaemizaq2aaSbaaSqaaiabdMgaPjabdQgaQbqabaGccqGHsisldaWcaaqaaiqbdkfaszaafaWaaSbaaSqaaiabdMgaPbqabaGccqGHRaWkcuWGsbGugaqbamaaBaaaleaacqWGQbGAaeqaaaGcbaGafmOCaiNbauaaaaaacaGLOaGaayzkaaGaemOCaiNaeyOeI0IaeGOmaiJaemizaq2aaSbaaSqaaiabdMgaPjabdQgaQbqabaGccaWLjaGaaCzcamaabmaabaGaeGynaudacaGLOaGaayzkaaaaaa@4F1A@

c i ( r ) = r R i r R i . ( 6 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGJbWydaWgaaWcbaGaemyAaKgabeaakiabcIcaOiabdkhaYjabcMcaPiabg2da9iabdkhaYnaalaaabaGafmOuaiLbauaadaWgaaWcbaGaemyAaKgabeaaaOqaaiqbdkhaYzaafaaaaiabgkHiTiabdkfasnaaBaaaleaacqWGPbqAaeqaaOGaeiOla4IaaCzcaiaaxMaadaqadaqaaiabiAda2aGaayjkaiaawMcaaaaa@41C7@

The rewriting expresses Q ij as a linear function f ij over r plus some correction terms c i and c j — which, assuming R k r R k ' r ' MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdkfasnaaBaaaleaacqWGRbWAaeqaaaGcbaGaemOCaihaaiabgIKi7oaalaaabaGaemOuai1aa0baaSqaaiabdUgaRbqaaiabcEcaNaaaaOqaaiabdkhaYjabcEcaNaaaaaa@3888@ for k = i, j, is likely to be small. Note that f ij only depends on the current value of d ij and the values of r' and R'; we only need to update f ij when i or j is joined in a new node, i.e., we only need to update a linear number of functions for each join.

We will define below a quad-tree with the f ij functions at the leaves and where each internal node ideally should store the function that is the minimum over all the linear functions stored at the leaves of the subtree rooted at the node. Unfortunately this is not a linear function but a convex function consisting of piecewise linear functions. To achieve an efficient algorithm we instead, for the interval of r values of interest, maintain a lower bound for the convex function that is a linear function.

Assume we decide to rebuild the structure after (at most) r ' m MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdkhaYjabcEcaNaqaaiabd2gaTbaaaaa@3062@ iterations, for some fixed m. For two linear functions f ij and f i'j' , define min m {f ij , f i'j' } to be the linear function that passes through the two points

(r' - r'/m, min {f ij (r' - r' /m), f i'j' (r' - r'/m)})

and

(r', min {f ij (r'), f i'j' (r')}),

as illustrated by Figure 4. Defined this way, min m {f ij , f i'j' } is a lower bound for both of the functions until the next rebuilding:

Figure 4
figure 4

The lower bound linear function. A linear function that is the best lower bound of two other linear functions on the interval r' - r'/m to r'. The dashed line is the linear function that is the greatest lower bound of the two linear functions shown as solid lines.

min m {f ij , f i'j' } (r) ≤ min{f ij (r), f i'j' (r)},     (7)

for all r [r' - r'/m, r']. This minimum-operation is easily generalized to take the minimum of four functions, and we define a quad-tree MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFXeIraaa@3787@ over the functions by:

MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFXeIraaa@3787@
(1)

[i, j, 0] = f ij (r)

[ i , j , l ] = min m { [ 2 i , 2 j , l 1 ] [ 2 i + 1 , 2 j , l 1 ] for  l > 0 [ 2 i , 2 j + 1 , l 1 ] [ 2 i + 1 , 2 j + 1 , l 1 ] MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFXeIrcqGGBbWwcqWGPbqAcqGGSaalcqWGQbGAcqGGSaalcqWGSbaBcqGGDbqxcqGH9aqpcyGGTbqBcqGGPbqAcqGGUbGBdaWgaaWcbaGaemyBa0gabeaakmaaceaabaqbaeaabqGaaaaabaGae8xmHyKaei4waSLaeGOmaiJaemyAaKMaeiilaWIaeGOmaiJaemOAaOMaeiilaWIaemiBaWMaeyOeI0IaeGymaeJaeiyxa0fabaaabaGae8xmHyKaei4waSLaeGOmaiJaemyAaKMaey4kaSIaeGymaeJaeiilaWIaeGOmaiJaemOAaOMaeiilaWIaemiBaWMaeyOeI0IaeGymaeJaeiyxa0fabaGaeeOzayMaee4Ba8MaeeOCaiNaeeiiaaIaemiBaWMaeyOpa4JaeGimaadabaGae8xmHyKaei4waSLaeGOmaiJaemyAaKMaeiilaWIaeGOmaiJaemOAaOMaey4kaSIaeGymaeJaeiilaWIaemiBaWMaeyOeI0IaeGymaeJaeiyxa0fabaaabaGae8xmHyKaei4waSLaeGOmaiJaemyAaKMaey4kaSIaeGymaeJaeiilaWIaeGOmaiJaemOAaOMaey4kaSIaeGymaeJaeiilaWIaemiBaWMaeyOeI0IaeGymaeJaeiyxa0fabaaaaaGaay5Eaaaaaa@8C35@

By induction on the number of minimum operations and a generalization of Eq. 7 we get

[ i , j , l ] ( r ) min 2 l i i ' < 2 l ( i + 1 ) , 2 l j j ' < 2 l ( j + 1 ) f i ' j ' ( r ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFXeIrcqGGBbWwcqWGPbqAcqGGSaalcqWGQbGAcqGGSaalcqWGSbaBcqGGDbqxcqGGOaakcqWGYbGCcqGGPaqkcqGHKjYOdaWfqaqaaiGbc2gaTjabcMgaPjabc6gaUbWcbaGaeGOmaiZaaWbaaWqabeaacqWGSbaBaaWccqWGPbqAcqGHKjYOcqWGPbqAcqGGNaWjcqGH8aapcqaIYaGmdaahaaadbeqaaiabdYgaSbaaliabcIcaOiabdMgaPjabgUcaRiabigdaXiabcMcaPiabcYcaSiaaykW7cqaIYaGmdaahaaadbeqaaiabdYgaSbaaliabdQgaQjabgsMiJkabdQgaQjabcEcaNiabgYda8iabikdaYmaaCaaameqabaGaemiBaWgaaSGaeiikaGIaemOAaOMaey4kaSIaeGymaeJaeiykaKcabeaakiabdAgaMnaaBaaaleaacqWGPbqAcqGGNaWjcqWGQbGAcqGGNaWjaeqaaOGaeiikaGIaemOCaiNaeiykaKIaeiilaWcaaa@75F0@

for all r [r' - r'/m, r'].

We can use this tree, together with a binary correction tree C MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFce=qaaa@3825@ defined by

C MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFce=qaaa@3825@

[k, 0] = c k (r)

C MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFce=qaaa@3825@

[k, l] = min{ C MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFce=qaaa@3825@ [2k, l - 1], C MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFce=qaaa@3825@ [2k + 1, l - 1]} for l > 0

to define the implicit quad-tree

MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@

[i, j, l](r) = MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFXeIraaa@3787@ [i, j, l](r) + Q i j ' MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGrbqudaqhaaWcbaGaemyAaKMaemOAaOgabaGaei4jaCcaaaaa@3192@ [i, l] + C MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFce=qaaa@3825@ [j, l]

satisfying

[ i , j , l ] ( r ) min 2 l i i ' < 2 l ( i + 1 ) , 2 l j j ' < 2 l ( j + 1 ) Q i j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectcqGGBbWwcqWGPbqAcqGGSaalcqWGQbGAcqGGSaalcqWGSbaBcqGGDbqxcqGGOaakcqWGYbGCcqGGPaqkcqGHKjYOdaWfqaqaaiGbc2gaTjabcMgaPjabc6gaUbWcbaGaeGOmaiZaaWbaaWqabeaacqWGSbaBaaWccqWGPbqAcqGHKjYOcqWGPbqAcqGGNaWjcqGH8aapcqaIYaGmdaahaaadbeqaaiabdYgaSbaaliabcIcaOiabdMgaPjabgUcaRiabigdaXiabcMcaPiabcYcaSiaaykW7cqaIYaGmdaahaaadbeqaaiabdYgaSbaaliabdQgaQjabgsMiJkabdQgaQjabcEcaNiabgYda8iabikdaYmaaCaaameqabaGaemiBaWgaaSGaeiikaGIaemOAaOMaey4kaSIaeGymaeJaeiykaKcabeaakiabdgfarnaaBaaaleaacqWGPbqAcqWGQbGAaeqaaaaa@6FF8@

for the current r, assuming (a) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFXeIraaa@3787@ is updated along with the functions f ij whenever i or j is joined, (b) r [r' - r'/m, r'], and (c) C MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFce=qaaa@3825@ is current. Condition a is necessary since f ij depends on d ij which changes when i or j is joined, and condition b is necessary because of the way the minimum operation is defined. Condition c simply states that since C MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFce=qaaa@3825@ [k, 0] depend on the current value of R k , it must be updated whenever a join is performed.

Searching

Before each iteration MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ we must rebuild a current version of C MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFce=qaaa@3825@ , which takes time O(r). After this, we can search for the minimal Q ij using the lower bounds in MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFsectaaa@376E@ , as described in the previous section, using either a depth-first search with cut-offs or a priority queue. If the number of nodes visited during a search on average is linear, the total search time is O(n2) for the depth-first approach, or O(n2 log n) for the priority queue approach.

Updating

For each join we must update two rows/columns in MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFXeIraaa@3787@ taking time O(r) for a total of O(n2). Furthermore, we must completely rebuild the function tree MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFXeIraaa@3787@ whenever r reaches r' - r'/m. For fixed m, this has a total cost of O(n2).

References

  1. Saitou N, Nei M: The Neighbor-Joining Method: A New Method for Reconstructing Phylogenetic Trees. Mol Biol Evol 1987, 4(4):406–425.

    CAS  PubMed  Google Scholar 

  2. Studier JA, Keppler KJ: A Note on the Neighbor-Joining Method of Saitou and Nei. Mol Biol Evol 1988, 5(6):729–731.

    CAS  PubMed  Google Scholar 

  3. St John K, Warnow T, Moret B, Vawter L: Performance Study of Phylogenetic Methods: (Unweighted) Quartet Methods and Neighbor-Joining. J Algorithms 2003, 48: 173–193. [(Special issue on best papers from SODA'01.)]. 10.1016/S0196-6774(03)00049-X

    Article  Google Scholar 

  4. Fuellen G, Spitzer M, Cullen P, Lorkowski S: BLASTing Proteomes, Yielding Phylogenies. Silico Biology 2003., 3: [To appear.].

    Google Scholar 

  5. Gascuel O: A note on Sattath and Tversky's, Saitou and Nei's, and Studier and Keppler's algorithms for inferring phylogenies from evolutionary distances. Mol Biol Evol 1994, 11(6):961–963.

    CAS  PubMed  Google Scholar 

  6. Nei N, Kumar S: Molecular Evolution and Phylogenetics. Volume chap 6.4. Oxford University Press; 2000:103–110.

    Google Scholar 

  7. Saitou N: Reconstruction of Gene Trees from Sequence Data. Methods in Enzymology 1996, 266: 427–448.

    Article  CAS  PubMed  Google Scholar 

  8. Finkel RA, Bentley JL: Quad Trees: A Data Structure for Retrieval by Composite Key. Acta Informatica 1974, 4: 1–9. 10.1007/BF00288933

    Article  Google Scholar 

  9. Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy S, Griffiths-Jones S, Howe K, Marshall M, Sonnhammer E: The Pfam Protein Families Database. Nucleic Acids Res 2002, 30: 276–280. 10.1093/nar/30.1.276

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  10. Pfam: Protein Families Database of Alignments and HMMs[http://www.sanger.ac.uk/Software/Pfam/]

  11. Howe K, Bateman A, Durbin R: QuickTree: Building Huge Neighbour-Joining Trees of Protein Sequences. Bioinformatics 2002, 18(11):1546–1547. 10.1093/bioinformatics/18.11.1546

    Article  CAS  PubMed  Google Scholar 

  12. QuickJoin[http://www.birc.dk/Software/QuickJoin/index.html]

Download references

Acknowledgements

This work was partially supported by the Future and Emerging Technologies programme of the EU under contract number IST-1999-14186 (ALCOM-FT). Gerth S. Brodal was supported by the Carlsberg Foundation (contract number ANS-0257/20).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Mailund.

Additional information

Authors' contributions

TM implemented the algorithms in the QuickJoin tool. TM and RF conducted the experiments. All authors participated in the development of the algorithms, designing the experiments, and writing the paper.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Mailund, T., Brodal, G.S., Fagerberg, R. et al. Recrafting the neighbor-joining method. BMC Bioinformatics 7, 29 (2006). https://doi.org/10.1186/1471-2105-7-29

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-7-29

Keywords