Existing agglomerative algorithms are defined by criteria and formulae which all can be rewritten as distance averages. These algorithms (e.g. NJ [1, 2], BIONJ [3] and MVR [4]) are generalized to incomplete distance matrices by estimating these averages using available distances, when some of those are missing. In the following, we first define notation and present a generic agglomerative scheme that covers all the algorithms being discussed here. Then, we describe for each of the three agglomeration steps (pair selection, branch length estimation, and matrix reduction), how NJ is generalized into NJ* to deal with missing distances. NJ* is further refined by BIONJ* that incorporates a first simple estimation of the variance associated to each evolutionary distance. Finally, a second, more accurate estimation of this variance is used by MVR* that generalizes the weighted least-squares (WLS) version of the MVR [4] approach.
Notation
Let L
n
= {1,2, ..., n} be the set of all taxa numbered from 1 to n, and (Δ
ij
) a distance matrix, where Δ
ij
corresponds to the evolutionary distance between taxa i, j ∈ L
n
, and Δ
ii
= 0, ∀i ∈ L
n
. Distance-based algorithms build a tree T (also denoted as , depending on the context) from (Δ
ij
), and estimate all branch lengths T
uv
, where uv is any pair of sibling nodes in T. At each agglomeration stage, a taxon pair xy is selected, connected to a new internal node u, and replaced by u in (Δ
ij
). Thus, at each stage, the set L
r
= {1,2, ..., r} of non-agglomerated taxa drops in cardinality by 1, and r is changed into r - 1. Tree reconstruction stops when r = 2.
Agglomerative algorithms with complete distance matrices
A number of existing agglomerative algorithms to deal with complete matrices can be summarized using the following scheme [4]:
-
(a)
Select the xy pair to be merged into u by optimizing an agglomeration criterion;
-
(b)
Estimate the branch lengths T
xu
and T
yu
:
with
(1)
-
(c)
Reduce the distance matrix (Δ
ij
) for all i ≠ x, y:
Δ
ui
= λ
i
(Δ
xi
- T
xu
) + (1 - λ
i
)(Δ
yi
- T
yu
) with λ
i
∈ [0,1] (2)
-
(d)
r = r - 1;
Step (a) in this generic scheme searches for the taxon pair xy to be merged by optimizing an agglomeration criterion. NJ, BIONJ and MVR select the pair which maximizes [1, 2]:
(3)
Let (Δ
ij
) be additive [61], i.e. be defined as the path-length distance between taxa in a phylogenetic tree T with positive branch lengths; then, maximizing Q
xy
over all taxon pairs selects a cherry of T, i.e. a pair of taxa being separated by a unique internal node in T. In other words, Q
xy
is consistent [36]. However, it is easily shown (using counter-examples) that the second best taxon pair (based on Q
xy
values) is not necessarily a cherry of T.
Conversely, the ADDTREE [5] pair selection criterion implies that all cherries of T have highest criterion value. The ADDTREE criterion counts the number of times where the xy pair is a cherry in all taxon quartets xyij:
(4)
where H(t) = 1 if t ≥ 0, and H(t) = 0 if t < 0. This criterion has integer values ranging from 0 to (n - 2)(n - 3)/2, and this maximum value is reached for all cherries (but for the cherries only) with additive distance matrices. Careful implementation [39] of ADDTREE allows for O(n4) run time. NJ, BIONJ and MVR are much faster. They first compute all R
z
sums in Equation (3), and then compute in O(1) the Q
xy
value of each xy pair. Each agglomeration stage thus requires O(r2) time (branch-length estimation and matrix reduction are achieved in O(r)), and the whole algorithm is in O(n3). Moreover, Q
xy
can be seen as a continuous version of N
xy
[62].
After xy pair selection, x and y are connected to the new node u, and the lengths of xu and yu branches are estimated using Equation (1). Assuming that (Δ
ij
) is additive and corresponds to tree T, we have T
xu
= (Δ
xy
+ Δ
xi
- Δ
yi
)/2, ∀i ≠ x, y. Equation (1) averages these elementary estimators using various (w
i
) weightings. With NJ, the average is equally-weighted and we have w
i
= w = 1/(2(r - 2)). We shall see that MVR uses different w
i
weights.
Finally (step (c)), (Δ
ij
) is reduced by replacing x and y with the new node u, and by computing all Δ
ui
distances, ∀i ≠ x, y. When (Δ
ij
) is additive and corresponds to tree T, we have Δ
ui
= Δ
xi
- T
xu
= Δ
yi
- T
yu
. Equation (2) averages these two elementary estimators. NJ uses equal weights (λ
i
= 1 - λ
i
= 1/2) while BIONJ and MVR adjust λ
i
in order to minimize the variance of Δ
ui
and to have reliable distance estimates during all agglomeration stages. For this purpose, BIONJ and MVR use (approximate) models for the variances and covariances of the distance estimates in (Δ
ij
).
NJ*: generalizing NJ to incomplete distance matrices
When (Δ
ij
) is incomplete (missing entries are denoted as ∅), the criteria and equations above do not apply. We shall see in this section how they are generalized to define the NJ* algorithm, which keeps NJ's O(n3) time complexity and is nearly equivalent to NJ with complete matrices.
(a) Agglomeration criterion
Let = Q
xy
/(r - 2). Maximizing is the same as maximizing Q
xy
(Equation (3)), and we have:
, where ,
which can be rewritten as:
(5)
The sum in Equation (5) relates to terms representing how distant is the path joining x to y from other taxa i ≠ x, y (Δ
xi
+ Δ
yi
- Δ
xy
equals twice the distance between u and i), whereas the first term expresses the additional distance induced by Δ
xy
. It has been shown [63, 64] that the relative weight of these two factors is unique, due to consistency requirement, and can be interpreted as the mean acentrality of the xy pair [35, 36]. To extend this criterion to incomplete distance matrices, we estimate it using the set of taxa with non-missing distances: = {i ∈ L
r
: Δ
xi
, Δ
yi
≠ ∅}. Moreover, we assume Δ
xy
≠ ∅, and thus x, y ∈ . The normalization factor is then equal to || - 2 (instead of r - 2) and we obtain the following generalization of Equation (5):
which applies to incomplete distance matrices, and is identical to with complete ones. This equation further simplifies into:
(6)
Other solutions are possible to extend Equation (5), e.g. preserving Δ
xy
/(r - 2) term rather than transforming it into Δ
xy
/( - 2). Simulation results (not shown) indicate that criterion (6) has better topological accuracy than these alternatives. Theoretical results would be desirable to explain these observations and establish the properties of criterion (6), but a first simple explanation is that Equation (6) precisely corresponds to the value being computed on taxon subset. To be consistent on the whole set of taxa (L
r
), it is mandatory that the criterion is consistent on taxon subsets (, here), and Equation (6) satisfies this requirement.
Maximizing seems to require O(r3) time for each iteration, and thus a total time complexity of O(n4). However, efficient implementation allows for O(n3) total run time. At the first stage (r = n), and || values are computed and stored for all x, y ∈ L
n
, which requires O(n3) time. In the subsequent agglomeration stages, these values are updated as follows:
-
After step (a), for all i, j ∈ L
r
- {x, y} we remove from and || : Δ
xi
and Δ
xj
(if Δ
xi
≠ ∅ and Δ
xj
≠ ∅), and Δ
yi
and Δ
yj
(if Δ
ui
≠ ∅ and Δ
uj
≠ ∅).
-
After step (c), we compute and || for all i ∈ L
r
- {u}, and
-
for all i, j ∈ L
r
- {u}, we add Δ
ui
and Δ
uj
to and || (if Δ
ui
≠ ∅ and Δ
uj
≠ ∅).
Each of these three updating routines requires O(r2) time, just as pair selection using criterion (6), meaning that using instead of Q
xy
does not change the total O(n3) time complexity of the original NJ algorithm.
However, as discussed earlier, a limitation of criterion is that: (1) it cannot be computed when Δ
xy
= ∅, and (2) only the best pair is guaranteed (with additive distance) to be a cherry in the correct tree. When xy is the best pair in the complete additive distance matrix, but Δ
xy
is missing in the available distance matrix, then using does not provide any guaranty of correctness. This difficulty is partly alleviated when using a generalization of N
xy
, as this criterion selects all cherries in the correct tree with complete additive distances. When some of the cherries correspond to missing distances, we are still able to select the others that correspond to non-missing entries. Our generalization of N
xy
(Equation (4)) to incomplete distances is defined as follows. Let:
(7)
(8)
differs from N
xy
in that we sum both H terms, instead of multiplying them. This way we exploit all available information. Indeed, when Δ
xj
= ∅ and/or Δ
yi
= ∅ but the other entries are available, we still use H(Δ
xi
+ Δ
yi
- Δ
xy
- Δ
ij
) in while a multiplicative solution in the line of N
xy
would discard this term. Moreover, it is easily seen that = 2N
xy
with complete additive distances. To select among taxon pairs, we use the averaged form of , that is:
(9)
which expresses the mean number of quartets where the xy pair corresponds to a cherry.
However, selecting pairs using sometimes produces ties. In this case, we select the pair with higher || value, that is the pair which is supported by the larger number of quartets. But ties may still occur, in which case we use:
(10)
where Miss(z) = {i ∈ L
r
, ≠ z : Δ
iz
= ∅ } corresponds to missing entries for taxon z. counts the number of missing entries in the current matrix that will be removed in the next step (see reduction procedure (13)). Maximizing tends to quickly fill missing entries in the running distance matrix, which both frees from Δ
xy
≠ ∅ limitation and allows using Q
xy
pair selection criterion only. Finally, in some (very rare) cases, we still have ties and then maximize the continuous version [62] of :
(11)
Pair selection criteria (9), || (8), (10) and (11) are used in a lexicographic way: taxon pairs are ranked based on the first criterion (), the second one (||) is used in case of ties, etc. However, using these four criteria only would result in O(n4) time complexity. In order to preserve O(n3) run times, we first select the s top pairs based on criterion (6), and then use the other criteria in lexicographic order to select the pair to be agglomerated among these s pairs. As computing Equations (7) to (11) requires O(r2) or less per taxon pair, the total time complexity of pair selection is O(n3) (first selection using (6)) plus O(s∑r2) (final selection using (8) to (11)), i.e. O(n3). As explained above, does not provide any guaranty of correctness with missing distances, while and partly circumvent the difficulty. However, enables to extract the most promising pairs for agglomeration and we have seen (Figure 1) that using for s a small constant (typically 15) is sufficient to obtain high accuracy, meaning that, in practice, run times are in O(n3).
(b) Branch length estimation
Equation (1) is easily rewritten using non-missing entries only:
(12)
NJ uses the same weight w
i
for every taxon i. The same holds for NJ*, that is, w
i
= w = 1/(2(|| - 2)). Note that for the selected pair we have Δ
xy
, ≠ ∅, meaning that Equation (12) is always applicable. Just as with NJ, branch length estimation (12) requires O(r) time at each agglomeration stage.
(c) Matrix reduction
Equation (2) averages two elementary estimators, and with NJ this average is equally weighted. With missing distances it may occur that one of these two estimators is not applicable (e.g. when Δ
xi
≠ ∅), that both are applicable, or that none is applicable. Thus, in NJ* Equation (2) becomes:
(13)
where λ
i
= λ = 1/2. In the second and third cases, entries missing in the previous matrix are now present in the new, reduced matrix. We have seen that criterion (10) tends to maximize the number of such entries, in order to fill as fast as possible the missing distances in the running matrix. Just as branch length estimation (12), matrix reduction (13) requires O(r) time at each stage and does not impact total time complexity. Thus, NJ* requires O(n3) run times, when s is kept constant.
BIONJ*: improving the reduction step, a first simple solution
BIONJ* uses the same pair selection criteria as NJ*, and adapts to missing distances BIONJ reduction procedure. BIONJ uses the degree of freedom corresponding to the λ
i
parameter in Equation (2), in order to minimize the variance of the new Δ
ui
estimates in step (c). For this purpose, BIONJ assumes a simple Poisson model of the variances in the original (Δ
ij
) matrix, stating that the variance V
ij
of Δ
ij
is proportional to Δ
ij
. BIONJ also accounts for the covariances in (Δ
ij
) (see [3] for more details). It uses a single λ parameter for every xy pair, which does not depend on i and is given by
(14)
Again, this equation may be seen as an average and can be rewritten using available entries only as:
(15)
The reduction step (c) is achieved by BIONJ* as defined by Equation (13), but using so-defined λ* (instead of 1/2) when Δ
xi
≠ ∅ and Δ
yi
≠ ∅.
Moreover, BIONJ starts with variance matrix (V
ij
) = (Δ
ij
) and reduces this matrix at each stage using λ value from Equation (14) and equation:V
ui
= λV
xi
+ (1-λ)V
yi
- λ(1 - λ)V
xy
.
BIONJ* combines this formula with Equation (13) and (15) to reduce the variance matrix, that is:
(16)
Computing λ* using Equation (15) and achieving matrix reductions (13) and (16) requires O(r) run times. Thus, BIONJ* has O(n3) time complexity (when s is kept constant, else O(sn3)).
MVR*: improving BIONJ* using variances dedicated to distance supermatrices
The BIONJ variance model is well suited for one-gene studies where distance estimations all use the same number of sites (at least when gaps are removed). With phylogenomic studies, some distances are computed using a large number of genes, and thus are reliable, while other distances are based on a few genes and are poorly estimated. Moreover, some distances may be missing due to the absence of common genes between the two species being compared. Altogether, this implies that the BIONJ and BIONJ* variance model can be improved to better fit phylogenomic requirements. This section describes the MVR* algorithm that is intended to this purpose.
Steps (b) and (c) in the generic scheme are based on w
i
and λ
i
parameters, respectively. The MVR algorithm [4] generalizes the BIONJ approach and uses these degrees of freedom in order to minimize the variance of the new estimates T
ux
, T
uy
and Δ
ui
. The main difference from BIONJ is that MVR is able to deal with any variance-covariance model of the δ
ij
distance estimates, while BIONJ is restricted to the Poisson model. The MVR variant that we use here only considers the variances and neglects the covariances, thus assuming a weighted least-squares model (it was called MVR-WLS in [4], but is named MVR here for simplicity). Thus, MVR inputs a distance matrix (Δ
ij
) and the corresponding (V
ij
) variance matrix. We shall see in the next section how (V
ij
) is calculated to deal with phylogenomic data, and describe now the way MVR and MVR* use and update these matrices all along the agglomeration procedure.
MVR uses Q
xy
pair selection criterion (3), just as NJ and BIONJ, while MVR* uses the same criteria and selection procedure as NJ* and BIONJ*.
In MVR step (b), i.e. branch length estimation, w
i
weights in Equation (1) depend on i and are given by:
(17)
MVR* uses Equation (12) (instead of Equation (1)) to deal with missing entries, and adapts above Equation (17) by replacing L
r
by .
In MVR step (c), i.e. matrix reduction, a different λ
i
parameter is associated in Equation (2) to each taxon i ≠ x, y using:
(18)
This value puts more weight and confidence on (Δ
xi
- T
xu
) when the associated variance V
xi
is low, compared to V
yi
. Equation (18) is also used by MVR*, but combined with Equation (13) to deal with missing distances.
Finally, MVR (just like BIONJ) reduces the variance matrix at each agglomeration stage. To this purpose, MVR uses the following equation:
This equation is also used by MVR* in combination with Equation (16).
All the computations described above (except pair selection) require O(r) run times at each agglomeration stage, and thus MVR* has O(n3) time complexity, just as do NJ* and BIONJ*.
Estimating the variances associated to distance supermatrices
Distance supermatrices are computed [6, 19] from source matrices which are first rescaled, and then averaged. SDM [6] inputs a collection C = of k distance matrices – each defined on taxon set L
p
and estimated from sequences with size s
p
—, and deforms them, without changing their topological signal, so as to bring them as close as possible to each other before averaging. The first deformation is scaling, which multiplies each () distance matrix by a factor α
p
. The second (optional in SDM) deformation adds a constant a
ip
to every non-diagonal entries. Then, SDM averages the resulting modified matrices to obtain the () super-matrix that is defined by:
(19)
Neglecting the variance of the deformation factors, we obtain a simple expression of the variance of :
(20)
where is the variance of . Note that no covariance terms between any and estimates appear in Equation (20), as these source distances are estimated from different genes and are independent. Moreover, the covariances between the entries in the SDM supermatrix are neglected, as is the case in a number of (WLS) approaches [30, 32, 40].
Several studies have shown that the variance V
ij
associated with the evolutionary distance Δ
ij
(estimated from a single gene) is approximately equal to with ρ ≈ 2 [11, 65]. Based on various experiments (not shown), we have chosen the usual formula , which corresponds to default option in FITCH program. Equation (20) then becomes: