The classical supertree problem is to state whether or not a set of partial trees are consistent, and if so construct a tree containing them all. Here, we introduce the classical methods for solving this problem, and explore natural generalizations to the supergenetree problem.
Let Γ be a set of n taxa (usually species in case of the supertree problem, and genes in case of the supergenetree problem), Γi, 1 ≤ i ≤kbe a set of possibly overlapping subsets of Γ, and be a set of trees where, for each 1 ≤ i ≤ k, G
i
is a tree for Γ
i
. Let be the set of triplets of defined as . Let be the triplet graph with the set of vertices Γ and the set of edges E = {xy : ∃ z ∈ Γ such that } (see Figure 2 for an example).
The classical BUILD algorithm [1] determines, in polynomial time, whether a set of triplets is consistent and if so constructs a tree T, possibly non-binary, compatible with them. The algorithm takes as input the graph . Let be the set of connected components of . If has at least three vertices and , then is inconsistent, and the algorithm terminates. For example, the set of five gene trees of Figure 2 is inconsistent, as the corresponding triplet graph (including dotted lines) is connected. Otherwise, if , a polytomy is created over , the internal node of the polytomy being the root r(T) of the compatible tree T under construction and its children being m subtrees with leafsets V(C1),..., V(C
m
), with their topology yet to be determined (where V(C
i
) ⊆ Γ denotes the set of taxa appearing in C
i
). The algorithm then recurses into each connected component, i.e. the subtree for V(C
i
) is determined recursively from the graph defined by E| C
i
= {xy : ∃ z ∈ Γ such that }. If, at any step, the considered graph has a single component containing more than two vertices, then is reported as an inconsistent set of trees and the algorithm terminates. Otherwise, recursion terminates when the graph has at most two vertices, eventually returning a supertree T. See Figure 3 for an example.
The BUILD algorithm has been generalized in an algorithm called AllTrees [20] to output all supertrees compatible with a set of triplets in case consistency holds. Instead of taking each element of as a separate leaf of r(T), all possible groupings, in other words all partitions of , are considered (see Figure 3, right, for a choice of bipartitions). For each partition of , a polytomy is created over . The algorithm then iterates by considering each possible partition of each subgraph induced by each element of . The algorithm is polynomial in the size of the output that may be exponential in the size of the input.
A tree T compatible with such that no internal edge of T can be contracted so that the resulting tree is also compatible with is called a minimally resolved supertree. Minimally resolved supertrees contain all the information about all supertrees compatible with but in a "compressed" format. By exhibiting some properties on graph components, Semple shows in [27] how some partitions of the triplet graph components can be avoided without loss of generality. The new developed algorithm, named AllMinTrees [27], outputs a minimally resolved tree in polynomial time. However, it was shown in [15] that the cardinality of the solution space can be exponential in n = |Γ|, leading to an exponential time algorithm with .
Notice that, in general, the trees output by all these methods are non-binary trees.
Extensions to the SuperGeneTree problem
Natural exact solutions for the supertree problem can be extended to the supergenetree problem as follows:
-
(1)
Use AllMinTrees to output all minimally resolved supertrees, and for each one which is non-binary in general, find in linear time a resolution minimizing the reconciliation [16, 32] or duplication [31] cost. Among all optimally resolved trees, select one of minimum cost. Clearly this approach has the same complexity as the AllMinTrees algorithm, multiplied by a factor of n to resolve each tree, which is .
-
(2)
As we are seeking a binary tree, each created node x of the supergenetree T under construction should determine a bipartition . Therefore, the AllTrees algorithm can be simplified by considering, instead of all partitions of , only all bipartitions of the triplet graph components set. See an example in Figure 3, right. Notice that this simplification approach is not applicable to the AllMinTrees algorithm, as by imposing bipartitions, the minimum resolution condition cannot be guaranteed.
A branch-and-bound approach
The tree space which is explored by the two exact methods described above can be reduced by using a branch-and-bound approach. Consider for example method (1) using the AllMinTrees algorithm. At each iteration of computing one minimally resolved tree, resolve the intermediate non-binary tree obtained at this step, using for example the linear-time algorithm presented in [16]. If its reconciliation cost is greater than the cost of a full tree already obtained at a previous stage of the AllMinTrees algorithm, then stop expanding this tree as this can only increase the reconciliation cost.
A dynamic programming approach
The recursive top-down method (2) can instead be handled by a dynamic programming approach computing the minimum reconciliation cost of a tree on a subset of Γ according to the reconciliation costs of trees on smaller subsets, similarly to the wrok done in [13].
More precisely, let P be an arbitrary subset of Γ, and denote by R(P) the minimum duplication cost of a tree T|
P
having leafset P and compatible with the set . Let be the BUILD graph restricted to P and , and the set of its connected components. If , by V(C) we mean . Denote the complement of C by . Finally set to 0 if s(V(C)) and ) are separated in S, in which case (V(C), is the bipartition of a speciation node, and 1 otherwise i.e. if (V(C), is the bipartition of a duplication node. Then:
the value of interest being R(Γ). First note that, assuming constant-time lca queries over S, d(V(C), can be computed in constant time if s(V(C)) and can be accessed in constant time, since if suffices to check that the lca of s(V(C)) and differs from both. To achieve this, we precompute s(X) for every subset X of Γ of size 1, 2,..., n in increasing order. Noting that if |X| > 1, then for any x ∈ X, s(X) = lca
S
(s(X \ {x}), s(x)), s(X) can be computed in constant time assuming that s(X \ {x}) was computed previously and assuming constant-time lca queries. As there are 2n subsets of Γ, each computed in constant time, this preprocessing step takes time O(2n).
As for R(Γ), we can simply ensure that each R(P) is computed at most once by storing its value in a table for subsequent accesses (i.e. when R(P) is needed, we use its value if it has been computed, or compute it and store it otherwise). In this manner, each subset P takes time, not counting the recursive calls, proportional to to construct , find , and evaluate each bipartition of . We will simply use the fact that is in O(2n). As this has to be done for, at worst, each of the 2n subsets of Γ, we get a total time O(2n + 2n ·2n) = O(4n). Note that this analysis probably overestimates the actual complexity of the algorithm, as we are assuming that each subset P and each component set are both always of size n. It is also worth mentioning that the R(P) recurrence can easily adapted to the mutation cost (duplications + losses).
A greedy heuristic for the duplication cost
Instead of trying all partitions of the triplet graph components set at each step of the AllTrees or AllMinTrees algorithms, if the goal is to minimize the duplication cost, then a natural greedy approach would be to choose the best partition at each iteration, namely the one allowing to minimize the number of duplications preceding each speciation event. Such an approach would result in pushing duplications down the tree. It leads to the following restricted version of the supergenetree problem.
MINIMUM PRE-SPECIATION DUPLICATION PROBLEM (MINPRESPEDUP PROBLEM):
Input: A species set Σ and a binary species tree S for Σ; a gene family Γ, a set Γi,1≤i≤kof subsets of Γ, and a set of consistent gene trees where, for each 1 ≤ i ≤ k, G
i
is a tree on Γ
i
.
Output: Among all gene trees for Γ compatible with , one tree T with minimum pre-speciation duplication nodes.
We will show in the following section that even this restricted version of the supergenetree problem is hard. Here, we give the intuition of a natural way of solving this problem, that reduces to repeated applications of the Max-Cut problem. Although known to be NP-hard, efficient heuristics exist (up to a factor of 0.878 [12]), that can be used for our purpose.
For the supertree problem, the triplet graph represents all triplets of the input trees that have to be combined. In the case of the supergenetree problem, another tree is available, the species tree S. A triplet xy|z found in the input trees can be reconciled with S, and if r(xy|z) is a duplication, then any tree compatible with G must contain this duplication. Say that r(xy|z) is a required duplication mapped to r(S) if s(r(xy|z)) = r(S) and r(xy|z) is a duplication. Let us include this information in . More precisely, let denote the set of connected components of , and let be the graph whose vertex set is , and share an edge if C1 has vertices x, y and C2 has a vertex z such that xy|z is a triplet in with r(xy|z) being a required duplication mapped to r(S). If there are, say, d distinct such triplets, one can possibly set a weight of d to the C1C2 edge. See Figure 4 for an example.
Consider the problem of clustering the components of into two parts B1, B2 of a bipartition in a way minimizing the number of duplications preceding the speciation event r(S). For each C1 ∈ B1 and C2 ∈ B2 such that C1C2 is an edge of , a tree T rooted at the bipartition (B1, B2) contains the required duplications mapped to r(S) represented by the C1C2 edge. If there are k such edges between B1 and B2 totalizing a weight of w, the single duplication at the root of T contains those w required duplications. In other words, we have "merged" w required duplications into one. It then becomes natural to find the bipartition of that merges a maximum of duplications, i.e. that contains a set of edges crossing between the two parts of maximum weight. This is the well-known Max-Cut problem. For instance in Figure 4, the Max-Cut has a weight of 3 and leads to the optimal tree T1. Any other bipartition sends a required duplication to a lower level and is hence suboptimal. The T2 tree is obtained from first taking the suboptimal ({a1, b1, d1}, {c1, e1, f1}) bipartition, which creates a duplication at the root and defers the c1e1|f1 required duplication for later.
Note however that the components of may contain required duplications themselves, which are not represented by the edges of . Thus, a Max-Cut must then be applied recursively on both parts of the chosen bipartition. Therefore, this method does not benefit directly from the efficient approximation factor known for the Max-Cut problem, as the approximation error stacks with each application. In the next section, we show that, unlike Max-Cut, the MinPreSpeDup problem cannot admit a constant factor approximation (unless P = NP).