 Research
 Open access
 Published:
Reconstructing a SuperGeneTree minimizing reconciliation
BMC Bioinformatics volume 16, Article number: S4 (2015)
Abstract
Combining a set of trees on partial datasets into a single tree is a classical method for inferring large phylogenetic trees. Ideally, the combined tree should display each input partial tree, which is only possible if input trees do not contain contradictory phylogenetic information. The simplest version of the supertree problem is thus to state whether a set of trees is compatible, and if so, construct a tree displaying them all. Classically, supertree methods have been applied to the reconstruction of species trees. Here we rather consider reconstructing a super gene tree in light of a known species tree S. We define the supergenetree problem as finding, among all supertrees displaying a set of input gene trees, one supertree minimizing a reconciliation distance with S. We first show how classical exact methods to the supertree problem can be extended to the supergenetree problem. As all these methods are highly exponential, we also exhibit a natural greedy heuristic for the duplication cost, based on minimizing the set of duplications preceding the first speciation event. We then show that both the supergenetree problem and its restriction to minimizing duplications preceding the first speciation are NPhard to approximate within a n^{1ϵ} factor, for any 0 < ϵ < 1. Finally, we show that a restriction of this problem to uniquely labeled speciation gene trees, which is relevant to many biological applications, is also NPhard. Therefore, we introduce new avenues in the field of supertrees, and set the theoretical basis for the exploration of various algorithmic aspects of the problems.
Introduction
A fundamental task in evolutionary biology is to combine a collection of rooted trees on partial, possibly overlapping, sets of data, into a single rooted tree on the full set of data. This is the goal of supertree methods, mainly designed and used for the purpose of reconstructing a species supertree from a set of species trees (see overviews of early methods in [4–6], and more recent methods in [2, 10, 21, 24, 25, 28, 29]).
Ideally, the combined supertree should "display" each of the input tree, in the sense that by restricting the supertree to the leaf set of an input tree, we obtain the same input tree. However, this is not always possible, as the input trees may contain conflicting phylogenetic information. Note that considering a set of input trees that are not all compatible leads to the questions of correcting input gene trees or finding a subset of compatible input trees or subtrees [26]. Here, we leave open these questions and study the more direct formulation of the supertree problem that is to consider a set of compatible input trees and find a supertree displaying them all. The BUILD algorithm by Aho et al. [1] can be used to test, in polynomial time, whether a collection of rooted trees is compatible, and if so, construct a compatible supertree, not necessarily fully resolved. This algorithm has been generalized in [9, 20] to output all compatible supertrees, and adapted in [27] to output all minimally resolved compatible supertrees.
Although supertree methods are classically applied to the construction of species trees, they can be used as well for the purpose of constructing gene trees. Several gene tree databases are available (see for example Ensembl Compara [30], Hogenom [22], Phog [11], MetaPHOrs [23], PhylomeDB [14], Panther[19]). For a gene family of interest, many different gene trees can therefore be available, and finding one single supertree displaying them all leads to a supertree question. On the other hand, given a gene of interest, a homologybased search tool is usually used to output all homologs in a set of genomes. The resulting gene family may be very large, involving distant gene sequences that may be hard to align, leading to weakly supported trees  or even worse, highly supported gene trees that are in fact incorrect. A standard way of reducing such errors is then to use a clustering algorithm based on sequence similarity, such as OrthoMCL [18], InParanoid [3], Proteinortho [17] or many others (see Quest for Orthologs links at http://questfororthologs.org/), to group genes into smaller sets of orthologs or inparalogs (paralogs that arose after a given speciation). Trees obtained for such partial gene families can then be combined by using a supertree method.
Considering input trees as parts of gene trees rather than as parts of species trees does not make any difference regarding the compatibility test procedure. However, for reconstructing a compatible "super gene tree", if a species tree is known for the taxa of interest, then it can be used as an additional information to choose among all possible supertrees displaying the input partial gene trees. Indeed, a natural optimization criterion is to minimize the reconciliation cost, i.e. either the duplication or the duplication plus loss cost, induced by the output tree. We call the problem of finding a compatible supertree minimizing a reconciliation cost the supergenetree problem.
In this paper, we first show how the exact methods developed for the supertree problem can be adapted to the supergenetree problem. As for the original algorithms, all the extensions have also exponential worsttime complexity. We then exhibit a heuristic, which can be seen as a greedy approach classically used for the supertree problems, that consists in constructing progressively the tree from its root to its leaves. The main module of this heuristic is to infer the minimum number of duplications preceding the first speciation, which we call the Minimum preSpeciation Duplication problem. We show that the supergenetree problem for the duplication cost, and even its restricted version the Minimum preSpeciation Duplication problem, are NPhard to approximate within a n^{1ϵ} factor, for any 0 < ϵ < 1 (n being the number of genes). Moreover, these inapproximability results even hold for instances in which there is only one gene per species in the input trees. Finally we consider the supergenetree problem with restrictions on input trees that are relevant to many biological applications. Namely, we require each gene to appear in at most one tree, and genes of any tree to be related through orthology only. This is for example the case of gene trees obtained for OrthoMCL clusters called orthogroups [18]. We show that even for this restriction, the supergenetree problem remains NPhard for the duplication cost.
The following section introduces preliminary notations that will be required in the rest of the paper.
Preliminaries
Notations on trees
Given a set L, a tree T for L is a rooted tree whose leafset \mathcal{L}\left(T\right) is in bijection with L. We denote by V(T) the set of nodes and by r(T) the root of T. Given an internal node x of T, the subtree of T rooted at x is denoted T_{ x }. The degree of an internal node x of T is the number of children of x. If T is binary, we arbitrarily set one of the two children of x as the left child x_{ l } and the other as the right child x_{ r }. We call \left(\mathcal{L}\left({T}_{{x}_{l}}\right),\phantom{\rule{2.36043pt}{0ex}}\mathcal{L}\left({T}_{{x}_{r}}\right)\right) the bipartition of a node x of degree 2 (note that the term 'bipartition' is sometimes used, in the context of unrooted trees, to denote the nodes or leaves of the two components obtained after removing a given edge. To avoid confusion, note that this is not what we mean here by 'bipartition').
A node x is an ancestor of y if x is on the (inclusive) path between y and the root, and we then call y a descendant of x. Two nodes x and y are separated in T if none is an ancestor of the other. The lowest common ancestor (lca) of a subset L' of \mathcal{L}\left(T\right), denoted lca_{ T }(L'), is the ancestor common to all nodes in L' that is the most distant from the root. The restriction T_{ L' }of T to L' is the tree with leafset L' obtained from the subtree of T rooted at lca_{ T }(L') by removing all leaves that are not in L', and contracting all internal nodes of degree 2, except the root. We generalize this notation to a set of trees: For a set \mathcal{T}\phantom{\rule{0.5em}{0ex}} of trees on L, \mathcal{T}{}_{L\prime}=\left\{T{}_{L\prime}:T\in \mathcal{T}\right\}. Let T' be a tree such that \mathcal{L}\left(T\prime \right)=L\prime \subseteq \mathcal{L}\left(T\right). We say that T displays T' iff T_{ L' }is the same tree as T'.
A triplet is a binary tree on a set L with L = 3. For L = {x, y, z}, we denote by xyz the unique triplet t on L with root r(t) for which lca_{ t }(x, y) ≠ r(t) holds.
A polytomy (or star tree) over a set L is a tree for L with a single internal node, which is of degree L.
A resolution B(T) of a nonbinary tree T is a binary tree respecting all the ancestral relations given by T. More precisely, B(T) is a binary tree such that \mathcal{L}\left(B\left(T\right)\right)=\mathcal{L}\left(T\right), and for any u, v ∈ V(T), if u is an ancestor of v in T, then lc{a}_{B\left(T\right)}\left(\mathcal{L}\left({T}_{u}\right)\right) is an ancestor of lc{a}_{B\left(T\right)}\left(\mathcal{L}\left({T}_{v}\right)\right).
Gene and species trees
Figure 1 is an illustration of the notations defined in this section.
A species tree S for a set Σ = {σ_{1},⋯,σ_{ t }} of species represents an ordered set of speciation events that have led to Σ: an internal node is an ancestral species at the moment of a speciation event, and its children are the new descendant species. Inside the species' genomes, genes undergo speciation when the species to which they belong do, but also duplications and losses (other events such as transfers can happen, but we ignore them here). A gene family is a set of genes Γ accompanied with a mapping function s : Γ → Σ mapping each gene to its corresponding species.
Consider a gene family Γ where each gene x ∈ Γ belongs to a species s(x) of Σ. The evolutionary history of Γ can be represented as a gene tree T for Γ, which is a rooted binary tree with its leafset in bijection with Γ, where each internal node refers to an ancestral gene at the moment of an event (either speciation or duplication). The mapping function s is generalized as follows: if x is an internal node of T, then s\left(x\right)=lc{a}_{S}\left(\left\{s\left(x\text{'}\right):x\text{'}\in \mathcal{L}\left({T}_{x}\right)\right\}\right).
An internal node x of T is called a speciation node if s(x_{ l }) and s(x_{ r }) are separated in S. Otherwise, x is a duplication node preceding the speciation event lca_{ S }(s(x_{ l }), s(x_{ r })) if lca_{ S }(s(x_{ l }), s(x_{ r })) is an internal node of S, otherwise it is a duplication inside the extant species lca_{ S }(s(x_{ l }), s(x_{ r })). A duplication node x such that s(x) = r(S) is called a prespeciation duplication node. A gene tree T with all internal nodes being speciation nodes is called a speciation tree. Two genes x, y of \mathcal{L}\left(T\right) are orthologs in T if their lca_{ T }(x, y) is a speciation node.
The duplication cost of T is the number of duplication nodes of T. It reflects the minimum number of duplications required to explain the evolution of the gene family inside the species tree S according to T. A wellknown reconciliation approach [7, 8] allows to further recover, in linear time, the minimum number of losses underlined by such an evolutionary history. We refer to the minimum number of duplications and losses required to explain T with respect to S as the reconciliation cost of T with respect to S, or simply the reconciliation cost if there is no ambiguity on the considered trees.
Supergenetree problem statement
A set \mathcal{G}\phantom{\rule{0.5em}{0ex}} of gene trees is said consistent if there is a tree T, called a supergenetree for \mathcal{G}\phantom{\rule{0.5em}{0ex}} displaying each tree of \mathcal{G}\phantom{\rule{0.5em}{0ex}}, and inconsistent otherwise. A supergenetree T for \mathcal{G}\phantom{\rule{0.5em}{0ex}} is said compatible with \mathcal{G}\phantom{\rule{0.5em}{0ex}}. For example, the four triplets in Figure 2 are consistent, and the gene tree T of Figure 1 is compatible with them. However, adding the dotted tree to the set of triplets makes the gene tree set incompatible. Consistency of a set of trees can be tested in polynomial time [1]. For a consistent set of trees, the problem considered here is to find a compatible gene tree of minimum reconciliation cost with respect to a given species tree. A formal statement of the general problem follows.
MINIMUM SUPERGENETREE PROBLEM (MINSGT PROBLEM):
Input: A species set Σ and a binary species tree S for Σ; a gene family Γ, a set Γ_{i, 1 ≤ i ≤ k}of subsets of Γ, and a set \mathcal{G}\phantom{\rule{0.5em}{0ex}} = {G_{1}, G_{2},⋯, G_{ k }} of consistent gene trees where, for each 1 ≤ i ≤ k, G_{ i } is a tree for Γ_{ i }.
Output: Among all gene trees for Γ compatible with \mathcal{G}\phantom{\rule{0.5em}{0ex}}, one tree T of minimum reconciliation cost.
When the considered cost is the duplication cost, the problem is called the Minimum Duplication SuperGeneTree Problem (MinDUPSGT problem).
From the SuperTree to the SuperGeneTree Problem
The classical supertree problem is to state whether or not a set of partial trees are consistent, and if so construct a tree containing them all. Here, we introduce the classical methods for solving this problem, and explore natural generalizations to the supergenetree problem.
Let Γ be a set of n taxa (usually species in case of the supertree problem, and genes in case of the supergenetree problem), Γ_{i, 1 ≤ i ≤k}be a set of possibly overlapping subsets of Γ, and \mathcal{G}=\left\{{G}_{1},\phantom{\rule{2.36043pt}{0ex}}{G}_{2},\phantom{\rule{2.36043pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.36043pt}{0ex}}{G}_{k}\right\} be a set of trees where, for each 1 ≤ i ≤ k, G_{ i } is a tree for Γ_{ i }. Let tr\left(\mathcal{G}\right) be the set of triplets of \mathcal{G}\phantom{\rule{0.5em}{0ex}} defined as tr\left(\mathcal{G}\right)=\left\{xyz:\exists \phantom{\rule{2.36043pt}{0ex}}1\le i\le k\phantom{\rule{2.36043pt}{0ex}}\mathsf{\text{such}}\phantom{\rule{2.36043pt}{0ex}}\mathsf{\text{that}}\phantom{\rule{2.36043pt}{0ex}}{G}_{i}{}_{\left\{x,\phantom{\rule{2.36043pt}{0ex}}y,\phantom{\rule{2.36043pt}{0ex}}z\right\}}=xyz\right\}. Let \mathcal{T}\left(\Gamma ,\phantom{\rule{2.36043pt}{0ex}}E\right) be the triplet graph with the set of vertices Γ and the set of edges E = {xy : ∃ z ∈ Γ such that xyz\in tr\left(\mathcal{G}\right)} (see Figure 2 for an example).
The classical BUILD algorithm [1] determines, in polynomial time, whether a set of triplets is consistent and if so constructs a tree T, possibly nonbinary, compatible with them. The algorithm takes as input the graph \mathcal{T}=\mathcal{T}\left(\Gamma ,\phantom{\rule{2.36043pt}{0ex}}E\right). Let \mathcal{C}\left(\mathcal{T}\right)=\left\{{C}_{1},\phantom{\rule{2.36043pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.36043pt}{0ex}}{C}_{m}\right\} be the set of connected components of \mathcal{T}\phantom{\rule{0.5em}{0ex}}. If \mathcal{T}\phantom{\rule{0.5em}{0ex}} has at least three vertices and \left\mathcal{C}\left(\mathcal{T}\right)\right=1, then \mathcal{G}\phantom{\rule{0.5em}{0ex}} is inconsistent, and the algorithm terminates. For example, the set of five gene trees of Figure 2 is inconsistent, as the corresponding triplet graph (including dotted lines) is connected. Otherwise, if \leftV\left(\mathcal{T}\right)\right\ge 3, a polytomy is created over \mathcal{C}\left(\mathcal{T}\right), the internal node of the polytomy being the root r(T) of the compatible tree T under construction and its children being m subtrees with leafsets V(C_{1}),..., V(C_{ m }), with their topology yet to be determined (where V(C_{ i }) ⊆ Γ denotes the set of taxa appearing in C_{ i }). The algorithm then recurses into each connected component, i.e. the subtree for V(C_{ i }) is determined recursively from the graph \mathcal{T}\left(V\left({C}_{i}\right),\phantom{\rule{2.36043pt}{0ex}}E{}_{{C}_{i}}\right) defined by E C_{ i }= {xy : ∃ z ∈ Γ such that xyz\in tr\left(\mathcal{G}{}_{V\phantom{\rule{2.36043pt}{0ex}}\left({C}_{i}\right)}\right)}. If, at any step, the considered graph has a single component containing more than two vertices, then \mathcal{G}\phantom{\rule{0.5em}{0ex}} is reported as an inconsistent set of trees and the algorithm terminates. Otherwise, recursion terminates when the graph has at most two vertices, eventually returning a supertree T. See Figure 3 for an example.
The BUILD algorithm has been generalized in an algorithm called AllTrees [20] to output all supertrees compatible with a set of triplets in case consistency holds. Instead of taking each element of \mathcal{C}\left(\mathcal{T}\right) as a separate leaf of r(T), all possible groupings, in other words all partitions of \mathcal{C}\left(\mathcal{T}\right), are considered (see Figure 3, right, for a choice of bipartitions). For each partition \mathcal{P}\left(\mathcal{C}\left(\mathcal{T}\right)\right) of \mathcal{C}\left(\mathcal{T}\right), a polytomy is created over \mathcal{P}\left(\mathcal{C}\left(\mathcal{T}\right)\right). The algorithm then iterates by considering each possible partition of each subgraph induced by each element of \mathcal{P}\left(\mathcal{C}\left(\mathcal{T}\right)\right). The algorithm is polynomial in the size of the output that may be exponential in the size of the input.
A tree T compatible with \mathcal{G}\phantom{\rule{0.5em}{0ex}} such that no internal edge of T can be contracted so that the resulting tree is also compatible with \mathcal{G}\phantom{\rule{0.5em}{0ex}} is called a minimally resolved supertree. Minimally resolved supertrees contain all the information about all supertrees compatible with \mathcal{G}\phantom{\rule{0.5em}{0ex}} but in a "compressed" format. By exhibiting some properties on graph components, Semple shows in [27] how some partitions of the triplet graph components can be avoided without loss of generality. The new developed algorithm, named AllMinTrees [27], outputs a minimally resolved tree in polynomial time. However, it was shown in [15] that the cardinality of the solution space can be exponential in n = Γ, leading to an exponential time algorithm with \Omega \left({\frac{n}{2}}^{\frac{n}{2}}\right).
Notice that, in general, the trees output by all these methods are nonbinary trees.
Extensions to the SuperGeneTree problem
Natural exact solutions for the supertree problem can be extended to the supergenetree problem as follows:

(1)
Use AllMinTrees to output all minimally resolved supertrees, and for each one which is nonbinary in general, find in linear time a resolution minimizing the reconciliation [16, 32] or duplication [31] cost. Among all optimally resolved trees, select one of minimum cost. Clearly this approach has the same complexity as the AllMinTrees algorithm, multiplied by a factor of n to resolve each tree, which is \Omega \left(n\cdot {\frac{n}{2}}^{\frac{n}{2}}\right).

(2)
As we are seeking a binary tree, each created node x of the supergenetree T under construction should determine a bipartition \left(\mathcal{L}\left({T}_{{x}_{l}}\right),\phantom{\rule{2.36043pt}{0ex}}\mathcal{L}\left({T}_{{x}_{r}}\right)\right). Therefore, the AllTrees algorithm can be simplified by considering, instead of all partitions of \mathcal{C}\left(\mathcal{T}\right), only all bipartitions of the triplet graph components set. See an example in Figure 3, right. Notice that this simplification approach is not applicable to the AllMinTrees algorithm, as by imposing bipartitions, the minimum resolution condition cannot be guaranteed.
A branchandbound approach
The tree space which is explored by the two exact methods described above can be reduced by using a branchandbound approach. Consider for example method (1) using the AllMinTrees algorithm. At each iteration of computing one minimally resolved tree, resolve the intermediate nonbinary tree obtained at this step, using for example the lineartime algorithm presented in [16]. If its reconciliation cost is greater than the cost of a full tree already obtained at a previous stage of the AllMinTrees algorithm, then stop expanding this tree as this can only increase the reconciliation cost.
A dynamic programming approach
The recursive topdown method (2) can instead be handled by a dynamic programming approach computing the minimum reconciliation cost of a tree on a subset of Γ according to the reconciliation costs of trees on smaller subsets, similarly to the wrok done in [13].
More precisely, let P be an arbitrary subset of Γ, and denote by R(P) the minimum duplication cost of a tree T_{ P } having leafset P and compatible with the set \mathcal{G}{}_{P}=\left\{{G}_{1}{}_{P},\phantom{\rule{2.36043pt}{0ex}}{G}_{2}{}_{P},\phantom{\rule{2.36043pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.36043pt}{0ex}}{G}_{k}{}_{P}\right\}. Let \mathcal{T}\left(P,\phantom{\rule{2.36043pt}{0ex}}E{}_{P}\right) be the BUILD graph restricted to P and \mathcal{G}{}_{P}, and C\left(\mathcal{T}\right)=\left\{{C}_{1},\phantom{\rule{2.36043pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.36043pt}{0ex}}{C}_{m}\right\} the set of its connected components. If C\subseteq \mathcal{C}\left(\mathcal{T}\right), by V(C) we mean {\cup}_{{C}_{i}\in C}V\left({C}_{i}\right). Denote the complement of C by \stackrel{\u0304}{C}=\mathcal{C}\left(\mathcal{T}\right)\backslash C. Finally set d\left(V\left(C\right),\phantom{\rule{2.36043pt}{0ex}}V\left(\stackrel{\u0304}{C}\right)\right) to 0 if s(V(C)) and s\left(V\right(\stackrel{\u0304}{C})) are separated in S, in which case (V(C), V\left(\stackrel{\u0304}{C}\right) is the bipartition of a speciation node, and 1 otherwise i.e. if (V(C), V\left(\stackrel{\u0304}{C}\right) is the bipartition of a duplication node. Then:
the value of interest being R(Γ). First note that, assuming constanttime lca queries over S, d(V(C), V\left(\stackrel{\u0304}{C}\right) can be computed in constant time if s(V(C)) and s\left(V\left(\stackrel{\u0304}{C}\right)\right) can be accessed in constant time, since if suffices to check that the lca of s(V(C)) and s\left(V\left(\stackrel{\u0304}{C}\right)\right) differs from both. To achieve this, we precompute s(X) for every subset X of Γ of size 1, 2,..., n in increasing order. Noting that if X > 1, then for any x ∈ X, s(X) = lca_{ S }(s(X \ {x}), s(x)), s(X) can be computed in constant time assuming that s(X \ {x}) was computed previously and assuming constanttime lca queries. As there are 2^{n} subsets of Γ, each computed in constant time, this preprocessing step takes time O(2^{n}).
As for R(Γ), we can simply ensure that each R(P) is computed at most once by storing its value in a table for subsequent accesses (i.e. when R(P) is needed, we use its value if it has been computed, or compute it and store it otherwise). In this manner, each subset P takes time, not counting the recursive calls, proportional to \leftP\right\left\mathcal{G}\right+\leftP\right+{2}^{\left\mathcal{C}\left(\mathcal{T}\right)\right} to construct \mathcal{T}\left(P,\phantom{\rule{2.36043pt}{0ex}}E{}_{P}\right), find \mathcal{C}\left(\mathcal{T}\right), and evaluate each bipartition of \mathcal{C}\left(\mathcal{T}\right). We will simply use the fact that \leftP\right\left\mathcal{G}\right+\leftP\right+{2}^{\left\mathcal{C}\left(\mathcal{T}\right)\right} is in O(2^{n}). As this has to be done for, at worst, each of the 2^{n} subsets of Γ, we get a total time O(2^{n} + 2^{n} ·2^{n}) = O(4^{n}). Note that this analysis probably overestimates the actual complexity of the algorithm, as we are assuming that each subset P and each component set \mathcal{C}\left(\mathcal{T}\right) are both always of size n. It is also worth mentioning that the R(P) recurrence can easily adapted to the mutation cost (duplications + losses).
A greedy heuristic for the duplication cost
Instead of trying all partitions of the triplet graph components set at each step of the AllTrees or AllMinTrees algorithms, if the goal is to minimize the duplication cost, then a natural greedy approach would be to choose the best partition at each iteration, namely the one allowing to minimize the number of duplications preceding each speciation event. Such an approach would result in pushing duplications down the tree. It leads to the following restricted version of the supergenetree problem.
MINIMUM PRESPECIATION DUPLICATION PROBLEM (MINPRESPEDUP PROBLEM):
Input: A species set Σ and a binary species tree S for Σ; a gene family Γ, a set Γ_{i,1≤i≤k}of subsets of Γ, and a set \mathcal{G}=\left\{{G}_{1},\phantom{\rule{2.36043pt}{0ex}}{G}_{2},\phantom{\rule{2.36043pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.36043pt}{0ex}}{G}_{k}\right\} of consistent gene trees where, for each 1 ≤ i ≤ k, G_{ i } is a tree on Γ_{ i }.
Output: Among all gene trees for Γ compatible with \mathcal{G}\phantom{\rule{0.5em}{0ex}}, one tree T with minimum prespeciation duplication nodes.
We will show in the following section that even this restricted version of the supergenetree problem is hard. Here, we give the intuition of a natural way of solving this problem, that reduces to repeated applications of the MaxCut problem. Although known to be NPhard, efficient heuristics exist (up to a factor of 0.878 [12]), that can be used for our purpose.
For the supertree problem, the triplet graph \mathcal{T}=\mathcal{T}\left(\Gamma ,\phantom{\rule{2.36043pt}{0ex}}E\right) represents all triplets of the input trees that have to be combined. In the case of the supergenetree problem, another tree is available, the species tree S. A triplet xyz found in the input trees \mathcal{G}\phantom{\rule{0.5em}{0ex}} can be reconciled with S, and if r(xyz) is a duplication, then any tree compatible with G must contain this duplication. Say that r(xyz) is a required duplication mapped to r(S) if s(r(xyz)) = r(S) and r(xyz) is a duplication. Let us include this information in \mathcal{T}\phantom{\rule{0.5em}{0ex}}. More precisely, let \mathcal{C}=\mathcal{C}\left(\mathcal{T}\right) denote the set of connected components of \mathcal{T}\phantom{\rule{0.5em}{0ex}}, and let \mathcal{T}\left(\mathcal{C}\right) be the graph whose vertex set is \mathcal{C}\phantom{\rule{0.5em}{0ex}}, and {C}_{1},\phantom{\rule{2.36043pt}{0ex}}{C}_{2}\in \mathcal{C} share an edge if C_{1} has vertices x, y and C_{2} has a vertex z such that xyz is a triplet in \mathcal{G}\phantom{\rule{0.5em}{0ex}} with r(xyz) being a required duplication mapped to r(S). If there are, say, d distinct such triplets, one can possibly set a weight of d to the C_{1}C_{2} edge. See Figure 4 for an example.
Consider the problem of clustering the components of \mathcal{T}\left(\mathcal{C}\right) into two parts B_{1}, B_{2} of a bipartition in a way minimizing the number of duplications preceding the speciation event r(S). For each C_{1} ∈ B_{1} and C_{2} ∈ B_{2} such that C_{1}C_{2} is an edge of \mathcal{T}\left(\mathcal{C}\right), a tree T rooted at the bipartition (B_{1}, B_{2}) contains the required duplications mapped to r(S) represented by the C_{1}C_{2} edge. If there are k such edges between B_{1} and B_{2} totalizing a weight of w, the single duplication at the root of T contains those w required duplications. In other words, we have "merged" w required duplications into one. It then becomes natural to find the bipartition of \mathcal{T}\left(\mathcal{C}\right) that merges a maximum of duplications, i.e. that contains a set of edges crossing between the two parts of maximum weight. This is the wellknown MaxCut problem. For instance in Figure 4, the MaxCut has a weight of 3 and leads to the optimal tree T_{1}. Any other bipartition sends a required duplication to a lower level and is hence suboptimal. The T_{2} tree is obtained from first taking the suboptimal ({a_{1}, b_{1}, d_{1}}, {c_{1}, e_{1}, f_{1}}) bipartition, which creates a duplication at the root and defers the c_{1}e_{1}f_{1} required duplication for later.
Note however that the components of \mathcal{T}\phantom{\rule{0.5em}{0ex}} may contain required duplications themselves, which are not represented by the edges of \mathcal{T}\left(\mathcal{C}\right). Thus, a MaxCut must then be applied recursively on both parts of the chosen bipartition. Therefore, this method does not benefit directly from the efficient approximation factor known for the MaxCut problem, as the approximation error stacks with each application. In the next section, we show that, unlike MaxCut, the MinPreSpeDup problem cannot admit a constant factor approximation (unless P = NP).
Inapproximability of the MinDupSGT and MinPreSpeDupSGT problems
Through the rest of this section, we denote by n = Γ the size of the considered gene family. We show that both the MinDupSGT problem and its restriction the MinPreSpeDupSGT problem are NPhard.
Theorem 1 The MinDupSGT and MinPreSpeDupSGT problems are both NPhard to approximate within a factor of n^{1ϵ} for any constant 0 < ϵ < 1. Moreover, this result holds for both problems even when restricted to instances having at most one gene per species in Γ.
Proof We use a reduction from the minimum kcolorability problem. Recall that a graph H = (V, E) is kcolorable if there is a partition {V_{1}, V_{2},..., V_{ k } } of V into independent sets (i.e. if x, y ∈ V_{ i } for some 1 ≤ i ≤ k, then xy ∉ E). It is now wellknown [33] that the smallest k for which H is kcolorable cannot be approximated within a factor of V^{1ϵ} unless P = NP.
Now, given a graph H = (V, E), we construct a gene set Γ, a set of rooted triplet gene trees \mathcal{G}\phantom{\rule{0.5em}{0ex}} and a species tree S such that H is kcolorable if and only if \mathcal{G}\phantom{\rule{0.5em}{0ex}} is compatible with some gene tree T having at most k − 1 duplications when reconciled with S. Using the same construction, we also show that H is kcolorable if and only if \mathcal{G}\phantom{\rule{0.5em}{0ex}} is compatible with some gene tree T having at most k − 1 prespeciation duplications when reconciled with S. In both cases, the genespecies mapping s is bijective, proving the second part of the theorem statement.
Let Γ = {v_{1}, v_{2} : v ∈ V} and for each edge vw ∈ E, add the triplets v_{1}v_{2}w_{1}, v_{1}v_{2}w_{2}, w_{1}w_{2}v_{1} and w_{1}w_{2}v_{2} to \mathcal{G}\phantom{\rule{0.5em}{0ex}}. Observe that this forces any tree T that displays \mathcal{G}\phantom{\rule{0.5em}{0ex}} to display the tree ((v_{1}, v_{2}), (w_{1}, w_{2})). Add one species to Σ for each gene of Γ so that the genespecies mapping s is bijective. As for S, first let S_{1} be any binary tree with one leaf for each member of {s(v_{1}) : v ∈ V}, and in the same manner let S_{2} be any binary tree with one leaf for each member of {s(v_{2}) : v ∈ V}. Obtain S by connecting the root of S_{1} and the root of S_{2} under a common parent r(S). Thus s(v_{1}) and s(v_{2}) are separated by r(S) for any v ∈ V. Clearly, \mathcal{G}\phantom{\rule{0.5em}{0ex}} and S can be constructed in polynomial time.
Claim 1 : if H is kcolorable, then we can find a tree T compatible with \mathcal{G}\phantom{\rule{0.5em}{0ex}} having at most k − 1 duplications. Moreover each such duplication x is a prespeciation duplication (i.e. s(x) = r(S)).
Let {V_{1}, V_{2},..., V_{ k }} be a kcoloring of H. For each 1 ≤ i ≤ k, let T_{ i } be the tree with leafset {V}_{i}^{\prime}=\left\{{v}_{1},\phantom{\rule{2.36043pt}{0ex}}{v}_{2}:v\in {V}_{i}\right\} that has only speciations, i.e. T_{ i } is S{}_{s\left({V}_{i}^{\prime}\right)} (because all genes in V'_{ i }belong to a different species). Notice that s(r(T_{ i })) = r(S), since r(S) separates v_{1} from v_{2} for all v ∈ V. Obtain T by taking any binary tree on k leaves (and hence k − 1 internal nodes), then replacing each leaf by a distinct T_{ i }. In this manner, T has k − 1 duplications since only the internal nodes of T that do not belong to any T_{ i } need to be duplications. Moreover, each duplication node x has s(x) = r(S). It remains to show that T is compatible with \mathcal{G}\phantom{\rule{0.5em}{0ex}}. It suffices to observe that all triplets of \mathcal{G}\phantom{\rule{0.5em}{0ex}} are of the form v_{1}v_{2}w_{ h } with h ∈ {1, 2}, and that such a triplet being in \mathcal{G}\phantom{\rule{0.5em}{0ex}} implies that vw ∈ E. For such a triplet, we must then have v ∈ V_{ i } and w ∈ V_{ j } with i ≠ j, implying {v}_{1},\phantom{\rule{2.36043pt}{0ex}}{v}_{2}\in {V}_{i}^{\prime} and {w}_{h}\in {V}_{j}^{\prime}. By the construction of T, v_{1}v_{2}w_{ h } must be a triplet of T, as desired.
Claim 2 : if there is a tree T compatible with \mathcal{G}\phantom{\rule{0.5em}{0ex}} having k − 1 duplications, then H is kcolorable. Moreover if T has k −1 duplications such that each duplication x has s(x) = r(S), then H is kcolorable.
Let T be a tree compatible with \mathcal{G}\phantom{\rule{0.5em}{0ex}} having k − 1 duplications. Call a node x of T Smaximal if x is not a duplication node mapped to r(S) but every proper ancestor of x is a duplication mapped to r(S). Let X = {x_{1}, x_{2},..., x_{ m }} be the set of Smaximal nodes of T. Note that if y ≠ r(T) is a duplication mapped to r(S), then so is the parent of y. This implies that every leaf ℓ of T has at least one ancestor x_{ i } in X, since x_{ i } is the highest (i.e. closest to the root) ancestor of ℓ that is not a duplication mapped to r(S) (such an x_{ i } always exists, since ℓ is itself one such node). Moreover, x_{ i } is unique, as no other x_{ j } ∈ X can be the ancestor of x_{ i }. Therefore, \left\{\mathcal{L}\left({T}_{{x}_{1}}\right),\phantom{\rule{2.36043pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.36043pt}{0ex}}\mathcal{L}\left({T}_{{x}_{m}}\right)\right\} is a partition of \mathcal{L}\left(T\right). We next show that m ≤ k. Let T' be the tree obtained by removing all descendants of x_{ i } in T, for all 1 ≤ i ≤ m. Then T' is a binary tree with m leaves, and all its m − 1 internal nodes are duplications mapped to r(S). Since T has no more than k − 1 duplications (in either cases of the claim), T' has at most k − 1 internal nodes and therefore at most k leaves. We deduce that m ≤ k.
Observe that if vw ∈ E, then α = lca(v_{1}, v_{2}, w_{1}, w_{2}) must be a duplication such that s(α) = r(S). Indeed, α separates lca(v_{1}, v_{2}) from lca(w_{1}, w_{2}) since T displays ((v_{1}, v_{2}), (w_{1}, w_{2})). But since s(lca(v_{1}, v_{2})) = s(lca(w_{1}, w_{2})) = r(S) by the construction of S, s(α) can only be r(S) as well, and so α must be a duplication.
Now, let V_{ i } = {v : v_{1} is a descendant of x_{ i }} for each 1 ≤ i ≤ m. Take v, w ∈ V_{ i } for some i. We show that vw ∉ E, and thus that {V_{1},..., V_{ m }} forms a coloring of H with at most k colors. The argument applies whether each duplication maps to r(S) or not, proving both parts of the claim. Suppose for the sake of contradiction that vw ∈ E, but v, w ∈ V_{ i }. In T, lca(v_{1}, w_{1}) must be a descendant of x_{ i }, since x_{ i } is a common ancestor of v_{1} and w_{1} by the definition of V_{ i }. Moreover, lca(v_{1}, w_{1}) ≠ x_{ i } since lca(v_{1}, w_{1}) = lca(v_{1}, v_{2}, w_{1}, w_{2}) is a duplication mapped to r(S), as shown above, while x_{ i } is not such a duplication, by its definition. Therefore, lca(v_{1}, w_{1}) is a proper descendant of x_{ i }. But s(lca(v_{1}, w_{1})) = r(S) = s(x_{ i }) implies that x_{ i } is a duplication mapped to r(S), a contradiction. We conclude that {V_{1},..., V_{ m }} with m ≤ k is a proper coloring of H.
This reduction, together with the fact that the kcoloring problem is NPhard to approximate within a n^{1ϵ} factor, proves the Theorem. □
Independent Speciation trees
We now consider the MinDupSGT problem in the special case where the input gene trees are independent speciation trees, meaning: (1) each gene of Γ appears in at most one gene tree leafset, and (2) gene trees of \mathcal{G}=\left\{{G}_{1},\phantom{\rule{2.36043pt}{0ex}}{G}_{2},\phantom{\rule{2.36043pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.36043pt}{0ex}}{G}_{k}\right\} are all speciation trees with respect to the species tree S. Our objective is to find a gene tree T compatible with \mathcal{G}\phantom{\rule{0.5em}{0ex}} minimizing duplications that also maintains the orthology relationships specified by \mathcal{G}\phantom{\rule{0.5em}{0ex}}. In other words, we require that for every {G}_{i}\in \mathcal{G},\phantom{\rule{2.36043pt}{0ex}}T{}_{\mathcal{L}\left({G}_{i}\right)} has only speciations. We say that a gene tree T that satisfies this property preserves the speciations of \mathcal{G}\phantom{\rule{0.5em}{0ex}}. Note that if T preserves the speciations of \mathcal{G}\phantom{\rule{0.5em}{0ex}}, then it is necessarily compatible with \mathcal{G}\phantom{\rule{0.5em}{0ex}}. We call T{}_{\mathcal{L}\left({G}_{i}\right)} the copy of G_{ i } in T.
MINIMUM SPECIATION SUPERGENETREE (MINSPECSGT PROBLEM):
Input: A species set Σ and a binary species tree S for Σ; a gene family Γ, a set Γ_{i,1≤i≤k}of disjoint subsets of Γ, and a set \mathcal{G}=\left\{{G}_{1},\phantom{\rule{2.36043pt}{0ex}}{G}_{2},\phantom{\rule{2.36043pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.36043pt}{0ex}}{G}_{k}\right\} of consistent independent speciation trees such that, for each 1 ≤ i ≤ k, G_{ i } is a tree for Γ_{ i }.
Output: Among all gene trees for Γ that preserve the speciations of \mathcal{G}\phantom{\rule{0.5em}{0ex}}, one tree T of minimum duplication cost.
Notice that, since no gene of Γ appears more than once in the set of input trees, \mathcal{G}\phantom{\rule{0.5em}{0ex}} always admits a solution. Indeed, taking any binary tree on k leaves and replacing each leaf by a distinct G_{ i } achieves the desired result. However, while apparently easier, we show that finding such a gene tree T minimizing the number of duplications is still hard.
Theorem 2 The decision version of the MinSpecSGT problem is NPComplete, i.e. it is NPComplete to decide if a species tree S and a set of independent speciation trees G admit a supertree T that preserves its speciations with at most k duplications.
Proof The problem is easily seen to be in NP, as it is easy to verify in polynomial time that a given gene tree T is compatible with \mathcal{G}\phantom{\rule{0.5em}{0ex}}, preserves its speciations and has k duplications. For NPhardness, we turn to the decision version of the kcolorability problem. That is, for a given k, deciding if a graph H = (V, E) is kcolorable is NPhard. We create from H a species tree S and a set of independent speciation trees \mathcal{G}\phantom{\rule{0.5em}{0ex}} such that H is kcolorable if and only if S and G admit a supertree T with at most k − 1 duplications.
Let n = V, and denote V = {v_{1},..., v_{ n }}. To create S, start with any binary tree S' on \left(\begin{array}{c}\hfill n\hfill \\ \hfill 2\hfill \end{array}\right) leaves. denote this leafset W = {w_{ i,j }: 1 ≤ i <j ≤ n} so that there is a onetoone correspondence between W and the unordered pairs of V. Then, add a special leaf a by joining it with the root of S' under a common parent p, and finally obtain S by adding another special leaf b by joining it with p under a common parent. Therefore, the species set is \sum =\mathcal{L}\left(S\right)=W\cup \left\{a,\phantom{\rule{2.36043pt}{0ex}}b\right\}.
For the construction of each gene tree G\in \mathcal{G}, we ease up notation by labeling each leaf g of G by s(g) directly (e.g. if we say that G is of the form (a, b), we mean that \mathcal{G}\phantom{\rule{0.5em}{0ex}} has two leaves g_{ a }, g_{ b } such that s(g_{ a }) = a and s(g_{ b }) = b). In this manner, since all trees of \mathcal{G}\phantom{\rule{0.5em}{0ex}} contain only speciations, each tree G\in \mathcal{G} must be a subtree of S (or it is obtained from such a subtree by contracting edges). Also recall that we are assuming that each gene appears in at most one gene tree of \mathcal{G}\phantom{\rule{0.5em}{0ex}}, and so the genes from two distinct trees must also be distinct (even if they share the same label).
In \mathcal{G}\phantom{\rule{0.5em}{0ex}}, we first add k trees of the form (a, b), plus one tree G_{ i } for each vertex v_{ i } in H. The tree G_{ i } corresponding to v_{ i } ∈ V is a copy of S from which we remove every leaf except those w_{ j,k } for which one of j = i or k = i holds, and (v_{ j } v_{ k) } ∈ E (i.e. we keep the leaves of W that correspond to an edge incident to v_{ i }). Also contract the degree 2 nodes of G_{ i }. Notice that if v_{ i }v_{ j } ∈ E and i <j, then both G_{ i } and G_{ j } contain a gene in the w_{ i,j } species. Also, if v_{ i }v_{ j } ∉E then G_{ i } and G_{ j } have no genes from a common species.
Claim 1 : if H is kcolorable, then S and \mathcal{G}\phantom{\rule{0.5em}{0ex}} admit a supertree T having at most k − 1 duplications.
Let {V_{1},..., V_{ k } } be a kpartition of V into independent sets. Take any h such that 1 ≤ h ≤ k. Recall that if v_{ i }, v_{ j } ∈ V_{ h }, then G_{ i } and G_{ j } share no gene from a common species (since v_{ i }v_{ j } ∉ E). Thus the trees in {\mathcal{G}}_{h}=\{{G}_{i}:{v}_{i}\in {V}_{h}\} are all disjoint in terms of species. Let Σ_{ h } be the set of species that appear in some tree of {\mathcal{G}}_{h}. Then, the tree S{}_{{\Sigma}_{h}} contains a copy of each tree in {\mathcal{G}}_{h}, and none of these copies overlap. Obtain T_{ h } by joining a gene labeled a to r(S{}_{{\Sigma}_{h}}) under a common parent p, then joining a gene labeled b to p under a new common parent. Now, T_{ h } contains a copy of each tree in G_{ h } and a copy of one of the (a, b) trees. By taking a tree with k leaves (where at worst, each k − 1 internal node is a duplication), and replacing each leaf by the speciation trees T_{1},..., T_{ k }, we obtain a gene supertree T, which preserves the speciations of \mathcal{G}\phantom{\rule{0.5em}{0ex}} and has at most k − 1 duplications.
Claim 2 : if S and \mathcal{G}\phantom{\rule{0.5em}{0ex}} admit a supertree T having k − 1 duplications, then H is kcolorable.
We first show that if T has k − 1 duplications, then it must have exactly k speciations mapped to r(S). It cannot have more, as there would then be more than k − 1 duplications. Suppose instead that there are k' <k such speciations, and denote them x_{1},...,x_{ k' }. Note that there must be at least k' − 1 duplications in the ancestors of the x_{ i }s. Now, for 1 ≤ i ≤ k', {T}_{{x}_{i}} must contain a certain number of copies of a and b. Let m_{ i }(a) and m_{ i }(b) denote, respectively, the number of copies of a and b contained in {T}_{{x}_{i}}, noting that in total, there are k copies of each since there are k subtrees of the form (a, b) in \mathcal{G}\phantom{\rule{0.5em}{0ex}}. Since x_{ i } is a speciation mapped to r(S), it separates the a copies from the b copies, thus the {T}_{{x}_{i}} subtree must contain at least m_{ i }(a) − 1 + m_{ i }(b) − 1 duplications. Denote by d(T) the number of duplications in T. It follows that \begin{array}{c}d\left(T\right)\ge {k}^{\prime}1+{\sum}_{i=1}^{{k}^{\prime}}\left({m}_{i}\left(a\right)+{m}_{i}\left(b\right)2\right)={k}^{\prime}12{k}^{\prime}+{\sum}_{i=1}^{{k}^{\prime}}{m}_{i}\left(a\right)+{\sum}_{i=1}^{{k}^{\prime}}({m}_{i}\left(b\right)\hfill \\ ={k}^{\prime}1+k+k=2kk\prime 1>k1\hfill \end{array} when k' <k, a contradiction.
Now, we can let x_{1},..., x_{ k } be the k speciation nodes of T mapped to r(S). The k − 1 duplications of T must then all be ancestors of the x_{ i }, and they are all mapped to r(S). Therefore the {T}_{{x}_{1}},\mathrm{...},{T}_{{x}_{h}} subtrees each contain only speciations. For any {G}_{i}\in \mathcal{G} corresponding to v_{ i }, one of the {T}_{{x}_{h}} must contain the copy of G_{ i } (for otherwise, the root of the copy of G_{ i } in T would be a duplication, while it should be a speciation). Take any h such that 1 ≤ h ≤ k. We claim that V_{ h } = {v_{ i } : {T}_{{x}_{h}} contains the copy of G_{ i }} forms an inpedendent set. Since {T}_{{x}_{h}} contains only speciations, it cannot contain genes from the same species. Thus for any G_{ i }, G_{ j } contained in {T}_{{x}_{h}} we must have v_{ i }v_{ j } ∉ E, as otherwise G_{ i } and G_{ j } would share a gene from the same species. Therefore V_{ h } is an independent set. Thus {V_{1},..., V_{ k } } form a kcoloring of H, and the proof is completed. □
It is interesting to note that this does not show the NPhardness of the special case in which the input trees are only triplets. Indeed, a tree G_{ i } created in this reduction has as many leaves as the number of neighbors of its corresponding vertex v_{ i }. Therefore, if H is a cubic graph (ie. 3regular), one can generate an input with only triplets. However, deciding if a cubic graph is kcolorable can be done in linear time, and thus the triplets case cannot be shown NPhard through this reduction. The 3colorability problem is NPhard on 4regular graph though, showing the NPhardness of the problem on input trees having at most 4 leaves.
Conclusion
We introduce the supergenetree problem which aims at constructing a supertree that displays a set of input gene trees while minimizing the reconciliation cost with respect to an input species tree. This problem is a natural formulation of the question of combining a set of gene trees obtained for subsets of a gene family into a full gene tree for the whole gene family.
The supergenetree problem is an extension of the classical supertree problem on a set of input leaflabeled trees, where the input trees are gene trees and a species tree is used in order to evaluate the reconciliation/duplication cost of a supergenetree. We show how existing exact and greedy heuristic algorithms for the supertree problem can be used to devise approaches for solving the supergenetree problem. The resulting approaches have exponential worsttime complexity as the original supertree algorithms.
We show that the supergenetree problem for the duplication cost is NPhard to approximate within a factor essentially better than n, and this complexity remains the same even when the problem is restricted, in a greedy approach, to finding a supertree with a minimum number of duplications before each speciation of the species tree. We also consider a restriction of the supergenetree problem relevant to many biological applications where subsets of orthologs are studied separately and then amalgamated into a single tree. Even this restriction is shown to be NP Complete. The reconciliation cost remains to be studied, although we conjecture all of the above mentioned problems are hard in this case also.
These negative complexity results are not surprising though as they extend an already large set of problems related to supertrees that are known to be NPhard. We think that appropriate heuristics for various classes of input trees are worth to be considered in future projects. Removing the assumption that the input gene trees are compatible would also lead to new interesting problems. A promising avenue would be to consider constructive FPT algorithms that can be integrated in greedy heuristics or dynamic programming algorithms. Also other restrictions on the input gene trees can be explored, hopefully leading to polynomial problems. Constructing gene trees by amalgamating smaller trees for subsets of orthologous genes is a natural way of constructing large trees that would benefit from a thorough theoretical and algorithmic analysis.
References
Aho AV, Yehoshua S, Szymanski TG, Ullman JD: Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comput. 1981, 10 (3): 405421. 10.1137/0210030.
Bansal M, Burleigh J, Eulenstein O, FernándezBaca D: Robinsonfoulds supertrees. Alg Mol Biol. 2010, 5 (18):
Berglund AC, Sjolund E, Ostlund G, Sonnhammer EL: InParanoid 6: eukaryotic ortholog clusters with inparalogs. Nucleic Acids Research. 2008, 36: D263D266.
BinindaEmonds O, editor: Phylogenetic Supertrees combining information to reveal The Tree Of Life. Computational Biology. 2004, Kluwer Academic, Dordrecht, the Netherlands
BinindaEmonds ORP, Gittleman J, Steel MA: The super tree of life: Procedures, problems and prospects. Annu Rev Ecol Syst. 2002, 33: 265289. 10.1146/annurev.ecolsys.33.010802.150511.
Bryant D: A classification of consensus methods for phylogenetics. DIMACS series in Discrete Math and Theo Comput Sci. 2003
Chauve C, ElMabrouk N: New perspectives on gene family evolution: losses in reconciliation and a link with supertrees. RECOMB of LNCS, Springer. 2009, 5541: 4658.
Chen K, Durand D, FarachColton M: Notung: Dating gene duplications using gene family trees. Journal of Computational Biology. 2000, 7: 429447. 10.1089/106652700750050871.
Constantinescu M, Sankoff D: An efficient algorithm for supertrees. J Classif. 1995, 12: 101112. 10.1007/BF01202270.
Cotton JA, Wilkinson M: Majorityrule supertrees. Syst Biol. 2007, 56 (3): 445452. 10.1080/10635150701416682.
Datta RS, Meacham C, Samad B, Neyer C, Sjölander K: Berkeley phog: Phylofacts orthology group prediction web server. Nucleic Acids Res. 2009, 37: W84W89. 10.1093/nar/gkp373.
Goemans Michel, Williamson David: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM). 1995, 42 (6): 11151145. 10.1145/227683.227684.
Hallett Mike, Lagergren Jens, Tofigh Ali: Simultaneous identification of duplications and lateral transfers. Proceedings of the eighth annual international conference on Resaerch in computational molecular biology ACM. 2004, 347356.
HuertaCepas J, CapellaGutierrez S, Pryszcz LP, Denisov I, Kormes D, MarcetHouben M, Gabald'on T: Phylomedb v3.0: an expanding repository of genomewide collections of trees, alignments and phylogenybased orthology and paralogy predictions. Nucleic Acids Res. 2011, 39: D556D560. 10.1093/nar/gkq1109.
Jansson J, Lemence RS, Lingas A: The complexity of inferring a minimally resolved phylogenetic supertree. SIAM J on computing. 2012, 41 (1): 272291. 10.1137/100811489.
Lafond M, Swenson KM, ElMabrouk N: An optimal reconciliation algorithm for gene trees with polytomies. LNCS, of WABI. 2012, 7534: 106122.
Lechner M, Findeib Sven, Steiner L, Marz1 M, Stadler PF, Prohaska SJ: Proteinortho: detection of (co)orthologs in largescale analysis. BMC Bioinformatics. 2011, 12: 12410.1186/1471210512124.
Li L, Stoeckert CJ, Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research. 2003, 13: 21782189. 10.1101/gr.1224503.
Mi H, Muruganujan A, Thomas PD: Panther in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 2012, 41: D377D386.
Ng MP, Wormald NC: Reconstruction of rooted trees from subtrees. Discrete Appl Math. 1996, 69: 1931. 10.1016/0166218X(95)000742.
Nguyen N, Mirarab S, Warnow T: MRL and SuperFine+MRL: new supertree methods. J Algo for Mol Biol. 2012, 7 (3):
Penel Simon, Arigon AnneMuriel, Dufayard JeanFrançois, Sertier AnneSophie, Daubin Vincent, Duret Laurent, Gouy Manolo, Perrière Guy: Databases of homologous gene families for comparative genomics. BMC Bioinformatics. 2009, 10 (Suppl 6): S310.1186/1471210510S6S3.
Pryszcz LP, HuertaCepas J, Gabaldón T: MetaPhOrs: orthology nd paralogy predictions from multiple phylogenetic evidence using a consistencybased confidence score. Nucleic Acids Research. 2011, 39: e3210.1093/nar/gkq953.
Ranwez V, Berry V, Criscuolo A, Fabre P, Guillemot S, Scornavacca C, Douzery E: PhySIC: a veto supertree method with desirable properties. Syst Biol. 2007, 56 (5): 798817. 10.1080/10635150701639754.
Ranwez V, Criscuolo A, Douzery EJ: SuperTriplets: a tripletbased supertree approach to phylogenomics. Bioinformatics. 2010, 26 (12): i115i123. 10.1093/bioinformatics/btq196.
Scornavacca Celine, Jacox Edwin, Szöllősi Gergely: Joint amalgamation of most parsimonious reconciled gene trees. Bioinformatics. 2014, btu728
Semple C: Reconstructing minimal rooted trees. Discrete Appl Math. 2003, 127 (3):
Steel M, Rodrigo A: Maximum likelihood supertrees. Syst Biol. 2008, 57 (2): 243250. 10.1080/10635150802033014.
Swenson MS, Suri R, Linder CR, Warnow T: SuperFine: fast and accurate supertree estimation. Sys Biol. 2012, 61 (2): 214227. 10.1093/sysbio/syr092.
Vilella AJ, Severin J, UretaVidal A, Heng L, Durbin R, Birney E: EnsemblCompara gene trees: Complete, duplicationaware phylogenetic trees in vertebrates. Genome Research. 2009, 19: 327335.
Zheng Y, Wu T, Zhang L: A lineartime algorithm for reconciliation of nonbinary gene tree and binary species tree. Combinatorial Optimization and Applications of LNCS. 2013, 8287: 190201. 10.1007/9783319037806_17.
Zheng Yu, Zhang Louxin: Reconciliation with nonbinary gene trees revisited. Research in Computational Molecular Biology, Springer. 2014, 418432.
Zuckerman David: Linear degree extractors and the inapproximability of max clique and chromatic number. Proceedings of the thirtyeighth annual ACM symposium on Theory of computing ACM. 2006, 681690.
Declarations
Publication of this work is funded by the Natural Sciences and Engineering Research Council of Canada (NSERC), Fonds de Recherche Nature et Technologies of Quebec (FRQNT) and the Canada Research Chair (CRC) in Biological and Computational Biology.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 14, 2015: Proceedings of the 13th Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S14.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
ML, AO, NE devised the proofs and algorithms and wrote the paper.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Lafond, M., Ouangraoua, A. & ElMabrouk, N. Reconstructing a SuperGeneTree minimizing reconciliation. BMC Bioinformatics 16 (Suppl 14), S4 (2015). https://doi.org/10.1186/1471210516S14S4
Published:
DOI: https://doi.org/10.1186/1471210516S14S4