Volume 13 Supplement 10
Selected articles from the 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)
Consensus properties for the deep coalescence problem and their application for scalable tree search
 Harris T Lin^{1},
 J Gordon Burleigh^{2} and
 Oliver Eulenstein^{1}Email author
DOI: 10.1186/1471210513S10S12
© Lin et al.; licensee BioMed Central Ltd. 2012
Published: 25 June 2012
Abstract
Background
To infer a species phylogeny from unlinked genes, phylogenetic inference methods must confront the biological processes that create incongruence between gene trees and the species phylogeny. Intraspecific gene variation in ancestral species can result in deep coalescence, also known as incomplete lineage sorting, which creates incongruence between gene trees and the species tree. One approach to account for deep coalescence in phylogenetic analyses is the deep coalescence problem, which takes a collection of gene trees and seeks the species tree that implies the fewest deep coalescence events. Although this approach is promising for phylogenetics, the consensus properties of this problem are mostly unknown and analyses of large data sets may be computationally prohibitive.
Results
We prove that the deep coalescence consensus tree problem satisfies the highly desirable Pareto property for clusters (clades). That is, in all instances, each cluster that is present in all of the input gene trees, called a consensus cluster, will also be found in every optimal solution. Moreover, we introduce a new divide and conquer method for the deep coalescence problem based on the Pareto property. This method refines the strict consensus of the input gene trees, thereby, in practice, often greatly reducing the complexity of the tree search and guaranteeing that the estimated species tree will satisfy the Pareto property.
Conclusions
Analyses of both simulated and empirical data sets demonstrate that the divide and conquer method can greatly improve upon the speed of heuristics that do not consider the Pareto consensus property, while also guaranteeing that the proposed solution fulfills the Pareto property. The divide and conquer method extends the utility of the deep coalescence problem to data sets with enormous numbers of taxa.
Introduction
The rapidly growing abundance of genomic sequence data has revealed extensive incongruence among gene trees (e.g., [1, 2]) that may be caused by processes such as deep coalescence (incomplete lineage sorting), gene duplication and loss, or lateral gene transfer (see [3–5]). In these cases, phylogenetic methods must account for and explain the patterns of variation among gene tree topologies, rather than simply assuming the gene tree topology reflects the relationships among species. In particular, there has been much recent interest in phylogenetic methods that account for deep coalescence, which may occur in any sexually reproducing organisms (e.g., [6–8]). One such approach is the deep coalescence problem, which, given a collection of gene trees, seeks a species tree that minimizes the number of deep coalescence events [4, 9]. Although the deep coalescence problem is NPhard [10], recent algorithmic advances enable scientists to solve instances with a small number of taxa [11, 12] and efficiently compute heuristic solutions for data sets with slightly more species [13]. Still, the heuristics are based on generic local tree search strategies with no performance guarantees, and they cannot handle enormous data sets. In this study, we prove that the deep coalescence problem satisfies the Pareto consensus property. We then describe a new divide and conquer approach, based on the Pareto property, that, in practice, can greatly extend the utility of existing heuristics while guaranteeing that the inferred species tree also has the Pareto property with respect to the input gene trees.
Related work
The deep coalescence problem is an example of a supertree problems, in which input trees with taxonomic overlap are combined to build a species tree that includes all of the taxa found in the input trees (see [14]). In fact, it is among the few supertree methods that use a biologically based optimality criterion. One way of evaluating supertree methods is by characterizing their consensus properties (e.g., [15, 16]). The consensus tree problem is the special case of the supertree problem in which all the input trees contain the same taxa. Since all supertree problems generally seek to retain phylogenetic information from the input trees, one of the most desirable consensus properties is the Pareto property. A consensus tree problem satisfies the Pareto property on clusters (or triplets, quartets, etc.) if every cluster (or triplet, quartet, etc.) that is present in every input tree appears in the consensus tree [15–17]. Many supertree problems satisfy the Pareto property for clusters in the consensus setting [15, 16]. However, this has not been shown for the deep coalescence problem.
Our contributions
We prove that the deep coalescence consensus tree problem satisfies the Pareto property for clusters. This result provides useful guidance for the species tree search. Instead of evaluating all possible species trees, to find the optimal solution we need only to examine trees that satisfy the Pareto property on clusters. These trees will all be refinements of the strict consensus of the gene trees. Furthermore, the Pareto property allow us to show that the problem can be divided into smaller independent subproblems based on the strict consensus tree. We apply this property and describe a new divide and conquer method, and our experiments demonstrate that this method can greatly improve the speed of deep coalescence tree heuristics, potentially enabling efficient and effective estimates from inputs with several thousands of taxa. Future work will exploit the independence of the subproblems and solve these on parallel machines, which should result in even larger and more accurate solutions.
Methods
Basic definitions, notations, and preliminaries
In this section we introduce basic definitions and notations and then define preliminaries required for this work. For brevity some proofs are omitted in the text but available in Additional file 1.
A graph G is an ordered pair (V, E) consisting of a nonempty set V of nodes and a set E of edges. We denote the set of nodes and edges of G by V(G) and E(G), respectively. If e = {u, v} is an edge of a graph G, then e is said to be incident with u and v. If v is a node of a graph G, then the degree of v in G is the number of edges in G that are incident with v.
A tree T is a connected graph with no cycles. T is rooted if it has exactly one distinguished node of degree one, called the root, and we denote it by $\mathrm{R}\mathrm{o}\left(T\right)$. The unique edge incident with $\mathrm{R}\mathrm{o}\left(T\right)$ is called the root edge.
Let T be a rooted tree. We define ≤_{ T } to be the partial order on V (T) where x ≤_{ T } y if y is a node on the path between $\mathrm{R}\mathrm{o}\left(T\right)$ and x. If x ≤_{ T } y we call x a descendant of y, and y an ancestor of x. We also define x <_{ T } y if x ≤_{ T } y and x ≠ y, in this case we call x a proper descendant of y, and y a proper ancestor of x. The set of minima under ≤_{ T } is denoted by $\mathrm{L}\mathrm{e}\left(T\right)$ and its elements are called leaves. A node is internal if it is not a leaf. The set of all internal nodes of T is denoted by I(T). Further, we will frequently refer to the subset of I(T) whose degree is two, and we denote this by I_{2}(T).
Let $X\subseteq \mathrm{L}\mathrm{e}\left(T\right)$, we write $\overline{X}$ to denote the leaf complement of X when the tree T is clear from the context, where $\overline{X}=\mathsf{\text{L}}e\left(T\right)\backslash X$.
If {x, y} ∈ E(T) and x <_{ T } y then we call y the parent of x denoted by ${\mathrm{P}\mathrm{a}}_{T}\left(x\right)$ and we call x a child of y. The set of all children of y is denoted by ${\mathrm{C}\mathrm{h}}_{T}\left(y\right)$. If two nodes in T have the same parent, they are called siblings. The least common ancestor (LCA) of a nonempty subset X ⊆ V(T), denoted as lca_{ T }(X), is the unique smallest upper bound of X under ≤_{ T }.
If e ∈ E(T), we define T/e to be the tree obtained from T by identifying the ends of e and then deleting e. T/e is said to be obtained from T by contracting e. If v is a vertex of T with degree one or two, and e is an edge incident with v, the tree T/e is said to be obtained from T by suppressing v.
T is binary if every node has degree one or three. Throughout this paper, the term tree refers to a rooted binary tree unless otherwise stated. Also, the subscript of a notation may be omitted when it is clear from the context.
Deep coalescence
Throughout this section we assume T and S are trees over the same leaf set.
Definition 1 (Path length). Suppose x ≤_{ T } y, the path length from x to y, denoted pl_{ T }(x, y), is the number of edges in the path from x to y. Further, let $X\subseteq Y\subseteq \mathrm{L}\mathrm{e}\left(T\right)$, we extend the path length function by pl_{ T }(X,Y) ≜ pl_{ T }(lca_{ T }(X), lca_{ T }(Y)).
Definition 2 (LCA mapping). Let v ∈ V(T), the LCA mapping of v in S, denoted M_{ T⊳S }(v), is defined by ${M}_{T\u22b3S}\left(v\right)\triangleq lc{a}_{S}\left(\mathrm{L}\mathrm{e}\left({T}_{v}\right)\right)$.
Consensus tree
Definition 4 (Consensus tree problem). Let $f:\mathcal{T}\mathsf{\text{x}}\times \mathcal{T}\mathsf{\text{x}}\to \mathcal{R}$be a cost function where X is a leaf set and $\mathcal{T}\mathsf{\text{x}}$is the set of all trees over X. A consensus tree problem based on f is defined as follows.
Instance: A tuple of n trees (T_{1},...,T_{ n }) over X
This set is also called the solutions for the consensus tree instance.
Definition 5 (Deep coalescence consensus tree problem). We define the deep coalescence consensus tree problem to be the consensus tree problem based on the deep coalescence cost function.
Cluster and Pareto
Definition 6 (Cluster). Let T be a tree, the clusters induced by T, denoted, $\mathrm{C}\mathrm{l}\left(T\right)$, is $\mathrm{C}\mathrm{l}\left(T\right)\triangleq \left\{\mathrm{L}\mathrm{e}\left({T}_{v}\right):v\in V\left(T\right)\right\}$. Further, $X\in \mathrm{C}\mathrm{l}\left(T\right)$ is called a trivial cluster if$X=\mathrm{L}\mathrm{e}\left(T\right)$ or X = 1, it is called nontrivial otherwise. Let $Y\subseteq \mathrm{L}\mathrm{e}\left(T\right)$, we say that T contains (cluster) Y if $Y\in \mathrm{C}\mathrm{l}\left(T\right)$.
Definition 7 (Pareto on clusters). Let P be a consensus tree problem based on some cost function. We say that P is Pareto on clusters if: for all instances I = (T_{1},...,T_{ n }) of P, for all solutions S of I, we have ${\cap}_{i=1}^{n}\mathsf{\text{C1}}\left({T}_{i}\right)\subseteq \mathsf{\text{C1}}\left(S\right)$.
Theorem overview
We wish to show that the deep coalescence consensus tree problem is Pareto on clusters. We describe a high level structure of the proof in this section and provide necessary supporting lemmata in the next section. The proof proceeds by contradiction, assuming that the deep coalescence consensus tree problem is not Pareto on clusters. By Def. 7, the assumption implies that there exists an instance I = (T_{1},...,T_{ n }), a solution S for I, and a cluster $X\subseteq \mathrm{L}\mathrm{e}\left(S\right)$ where $X\in {\cap}_{i=1}^{n}\mathrm{C}\mathrm{l}\left({T}_{i}\right)$ but $X\notin \mathrm{C}\mathrm{l}\left(S\right)$. S being a solution for I, implies by Def. 4, that the aggregated deep coalescence cost, i.e. ${\sum}_{i=1}^{n}DC\left({T}_{i},\phantom{\rule{2.77695pt}{0ex}}S\right)$ is minimized. Then, based on the existence of the cluster X, we edit S and form a new tree R using a tree edit operation which will be introduced in the next section. The properties of this new operation together with the properties of X (proved in the next section), provides the key ingredients to calculate the changes in deep coalescence costs. With some further arithmetics, this allows us to conclude that R in fact has a smaller aggregated deep coalescence cost, i.e. ${\sum}_{i=1}^{n}DC\left({T}_{i},\phantom{\rule{2.77695pt}{0ex}}S\right)>{\sum}_{i=1}^{n}DC\left({T}_{i},\phantom{\rule{2.77695pt}{0ex}}R\right)$, hence contradicting the assumption that S is a solution for I.
Supporting lemmata
Shallowest regrouping operation
Definition 8 (Node depth). The depth of a node v∈V(T), denoted dep_{ T } (v), is $pl\left(v,\mathsf{\text{Ro}}\left(T\right)\right)$.
Definition 9 (Shallowest nodes). Let T be a tree and X ⊆ V(T), the shallowest function, denoted shallowest_{ T }(X), is the set of nodes in X which have the minimum depth among all nodes in X. Formally, we define shallowest_{T} (X) ≜ argmin_{ v∈X } (dep_{ T }(v)).
Now we have the necessary mechanics to define the new tree edit operation. In what follows, we assume S to be a tree, $\varnothing \subset X\subset \mathrm{L}\mathrm{e}\left(S\right)$, and $S\prime =S\left(\overline{X}\right)$.
Definition 10 (Regroup). Let v ∈ I_{2}(S'). The regrouping operation of S by X on v, denoted Γ(S, X, v), is the tree obtained from S' by
1. (R1) Identify Ro(SX) and v. In other words we adjoin the root of tree SX onto the node v.
2. (R2) Suppress all nodes with degree two.
Definition 11 (Shallowest regroup). The shallowest regrouping operation of S by X, denoted $\hat{\Gamma}\left(S,\phantom{\rule{2.77695pt}{0ex}}X\right),$defines a set of trees by $\hat{\Gamma}\left(S,\phantom{\rule{2.77695pt}{0ex}}X\right)\triangleq \left\{\Gamma \left(S,\phantom{\rule{2.77695pt}{0ex}}X,\phantom{\rule{2.77695pt}{0ex}}v\right):v\in shallowes{t}_{S\prime}\phantom{\rule{2.77695pt}{0ex}}\left({I}_{2}\left(S\prime \right)\right)\right\}.$
As Figure 3 shows, the shallowest regrouping operation pulls apart X from S and regroups X back onto each of the shallowest nodes in S.
Counting the number of degreetwo nodes
The regrouping operation includes the step of suppressing nodes with degree two. Since this step affects path lengths and ultimately deep coalescence costs, we are required to count carefully the number of degreetwo nodes under various conditions. Here we assume that T is a tree and {X, Y} is a bipartition of $\mathrm{L}\mathrm{e}\left(T\right).$ We begin with two observations that assert existence of degreetwo nodes, and assert existence of leaf sets given a degreetwo node.
Observation 1. I_{2}(T(X) ≠ Ø and. I_{2}(T(Y) ≠ Ø.
Observation 2. If v ∈ I_{2}(T(X)), then $\mathrm{L}\mathrm{e}\left({T}_{v}\right)\cap X\ne \varnothing $ and. $\mathrm{L}\mathrm{e}\left({T}_{v}\right)\cap Y\ne \varnothing $.
The next Lemma says that if the root of T is the parent of lca(X), then the number of degreetwo nodes in T(X) is at least the depth of v, where v is a shallowest degreetwo node of T(Y).
Lemma 1. If $\mathrm{P}\mathrm{a}\left(lca\left(X\right)\right)=\mathrm{R}\mathrm{o}\left(T\right)$and v ∈ shallowest (I_{2}(T(Y))), then dep(v) ≤ I_{2}(T(X)).

v_{ n }= lca(X) because $\mathrm{P}\mathrm{a}\left(lca\left(X\right)\right)=\mathrm{R}\mathrm{o}\left(T\right)$.

B ⊆ X and A_{1} ∩ Y ≠ ∅, because v is a degreetwo node of T(Y).

A_{1} ∩ Y ≠ ∅ implies that A_{2},..., A_{ n }each contains at least an element of Y. For otherwise, each of v_{2}, ...,v_{ n }becomes a degreetwo node in T(Y), contradicting the assumption that v = v_{1} is the shallowest degreetwo node in T(Y).
In order to obtain T(X), we must prune subtrees in A_{1} whose leaves are in Y (which could be the entire subtree A_{1}). Thus there must be at least one degreetwo node in A_{1} (or v_{1} if A_{1} is pruned). Similarly, for 1 <i ≤ n, either v_{ i } has degree two or there exists a degreetwo node in A_{ i }. Overall T(X) has at least n degreetwo nodes, as required. □
Properties of the regrouping operation
We examine some properties of the regrouping operation in this section. In general, these properties show that the path lengths defined by LCA's do not increase under several different assumptions. This preservation of path lengths would later assist in the calculation of deep coalescence costs. Throughout this section, we assume S to be a tree, $\varnothing \subset X\subset \mathrm{L}\mathrm{e}\left(S\right)$, and ${S}^{\prime}=S\left(\overline{X}\right).$ Further we let R = Γ(S, X, v) where v ∈ I_{2}(S').
Lemma 2. If $A\subseteq B\subseteq \mathrm{L}\mathrm{e}\left(S\right)$and $B\subseteq \phantom{\rule{0.3em}{0ex}}\overline{X}$, then pl_{ S }(A, B) = pl_{ S' } (A, B).
Lemma 3. If $A\subseteq B\subseteq \mathrm{L}\mathrm{e}\left(S\right)$and $B\subseteq \phantom{\rule{0.3em}{0ex}}\overline{X}$, then pl_{ S }(A, B) ≥ pl_{ R }(A, B).
Lemma 4. If $A\subseteq B\subseteq \mathrm{L}\mathrm{e}\left(S\right)$and $B\subseteq \phantom{\rule{0.3em}{0ex}}X$, then pl_{ S }(A, B) ≥ pl_{ R }(A, B).
Lemma 5. If $A\subseteq B\subseteq \mathrm{L}\mathrm{e}\left(S\right)$, $A\subseteq \phantom{\rule{0.3em}{0ex}}\overline{X}$, and X ⊆ B, then pl_{ S }(A, B) ≥ pl_{ R }(A, B).
Proof. Let S" be the tree obtained from S' by identifying Ro(SX) and v. In other words, S" is the tree after step (R1) of the regroup operation Γ(S, X, v). We will show that pl_{ S } ≥ (A, B) ≥ pl_{ S" } (A, B) ≥ pl_{ R }(A, B). We begin with the first inequality. First, since $A\subseteq \phantom{\rule{0.3em}{0ex}}\overline{X}$ we know that lca_{ S }(A) = lca_{ S' }(A) = lca_{ S" } (A). Let x = lca_{ S }(X) and b = lca_{ S }(B), then the assumption of X ⊆ B implies x ≤_{ S } b. Since v has degree two in S', we know that $\mathrm{L}\mathrm{e}\phantom{\rule{0.3em}{0ex}}\left({S}_{v}\right)\cap X\ne \phantom{\rule{0.3em}{0ex}}\varnothing $ (Observation 2), and so v ≤_{ S } x. Now let x" = lca_{ S" } (X) and b" = lca_{ S" }(B). By (R1) we have that x" ≤ _{ S" } v, and so x" ≤ _{ S" } x, which implies b" ≤ _{ S" } b. Furthermore, lca_{ S }(A)= lca_{ S" } (A) is a descendant of both b and b" because A ⊆ B, and hence b" ≤ _{ S" } b implies that pl_{ S }(A, B) ≥ pl_{ S" } (A, B).
Next, by (R2) R is obtained from S" by suppressing some nodes, therefore a path in S" can only be made shorter in R, hence we have pl_{ S" }(A, B) ≥ pl_{ R }(A, B).
Finally, combining the above results we have pl_{ S }(A, B) ≥ pl_{ R }(A, B). □
Main theorem
Theorem 1. Deep coalescence consensus tree problem is Pareto on clusters.
Proof. Assume not for a contradiction, then there exists an instance I = (T_{1},...,T_{ n }), a solution S for I, and a cluster $X\subseteq \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\mathrm{L}\mathrm{e}\left(S\right)$ where $X\in {\cap}_{i=1}^{n}\mathrm{C}\mathrm{l}\mathsf{\text{(}}{T}_{i}\mathsf{\text{)}}$ but $X\notin \mathrm{C}\mathrm{l}\left(S\right)$. Since $X\notin \mathrm{C}\mathrm{l}\left(S\right)$, X must be nontrivial, therefore $\hat{\Gamma}\left(S,\phantom{\rule{2.77695pt}{0ex}}X\right)$ does not contain S and is not empty. Let $R\in \hat{\Gamma}\left(S,\phantom{\rule{2.77695pt}{0ex}}X\right)$. We will show that (∀ 1 ≤ i ≤ n) (DC(T_{ i }, S) >DC(T_{ i }, R)), which implies ${\sum}_{i=1}^{n}DC\left({T}_{i},\phantom{\rule{2.77695pt}{0ex}}S\right)\phantom{\rule{0.3em}{0ex}}>\phantom{\rule{0.3em}{0ex}}{\sum}_{i=1}^{n}DC\left({T}_{i},\phantom{\rule{2.77695pt}{0ex}}R\right),$ contradicting the assumption that S is a solution for I.
We identify some specific nodes in order to partition the edges of T. Let ${S}^{\prime}=S\left(\stackrel{\u0304}{X}\right),\phantom{\rule{2.77695pt}{0ex}}w\in {I}_{2}\left({S}^{\prime}\right)$ where R = Γ(S, X, w). Since $X\notin \mathrm{C}\mathrm{l}\left(S\right)$, S' contains at least two nodes with degree two. Let w'∈ I_{2}(S') such that w' ≠ w, then S_{ w' } contains some leaf y ∉ X (Observation 2).
 1.
E_{1} ≜ {{u, v} ∈ E(T) : u <v ≤ x} = A ll edges under x
 2.
E_{2} ≜ {{u, v} ∈ E(T) : x ≤ u <v} = Edges forming the path from x to $\mathrm{R}\mathrm{o}\left(T\right)$
 3.
E_{3} ≜ {{u,v} ∈ E(T) : y ≤ u <v ≤ z} = Edges forming the path from y to z
 4.
E_{4} ≜ E(T) \ (E_{1} ∪ E_{2} ∪ E_{3})
Let x' = lca_{ S }(X) and p = pl_{ S } (w, x') + 1. For each i ∈ {1, 2, 3, 4}, we claim and prove the bound of Σ_{ i } as follows.
Claim 1. Σ_{1} ≥ p
Proof. First we observe that the difference for each path length in this partition is ≥ 0 (Lemma 4), so we have Σ_{1} ≥ 0. Since x' = lca_{ S } (X), we only need to consider the subtree S_{ x' } in computing the path lengths in this partition. Define U = S_{ x' }. In particular, the number of degree two nodes in U(X) gives us a lower bound on the total decreases of path lengths, because these nodes are removed to obtain UX which is a subtree of R. That is, Σ_{1} ≥ I_{2}(U(X)). Lemma 1 applies to U with bipartition $\left\{X,\mathrm{L}\mathrm{e}\left(U\right)\backslash X\right\}$ and the node w, so we have I_{2}(U(X)) ≥ dep_{ U } (w). The depth dep_{ U } (w) is with respect to U, and we relate it to a path length in S by taking away the root edge, that is $de{p}_{U}\mathsf{\text{(}}w\mathsf{\text{)}}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{1=}}p{l}_{S}\mathsf{\text{(}}w\mathsf{\text{,}}x\prime \mathsf{\text{)}}$. Finally, using the definition of p we obtain Σ_{1} ≥ I_{2}(U(X)) ≥ dep_{ U } (w)= pl_{ S } (w, x') +1= p.
Claim 2. Σ_{2} = −p
The fourth equality holds because w is the shallowest degreetwo node in S', so that no edges along the path from w to x' are contracted in R, hence pl_{ R }(w, x') = pl_{ S }(w, x').
Claim 3. Σ_{3} ≥ 1
 1.
If $B\subseteq \phantom{\rule{0.3em}{0ex}}\overline{X}$, then Lemma 3 applies on S, R, A, B, so pl_{ S }(A, B) − pl_{ R }(A, B) ≥ 0.
 2.
If X ⊆ B, then Lemma 5 applies on S, R, A, B, so pl_{ S }(A, B) − pl_{ R }(A, B) ≥ 0.
In any case, we have pl_{ S }(A, B) − pl_{ R }(A, B) ≥ 0 for each edge {a, b} ∈ E_{3}. This implies that Σ_{3} ≥ 0. Further, since w' ∈ I_{2}(S') and w' ≠ w, w' does not exist in R. We also know that $y{<}_{S}w\prime {<}_{S}lc{a}_{S}\left(X\cup \left\{y\right\}\right)$ by the definitions of w' and y. Therefore there exists an edge {a, b} ∈ E_{3} such that pl_{ S }(A, B) − pl_{ R }(A, B) ≥ 1. Hence we have Σ_{3} ≥ 1.
Claim 4. Σ_{4} ≥ 0
Proof. Let {a, b} ∈ E_{4} where a <_{ T } b, $A=\mathrm{L}\mathrm{e}\left({T}_{a}\right)$, and $B=\mathrm{L}\mathrm{e}\left({T}_{b}\right)$. The proof follows from the same argument as in Claim 3 where we have pl_{ S } (A, B) − pl_{ R }(A, B) ≥ 0 for each edge {a, b} ∈ E_{4}, hence Σ_{4} ≥ 0.
Finally, we have Σ_{1} +Σ_{2} +Σ_{3} +Σ_{4} ≥ p + (p) + 1+ 0 =1 > 0. Hence (2) is satisfied, and so is (1). In sum, we have constructed a tree R and showed that ${\sum}_{i=1}^{n}DC\left({T}_{i},\phantom{\rule{2.77695pt}{0ex}}S\right)\phantom{\rule{0.3em}{0ex}}>\phantom{\rule{0.3em}{0ex}}{\sum}_{i=1}^{n}DC\left({T}_{i},\phantom{\rule{2.77695pt}{0ex}}R\right),$ which contradicts with the assumption that S is a solution for I, in other words the assumption that S has the minimum aggregated cost with respect to the deep coalescence cost function. □
Algorithm for improving a candidate solution
Algorithm 1 takes a consensus tree problem instance and a candidate solution as inputs. If the candidate solution does not display the consensus clusters, it is transformed into one that includes all of the consensus clusters and has a smaller (more optimal) deep coalescence cost.
Algorithm 1 Deep coalescence consensus clusters builder
1: procedure DCConsensusClustersBuilder (I, T)
Input: A consensus tree problem instance I = (T_{1},...,T_{ n }), a candidate solution T for I
Output: T, or an improved solution R that contains all consensus clusters of I
2: R ← T
3: C ← Set of all consensus clusters of I
4: for all cluster X ∈ C do
5: if R does not contain X then
6: v ← A node in shallowest $\left({I}_{\mathsf{\text{2}}}\left(R\left(\overline{X}\right)\right)\right)$ (shallowest degreetwo node of $R\left(\overline{X}\right)$)
7: R ← Γ(R, X, v) (regrouping operation of R by X on v)
8: end if
9: end for
10: return R
11: end procedure
The correctness of Algorithm 1 follows from the proof of Theorem 1. We now analyze its time complexity. Let m be the number of taxa present in the input trees. Line 3 takes O(nm) time. Line 5, 6, and 7 each takes O(m) time, and there are O(m) iterations. Overall Algorithm 1 takes O(nm + m^{2}) time.
General method for improving a search algorithm
Definition 12 (Strict consensus tree [18]). Given a tuple of n trees I = (T_{1},...,T_{ n }), the strict consensus tree of I, denoted StrictCon(I), is the unique tree that contains those clusters common to all the input trees. Formally, StrictCon(I) is a (possibly nonbinary) tree S such that $\mathrm{C}\mathrm{l}\left(S\right)={\bigcap}_{i=1}^{n}\mathrm{C}\mathrm{l}\left({T}_{i}\right)$.
Definition 13 (Cut on trees). Let H and T be two trees over the same leaf set, such that H is a nonbinary tree and T is a binary tree that refines H. Given an internal node h in H, a cut on T via H and h, denoted $\mathrm{C}\mathrm{u}{\text{t}}_{H,h}\left(T\right)$, is the minimal connected subtree of T that contains $\left\{{M}_{H\u22b3T}\left(c\right):c\in {\mathrm{C}\mathrm{h}}_{H}\left(h\right)\right\}$, and we rename each leaf x by $\mathrm{L}\mathrm{e}\left({T}_{x}\right)$.
We further extend this to a tuple of trees I = (T_{1},...,T_{ n }) by ${\mathrm{C}\mathrm{u}t}_{H,h}\left(I\right)\triangleq \left({\mathrm{C}\mathrm{u}t}_{H,h}\left({T}_{1}\right),\dots ,{\mathrm{C}\mathrm{u}t}_{H,h}\left({T}_{n}\right)\right)$.
Theorem 2. Let I =(T_{1},...,T_{ n }) be an instance of the deep coalescence consensus tree problem, and let S be a solution for I (having the optimal deep coalescence cost). Further suppose H is the strict consensus tree of I, and h is an internal node in H. Then $\mathrm{C}\mathrm{u}{\text{t}}_{H,h}\left(S\right)$is a solution for the instance $\mathrm{C}\mathrm{u}{\text{t}}_{H,h}\mathsf{\text{(}}I\mathsf{\text{)}}$of the deep coalescence consensus tree problem.
 1.
Remove all edges of S', and remove all nodes of S' excepts the root and the leaves.
 2.
Identify $\mathrm{R}\mathrm{o}\left({S}^{\prime}\right)$ with. $\mathrm{R}\mathrm{o}\left({R}^{\prime}\right)$
 3.
For each leaf v of S', identify v with a leaf x of R' where $x=\mathrm{L}\mathrm{e}\left({S}_{v}\right)$.
Let the resulting tree be R. We will show that R has a lower deep coalescence cost, contradicting the assumption that S is a solution for I.
For convenience, let ${\mathrm{C}\mathrm{h}}_{H}\left(h\right)=\left\{{c}_{1},\dots ,{c}_{m}\right\},{h}^{\prime}={M}_{H\u22b3T}\left(h\right)$, and ${c}_{j}^{\prime}={M}_{H\u22b3T}\left({c}_{j}\right)$where 1 ≤ j ≤ m.
 1.
${E}_{under}\triangleq \left\{\left\{u,\phantom{\rule{2.77695pt}{0ex}}v\right\}\in E\left(T\right):u<v\mathsf{\text{and}}\phantom{\rule{0.3em}{0ex}}\left(\exists j\right)\left(v\le {c}_{j}^{\prime}\right)\right\}$
 2.
${E}_{out}\triangleq \left\{\left\{u,\phantom{\rule{2.77695pt}{0ex}}v\right\}\in E\left(T\right):u<v\mathsf{\text{and}}\phantom{\rule{0.3em}{0ex}}v\nleqq {h}^{\prime}\right\}$
 3.
${E}_{in}\triangleq E\left(T\right)\backslash \left({E}_{under}\cup {E}_{out}\right)$
Recall that the modification of S into R only involves the subtree S', therefore ${M}_{T\u22b3S}\left(v\right)$ is unchanged for every v occurs in E_{ under } and E_{ out }. Hence it suffices to evaluate (3) on E_{ in } only. However we have already assumed that ${\sum}_{i=1}^{n}DC\left({{T}^{\prime}}_{i},\phantom{\rule{2.77695pt}{0ex}}{S}^{\prime}\right)>{\sum}_{i=1}^{n}DC\left({T}_{i}^{\prime},\phantom{\rule{2.77695pt}{0ex}}{R}^{\prime}\right)$, therefore (3) holds. Overall we have that R has a lower deep coalescence cost, contradicting the assumption that S is a solution for I. □
Theorem 2 implies that every internal node of the strict consensus tree defines an independent subproblem, and solutions of these subproblems can be combined to give a solution to the original deep coalescence consensus tree problem. This leads to the following general divide and conquer method that improves an existing search algorithm.
Method 1 Deep coalescence consensus tree method
1: procedure DCConsensusTreeMethod(I)
Input: A DC consensus tree problem instance I =(T_{1},...,T_{ n }), an external program DCSOLVER.
Output: A candidate solution T for I
2: H ← StrictCon(I)
3: for all internal node h of H do
4: I_{ h } ← $\mathrm{C}\mathrm{u}{\text{t}}_{H,h}\mathsf{\text{(}}I\mathsf{\text{)}}$
5: S_{ h } ← DCSOLVER(I_{ h })
6: Refine the children of h on H by the tree S_{ h }
7: end for
8: return H
9: end procedure
Results
We used simulation experiments to (i) test if the solutions obtained from efficient heuristics presented in [13] display the Pareto property, and (ii) compare the performance of our new divide and conquer approach based on the Pareto property to the generic heuristic in [13].
Experiment results 1
First to examine if subtree pruning and regrafting (SPR) heuristic solutions from [13] display the Pareto property, we generated a series of four 14taxon trees that share few clusters. To do this, we first generated random 11taxon trees. Next, we generated random 4taxon trees containing the species 1114. We then replaced the one of the leaves in the 11taxon tree with the random 4taxon tree. This procedure produces gene trees that share at least a single 4taxon cluster in common. Although this simulation does not reflect a biological process, it represents extreme cases of error or incongruence among gene trees. In three cases with the 14taxon gene trees, we found that the SPR heuristic did not return a result that contained the consensus cluster. In these cases, our proof demonstrates that there exists a better solution that also contained the consensus cluster. However, the failure of the SPR heuristic in these cases appears to depend on the starting tree; these data sets did not fail with all starting trees. Thus, the shortcomings of the SPR heuristic may be ameliorated by performing multiple runs from different starting trees.
Experiment results 2
Experiment results 3
Finally, we examined the performance of Method 1 and compare it to the standalone SPR heuristic using more biologically plausible coalescence simulations. We followed the general structure the coalescence simulation protocol described by Maddison and Knowles [9]. First, we generated 40 256taxon species trees based on a Yule pure birth process using the r8s software package [19]. To transform the branch lengths from the Yule simulation to represent generations, we multiplied them all by 1,000,000. Next, we simulated coalescence within each species tree (assuming no migration or hybridization) using Mesquite [20]. All simulations produced a single gene copy from each species. For each species tree, we simulated 20 gene trees assuming a constant population size. The population size effects the number of deep coalescence events, with larger populations leading to more incomplete lineage sorting and consequently less agreement among the gene trees. Thus, to incorporate different levels of incomplete lineage sorting, for 20 of the species trees, we used a constant population size of 10,000, and for 20 we used a constant population size of 100,000. Thus, in total, we produced 40 sets of 20 gene trees, with each set simulated from a different 256taxon species tree.
For each data set, we performed a phylogenetic analyses using Method 1 and also using only the SPR heuristic from Bansal et al. [13]. In contrast to the simulations in Experiment 1, the standalone SPR heuristic of Bansal et al. [13] always returned species trees with all consensus clusters. Of course, all solutions from Method 1 must display the Pareto property. The deep coalescence reconciliation score for the best trees were similar with both algorithms. When the population size was 10,000, the average coalescence cost was 279, and all the gene trees shared an average of 29.4 clusters. In 19 out of the 20 of these simulations, both approaches produced the same results, while in one case, Method 1 found a species tree with a one fewer implied deep coalescence event. When the population size was 100,000, the average coalescence cost was 2142, and the all gene trees shared an average of 19.1 clusters. Although the reconciliation cost never differed by more than 15, Method 1 had a better score in 6 replicates, and the standalone SPR had a better score in 11 replicates. All analyses finished within 30 seconds in a laptop PC, but Method 1 was always faster than SPR alone.
Discussion
In addition to offering a biologically informed optimality criterion to resolve incongruence among gene trees, we prove that the deep coalescence problem also is guaranteed to retain the phylogenetic clusters for which all gene trees agree. Since the deep coalescence problem is NPhard [10], most meaningful instances will require heuristics to estimate a solution. We demonstrate that the Pareto property can be leveraged to vastly improve upon the running time of heuristics. Method 1 represents a new general approach to phylogenetic algorithms. In most cases, heuristics to estimate solutions for phylogenetic inference problems are based on a few generic search strategies such as the local search heuristics based on nearest neighbor interchange (NNI), SPR, or tree bisection and reconnection (TBR) branch swapping. Although these search strategies often appear to perform well, they are not connected to any specific phylogenetic problems or optimality criteria. Ideally, however, efficient and effective heuristics should be tailored to the properties of the phylogenetic problem. In the case of the deep coalescence consensus tree problem, the Pareto property provides an informative guiding constraint for the tree search. Specifically, when considering possible solutions, we need only consider solutions that contain all clusters from the input gene trees, or, in other words, that refine the strict consensus of the input gene trees.
Still, our simulation experiments suggest that, in many cases, the SPR local search heuristic described by Bansal et al. [13] performs well. While we identified cases in which the estimate from the SPR heuristic did not contain the Pareto clusters, in most cases SPR alone found trees as good, or even slightly better, than Method 1. We note that the size of the simulated coalescence data set, 256 taxa, exceeds the size of the largest published analysis of the deep coalescence consensus tree problem and is far beyond the largest instances (8 taxa) from which exact solutions have been calculated [11], and the SPR found good solutions within 30 seconds. Still, running time for the SPR heuristic does not always scale well, and the results of Experiment 2 suggest that it might not be tractable for extremely large data sets. In these cases, in practice Method 1 may vastly improve upon the running time, while guaranteeing a solution with the Pareto property.
Further, Theorem 2 shows that the deep coalescence consensus tree problem exhibits independent optimal substructures. This implies that, once we compute the strict consensus tree of the problem instance, the rest of Method 1 can be directly parallelized, regardless of which external deep coalescence solver is used. In the case where the external solver guarantees exact solutions, our method would also give exact solutions, but can potentially solve instances with a much larger taxa size compared to running the external solver alone.
Although the Pareto property for the deep coalescence consensus tree problem is desirable, and the divide and conquer method is promising for largescale analyses, there are limitations to their use. First, the Pareto property and Method 1 are limited to the consensus case, or, instances in which all of the input gene trees contain sequences from all of the species. Also, the Pareto property is only useful when all input trees share some clusters in common. If there are no consensus clusters among the input trees, then Method 1 conveys no runtime benefits. While this may seem like an extreme case, it is possible with high levels of incomplete lineage sorting, or, perhaps more likely, much error in the gene tree estimates. Also, as we add more and more gene trees, we would expect more instances of conflict among the gene trees, potentially converging towards the elimination of consensus clusters. Than and Rosenberg [21] recently proved the existence of cases in which the deep coalescence problem is inconsistent, or converges on the wrong species tree estimate with increasing gene tree data. Although inconsistency is concerning, the Pareto property provides some reassurance. Even in a worse case scenario in which the deep coalescence problem is misled, the optimal solutions will still contain all of the agreed upon clades from the gene trees. Perhaps the greatest advantage of the deep coalescence problem, especially compared to likelihood and Bayesian approaches that infer species trees based on coalescence models (e.g., [22–24]), is its computational speed and the feasibility of estimating a species tree from largescale genomic data sets representing hundreds or even thousands of taxa [13]. Not only can our method improve the performance of any existing heuristic, the Pareto property describes a limited subset of possible species trees that must contain the optimal solution.
Conclusions
We prove that the deep coalescence consensus tree problem satisfies the Pareto property for clusters and describe an efficient algorithm that, given a candidate solution that does not display the consensus clusters, transforms the solution so that it includes all the consensus clusters and has a lower deep coalescence cost. We extend the result and prove that the problem exhibits optimal substructures based on the strict consensus tree of the input gene trees. Based on this property, we suggest a new, parallelizable tree search method, in which we refine the strict consensus of the input gene trees. In contrast to previously proposed heuristics, this method guarantees that the proposed solution will contain the Pareto clusters. Also, as our experiments demonstrate, this method can greatly improve the speed of deep coalescence tree heuristics, potentially enabling efficient and effective estimates from input with thousands of taxa.
List of abbreviations used
 LCA:

least common ancestor
 SPR:

subtree pruning and regrafting
 NNI:

nearest neighbor interchange
 TBR:

tree bisection and reconnection
Declarations
Acknowledgements
The authors would like to thank our anonymous reviewers who have provided valuable comments, as well as providing a simpler proof of Lemma 1. This work was conducted with support from the Gene Tree Reconciliation Working Group at NIMBioS through NSF award #EF0832858, with additional support from the University of Tennessee. HTL and OE were supported in parts by NSF awards #0830012 and #10117189.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 10, 2012: "Selected articles from the 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)". The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S10.
Authors’ Affiliations
References
 Rokas A, Williams BL, King N, Carroll SB: Genomescale approaches to resolving incongruence in molecular phylogenies. Nature. 2003, 425 (6960): 798804. 10.1038/nature02053.View ArticlePubMedGoogle Scholar
 Pollard DA, Iyer VN, Moses AM, Eisen MB: Widespread Discordance of Gene Trees with Species Tree in Drosophila: Evidence for Incomplete Lineage Sorting. PLoS Genet. 2006, 2 (10): e173.PubMed CentralView ArticlePubMedGoogle Scholar
 Goodman M, Czelusniak J, Moore GW, RomeroHerrera AE, Matsuda G: Fitting the Gene Lineage into its Species Lineage, a Parsimony Strategy Illustrated by Cladograms Constructed from Globin Sequences. Systematic Zoology. 1979, 28 (2): 132163. 10.2307/2412519.View ArticleGoogle Scholar
 Maddison WP: Gene Trees in Species Trees. Systematic Biology. 1997, 46 (3): 523536. 10.1093/sysbio/46.3.523.View ArticleGoogle Scholar
 Nichols R: Gene trees and species trees are not the same. Trends in Ecology & Evolution. 2001, 16 (7): 358364. 10.1016/S01695347(01)022030.View ArticleGoogle Scholar
 Edwards SV: Is a new and general theory of molecular systematics emerging?. Evolution; International Journal of Organic Evolution. 2009, 63: 119. 10.1111/j.15585646.2008.00549.x.View ArticlePubMedGoogle Scholar
 Knowles LL: Estimating Species Trees: Methods of Phylogenetic Analysis When There Is Incongruence across Genes. Systematic Biology. 2009, 58 (5): 463467. 10.1093/sysbio/syp061.View ArticlePubMedGoogle Scholar
 Yu Y, Warnow T, Nakhleh L: Algorithms for MDCbased multilocus phylogeny inference. Proceedings of the 15th Annual international conference on Research in computational molecular biology. 2011, RECOMB, Berlin, Heidelberg: SpringerVerlag, 531545.View ArticleGoogle Scholar
 Maddison WP, Knowles LL: Inferring Phylogeny Despite Incomplete Lineage Sorting. Systematic Biology. 2006, 55: 2130. 10.1080/10635150500354928.View ArticlePubMedGoogle Scholar
 Zhang L: From gene trees to species trees II: Species tree inference in the deep coalescence model. IEEE/ACM Trans Comput Biol Bioinformatics. 2011, 8 (6): 16851691.View ArticleGoogle Scholar
 Than C, Nakhleh L: Species Tree Inference by Minimizing Deep Coalescences. PLoS Computational Biology. 2009, 5 (9): e100050110.1371/journal.pcbi.1000501.PubMed CentralView ArticlePubMedGoogle Scholar
 Than C, Nakhleh L: Estimating species trees: Practical and Theoretical Aspects. WileyVCH, Chichester 2010 chap. Inference of parsimonious species tree phylogenies from multilocus data by minimizing deep coalescences, 7998.
 Bansal M, Burleigh JG, Eulenstein O: Efficient genomescale phylogenetic analysis under the duplicationloss and deep coalescence cost models. BMC Bioinformatics. 2010, 11 (Suppl 1): S4210.1186/1471210511S1S42.PubMed CentralView ArticlePubMedGoogle Scholar
 BinindaEmonds ORP: Phylogenetic supertrees: combining information to reveal the Tree of Life. 2004, SpringerView ArticleGoogle Scholar
 Bryant D: A classification of consensus methods for phylogenies. BioConsensus, DIMACS. AMS. 2003, 163184.Google Scholar
 Wilkinson M, Cotton JA, Lapointe F, Pisani D: Properties of Supertree Methods in the Consensus Setting. Systematic Biology. 2007, 56 (2): 330337. 10.1080/10635150701245370.View ArticlePubMedGoogle Scholar
 Wilkinson M, Thorley J, Pisani D, Lapointe FJ, McInerney J: Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life. Springer, Dordrecht, the Netherlands 2004 chap. Some desiderata for liberal supertrees, 227246.
 McMorris FR, Meronk DB, Neumann DA: A view of some consensus methods for trees. Numerical Taxonomy. 1983, 122125.View ArticleGoogle Scholar
 Sanderson MJ: r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics (Oxford, England). 2003, 19 (2): 301302. 10.1093/bioinformatics/19.2.301.View ArticleGoogle Scholar
 Maddison WP, Maddison D: Mesquite: a modular system for evolutionary analysis. 2001, [http://mesquiteproject.org]Google Scholar
 Than CV, Rosenberg NA: Consistency properties of species tree inference by minimizing deep coalescences. Journal of Computational Biology. 2011, 18: 115. 10.1089/cmb.2010.0102.View ArticlePubMedGoogle Scholar
 Liu L: BEST: Bayesian estimation of species trees under the coalescent model. Bioinformatics. 2008, 24 (21): 25422543. 10.1093/bioinformatics/btn484.View ArticlePubMedGoogle Scholar
 Kubatko LS, Carstens BC, Knowles LL: STEM: species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics. 2009, 25 (7): 971973. 10.1093/bioinformatics/btp079.View ArticlePubMedGoogle Scholar
 Heled J, Drummond AJ: Bayesian Inference of Species Trees from Multilocus Data. Molecular Biology and Evolution. 2010, 27 (3): 570580. 10.1093/molbev/msp274.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.