Refining discordant gene trees

Background Evolutionary studies are complicated by discordance between gene trees and the species tree in which they evolved. Dealing with discordant trees often relies on comparison costs between gene and species trees, including the well-established Robinson-Foulds, gene duplication, and deep coalescence costs. While these costs have provided credible results for binary rooted gene trees, corresponding cost definitions for non-binary unrooted gene trees, which are frequently occurring in practice, are challenged by biological realism. Result We propose a natural extension of the well-established costs for comparing unrooted and non-binary gene trees with rooted binary species trees using a binary refinement model. For the duplication cost we describe an efficient algorithm that is based on a linear time reduction and also computes an optimal rooted binary refinement of the given gene tree. Finally, we show that similar reductions lead to solutions for computing the deep coalescence and the Robinson-Foulds costs. Conclusion Our binary refinement of Robinson-Foulds, gene duplication, and deep coalescence costs for unrooted and non-binary gene trees together with the linear time reductions provided here for computing these costs significantly extends the range of trees that can be incorporated into approaches dealing with discordance.


Introduction
Gene trees represent estimates of evolutionary histories of gene families, and are fundamental for evolutionary biological research [1,2]. Often gene trees are assumed to reflect the evolutionary history of species, or species tree, from which their sequences were sampled, presenting a common approach of species tree inference [3][4][5][6][7]. Gene trees can also provide fundamental information to study the evolution of biochemical function in gene families [8].
Gene trees can be inferred from multiple sequence alignments of sequences culled from a gene family. The number of these sequences as well as their evolutionary complexity has expanded on an unprecedented scale in recent years [9], prompting the estimation of ever larger and more credible gene trees. Despite these potentials, evolutionary biologists have long recognized the potential for substantial discordance among the gene trees as well as among the gene trees and the species tree in which they evolve [10][11][12][13][14], challenging traditional phylogenetic gene tree and species tree estimation. Discordance can be caused by error as well as major evolutionary processes, such as the duplication of genes or deep coalescence. Complicating matters further such error and evolutionary processes can occur on a staggering scale [15,16]. For example simulations with realistic parameters suggested that analyzes individual avian genes frequently resulted in trees with substantial error [17], and evolutionary processes cause discordance among evolutionary relationships of major avian groups [18]. Consequently, phylogenetic approaches are challenged to deal with error as well as complex histories of evolutionary processes in order to explain discordance in gene trees [19][20][21].
A common approach to deal with discordance in gene trees is by representing them with an estimate of the species tree that is thought to be the median tree of the gene trees under a particular (topological comparison) cost from a gene tree to a species tree, which is often referred to as a supertree [22]. A median tree S for a given cost and a collection of trees minimizes the sum of the pairwise costs from every gene tree to S. While varies costs have been proposed [23][24][25][26], here we are concerned with the wellresearched Robinson-Foulds, duplication, and deep coalescence costs. The Robinson-Foulds cost is measuring quantitative dissimilarities between two trees without relying on an evolutionary model, and is therefore well suited to address discordance caused by error [27,28]. In difference, the costs for the evolutionary events gene duplication and deep coalescence are both based on an evolutionary parsimony model allowing to resolve discord based on such events [29,30].
However, the presented costs are not well adapted to biological realism [31,32]. In practice gene trees are frequently inferred from sequences that do not permit reliable estimations of rootings or bifurcations [33], and therefore are unrooted and non-binary. The original evolutionary costs for gene duplication and deep coalescence can not be applied to such trees, since they are only defined for rooted and binary gene trees. In contrast the Robinson-Foulds distance is formally defined for unrooted and non-binary trees, but multifurcations in phylogenetic trees are interpreted as true evolutionary multifurcations (hard multifurcations). However, non-binary relationships in gene trees represent uncertainties about the correct binary relationships (soft multifurcations), rather than hard multifurcations which are rare [34]. Consequently, all of the the presented costs are not applicable to a large number of gene trees in practice.
More recently, a binary refinement model for the duplication cost [35] and the deep coalescence cost [36,37] for rooted gene trees that are non-binary were introduced. Here we propose a natural extension of this model for our costs to compare unrooted and non-binary gene trees with rooted binary species trees, and describe linear time reductions to compute these costs.

Related work
Here we provide definitions as well as computational and applicability results, first for the Robinson-Foulds cost, and then for the duplication and deep coalescence costs.
The Robinson-Foulds cost is an elementary tool for estimating quantitative dissimilarities between phylogenetic trees [38][39][40]. This cost is defined for two trees to be the cardinality of the symmetric difference of their split presentations for unrooted trees, and of their cluster presentations for rooted trees. The split-presentation of an unrooted tree is the set of all bipartitions, called splits, of the trees' taxon set induced by the removal of an edge [39,41]. Analogously, the cluster presentation of a rooted tree is the set of all taxon sets of its full subtrees [39]. The Robinson-Foulds cost for two trees, both either unrooted or rooted, satisfies the metric properties [38], and can be computed in linear time [42]. A randomized approximation scheme computes, in sublinear time and with high probability, a (1 + ∈) approximation of the Robinson-Foulds cost [43]. More recently, the Robinson-Foulds cost between an unrooted tree and a rooted tree was introduced in [44] to be the minimum cost under all pairs consisting of a rooting of the unrooted tree and the rooted tree. In fact, this cost is still computable in linear time [44]. Moreover, the distribution of the Robinson-Foulds distance relative to a fixed tree can be computed in linear time [45]. Note, the skewed distribution of the Robinson-Foulds metric suggests that it is only of use when the trees to be compared are quite similar [46]. While the Robinson-Foulds cost is wide-spread for the comparative analysis of phylogenetic trees, it does not rely on a biological model explaining the difference between trees. Therefore, the Robinson-Foulds cost is generally applicable to any type of trees, e.g. linguistic trees [47] and trees representing dominance hierarchies [48].
In contrast, the duplication and the deep coalescence costs rely on a biological model explaining the discordance between a gene tree and a species tree based on evolutionary events. For a gene and a species tree, both rooted and binary, the duplication cost and the deep coalescence cost are defined to be the minimum number of gene duplications and coalescences, respectively, required to reconcile the gene tree with the species tree [49,50]. While theses costs are not symmetric, they are computable in linear time [51,52], and allow to infer credible species trees [53][54][55][56][57]. Furthermore, gene trees that are reconciled by the minimum number of evolutionary events allow studying complex histories of evolutionary events [54,58]. The gene duplication and deep coalescence costs can also be defined for binary unrooted gene trees and binary rooted species trees as the minimum cost under all rootings of the gene tree and computed in linear time [32,59,60]. However, often gene trees are unrooted and non-binary in practice. While existing definitions for such gene trees and rooted binary species trees are linear time computable [31,32], they are not well adapted to biological realism. More recently, cost definitions for such trees were introduced that are based on a binary refinement model, by choosing the minimum cost between every binary refinement of a rooted gene tree and a rooted binary species tree, which are polynomial time computable [35,61]. In contrast, finding the minimum cost between a rooted binary gene tree and all binary refinements of a rooted non-binary species tree is NP hard [37]. However, costs under a binary refinement model for unrooted and non-binary gene trees have not been addressed in the literature. For a detailed overview about gene tree reconciliation the interested reader is referred to [62].

Contributions
Here, we define the Robinson-Foulds, duplication, and deep coalescence costs for unrooted and non-binary gene trees and a rooted binary species tree under the binary refinement model. To compute the duplication cost we describe a linear time reduction from the problem of computing optimal binary refinements of unrooted gene trees to the problem of computing such refinements for rooted gene trees. The latter problem can be solved in linear time [37]. Then, based on the theory of unrooted tree reconciliation [32,44,63,59], we prove that the duplication cost has similar properties to the deep coalescence and Robinson-Foulds costs when comparing unrooted and non-binary gene trees with rooted species trees. From this follows that we can prove linear time reductions for the deep coalescence and the Robinson-Foulds costs that are similar to our reduction for the duplication cost. Since our reductions require only linear time, the runtime to compute the optimal binary refinements of unrooted gene trees is bound by the time complexity of computing optimal binary refinement for rooted binary gene trees.

Basic definitions and preliminaries
An unrooted tree T is an acyclic, connected, and undirected graph that has no degree-two nodes, and every degree-one node is labeled with a species name. The degree-one nodes are called leaves; and the remaining nodes are called internal nodes. A tree is binary if every internal node has degree three. A rooted tree is defined similar to an unrooted tree, with the difference that it has a distinguished node, called root. A contraction of an edge e of an (un)rooted tree T removes e from T and merges both ends of e into a single node. A binary refinement of an unrooted or rooted tree T is a binary tree that can be transformed into T by contractions. By L(T ) we denote the set of all leaf labels in T .
A rooted tree S with a unique leaf labeling is called a species tree. For two nodes a, b of S, a ⊕ b is the least common ancestor of a and b in S. Let T and be a rooted tree (called rooted gene tree) such that L(T ) ⊆ L(S). By M : T S we denote the least common ancestor (lca) mapping between the nodes of T and S that preserves the labeling of the leaves. The duplication cost between T and S, is defined by: D(T, S) := |{M (g) = M (c) : c is a child of an internal node g ∈ T }|.
Let G = 〈V G , E G 〉 be an unrooted tree (called unrooted gene tree). A rooting of G is defined by choosing an edge e from G on which the root is to placed. Such a rooted tree will be denoted by G e . Note that G e has one more node (the root) that G. A rooted binary refinement of an unrooted gene tree G, is a binary refinement of a rooting of G.
The unrooted duplication (urD) cost between an unrooted gene tree G and a species tree S is defined as The edges with minimal cost will be called optimal. In the remainder of this work we show first how to compute urD in linear time and space, and then solve the following problem. Observe, that in contrast to our previous study [44,32,64], here, for the first time, we extend the notion of rooting by incorporating rooting at nodes.
Problem 1 For a given unrooted gene tree G and a binary species tree S, find the binary refinement of under all rootings of G that minimizes the duplication cost.
A similar problem for rooted gene trees was solved in [35]. In the remaining section we show how to reduce Problem 1 to the rooted problem in linear time.

Unrooted reconciliation
First we provide definitions introducing the basics of unrooted reconciliation. This approach is partially based on our previous papers [32,44,63,59]. However, for the first time, we prove properties of urD for trees with multifurcations. We assume that G is an unrooted gene tree and S is a species tree. We transform G into a directed graph G , by replacing each edge {v, w} by a pair of directed edges 〈v, w〉 and 〈w, v〉. We label the edges of G by the nodes of S as follows. If v ∈ G is a leaf labeled by a, then the edge {v, w} in G is labeled by the node in S whose label is a. Let v ∈ G have exactly k siblings w 1 , w 2 , . . . , w k . If a i and b i are the labels of 〈v, w i 〉 and 〈w i , v〉, respectively, then ai = ⊕ j=k j=1, j =i b j . Let ⊤ be the root of S. Each internal node v ∈ G defines a star with the center v as indicated in Figure 1a. We refer to the undirected edge {v, w i } as e i , for all i = 1, 2, . . . k.
There is a limited number of star types in gene trees [44]. Let K be a star with center v and k siblings as indicated in Figure 1a Proposition 1 For a given unrooted gene tree G and a species tree S a gene tree G can have any number of stars M1. For the remaining stars we have three mutually exclusive cases: (i) G has an empty edge, (ii) G has a double edge or (iii) G has only single edges.
Proof The proof follows easily from the properties of stars. See also Lemma 2 from [44]. □ Observe that in case (i) G has one or two stars M2, in case (ii) G has a star of type M3-M5 and in (iii) G has exactly one star of type M6.
The next propositions states a crucial difference between binary and general trees. For the proof please refer to [44].
Proposition 2 If both an unrooted gene tree G and a species S are binary then G has at least one empty or double edge.

Polytomies and the duplication cost
The next two proposition shows how the cost changes when we move a position of the root in G. Proposition 3 Under the notation from Figure 1. If for some i ∈ {1, 2, . . . , k} one of the following conditions are true: • If the star type is M1 or M3 and b i = ⊤.
• If the star type is M2 and a i ≠ ⊤ ≠ b i . then D(G e i , S) ≤ D(G e j , S)for every j = 1, 2, . . . , k. Proof All rootings of G share the same subtrees attached to w 1 , w 2 , . . . w k . Therefore, all costs share the same component c coming from the partial duplication cost for these subtrees. The remainder follows in from the definition of the duplication cost and Figure 1 and Figure 2. For l ∈ {1, 2, . . . k} let M l be the lca-mapping from G e i to S. In the case of stars M1 or M3 we have M j (v) = M j (w i ) = ⊤. Therefore, both nodes, w and the root of G e i , are duplication nodes; that is, D(G e j , S) = c + 2 . However, in G e i , v can be a non-duplication node, thus In the case of M2, we have M i (w i ) ≠ ⊤ ≠ M i (v) and M i (w i ) ⊕ M i (v) = ⊤, thus the root of G i is a non-duplication node. On the other hand, M j (v) = ⊤ and the root of G j is a duplication node. We conclude c ≤ D(G ei , S) ≤ c + 1 ≤ D(G ej , S) . □ Proposition 4 Using the notation from Proposition 3. If the star type is M4 − M6 then D(G e i , S) = D(G e j , S) for all i and j.
Proof Similarly to the proof of the previous proposition, it is easy to show that the root of G ei is is a duplication node while v is a duplication node, if and only if, the star is of type M4 or M5. Therefore, for every i, This observation leads to a linear time and space reduction for urD computation similar to algorithms from [32,44]. Now we reduce Problem 1, to the problem where gene trees are rooted. In the special case of star M6, we need to root a tree at a node instead of edge. For a non-leaf node v ∈ VG by Gv we denote the tree rooted at v. We refer to the algorithm for refining rooted gene trees from [65] by Bin(T, S), where T is a rooted tree and S is a binary species tree. It is known that Bin(T, S) runs in O(|T ||S|) time [65].
Theorem 2 Algorithm 1 infers a rooted binary refinement G* of an unrooted gene tree G such that D(G*, S) = min {urD(G', S) : G' is a binary refinement of G}.
Proof The correctness of Algorithm 1 follows from the property that the refinement operation will not change the labels of an existing edge in G and properties of stars for binary trees [63]. We analyze the cases from Proposition 1. (i) If G has a double edge e, then in every (unrooted) binary refinement of G e is a double edge. Thus, by Proposition 1 e is optimal in every binary refinement of G. We conclude that rooting G at e and removing polytomies from G e by applying the solution for rooted trees will infer an optimal rooted refinement of G. (ii) The same result applies when G has an empty edge. (iii) When G has only single edges, then the elements of the unique star M6 in G are optimal edges in G. Similarly, to previous cases these single edges will be present in any (unrooted) binary refinement of G (see Figures 3, 4, 5 for example). However, by Proposition 2 and Proposition 1 they are not necessarily optimal in such refinements. To address this problem, observe that any binary unrooted refinement of G will have either empty or double edges "surrounded" by the edges previously present in the star of type M6. Thus, we can representation of edges (empty, single and double) that will be used through the rest of this work. The notation ≠ ⊤ denotes that the label is a non-root node from S.
Górecki and Eulenstein BMC Bioinformatics 2014, 15(Suppl 13):S3 http://www.biomedcentral.com/1471-2105/15/S13/S3 simply root G at the center of the star M6 and then proceed with the refinement procedure for rooted trees. Clearly, the refinement procedure, will infer a rooted gene tree T such that its unrooted variant is a binary refinement of G with the minimal duplication cost. An example of a gene tree with star M6 with all binary refinements is depicted in Figure 5.
In summary, it is sufficient to identify an optimal edge in G, and then proceed accordingly with the refinement procedure. In steps 3-5 the algorithm is evaluating labels of edges from G . The optimal edge is found in the loop present in steps 6-7. Finally, the refinement procedure is called in steps 9-10 depending on the type of the star. □ Theorem 3 Algorithm 1 requires O(|G||S|) time, while the reduction (steps 1-7) can be completed in O(|G| + |S|) time and space.

Figure 3
Gene tree and species trees. An example of an unrooted gene tree G with three multifurcations and a species tree S. The gene tree G is depicted with a star topology, and it has one star of type M2 and three stars of type M1. Every edge e of G is decorated with the duplication cost D(G e , S) (note that the rooting G e is not refined). Observe, that the optimal edge (empty edge) is adjacent to a leaf labelled by a. Rooting at this edge yields the duplication cost 0.
Examples of (unrooted) binary refinements with costs of all rootings of an unrooted gene tree with multifurcations are depicted in Figures 3, 4

Polytomies and other cost functions
Similarly to the gene duplication cost we show results for other cost functions that are related to the duplication cost [63]. Here, we introduce for the first time a Figure 4 Binary refinements and unrooted reconciliation. All 27 unrooted binary refinements of the gene tree G from Figure 3 shown in star-like topology. Observe, that the edge adjacent to a leaf labelled by a is optimal in every refinement of G, and it has the same type as in G (i.e., it is an empty edge). The optimal duplication cost equals 1. The optimal edges with this cost are marked in gene trees G 19 , G 23 and G 25 . The bottom-right part of this figure depicts the embedding of the optimal rooting of G 25 into the species tree S from general approach, similar to [32,44], for the case where both trees, i.e., a gene tree and a species tree can be non-binary.
Costs can be defined for rooted trees as follows: where T is a rooted gene tree and S is a species tree such that L(T ) ⊆ L(S), I(T ) is the set of all internal nodes of T , K is a cost name and ξ K : R is a contribution function that for an internal node v of T defines a contribution of v to the cost K when comparing T and S. For a node v in a rooted tree, by c(v) we denote the cluster of v defined as the set of all leaf labels visible from v. The contribution functions for standard costs are defined as follows. Let g be an internal node of T and M be the lca-mapping from T to S.  Star M6 and binary refinements. The special case of a refinement when the star M6 is present in a gene tree. An optimal edge can be found after rooting at the center node of star M6 and then applying the refinement procedure for rooted gene trees (see Algorithm 1). An optimal edge of every binary refinement of G is "surrounded" by the edges related to the star M6 present in G. For example, the candidates are two internal edges of G i for each i. The optimal binary refinement of G has the gene duplication cost equal to 0 and it is obtained by rooting G 7 at the left internal edge. See also bottom part of this figure. Clearly, it has the same topology as the species tree. Similarly to Figure 3, each edge e of the gene tree G is decorated with the duplication cost D(G e , S), where G e is a (not refined) rooting of G.
Note that the classical Robinson-Foulds distance can be obtained by RF (T, S) = |I(S)| + 2 * rRF (T, S) − |I(T )|. Additionally, we have to assume that for the RF distance T is bijectively labelled by the labels present in L(S). For more details and discussion please refer to [44,63].
For an unrooted gene tree G, a species tree S, the unrooted cost is defined by: where f : E G R is a cost function usually defined for a cost K by f(e) = r K (G e , S). Assume that f S (e) = D(G e , S), then it can be proved that ur D(G, S) = Δ(G, S, f S ).
In the previous section we described the solution to Problem 1 defined for the duplication cost by reducing the unrooted problem to a rooted one in linear time and space. Here, we show that the same kind reduction can by applied for the DC and RF cost functions.
Problem 2 (Unrooted refinement under DC cost) For a given unrooted gene tree G and a binary species tree S, find a binary refinement under all rootings of G that minimizes the DC cost.
Problem 3 (Unrooted refinement under RF cost) For a given unrooted gene tree G and a binary species tree S, find a binary refinement under all rootings of G that minimizes the RF cost.
The result for the DC and the RF cost follows from [32] (Proposition 1 and Proposition 2) and [44] (Proposition 1 and Proposition 2), respectively. We conclude, that the statement from Theorem 1 also holds for the DC and RF cost functions. Therefore, Algorithm 1 can be used for locating an optimal edge or star M6 in an unrooted gene tree with multifurcations. Then after such a rooting is identified, one can apply the solution that removes polytomies from rooted gene trees. Clearly this reduction can be performed in linear time and space for both cost functions.
Problem 4 (Rooted refinement under DC cost) For a given rooted gene tree G and a binary species tree S, find a binary refinement under all rootings of G that minimizes the DC cost.
Problem 5 (Rooted refinement under RF cost) For a given rooted gene tree G and a binary species tree S, find a binary refinement under all rootings of G that minimizes the RF cost.
According to our knowledge Problem 4 and Problem 5 are open, with the exception that Problem 4 can be solved in quadratic time for the case when the gene tree has a bijective leaf labelling [36]. We conjecture that these two problems can be solved in polynomial time similarly to the problem under the duplication cost [35] (see Bin(Ge, S) in Algorithm 1). Our reduction shows that Problem 2 and Problem 3 have the same time complexity as the rooted ones.

Conclusion
To deal with discordance in practice we introduced a binary refinement model for the well-studied Robinson-Foulds, duplication, and deep coalescence costs. To compute these costs we described novel linear time reductions, from which quadratic time algorithms follow for the duplication cost and for the deep coalescence cost when constrained to bijective labelings. Our binary refinement model together with the efficient algorithms allows the exploitation of the full range of available gene trees. Finally, our algorithms not only compute optimal binary refinement costs efficiently, but also simultaneously root and refine gene trees optimally. However, the time complexity of the Robinson-Foulds cost for unrooted and non-binary gene trees will depend on the time complexity of computing this cost for rooted nonbinary gene trees, which is unknown to the best knowledge of the authors.