Structural properties of the reconciliation space and their applications in enumerating nearly-optimal reconciliations between a gene tree and a species tree

BMC Bioinformatics201112(Suppl 9):S7

DOI: 10.1186/1471-2105-12-S9-S7

Published: 5 October 2011

Abstract

Introduction

A gene tree for a gene family is often discordant with the containing species tree because of its complex evolutionary course during which gene duplication, gene loss and incomplete lineage sorting events might occur. Hence, it is of great challenge to infer the containing species tree from a set of gene trees. One common approach to this inference problem is through gene tree and species tree reconciliation.

Results

In this paper, we generalize the traditional least common ancestor (LCA) reconciliation to define a reconciliation between a gene tree and species tree under the tree homomorphism framework. We then study the structural properties of the space of all reconciliations between a gene tree and a species tree in terms of the gene duplication, gene loss or deep coalescence costs. As application, we show that the LCA reconciliation is the unique one that has the minimum deep coalescence cost, provide a novel characterization of the reconciliations with the optimal duplication cost, and present efficient algorithms for enumerating (nearly-)optimal reconciliations with respect to each cost.

Conclusions

This work provides a new graph-theoretic framework for studying gene tree and species tree reconciliations.

Background

With much higher speed than the traditional Sanger sequencing technology, the ultra-deep sequencing technology has made huge amounts of molecular data available for genomics study [1]. It provides an unprecedented opportunity to infer phylogenetic trees from multilocus and genomics data. One approach to inferring phylogeny from multilocus data is to reconstruct a gene tree from each locus and then to combine the resulting trees into a phylogeny, called the containing species tree. Gene trees are often different since each gene family might undergo different mutational events such as gene duplication and loss, horizontal gene transfer, and incomplete lineage sorting [2, 3]. Therefore, the containing species tree is inferred from gene trees by reconciling it with each gene tree to minimize the total number of hypothetical evolutionary events that are responsible for the discordance between the trees.

The gene tree and species tree reconciliation was first introduced by Goodman et al. [4] and formally defined by Page [5]. Given a gene tree for a gene family and a containing species tree, a reconciliation between them represents an evolutionary scenario of the gene family within the evolutionary history represented by the species tree [4]. To study gene duplication history, gene tree and species tree are reconciled to minimize the number of gene duplications and/or losses. The mathematical and algorithmic issues of gene tree and species tree reconciliations have been intensively studied in the past decade [614]. For example, it has been shown that the so-called least common ancestor (LCA) reconciliation has the minimum duplication and loss cost [9, 15].

Although the LCA reconciliation is optimal in terms of the duplication cost, it may not represent the true evolution of the gene family being considered. Indeed, recent studies suggest that more than one reconciliations may occur with the highest probability [16, 17]. Such studies [3, 14, 17, 18] in the stochastic framework assume that the discordance between a gene tree and a species tree is caused by incomplete lineage sorting and adopt Kingman’s coalescent theory from population genetics [19].

The fact that the LCA reconciliation may not be the unique optimal with respect to the duplication cost motivates researchers to study the space of all the reconciliations and develop algorithms to enumerate nearly-optimal reconciliations for a species tree and a gene tree [20, 21]. In this paper, we take a different approach to these two issues. We generalize the LCA reconciliation to define an arbitrary reconciliation as a vertex-mapping from a gene tree to a species tree that preserves the hierarchical structure of the gene tree. Our approach is essentially different from the existing ones [20, 22], where the specific mutation events are used and a gene tree vertex is mapped to a species tree branch to specify a duplication event. One advantage of our approach over the others is that we separate reconciliation concept from the cost models that are used to measure the tree discordance. Because of this, we are able to study the structural properties of the space of all reconciliations between a gene tree and a species tree in the same manner for each of the three cost models. We show that the LCA reconciliation has not only the minimum duplication and loss cost [9, 15], but also the minimum deep coalescence cost. We also present a novel characterization of the reconciliations with the optimal duplication cost, and develop efficient algorithms for enumerating (nearly-)optimal reconciliations with respect to each cost model.

Methods

Basic notations

Species evolve from their common ancestor through a series of speciation events. A species tree represents the evolutionary history of a set of species. A gene family might evolve from its common ancestral gene through gene duplication and loss events. Here we will assume that no lateral gene transfer has occurred.

Both gene and species trees are rooted trees with labeled leaves. In a species tree, a leaf x represents a species, the label of x. Hence, the species tree is uniquely leaf-labeled. In a gene tree, a leaf y represents a gene found in a species. To infer the duplication history of a gene family, its gene tree and the containing species tree is reconciled [4]. For this purpose, a leaf of a gene tree is labeled with the containing species. Since a species may contain duplicate genes, two leaves in a gene tree can have the same label.

Let T be a species or gene tree; its vertex set and edge set are denoted by V(T) and E(T), respectively. Given two vertices u and v in T, there exists a unique path P(u, v) from u to v. The number of edges in P(u, v), denoted by d(u, v), is called the distance between u and v. Note that d(u, v) = 0 if and only if u = v. The node v is a descendant of u or u is an ancestor of v, denoted by vu, if u is on the unique path from r(T), the root of T, to v. For simplicity, we also write v <u if vu and vu. Given a set A of vertices in T, u is a common ancestor of A if and only if vu for every vA. In addition, if uu′ for any other common ancestor u′ of A, then we say u is the least common ancestor of A, written as lca(A), or lca(u 1, ⋯, u k ) if A = {u 1, ⋯, u k }.

For each vertex u in T with ur(T), the parent of u, denoted by p(u), is the unique vertex in T that is adjacent to u and contained on the path from r(T) to u. In this case, u is also called a child of p(u). The out-degree of u, denoted by d(u), is defined as the number of the children of u. Obviously, a node is a leaf if and only if its out-degree is 0. Non-leaf nodes are internal nodes; they form a subset (T) of V(T). If every internal vertex has out-degree two, then T is binary. For an internal vertex u in a binary tree, its two children are denoted by u 1 and u 2, unless stated otherwise. In this study, we will focus on the case that gene trees and species trees are binary. For a vertex u, we use L(u) to denote the set of the labels of its leaf descendants and call it the cluster induced by u. Finally, we use L(T) to denote the set of leaf labels, i.e., the cluster induced by the root of T.

Reconciliation between gene tree and species tree

Let S be a species tree over a set of species and G a gene tree such that L(G) ⊆ L(S), i.e., G is over all the homologous genes of a gene family found in some species. A map f from V(G) to V(S) is order-preserving if for each pair of vertices u, v in G, uv implies f(u) ≤ f(v); it is leaf-preserving if, for each leaf x in G, f(x) is the unique leaf in S that has the same label.

A reconciliation between a gene tree G and a species tree S is a leaf-preserving and order-preserving map from V(G) to V(S). Clearly, a reconciliation f between G and S is necessarily an inclusion-preserving mapping (see [8]), that is, for each pair of vertices u, v in G, uv implies L(f(u)) ⊆ L(f(v)). However, the reverse statement is not true. For instance, the mapping that maps each vertex of G to the root of S is an inclusion-preserving mapping, but according to our definition, it is not leaf-preserving, and hence not a reconciliation.

Note that our definition is consistent with the one used in [20], where a reconciliation is defined as a mapping from V(G) to V(S) ∪ E(S) that satisfies three constraints: base constraint, tree mapping constraint and ancestor consistency constraint. Roughly speaking, our order-preserving condition corresponds to the ancestor consistency constraint, and the leaf-preserving condition is related to the base constraint, while the tree mapaaping constraint is not needed in our setting. The main difference between these two frameworks is the model used to interpret mappings. For example, in [20], a duplication event is associated to a vertex v in G if and only if v is mapped to an edge, while in our model, whether v is associated with a duplication event is not solely determined by the image of v.

A reconciliation represents a hypothetical evolutionary history of the gene family. In a gene tree, an internal vertex u represents the common ancestor of the genes represented by the leaves below it. The property just reflects the intuitive fact that u is an ancient gene appearing in some common ancestor of the species from which the genes are taken. Recall that in species tree each branch represents an ancestral species. Under the reconciliation f, we considered u as the gene ancestor found in the species represented by the branch entering f(u).

There is a canonical partial order ≼ on the set of reconciliations between G and S: for any f′ and f, f′ ≼ f if and only if f′(v) ≤ f(v) holds for every vertex v in G. Define a mapping M from G to S recursively as:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equa_HTML.gif

M is called the least common ancestor (LCA) reconciliation between G and S. Note that we have Mf for every reconciliation f between G and S, because it is easy to see that M(u) ≤ f(u) holds for all uV(G), by a bottom-up traversal.

Inference of gene duplications

If the discord of a gene tree G and its containing species tree S is due to gene duplication, a reconciliation f between them represents a plausible duplication history of genes. For an internal vertex u, a duplication event is associated with u if and only if one of the following two conditions holds: (D-i) f(u) = f(u 1), f(u) = f(u 2) or both hold; (D-ii) P(f(u), f(u 1)) and P(f(u), f(u 2)) contain a common edge. In the literature (see [5]), when the LCA reconciliation M is used for inferring gene duplications, the duplication condition used is (D-i). This is correct for the LCA reconciliation between a gene tree and a species tree. However, this stringent condition is no longer appropriate as the definition of duplication events for arbitrary reconciliations. For example, consider the reconciliation f between the gene tree G and the species tree S as in Figure 1. If the original definition is used, as proposed in [8], only one duplication is inferred, which is associated with r. However, one duplication cannot produce such a gene family having the gene tree G. On the other hand, if our proposed definition is used, two duplications are inferred, one associated with r and the other with b; the implied duplication scenario is given in Figure 2.
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Fig1_HTML.jpg

Figure 1

http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Fig2_HTML.jpg

Figure 2

Now, for an internal node u, we let δ f (u) = 1 if there is a duplication event associated with it, and δ f (u) = 0 otherwise. Then the gene duplication cost gd (f) of f is defined as:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equ1_HTML.gif
(1)

Gene loss cost

Let G be a gene tree, S a species tree and f a reconciliation between G and S. Then the number of losses l f (u) associated to an internal vertex u is defined as:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equb_HTML.gif
Note that our definition of l f (u) is a generalization of the one introduced by Ma et al. in [6], and is consistent with the one in [20]. When f is the LCA reconciliation, our definition agrees with the traditional one [5, 6]. For later use, it is often convenient to combine the two formulae in the above definition, i.e., we have:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equ2_HTML.gif
(2)
For simplicity, we also set l f (x) = 0 for any leaf x of G. The gene loss cost gl(f) of f is defined as:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equ3_HTML.gif
(3)
For example, for the reconciliation f in Figure 1, we have gl(f) = 7 by noting that:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equc_HTML.gif

which can also been observed from Figure 2.

Deep coalescence cost

If the discord of a gene tree G and a species tree S is due to incomplete lineage sorting, a reconciliation f between them is measured by the deep coalescence cost [3]. Given a branch e in S, we say that there are k (k > 0) extra lineages (with respect to f) failing to coalesce on e, denoted by τ f (e) = k, if there exist k + 1 distinct edges (u i , v i ) (1 ≤ ik + 1) in G such that e is on the path P(f(u i ), f(v i )) for each i; otherwise, we let τ f (e) = 0. The deep coalescence cost dc(f) of f is then defined as:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equd_HTML.gif
i.e., the total number of the extra lineages with respect to f on all branches of S. For example, for the reconciliation f in Figure 1, we have dc(f) = 3 by noting that:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Eque_HTML.gif

Results

The monotonicity of the reconciliation costs

We first have the following useful observations on the gene duplication cost.

Lemma 1Let f be a reconciliation between a gene tree G and a species tree S. If u is an internal vertex in G with children u1and u2, then the following observations hold.

(i): δ f (u) = 1 if and only if f(u) ∈ {f(u1), f(u2)} or lca(f(u1), f(u2)) <f(u).

(ii): δ f (u) = 0 if and only if f(u1) ≠ f(u) ≠ f(u2) and lca(f(u1),f(u2)) = f(u).

(iii): If L(f(u1)) ∩ L(f(u2)) ≠ ∅, then δ f (u) = 1.

(iv): If f(u) >M(u), then δ f (u) = 1.

(v): If δ f (u) = 0, then f(u) = M(u) and L(f(u1)) ∩ L(f(u2)) = ∅.

Proof: Since δ f (u) is either 0 or 1, (ii) clearly follows from (i), and (v) follows from (iii) and (iv).

To establish (i), it suffices to show that lca(f(u 1), f(u 2)) <f(u) if and only if P(f(u), f(u 1)) and P(f(u),f(u 2)) share a common edge. Indeed, if we have lca(f(u 1),f(u 2)) <f(u), then P(f(u), f(u 1)) and P(f(u), f(u 2)) share the edge that is incident to lca(f(u 1), f(u 2)) and its parent. On the other hand, if P(f(u), f(u 1)) and P(f(u), f(u 2)) share a common edge (s, s′) with s′ <s, then s′ is a common ancestor of f(u 1) and f(u 2) such that s′ <sf(u). Therefore we have L(f(u 1), f(u 2)) ≤ s′ <f(u), as required.

Now we proceed to prove (iii). If L(f(u 1)) ∩ L(f(u 2)) ≠ ∅, then we have either f(u 1) ≤ f(u 2) or f(u 2) ≤ f(u 1). By symmetry, we may assume f(u 1) ≤ f(u 2), and hence lca(f(u 1), f(u 2)) = f(u 2) holds. Now there are two cases to be considered, i.e., f(u2) = f(u) and f(u2) <f(u). By (i), we can conclude δ f (u) = 1 in both of them.

It remains to show (iv). Note first that we can assume lca(f(u 1), f(u 2)) >M(u), because otherwise we have lca( f(u 1), f(u 2)) = M(u) <f(u), and hence δ f (u) = 1 by (i). It follows that f(u i ) >M(u) for some i = 1, 2. Therefore, by switching u 1 and u 2 if necessary, we can further assume f(u 1) >M(u). Now we need to consider two cases: f(u 2) >M(u) and f(u 2) ≤ M(u). If f(u 2) >M(u), then f(u 1) and f(u 2) are both contained in the path P(f(u), M(u)), and thus L(f(u 1)) ∩ L(f(u 2)) ≠ ∅ holds. On the other hand, f(u 2) ≤ M(u) implies f(u 2) ≤ f(u 1), and hence also L(f(u 1)) ∩ L(f(u 2)) ≠ ∅. Since in both cases we have L(f(u 1)) ∩ L(f(u 2)) = ∅, by (iii) we obtain δ f (u) ≠ 1, as required. Q.E.D

Note that (i) in the above lemma provides an additional characterization of gene duplication events. This characterization is easier for calculation while the original definition is more natural, from an evolutionary point of view. By (v) in the above lemma, if a speciation event happens at u, i.e., δ f (u) = 0, then we have f(u) = M(u). This agrees with the definition of reconciliation in [20]. Now we have the following main result.

Theorem 2Let f and f′ be two distinct reconciliations between a gene tree G and a species tree S with f′f; then we have:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equ4_HTML.gif
(4)

In addition, δ f′ (u) ≤ δ f (u) for each uV(G), where the equality holds for each uV(G) if and only if gd(f′) = gd(f).

Proof: Let D(f′, f) be the number of vertices v in V(G) with f(v) ≠ f′(v); then the following observation plays an important role in our proof of the theorem.

Lemma 3 Let f and f′ be the two reconciliations as given in the theorem. Then there exists a reconciliation f* between G and S that satisfies the following three conditions:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equf_HTML.gif
Proof: To establish the above lemma, we select a minimal element vmin (with respect to the partial order ≤ on V(G)) in the set {uV(G) : f(u) ≠ f′(u)}, which is necessarily non-empty by the assumption ff′. In other words, f(v) = f′(v) holds for any v such that v <vmin. Now consider the map f* defined as:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equg_HTML.gif
Then f* is a reconciliation between G and S. To see this, note first that f and f′ are reconciliations, and hence they are leaf-preserving. Therefore we know f* is also leaf-preserving. Let u and v be a pair of vertices in G with uv. If u, vV(G) {v min}, then f*(u) ≤ f*(v) because f is order-preserving. On the other hand, if u = v min and vu, then we also have:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equh_HTML.gif

where we use the fact that f′ is order-preserving and f′f in the first and second inequality, respectively.

Finally, suppose v = v min and uv. By the way that v min is chosen, we have f′(u) = f(u), and hence also:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equi_HTML.gif

This shows f* is order-preserving, and hence f* is indeed a reconciliation between G and S.

It remains to show that f* satisfies the three conditions required in the claim. Since f′f, from the construction of f* we have f′f* ≼ f. Noting that v min is the only vertex in V(G) that is mapped to different images by f and f*, we have D(f*, f) = 1. Finally, for any v in G, f′(v) ≠ f*(v) if and only if vv min and f′(v) ≠ f(v). In other words, we have D(f′, f*) = D(f′, f) – 1, which completes the proof of Lemma 3. Q.E.D.

Now it suffices to prove the theorem for the special case D(f, f′) = 1. Indeed, if D(f, f′) = m > 1, then by Lemma 3, there exist m +1 reconciliations f 1 := f′, f 2, , f m +1 := f so that f i f i +1 and D(f i , f i +1) = 1 for 1 ≤ im. Applying the theorem (in the special case mentioned above) for each pair of reconciliations fi and f i +1, we have gd(f i ) ≤ gd(f i +1) for 1 ≤ im, and hence gd(f′) = gd(f 1) ≤ gd(f m +1) = gd(f). Similarly, we can show gl(f′) < gl(f), dc(f′) < dc(f), and δ f′ (u) ≤ δ f (u) for each uV(G), among which the last one implies that gd(f) = gd(f′) if and only if δ f (u) = δ f′ (u) for each uV(G).

Now let v be the unique vertex in G with f(v) ≠ f′(v). Clearly, v is an internal vertex. If v is not the root, let v 0 := p(v) be its parent and v 3 be its sibling, that is, the other child of v 0. The remainder argument will be divided into three cases, according to the cost measure considered.

Duplication cost case

Noting that f(v i ) = f′(v i ) ≤ f′(v) <f(v) for i = 1, 2, we have:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equj_HTML.gif

By (i) in Lemma 1, this shows δ f (v) = 1, and hence δ f (v) ≥ δ f′ (v). If v is the root of G, then we have gd(f) – gd(f′) = δ f (v) – δ f′ (v) ≥ 0, as required.

Now we assume v is not the root, and proceed to show δ f (v 0) ≥ δ f′ (v 0). To begin with, we can assume δ f′ (v 0) = 1, because otherwise the inequality trivially holds. In addition, we can further assume f(v) <f(v 0) and f(v 3) <f(v 0), because otherwise we have δ f (v 0) = 1, which also implies the inequality. It follows that we have:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equk_HTML.gif

By (i) in Lemma 1, this leads to lca(f′(v), f′(v 3)) <f′(v 0) = f(v 0). Let s be the child of f(v 0) so that lca(f′(v), f′(v 3)) ≤ s. Since f′(v) ≤ f(v) <f(v 0) and f′(v 3) = f(v 3), s is also a common ancestor of f(v) and f (v 3). Therefore we have lca (f′ (v), f′ (v 3)) ≤ s <f(v 0). Using (i) in Lemma 1 again, we can conclude δ f (v 0) = 1, as required.

Since v is the only vertex in G with f(v) ≠ f′(v), for each internal vertex gV(G) {v, v 0} and its two children g 1 and g 2, we have:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equl_HTML.gif

By definition, this implies http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_IEq1_HTML.gif for all gV(G) {v, v 0}. Combining the above observations, we can conclude that δ f (u) ≥ δ f′ (u) for each uV(G). This leads to gd(f) ≥ gd(f′), where the equality holds if and only if δ f (u) = δ f′ (u) for each uV(G).

Gene loss case

Since f(v i ) ≤ f′(v) <f(v) holds for i = 1,2, we have:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equ5_HTML.gif
(5)
Together with the definition of l f , we obtain:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equm_HTML.gif

Since δ f (v) ≥ δ f′ (v), following the proof of the duplication cost case, and d(f(v), f′(v)) > 0, we can conclude that l f (v) – l f′ (v) ≥ 0. If v is the root of G, then this leads to gl(f) > gl(f′), as required.

Now we assume v is not the root of G. Then we have:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equn_HTML.gif
where we use the observation that f′(v) <f(v) ≤ f(v 0) implies:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equo_HTML.gif
Combining these results, we have:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equp_HTML.gif

Since gd(f) ≥ gd(f′), following the proof of the duplication cost case, and d(f(v), f′(v)) > 0, we obtain gl(f) > gl(f′), which completes the proof of this case.

Deep coalescence case

Let E f (S) be the set of edges e in S such that there exists an edge (u,u′) in G such that e is contained in the directed path from f(u) to f(u′). Now by counting extra lineages in terms of the edges contained in paths that have form P(f(u), f(u′)) for some edge (u, u′) in G, we have:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equ6_HTML.gif
(6)
Since E f (S) = E f′ (S) and f(u) = f′(u) for uv, the above formula implies:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equq_HTML.gif
if v is not the root of G. Here in the second equality we use the observation that f(v) is on the directed path from f(v 0) to f′(v), and for i = 1,2, f′(v) is on the directed path from f(v) to f(v i ). If v is the root of G, then a similar argument leads to:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equr_HTML.gif

which completes the proof. Q.E.D.

Since the LCA reconciliation is the minimal element in the space of reconciliations, the above theorem leads directly to the following result.

Corollary 4Among all reconciliations between a gene tree G and a species tree S, the LCA reconciliation has (a) the minimum gene duplication cost[9], (b) the unique one with the optimal gene loss cost[15]and the optimal deep coalescence cost.

Note that there is a close relationship among the gene duplication, gene loss and deep coalescence costs [7]. From their relationship, one can easily obtained the fact that the LCA is the unique one with the optimal gene loss cost from that it is the unique one with the optimal deep coalescence, but the reverse is not clear.

Gd-optimal reconciliations

By Corollary 4, the LCA reconciliation is the unique optimal reconciliation for the gene loss cost, as well as the deep coalescence cost. However, the LCA reconciliation may not be the unique optimal one for the gene duplication cost (see [15]). For example, for the reconciliation f in Figure 1 and the LCA reconciliation M between the gene tree and species tree in Figure 1, we have gd(f) = gd(M) = 2. Since the reconciliations with the minimum gene duplication cost, which we shall refer to as gd-optimal reconciliations, may not be unique, in this section we will present a characterization of them, using the theoretical results developed above.

By Theorem 2, a reconciliation f is gd-optimal if and only if δ f (u) = δ M (u) holds for each vertex u in G. Based on it, we will show that there exists a unique maximal gd-optimal reconciliation M* so that f is gd-optimal if and only if fM* holds. The reconciliation M* between a gene tree G and a species tree S can be constructed as follows. For all uV(G) with δ M (u) = 0, M* maps u to M(u), i.e., M*(u) = M(u). For those uV(G) with δ M (u) = 1, we shall define M*(u) recursively. If u = r(G), i.e., it is the root of G, then M*(u) is defined as r(S), the root of S. Otherwise, M*(p(u)) has been defined, and M*(u) is defined as:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equs_HTML.gif
If u is a vertex in G such that ur(G), then δ M (p(u)) = 0 implies M(u) <M(p(u)) ≤ M*(p(u)), hence the mapping M* is well defined. In addition, M* is also a reconciliation between G and S. To see this, note that if u is a leaf in G, then we have δ M (u) = 0, which implies M*(u) = M(u) and hence M* is leaf-preserving. On the other hand, by the construction of M*, it is order-preserving. For example, for the gene tree and species tree in Figure 1, the reconciliation M* is defined as:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equt_HTML.gif

In this example, it is not difficult to check that gd(f) = gd(M*) holds for all fM*, which also follows directly from the following general result.

Theorem 5Given a gene tree G and a species tree S, a reconciliation f is gd-optimal if and only if MfM* holds. In particular, M* is the unique maximal gd-optimal reconciliation between G and S.

Proof: We need only to show that gd(f) = gd(M) for a reconciliation f if and only if MfM* holds, because this implies M* is indeed the unique maximal gd-optimal reconciliation.

To show that gd(M) = gd(f) holds for every reconciliation f with MfM*, it suffices to prove gd(M*) = gd(M), because together with Theorem 2, this implies gd(f) = gd(M) = gd(M*). To this end, we need only to show δ M (u) = δ M* (u) for each internal vertex u in G. Now fix an internal vertex u in G. Since MM*, we have δ M* (u) ≥ δ M (u) by Theorem 2. If δ M (u) = 1, then we have δ M* (u) = 1 = δ M (u). Therefore it remains to consider the case δ M (u) = 0. By (ii) in Lemma 1, δ M (u) = 0 implies M(u 1) ≠ M(u) ≠ M(u 2). Together with the construction of M*, we have M*(u 1) ≠ M*(u) ≠ M*(u 2). Since M(u i ) ≤ M*(u i ) ≤ M*(u) for i = 1, 2, we have:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equu_HTML.gif
By the construction of M, we know lca(M(u 1), M(u 2)) = M(u), and hence:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equv_HTML.gif

By (ii) in Lemma 1, this shows δ M * (u) = 0, as required.

To establish the other direction, assume gd(f) = gd(M) for a reconciliation f, and we shall show fM*, i.e., f(u) ≤ M*(u) for each internal u in V(G). To this end, fix an internal vertex u in G, and denote its two children by u 1 and u 2. If δ f (u) = 0, then by (v) in Lemma 1 we have f(u) = M(u), and hence f(u) = M*(u). Therefore, it remains to prove f(u) ≤ M*(u) for δ f (u) = 1, which will be established by induction. The base case is u being the root of G; then M*(u) is the root of S, and f(u) ≤ M*(u) trivially holds. For the induction step, let u 0 := p(u) be the parent of u; then the induction assumption is f(u 0) ≤ M*(u 0). Now if δ M (u 0) = 1, then by the definition of M* we have:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equw_HTML.gif
Otherwise, we have δ M (u 0) = 0. Together with Mf, MM* and gd(M) = gd(f) = gd(M), this leads to δ f (u 0) = δ M* (u 0) = 0 by Theorem 2. In view of (v) in Lemma 1, we obtain M(u 0) = f(u 0) = M*(u 0). Since δ f (u 0) = 0, (ii) in Lemma 1 implies f(u) <f(u 0), and hence also:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equx_HTML.gif

By definition, M*(u) is the largest vertex in the set {s : M(u) ≤ s <M*(u 0)}. Since M*(u 0) = M(u 0), we can conclude f(u) ≤ M*(u), which completes the proof. Q.E.D.

Enumerate nearly-optimal reconciliations

Recall that there are other reconciliations having the minimum duplication cost than the LCA reconciliation. Moreover, in a biological study, a nearly-optimal reconciliation could be the correct solution to its problem. Therefore, it is of interest to study the following problem [20]: Given a positive number ε, compute the set of nearly-optimal reconciliations that have the duplication cost less than or equal to gd(M) + ε, where gd(M) is the minimum duplication cost a reconciliation between the gene tree and the species tree can have. Such a subset of the nearly-optimal reconciliations is denoted by Γ ε (G, S, gd), which is also a subset of Γ(G, S), the set of all reconciliations between G and S.

In this section we will present an algorithm for enumerating Γ ε (G, S, gd). To this end, we need to introduce some additional definitions. Following [20], for a vertex uV(G), let id(u) be the number of vertices that precede u according to the prefix traversal of G, where the left child u 1 of a vertex u(G) is visited before the right child u 2. For a reconciliation f in Γ(G, S), and a vertex u(G) with f(u) ≠ r(S), f[u] is a mapping defined as:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equy_HTML.gif
For an internal vertex u with ur(G), f[u] is a reconciliation if and only if f(u) <f(p(u)); for the root r(G) of G, f[r(G)] is a reconciliation if and only if f(r(G)) <r(S). In both cases, we will say that the reconciliation f[u] is obtained from f by applying a Nearest Mapping Change (NMC) operator on u; this operator is adapted from the one introduced in [20]. Similarly, we can define f[u 1, ,u k ] for a sequence of (not necessarily distinct) vertices in G. Note that for a reconciliation f in Γ(G, S) with fM, there exists a unique sequence u 1, , u k so that f = M[u 1, , u k ] and id(u i ) ≤ id(u i +1) for i = 1,..., k – 1; now id(f) is defined as id(u k ), where u k is the last vertex in this sequence. For completeness, we will use the convention id(M) = 0. Finally, for a reconciliation f in Γ(G, S), we set:
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equz_HTML.gif

where K(f) will be regarded as an ordered list (with the order induced by id).

The NMC operator induces a tree structure on the set Γ(G, S): the root is M; f′ is a child of f if and only if f′ = f[u] for some uK(f). This tree, whose vertex set is Γ(G, S), will be denoted by T(G, S). The idea of considering a tree structure on the space of reconciliation was introduced in [20]. Clearly, by Theorem 2, the restriction of T(G, S) on Γ ε (G, S, gd) is a subtree, which will be referred to as T ε (G, S, gd). Now we can state our algorithm as follows, which enumerates Γ ε (G, S, gd) by a traversal of T ε (G, S, gd). Here ⊔ stands for disjoint union.
http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S7/MediaObjects/12859_2011_4814_Equaa_HTML.gif

To see the running time of the above algorithm, note first that for a reconciliation f, K(f) is a subset of (G), and for each u(G), whether uK(f) or not can be determined in constant time, when id(u) and id(f) are known. In addition, if δ M is given, then line 5a and 5b can be computed in constant time; the proof of this observation will be presented in the full version of this paper. Therefore, the above algorithm runs in time O(|V(G)| . |Γ ε (G, S, gd)|), plus additional preprocessing time to compute id(u) and δ M (u) for each u(G).

Two facts prevent us from designing better algorithm for the enumeration problems. The first one concerns the boundary set B ε (G, S, gd), which consists of all reconciliation f in Γ(G, S) Γ ε (G, S, gd) such that for some f* ∈ Γ ε (G, S, gd), f is a child of f* in T(G, S). In order to enumerate Γ ε (G, S, gd), an algorithm typically needs to visit not only the reconciliations in Γ ε (G, S, gd), but also those in B ε (G, S, gd). However, |B ε (G, S, gd)| could be as large as O(|V(G)| . |Γ ε (G, S, gd)|). For instance, if G and S have the same tree structure on n +1 leaves, then Γ0(G, S, gd) = {M} but |B 0(G, S, gd)| contains n – 1 reconciliations. Furthermore, we have |Γ1(G, S, gd)| = n and |B 1(G, S, gd)| = Θ(n 2).

The other concern is about the set K ε (f) := {uV(G) : f[u] is in Γε(G, S, gd) and id(u) ≥ id(f)}, which is needed if we want to explore Γ ε (G, S, gd) without visiting the boundary set B ε (G, S, gd). However, some properties of these two sets, K(f) and K ε (f), are different. For instance, the following property of K(f) is crucial to the optimal algorithm for exploring Γ(G, S) (see Property 5 and Proposition 4 in [20]): If u is the first vertex in K(f) and f′ = f[u], then we have K(f) – K(f′) ⊆ {u}. However, this does not hold for K ε . To see it, considering the example mentioned in the previous paragraph, and denoting the first child of r(G) by r 1, then we have K 1(M) = (G) {r(G)} while K 1(M[r 1]) = ∅.

Since Γ0(G, S, gd) contains the gd-optimal reconciliations, the above algorithm also provides a method for enumerating all the optimal reconciliations between a gene tree and a species tree. Since T ε (G, S, gl), as well as T ε (G, S, dc), is also a subtree of T(G, S), we also remark that it can be modified to list nearly-optimal reconciliations with respect to the gene loss or deep coalescence cost. Due to the limited space, the details of these algorithms are omitted here and one is referred to the full version of this work appearing in our personal website. As our on-going work, the algorithms presented here will be coded in C++ and evaluated by comparing them with the existing ones on simulation data.

Conclusions

To investigate all reconciliations between a gene tree and a species tree, we have generalized the LCA reconciliation to define an arbitrary reconciliation as a vertex mapping from the gene tree to the species tree. This provides a new framework for investigating various mathematical issues of the reconciliation space. It allows us to give a unified approach to study reconciliations with each of the cost models. As applications, we show that the LCA reconciliation is the unique one having the smallest deep coalescence cost, and present a characterization of the reconciliations with the minimum gene duplication cost; we also develop efficient algorithms to enumerate nearly-optimal reconciliations with each cost models. In future, we shall incorporate other evolutionary forces behind the gene tree heterogeneity, such as horizontal gene transfer and recombination, into this framework.

Declarations

Acknowledgements

The work was financially supported by the Singapore MOE grant R-146-000-134-112. We thank three anonymous referees for providing constructive comments.

This article has been published as part of BMC Bioinformatics Volume 12 Supplement 9, 2011: Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://​www.​biomedcentral.​com/​1471-2105/​12?​issue=​S9.

Authors’ Affiliations

(1)
Department of Mathematics, National University of Singapore

References

  1. Metzker M: Sequencing technologies - the next generation. Nature Reviews Genetics 2010, 11:31–46.PubMedView Article
  2. Pamilo P, Nei M: Relationship between gene trees and species trees. Mol. Biol. Evol 1988, 5:568–583.PubMed
  3. Maddison W: Gene trees in species trees. Syst. Biol 1997, 46:523–536.View Article
  4. Goodman M, Czelusniak J, Moore G, Romero-Herrera A, Matsuda G: Fitting the gene lineage into its species lineage: A parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool 1979, 28:132–168.View Article
  5. Page R: Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst. Biol 1994, 43:58–77.
  6. Ma B, Li M, Zhang L: From gene trees to species trees. SIAM J. Comput 2010, 30:729–752.View Article
  7. Zhang L: From gene trees to species trees II: species tree inference by minimizing deep coalescence event. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2011. accepted
  8. Bonizzoni P, Della Vedova G, Dondi R: Reconciling a gene tree to a species tree under the duplication cost model. Theoretical Computer Science 2005, 347:36–53.View Article
  9. Gorecki P, Tiuryn J: DLS-trees: A model of evolutionary scenario. Theoretical Computer Science 2006, 359:378–399.View Article
  10. Arvestad L, Lagergren J, Sennblad B: The gene evolution model and computing its associated probabilities. J. ACM 2009,56(2):1–44.View Article
  11. Chen K, Durand D, Farach-Colton M: Notung: A program for dating gene duplications and optimizing gene family trees. Journal of Computational Biology 2000, 7:429–447.PubMedView Article
  12. Eulenstein O, Mirkin B, Vingron M: Duplication-based measures of difference between gene and species trees. Journal of Computational Biology 1998, 5:135–148.PubMedView Article
  13. Bansal M, Eulenstein O: The multiple gene duplication problem revisited. Bioinformatics 2008, 23:132–138.View Article
  14. Liu L, Yu L, Kubatko L, Pearl D, Edwards S: Coalescent methods for estimating phylogenetic trees. Mol. Phylogenet. Evol 2009, 53:320–328.PubMedView Article
  15. Chauve C, El-Mabrouk N: New perspectives on gene family evolution: Losses in reconciliation and a link with supertrees. In Research in Computational Molecular Biology, LNCS 5541. Edited by: Batzoglou S. Springer Berlin /Heidelberg; 2009:46–58.View Article
  16. Degnan J, Rosenberg N: Gene tree discordance, phylogenetic inference, and the multispecies coalescent. Trends in Ecology and Evolution 2009, 24:332–340.PubMedView Article
  17. Than C, Rosenberg N: Consistency properties of species tree inference by minimizing deep coalescences. Journal of Computational Biology 2011, 18:1–15.PubMedView Article
  18. Than C, Nakhleh L: Species tree inference by minimizing deep coalescences. PLoS Computational Biology 2009,5(9):e1000501.PubMedView Article
  19. Kingman J: Origins of the coalescent. 1974–1982. Genetics 2000, 156:1461–1463.PubMed
  20. Doyon J, Chauve C, Hamel S: Space of gene/species trees reconciliations and parsimonious models. Journal of Computational Biology 2009, 16:1399–1418.PubMedView Article
  21. Doyon J, Hamel S, Chauve C: An efficient method for exploring the space of gene tree/species tree reconciliations in a probabilistic framework. preprint 2010.
  22. Arvestad L, Berglund A, Lagergren J, Sennblad B: Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution. Proceedings of the eighth annual international conference on Research in computational molecular biology, RECOMB ’04 2004, 326–335.View Article

Copyright

© Wu and Zhang; licensee BioMed Central Ltd. 2011

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://​creativecommons.​org/​licenses/​by/​2.​0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Advertisement