Maximum parsimony reconciliation in the DTLOR model

Background Analyses of microbial evolution often use reconciliation methods. However, the standard duplication-transfer-loss (DTL) model does not account for the fact that species trees are often not fully sampled and thus, from the perspective of reconciliation, a gene family may enter the species tree from the outside. Moreover, within the genome, genes are often rearranged, causing them to move to new syntenic regions. Results We extend the DTL model to account for two events that commonly arise in the evolution of microbes: origin of a gene from outside the sampled species tree and rearrangement of gene syntenic regions. We describe an efficient algorithm for maximum parsimony reconciliation in this new DTLOR model and then show how it can be extended to account for non-binary gene trees to handle uncertainty in gene tree topologies. Finally, we describe preliminary experimental results from the integration of our algorithm into the existing xenoGI tool for reconstructing the histories of genomic islands in closely related bacteria. Conclusions Reconciliation in the DTLOR model can offer new insights into the evolution of microbes that is not currently possible under the DTL model.

(e.g., only duplication and loss) or different types of events (e.g,. incomplete lineage sorting). While the DTL model is applicable to evolution in microbes, it only allows horizontal transfer between species that are part of the species tree. In the analysis of the evolution of microbes in particular, it is quite common that the species tree is not fully sampled. Thus, from the perspective of performing a reconciliation analysis, a gene family may effectively enter the given species tree via transfer from the outside [2,3].
In this paper we describe the DTLOR model that addresses this issue by extending the DTL model to allow some or all of the evolution of a gene family to occur outside of the given species tree and for transfers events to occur from the outside. To facilitate the recognition of such entry events, the model also keeps track of the syntenic region of each gene as it evolves in the species tree. Two genes are said to be in the same syntenic region if they share a substantial fraction of core genes in a relatively large window around them and, second, they share a certain amount of similarity among all genes in a smaller window around them [2]. Thus, in addition to duplication, transfer, and loss events, the DTLOR model adds origin events to indicate that a gene is transferred from outside of the species tree and rearrangement events that account for changes in the syntenic regions of genes on the genome.
In the DTL model, reconciliation is generally performed using a maximum parsimony formulation. A positive cost is associated with each type of event and the objective is to find a reconciliation that minimizes the total cost of the incurred events. Efficient algorithms have been developed for maximum parsimony reconciliations (MPRs) in the DTL model [4][5][6], and several software tools implement these algorithms [7][8][9][10].
In earlier related work, Delabre et al. studied a related problem of reconciliation with synteny information in the Duplication-Loss model; horizontal transfer events were not considered in that work [11]. Szöllősi et al. [3] proposed an event called "transfer from the dead" to account for gene evolution that occurs outside of the species tree and Jacox et al. [10] described an extension of an existing DTL maximum parsimony reconciliation algorithm to compute most parsimonious reconciliations with this additional event. Our work differs in two significant ways from that prior work. First, while "transfer from the dead" allows gene lineages to transfer out of and back into the sampled species trees multiple times, the DTLOR model only permits transfer from the outside under the assumption that the species tree comprises closely-related species and, thus, transfers out of the species tree and back in are considered to be relatively rare. Second, the DTLOR model captures rearrangement events, which are not considered in conjunction with DTL events in previous models. Reconstructing rearrangement events is particularly important in identifying genomic islands in bacteria [2].
In summary, in this paper we extend the DTL model to allow origin (O) and rearrangement (R) events. We give an exact polynomial-time algorithm for maximum parsimony reconciliation in the DTLOR model. Since gene trees are often non-binary due to lack of signal in their sequence data, we show how maximum parsimony reconciliations can be found in fixed-parameter polynomial time for non-binary gene trees where the parameter is the maximum branching factor of a node. Finally, we describe preliminary results from the integration of the DTLOR MPR algorithm into the xenoGI tool [2] which may provide new insights into microbial evolution.

Definitions
An instance of the DTLOR reconciliation problem comprises undated rooted species and gene trees, S and G, respectively; a positive integer syntenic region number associated with each leaf vertex (extant gene) in G; and a mapping of the leaves of G to the leaves of S. We assume that both trees are binary, but consider the case that gene trees may be non-binary in Sect. 4. Some leaves of the gene tree may be in the same syntenic region while others may be in unique syntenic regions. The DTLOR model comprises the standard DTL events (duplication, transfer, and loss; described in detail below) [4] and two additional events called origin and rearrangement. Each of those five event types has an associated positive cost.
A syntenic region number is a positive integer from the set of syntenic region numbers of the leaves of the gene tree (called an actual syntenic region) or the special unknown syntenic region symbol * . When a gene vertex is labeled with * , that vertex is assumed to be evolving outside the species tree. When a gene vertex is assigned an actual syntenic region but its parent has unknown syntenic region, this means that the gene entered the species tree through transfer from outside, inducing an origin event. Rearrangement indicates a change in syntenic region that happens during the course of evolution within the species tree.
The rules for syntenic region numbering are as follows: 1. If a vertex u is labeled * and v is a child of u, then v may be labeled with either * or an actual syntenic region number. 2. If a vertex u is labeled with an actual syntenic region number, then its children must be labeled with actual syntenic region numbers. Note that this implies that any vertex labeled with an actual syntenic region number has the property that all of its descendants are also labeled with actual syntenic region numbers.
Constraint 1 ensures that genes can originate outside the species tree while constraint 2 ensures that once a gene is found in the species tree it continues evolving within the tree.
The DTL events in this model are analogous to those in the DTL model. O and R events are induced as follows: 1. If a vertex u is labeled * and a child v is labeled with an actual syntenic region number, then vertex v induces an O event. 2. If a vertex u and its child v have actual syntenic region numbers and those two syntenic region numbers are different, then an R event is induced on the edge between u and v.
The objective of the DTLOR maximum parsimony reconciliation (DTLOR MPR) problem is to map the vertices and edges of the gene tree onto the species tree and to identify a syntenic region number with each internal vertex in the gene tree, minimizing the total cost of the induced events. Note that this model implicitly assumes that duplications are tandem or proximal duplications, and thus a duplication event by itself does not imply a change of syntenic region. A duplication that gives rise to a copy at a different syntenic region is modeled implicitly by a duplication and a rearrangement event. The model can be extended to permit other types of duplication events.

Notation
Let S and G denote a pair of undated species and gene trees, respectively. Throughout this and the next section, we assume that S and G are binary. In Sect. 4 we extend results to non-binary trees. For a tree T, let root(T) be the root and Le(T ) be the set of leaves or tips. For a nonroot vertex v in the tree, p(v) is the parent of v. For a non-leaf vertex v, v 1 and v 2 denote its two children. We assume that each tree T has an additional handle edge, namely an edge (u, root(T )) . The handle of S is denoted e S and the handle of G is denoted e G . For a vertex v of T, we let T(v) be the subtree of T rooted at v, including its own handle edge e v from p(v) to v. An edge of a tree T is said to be a leaf edge if its terminus is a leaf and is said to be an internal edge otherwise.

The DTLOR MPR problem
An instance of the DTLOR-MPR problem is a 10-tuple (S, G, L, φ, γ , D, T, L, O, R) where: are binary species and gene trees, respectively; • L is a finite set of syntenic regions which are represented by counting numbers; • φ : Le(G) → Le(S) is a mapping that associates each leaf of G with a leaf of S; • γ : Le(G) → L is a surjective mapping that associates each leaf of G with a syntenic region; • Parameters D, T, L, O, R are positive costs for duplication, transfer, loss, origin, and rearrangement events, described in detail below.
A reconciliation in the DTLOR model comprises a pair of mappings (�, Ŵ) that extend the mappings φ and γ . Specifically, � : V (G) → V (S) ∪ {N } maps the vertices of G to the vertices of S or the special N location representing a species that is not in the species tree S. The constraints on are as follows: 1. �(g) = φ(g) for each leaf g of G; 2. If g is an internal vertex of G and �(g) = N then the children of g, denoted g 1 and g 2 , have the properties that (a) �(g 1 ) = N and �(g 2 ) = N; (b) Neither �(g 1 ) nor �(g 2 ) is an ancestor of �(g) ; and (c) At least one of �(g 1 ) or �(g 2 ) is equal to or a descendant of �(g).
Constraint 1 ensures that the mapping is consistent with the leaf mapping φ while constraint 2 ensures that for any gene vertex mapped to a species vertex, (a) the children of g are also mapped to species vertices, (b) the children are not mapped to species vertices that are ancestral to their parent, and (c) at most one child can transfer to a different clade.
Note that we assume that the trees are undated and it is therefore possible that a mapping that satisfies these constraints is, nonetheless, time-inconsistent in the sense that there is no ordering of the internal nodes of the species tree that is consistent with the set of duplication, transfer, and loss events. But time-inconsistencies in an MPR can be detected in polynomial-time [12,13]. Moreover, the problem of finding MPRs that are guaranteed to be time-consistent is NP-hard [14].
Note also that unlike the DTL model, which requires every gene vertex to be mapped to a vertex in the species tree, the DTLOR model allows gene vertices to be mapped to the N location which is outside the sampled species tree.
The mapping induces four types of events. For an internal gene tree vertex g, with children g 1 and g 2 , and �(g) = N , the events induced by are as follows: Speciation event: Vertex g induces a speciation event if one of �(g 1 ) and �(g 2 ) is in the left subtree and the other is in the right subtree of �(g). Duplication event: Vertex g induces a duplication event if each of �(g 1 ) and �(g 2 ) is either equal to or a descendant of �(g) but does not satisfy the requirements for a speciation event. Transfer event: Vertex g induces a transfer event if exactly one of �(g 1 ) and �(g 2 ) is either equal to or a descendant of �(g) and the other is neither an ancestor nor a descendant of �(g). Loss events: Each non-root vertex g (including leaf vertices) may induce zero or more loss events as follows: If �(p(g)) = N is ancestral to �(g) , then each species vertex s on the path from �(p(g)) to �(g) induces a loss event, except for �(g) and also not �(p(g)) if p(g) induces a speciation event. For each loss induced by a vertex s on the path from �(p(g)) to �(g) , we say that g passes through s.
If �(g) = N then g induces none of these four types of events. The mapping Ŵ : V (G) → L ∪ { * } maps each vertex g in G to an element of L or the special syntenic region represented by * indicating that it is in an unknown syntenic region because it occurs outside of the species tree. The constraints on Ŵ and its relationship to are as follows: 3. If Ŵ(g) = * and g has children g 1 and g 2 then Ŵ(g 1 ) = * and Ŵ(g 2 ) = * .
Constraint 1 ensures that the mapping Ŵ is consistent with the leaf mapping γ , constraint 2 ensures that if a gene vertex is mapped outside of the species tree then its syntenic region is not yet established, and constraint 3 ensures that once the syntenic region for a gene node is established, the syntenic regions of its children are also established. The mapping Ŵ induces events as follows: Origin event: A non-root vertex g induces an origin event if Ŵ(p(g)) = * and Ŵ(g) = * . The root vertex root(G) induces an origin event if Ŵ(root(G)) = * . Rearrangement event: A non-root vertex g induces a rearrangement event if Ŵ(g) = * , Ŵ(p(g)) = * , and Ŵ(p(g)) = Ŵ(g).
The cost of a reconciliation is defined to be the sum of the number of duplication, transfer, loss, origin, and rearrangement events scaled by the event costs D, T, L, O, and R, respectively. Speciation events are assigned an implicit cost of zero because a gene is expected to diverge when the species that carries it diverges.

Methods
When a gene vertex g induces an origin event, all of the genes in the subtree G(g) rooted at g must have actual syntenic regions (by rule 3 in the definition of Ŵ ), and the genes in that subtree are mapped to species in S (by rule 2 in the definition of Ŵ ), that is, �(g ′ ) ∈ V S and Ŵ(g ′ ) ∈ L for all g ′ ∈ G(g) . The mappings and Ŵ are only related by the constraint that �(g) = N iff Ŵ(g) = * . Thus, if g induces an origin event, then the pair of mappings and Ŵ restricted to the domain G(g) are independent. Therefore, for an origin subtree, a subtree of G whose root induces an origin event, the process of finding an optimal species mapping can be decoupled from the process of finding an optimal syntenic region mapping Ŵ . Further, by definition, vertices that induce origin events cannot be ancestrally related. Thus, in a reconciliation (�, Ŵ) where g ′ , g ′′ induce origin events, the species and syntenic region mappings restricted to the origin subtree G(g ′ ) are independent of the mappings restricted to the origin subtree G(g ′′ ).
For binary gene trees, we use a dynamic programming algorithm to compute the optimal cost of a species mapping of each subtree of the gene tree. Then, we use a second dynamic programming algorithm to compute the optimal cost for the syntenic region mapping for each subtree. Finally, a third algorithm combines these results to find an optimal solution to the DTLOR MPR problem. For non-binary gene trees, this decoupling is no longer possible and a different (and less efficient) algorithm is presented in Sect. 9.

Computing the species map
Next, we give an efficient algorithm for computing an optimal species mapping for each origin subtree G(g). The algorithm is similar to other DTL reconciliation algorithms [4], but the variant used here is useful in the extensions and generalizations in later sections.
For a species mapping , or its restriction to an origin subtree of the gene tree, we say that a gene tree edge e g is placed on species tree edge e s if either �(g) = s or if the path from �(p(g)) to �(g) includes vertex s, unless p(g) is involved in a speciation event at s. As a special case, if g is the root of an origin subtree, then �(p(g)) = N . In this case, there is no path from �(p(g)) to �(g) , so e g is placed on e s if and only if �(g) = s . If �(g) = s we say that e g terminates on edge e s and if �(g) is a descendant of s then a loss event is induced and we say that e g continues on the corresponding child edge of e s .
Let C(g) denote the optimal cost for a species mapping restricted to the domain of G(g) and let C(e g , e s ) denote the optimal cost for a species mapping of G(g) such that e g is placed on e s . Then C(g) = min e s ∈E S C(e g , e s ).
We now describe an algorithm for computing C(e g , e s ) . The algorithm computes the C table by considering edges in the gene tree bottom-up (postorder): An edge e g is considered if either g is a leaf or the children edges e g 1 and e g 2 have already been considered. For each edge e g under consideration, we now consider each edge e s in the species tree in postorder.
To compute C(e g , e s ) , we enumerate the four possible cases: • In the base case, if g and s are leaves, then: • If neither g nor s is a leaf, then either g is mapped to s or not. If g is not mapped to s, then it induces a loss at s by being mapped to one of its children. Otherwise, g is mapped to s, which induces either a speciation, duplication, or transfer event, which incurs a corresponding cost. Thus, where the computation of Spec , Loss , Dup , and Transfer are described below. • If g is not a leaf but s is a leaf, then speciation and loss at g are not possible, so: • If g is a leaf but s is not a leaf, then speciation, duplication and transfer at g are not possible, so: The functions Spec(e g , e s ) , Loss(e g , e s ) , Dup(e g , e s ) , and Transfer(e g , e s ) are computed as follows: (1) (2) C(e g , e s ) = min{Spec(e g , e s ), Loss(e g , e s ) Dup(e g , e s ), Transfer(e g , e s )} (3) C(e g , e s ) = min{Dup(e g , e s ), Transfer(e g , e s )} (4) C(e g , e s ) = Loss(e g , e s ) (5) Spec(e g , e s ) = min{C(e g 1 , e s 1 ) + C(e g 2 , e s 2 ), C(e g 1 , e s 2 ) + C(e g 2 , e s 1 )} (6) Loss(e g , e s ) =L + min{C(e g , e s 1 ), C(e g , e s 2 )}

(7)
Dup(e g , e s ) =D + C(e g 1 , e s ) + C(e g 2 , e s ) Transfer(e g , e s ) = T+ The speciation term (5) considers both ways in which the children edges of e g can be placed onto the children edges of e s in a speciation event. The loss term (6) considers both ways in which edge e g can continue, either on one child of e s or the other. The duplication term (7) places both children edges of e g on e s . In the transfer term (8), we consider both ways of selecting the transferred child edge. The non-transferred child edge of e g remains on e s , but the transferred child edge is placed on a species edge determined by Best-Transfer ; Best-Transfer(e j , e s ) denotes the minimum cost of a mapping of the subtree G(g j ) assuming that e j is placed on a species edge that is neither ancestral nor descendant to e s . In order to compute these values, the algorithm maintains another table called Best-Entry(e g , e s ) which stores the minimum of C(e g , e i ) over all e i in the subtree rooted at e s . The algorithm is given in Algorithm 1.

Computing the synteny map
We use another dynamic programming algorithm to find the optimal cost for a syntenic region mapping for each subtree G(g). Let syn(g) denote the optimal cost for a syntenic (8) min C(e g 1 , e s ) + Best-Transfer(e g 2 , e s ), C(e g 2 , e s ) + Best-Transfer(e g 1 , e s ) region mapping of G(g). Let syn(g, ℓ) denote the optimal cost for a syntenic region mapping of G(g) such that g has the syntenic region ℓ . Then syn(g) = min ℓ∈L syn(g, ℓ).
If g is a leaf, then Ŵ(g) = γ (g) (by rule 1 in the definition of Ŵ ). Thus, If g is not a leaf, then (recalling that g 1 and g 2 denote the children of g): This accounts for each child g 1 and g 2 either remaining in the same syntenic region as g or potentially changing to a new region and incurring a cost of R . The algorithm for computing syn(g, ℓ) is summarized in Algorithm 2.

Solving the DTLOR MPR problem
Let Origin(g) denote the cost of reconciling an origin subtree G(g) which, as noted earlier, can be computed as Origin(g) = O + C(g) + syn(g) . To find a maximum parsimony reconciliation, we must therefore determine the optimal locations for origin events. Let Null(g) be the optimal cost of reconciling G(g) such that g has the unknown syntenic region * . Since the given mapping γ of leaves to syntenic regions must be respected, g may not be assigned syntenic region * if g is a leaf. Thus, Null(g) is calculated as: The optimal cost for reconciling the entire gene tree G is given by: The algorithm for computing Null(g) is summarized in Algorithm 3.
if g is a leaf min{Null(g 1 ), Origin(g 1 )} + min{Null(g 2 ), Origin(g 2 )} otherwise (12) Opt = min{Null(root(G)), Origin(root(G))} Note that if we wish to reconstruct an optimal solution, the DP tables C, syn, Null can be annotated in the standard way, allowing the solutions to be reconstructed by tracing through the table. We first trace through the Null table to find a set of origin events that produce an optimal solution. For any gene vertex not in any of the origin subtrees induced by these origin events, they are labeled with the unknown syntenic region * . For each origin subtree, we trace through the syn table to get an optimal syntenic region mapping, and then we trace through the C table to get an optimal species mapping. Because of loss events, there can be multiple C(e g , e s ) entries that involve the same gene vertex in an optimal solution. The mapping of that gene vertex corresponds to the lowest such e s .
The proofs of the following are given in the Appendix:

Lemma 1 Algorithm 1 correctly computes C(g) for every gene vertex g ∈ V (G).
Lemma 2 Algorithm 2 correctly computes syn(g) for every gene vertex g ∈ V (G).

Time complexity
Computing each entry of the C

Non-binary gene trees
While it is generally possible to construct accurate species trees using a variety of methods, gene trees are susceptible to ambiguity due to the relatively little information available in their sequence data. Consequently, phylogenetic trees for genes often have non-binary vertices, also known as multifurcations or soft polytomies, which correspond to an unknown ordering of the underlying sequence of divergences [15]. In this case, we wish to expand each multifurcation into a sequence of binary divergences, leading to a binary gene tree. Such an expansion is called a resolution or binarization of the nonbinary tree. The DTLOR MPR problem for non-binary trees seeks to find the optimal reconciliation of G and S over all possible resolutions of G. Unfortunately, the number of resolutions of a non-binary tree can be exponential in the number of vertices in the tree. It is, therefore, impractical to explicitly consider every resolution. However, Kordi and Bansal [15] and Jacox et al. [16] demonstrated the existence of fixed-parameter polynomial-time algorithms for finding maximum parsimony reconciliations for non-binary trees in the DTL model. These algorithms operate in polynomial time assuming the maximum number of children of any nonbinary vertex is bounded by some constant k. More precisely, a fixed-parameter algorithm in this context runs in time O(f(k)p(m, n)) where m and n denote the sizes of the gene and species trees, k is the maximum branching factor of any gene vertex, p(m, n) is a polynomial in m and n, and f(k) is some function of k which may even be exponential in k. In particular, f(k) in this context is the number of distinct binary resolutions of a tree comprising a root and k children times the size of such a binary resolution, which can be shown to be f (k) = O(2 k (k − 1)!) . For any fixed k, this value is a fixed constant. Jacox et al. [16] offer an approach that results in an f(k) which is smaller but still potentially exponential in k. Importantly, a fixed-parameter polynomial-time algorithm is much more efficient and practical than an exponential time algorithm such as the naive approach of enumerating all possible resolutions of a non-binary tree which would have running time O((2 k (k − 1)!) n ).
In this section, we describe a fixed-parameter polynomial time algorithm for the DTLOR MPR problem. Following the approach of Kordi and Bansal [15], our algorithm expands each individual non-binary vertex into every possible binary resolution but avoids enumerating all possible binary resolutions of the entire tree, thus resulting in a fixed-parameter polynomial-time algorithm rather than an exponentialtime algorithm. While our algorithm leverages the important ideas for resolving nonbinary vertices first proposed by Kordi and Bansal [15], it requires a new algorithm due to the advent of O and R events.
Each binary resolution of a non-binary gene tree implies a different topology for the gene tree, which induces potentially different costs for both the species mapping and the syntenic region mapping. Note that one resolution may be most favorable for minimizing the cost of the species mapping while a different one may admit the least expensive syntenic region mapping (Fig. 1). Thus, while in binary gene trees the species mapping and syntenic region mappings could be efficiently solved independently and then merged into an optimal solution for the DTLOR MPR problem, the situation is more complicated in the presence of non-binary gene trees. The algorithm presented here considers the species mapping and syntenic region mapping simultaneously as non-binary vertices are resolved one-by-one. Importantly, once the best resolution is found for the subtree rooted at a given gene vertex, that value can be saved and used as the dynamic program processes the ancestors of g. For this reason, it is not necessary to consider complete resolutions of the gene tree but, instead, the non-binary vertices can be resolved one-at-a-time. This results in an algorithm that is additive, rather than multiplicative, in the number of resolutions of individual nonbinary vertices.
The following definitions and equations assume that the terminus vertex g of edge e g is either a leaf or has exactly two children. Later, we show how to apply these to nonbinary trees. Let C(e g , e s , ℓ) denote the optimal cost of reconciling the subtree G(g) with S such that e g is placed on e s and g has the syntenic region ℓ . Note that in contrast to C(e g , e s ) used in the previous section, C(e g , e s , ℓ) also encodes the constraint that g has syntenic region ℓ and the total cost includes the cost of rearrangement events in the subtree G(g). We define Best-Entry(e g , e s , ℓ) and Best-Transfer(e g , e s , ℓ) analogously to Best-Entry(e g , e s ) and Best-Transfer(e g , e s ) , respectively, in the previous section. We define C(e g , e s , L) = min ℓ∈L C(e g , e s , ℓ) and Fig. 1 An example showing that one resolution of a multifurcation can be optimal for species mapping while another may be optimal for syntenic region mapping. a A gene tree with six leaves labeled with their syntenic regions. b A species tree. The tip association is φ(g i ) = s i , 1 ≤ i ≤ 6 . c A binary resolution of the gene tree in which the optimal species mapping cost is necessarily greater than the origin cost O since this tree is not isomorphic to the species tree; the optimal number of rearrangements in this case is 1. d Another binary resolution of the gene tree in which the optimal species mapping cost is just the origin cost O since this tree is isomorphic to the species tree; the optimal number of rearrangements in this case is 2 We compute C(e g , e s , ℓ) in postorder. There are four cases: • In the base case, if g and s are leaves, then: • If neither g nor s is a leaf, then: where the computation of Spec , Loss , Dup , and Transfer are given below. • If g is not a leaf but s is a leaf, then: • If g is a leaf but s is not a leaf, then: The functions Spec(e g , e s , ℓ) , Loss(e g , e s , ℓ) , Dup(e g , e s , ℓ) , and Transfer(e g , e s , ℓ) are computed as follows: Spec(e g , e s , ℓ) = min{ Dup(e g , e s , ℓ) = Transfer(e g , e s , ℓ) = T + min{ Best-Transfer(e g , e s , L) = min ℓ∈L Best-Transfer(e g , e s , ℓ) (13) C(e g , e s , ℓ) = 0 if φ(g) = s and γ (g) = ℓ ∞ otherwise (14) C(e g , e s , ℓ) = min{Spec(e g , e s , ℓ), Loss(e g , e s , ℓ), Dup(e g , e s , ℓ), Transfer(e g , e s , ℓ)} (15) C(e g , e s , ℓ) = min{Dup(e g , e s , ℓ), Transfer(e g , e s , ℓ)} (16) C(e g , e s , ℓ) = Loss(e g , e s , ℓ) (17) min{C(e g 1 , e s 1 , ℓ), R + C(e g 1 , e s 1 , L)}+ min{C(e g 2 , e s 2 , ℓ), R + C(e g 2 , e s 2 , L)}, min{C(e g 1 , e s 2 , ℓ), R + C(e g 1 , e s 2 , L)}+ min{C(e g 2 , e s 1 , ℓ), R + C(e g 2 , e s 1 , L)}} (18) Loss(e g , e s , ℓ) =L + min{C(e g , e s 1 , ℓ), C(e g , e s 2 , ℓ)} (19) D + min{C(e g 1 , e s , ℓ), R + C(e g 1 , e s , L)} + min{C(e g 2 , e s , ℓ), R + C(e g 2 , e s , L)} (20) min{C(e g 1 , e s , ℓ), R + C(e g 1 , e s , L)}+ min{Best-Transfer(e g 2 , e s , ℓ), R + Best-Transfer(e g 2 , e s , L)}, min{C(e g 2 , e s , ℓ), R + C(e g 2 , e s , L)}+ min{Best-Transfer(e g 1 , e s , ℓ), R + Best-Transfer(e g 1 , e s , L)}} In order to compute Best-Transfer , we compute Best-Entry(e g , e s , ℓ) as follows. If s is a leaf, then Otherwise, Best-Transfer(e g , e s , ℓ) is then computed in preorder: First, for the handle edge e S For all other edges, e s with child edges e s 1 and e s 2 Best-Transfer(e g , e s 1 , ℓ) = min{ Best-Transfer(e g , e s 2 , ℓ) = min{ Now, we consider the case that each internal vertex g has an arbitrary number of children denoted g 1 , . . . g k , k ≥ 2 . A binary resolution for g is defined to be a binary tree whose root is g and whose leaves are g 1 , g 2 , . . . , g k . Let BR(g) denote the set of all binary resolutions for g. Note that if g has two children, then it has just one binary resolution. Note also that a binary resolution for a vertex g is different from a binary resolution for the entire gene tree; the former only resolves g into a binary subtree whereas the latter resolves all non-binary vertices in G.
Let Null(g) be the optimal cost of reconciling G(g) with S such that g has the unknown syntenic region * . Let Origin(g) be the optimal cost of reconciling G(g) with S such that g induces an origin event. Let H be a binary resolution for g and let G H (g) denote the subtree G(g), along with its handle, such that g and its children have been replaced by H. Note that if g has exactly two children, then G H (g) = G(g) . Let C H , Best-Entry H , Best-Transfer H , Origin H , Null H correspond to C, Best-Entry , Best-Transfer , Origin , and Null (from the previous section) for G H (g).
Let e h be an edge in H. If e h is a leaf edge in H (and thus h is one of the children of g in G), then C H (e h , e s , ℓ) = C(e h , e s , ℓ) for all e s , ℓ , Origin H (h) = Origin(h) , and Null H (h) = Null(h) . Thus, as the algorithm considers each gene edge e g in postorder, it then considers each binary resolution H of g and the induced subtree G H (g) . Within G H (g) it considers the edges of H in postorder to compute the optimal reconciliation for G H (g) . Finally, the optimal reconciliation over all binary resolutions H of g yields the optimal reconciliation for G(g). The algorithm is summarized in Algorithm 4. (Recall that h 1 and h 2 denote the two children of vertex h.) (21) Best-Entry(e g , e s , ℓ) = C(e g , e s , ℓ).

Theorem 2 Algorithm 4 correctly computes the optimal solution to the DTLOR-MPR problem for non-binary gene trees.
The proof is summarized in the Appendix.

Time complexity
The algorithm first initializes Origin

Results
The implementation of the DTLOR MPR Algorithm (Algorithm 3) was integrated into the xenoGI software package [2] which seeks to reconstruct the history of genome evolution in clades of microbes. xenoGI takes as input a set of sequenced genomes, identifies gene families within this set, and groups those families by common origin. The previous version of xenoGI created gene families in a species-tree aware way, but did not make use of reconciliations. It was able to map gene families onto the species tree and identify their point of origin. However, it was not able reconstruct events in the subsequent evolution of the gene family (e.g. losses or rearrangements). The integration of the DTLOR MPR algorithm allows xenoGI to reconstruct these subsequent events, providing potentially important new insights into microbial evolution.
Within the new DTLOR version of xenoGI, we construct a gene tree for every family, then reconcile it with the species tree. The resulting reconciliation can be used to refine the family (e.g. split it into multiple parts based on the placement of origin events) and to provide detailed information about the family's subsequent evolution. Table 1 shows the running time for the DTLOR MPR Algorithm within xenoGI on all gene trees given inputs ranging from 4 to 15 bacterial genomes (species). Trees were constructed using FastTree [17] and MUSCLE [18] In each case, DTLOR was run on every binary gene tree with more than two leaves, with gene trees being rooted in all possible ways. These calculations were performed on a commodity server (50 AMD Opteron 6276 2.3 GHz processors, 503 GB RAM). The DTLOR costs were set at 1, 1, 1, 2, 2, respectively.
In one of our enteric bacterial test data sets (Dataset B in Table 1), we examined DTLOR output for a known genomic island, the Acid Fitness Island (AFI) [19]. The corresponding species tree is shown in Fig. 2. This island is thought to have originated with an insertion of 19 genes on branch s2 (the branch leading to the internal vertex s2) via horizontal transfer from outside the clade. It then evolved in the clade and was inherited in the four descendant strains, with the notable loss of nine genes along the branch leading to E. coli K12 [2]. For nearly all the gene families in this island, DTLOR produced reconciliations that place the origin of the family on the branch leading to s2, and it correctly recognized loss events on the K12 branch where those occurred. In a few cases, there were multiple most parsimonious reconciliations (MPRs), one of which agreed with the above scenario and was deemed correct, and the others did not agree. Finally there is one family (glutamate decarboxylase) with an evidently complicated post-insertion evolution that is not fully understood. In this case, none of the MPRs using the selected event costs appear to be correct. (The evolution of this family likely involved gene conversion, but the MPR lacks transfer events using the event costs that we used in this experiment.)

Discussion
Several important problems remain to be studied. First, the impact of event costs is not well-understood. Just as in the case of the DTL model, different event costs can give rise to different reconciliations which, in turn, can lead to different conclusions. We believe that the costscape algorithm developed for the DTL model [20] may be extendible to the DTLOR model, which would provide insights into the impact of event costs to the solution space. Second, it is often the case that there are many distinct MPRs. In fact, even in the DTL model, the number of MPRs can be exponential in the size of the two trees [21]. A subset of these MPRs may be of particular interest because they contain certain evolutionary events that are strongly believed to have occurred (e.g., a horizontal transfer on a particular branch of the species tree). It is desirable, therefore, to efficiently filter the set of MPRs to only include those that contain a specified set of events, count the number of MPRs in the filtered set, compute support values on the constituent events in that space of MPRs, and select representative reconciliations from this set.
Finally, further systematic studies are needed to determine the full impact of the DTLOR MPR algorithm on the analyses that can now be performed with the enhanced xenoGI tool, including the case of non-binary gene trees.

Conclusions
In this paper, we have described the DTLOR model which extends the well-known DTL model to include origin and rearrangement events. This model is particularly applicable to the evolution of microbes where the species tree is, in many cases, not fully sampled. Therefore, reconciliations must be able to account for transfer events from outside the sampled tree. In addition, the DTLOR model allows for syntenic rearrangement, which is also prevalent in microbial gene families.
We have described efficient algorithms for maximum parsimony reconciliation in the DTLOR model. In binary gene trees, our algorithm solves the DTL reconciliation problem and the sytnenic region problems independently and then combines the results of those two algorithms, resulting in a particularly efficient solution. When the gene tree is non-binary, the two subproblems can no longer be decoupled in this way, and our algorithm for this case considers all of the events simultaneously.
correctly for each descendant edge e s ′ of e s . The ways for e g to enter the subtree rooted at e s are at e s , the left subtree of e s , or the right subtree of e s , thus Best-Entry(e g , e s ) is computed correctly on line 19. Now we prove the base case for Best-Transfer(e g , ·) using structural induction on S from the handle edge e S . In the base case where e s = e S , all edges in S is a descendent of e S , so there is no valid species edge for the child edge of e g to transfer to. Thus Best-Transfer(e g , e S ) is computed correctly on line 22. In the inductive step, we consider a non-root edge e s 1 , which has a sibling edge e s 2 . By the inductive hypothesis Best-Transfer(e g , p(e s 1 )) is computed correctly and by the inductive proof on Best-Entry(e g , ·) , Best-Entry(e g , e s 2 ) is computed correctly. Since the species edges that e g are allowed to transfer to from e s 1 not only include the same edges if e g were to transfer from p(e s 1 ) , but also edges in the subtree rooted at e s 2 , and the optimal cost of placing e g inside the subtree rooted at e s 2 is given by Best-Entry(e g , e s 2 ) , Best-Transfer(e g , e s ) is computed correctly. This concludes the base case for Best-Transfer(e g , ·).
To conclude the proof of correctness of C(e g , ·) , we consider a non-leaf edge e g . We use structural induction on S. In the base case, e s is a leaf edge; the only two possibilities are e g duplicates on e s or transfers on e s . The correctness of Dup(e g , e s ) is guaranteed by the correctness of C(e g ′ , ·) , while the correctness of Transfer(e g , e s ) is guaranteed by the correctness of both C(e g ′ , ·) and Best-Transfer(e g ′ , ·) for each descendant edge e g ′ of e g . In the inductive step, e s is not a leaf edge, then C(e g , e s ) is, by definition, the minimum of Spec(e g , e s ) , Dup(e g , e s ) , Transfer(e g , e s ) , and Loss(e g , e s ) (see Eq. 2). Again, Spec(e g , e s ), Dup(e g , e s ) and Transfer(e g , e s ) are computed correctly by the correctness of C(e g ′ , ·) and Best-Transfer(e g ′ , ·) . The correctness of Loss(e g , e s ) is guaranteed by the inductive hypothesis on the correctness of C(e g , e s ′ ) for every descendant edge e s ′ of e s . This concludes the inductive step for C(e g , e s ) . The inductive step for Best-Transfer(e g , ·) is analogous to the proof for the base case.
Finally, C(g) is the optimal cost for a species mapping of G(g) such that g can be mapped to any species in S. By definition, C(g) = min s C(g, s) where C(g, s) denotes the optimal cost for a species mapping of G(g) in which g is mapped to s.
Consider min e s C(e g , e s ) . Since C(e g , e s ) is the optimal cost of a species mapping where e g is placed on e s , this implies a mapping of g to s or one of its descendants. If g is mapped to s, then C(e g , e s ) = C(g, s) . If C(e g , e s ) involves a mapping of g to s ′ which is a descendant of s, then it induces loss events. Because loss events have a non-negative cost, C(e g , e ′ s ) ≤ C(e g , e s ) , and thus min e s C(e g , e s ) only includes those entries of C(e g , e s ) where g is mapped to s. Thus, C(g) = min s C(g, s) = min e s C(e g , e s ).