Model
Given a rooted, binary gene tree, T_{
GS
}= (V_{
G
}, E_{
G
}), a rooted, binary domain tree, T_{
D
}= (V_{
D
}, E_{
D
}), and a mapping, {M}_{L}^{DG}from L(T_{
D
}), the leaves in the domain tree, to L(T_{
GS
}), the leaves in gene tree, the goal of domain shuffling event inferenceis to infer the association between ancestral taxa in the species, gene, and domain trees and the set of events that best explains this association. In our model, we define the following set of events:
Codivergence ( C )A bifurcation in the domain tree that arose through a bifurcation in the gene tree. The gene tree bifurcation may have arisen via speciation (C_{
S
}), gene duplication (C_{
D
}), or gene transfer (C_{
T
}).
Duplication \left(\mathcal{D}\right)A single domain is copied resulting in two separate copies of the domain within the same gene.
Duplication \left(\mathcal{L}\right)A domain is deleted from the gene (and genome).
Domain insertion ( I )A new copy of the domain is inserted into a different gene within the same genome.
Horizontal domain transfer \left(\mathcal{T}\right)A new copy of a domain is inserted into a gene in a different genome.
Reconciliation is the process of inferring M ^{DG}: V_{
D
}→ V_{
G
}, the association between ancestral domains and ancestral genes. The result is a reconciled domain tree, T_{
DG
}= (V_{
D
}, E_{
D
}), in which every node, d∈ V_{
D
}, is annotated with M ^{DG}(d) = g, where gis the ancestral gene that contained domain d; \mathcal{L}\phantom{\rule{2.36043pt}{0ex}}\left(d\right), the domains lost on the edge leading to d; and E(d), the event that caused the divergence at d. Codivergences and domain duplications correspond to internal nodes in T_{
DG
}. Each insertion and transfer corresponds to an edge, (d_{1}, d_{2}) ∈ E_{
D
}, where d_{2}is the inserted (or transferred) domain and d_{1}is the donor domain. In a parsimony framework, the cost of a reconciliation is \kappa ={\sum}_{\forall \epsilon}{\kappa}_{\epsilon}\cdot {n}_{\epsilon}, where κ_{
ε
}is the cost of event ε and n_{
ε
}is the number of occurrences of ε in the reconciliation.
Formally, we define the problem of inferring a domain event history as follows:
Domain Event Inference with Transfers (DEDTL)
Domain events: \left\{{C}_{S},\phantom{\rule{2.36043pt}{0ex}}{C}_{D},\phantom{\rule{2.36043pt}{0ex}}{C}_{T},\phantom{\rule{2.36043pt}{0ex}}\mathcal{D},\phantom{\rule{2.36043pt}{0ex}}\mathcal{T},\phantom{\rule{2.36043pt}{0ex}}I,\phantom{\rule{2.36043pt}{0ex}}\mathcal{L}\right\}.
Input:A rooted, binary domain tree, T_{
D
}= (V_{
D
}, E_{
D
}); a rooted, binary, DTLreconciled gene tree, T_{
GS
}; and a leaf mapping, {M}_{L}^{DG}\phantom{\rule{2.36043pt}{0ex}}:\phantom{\rule{2.36043pt}{0ex}}L\left({T}_{D}\right)\to L\left({T}_{GS}\right).
Output:The set of all temporally feasible, domain shuffling histories T_{
DG
}that minimize
\kappa ={\kappa}_{\mathcal{D}}{n}_{\mathcal{D}}+{\kappa}_{\mathcal{T}}{n}_{\mathcal{T}}+{\kappa}_{I}{n}_{I}+{\kappa}_{\mathcal{L}}{n}_{\mathcal{L}}.
Solving DEDTL entails challenges that do not arise in gene treespecies tree reconciliation. First, when a domain codivergence is inferred, additional information is required to determine whether the codivergence is the result of a speciation, a gene duplication, or a gene transfer. Second, domain insertions are horizontal events between genes in the same species. In contrast, domain transfers are horizontal events between different species. An extra test is needed to ensure that the correct event is inferred. Third, a missing taxon in the domain tree may be due to a domain loss or the loss of a gene. If the latter, then the loss should not be included in the event cost of the domaingene reconciliation. Finally, when testing for temporal feasibility, temporal constraints arising from gene transfers, domain transfers, and domain insertions must all be considered.
Notation:Given a tree T_{
i
}= (V_{
i
}, E_{
i
}), ρ_{
i
}designates the root of the tree. For nodes u, v∈ V_{
i
}, p(v) denotes the parent of vand l(v) and r(v) denote the left child and right child of v, respectively. If uis on the path from vto ρ_{
i
}, then uis an ancestor of v, designated u≥_{
i
} v, and vis a descendant of u, designated v≤_{
i
} u. If v{\ngeqq}_{i}uand u{\ngeqq}_{i}v, then uand vare incomparable, designated u{\lessgtr}_{i}v. Given a reconciled gene tree T_{
GS
}and a reconciled domain tree T_{
DG
}, s= M ^{GS}(g) is the species s∈ V_{
S
}that contained gene g∈ V_{
G
}and g= M ^{DG}(d) is the gene gthat contained domain d∈ V_{
D
}. The species containing dis s= M ^{DS}(d), where M ^{DS}(d) = M ^{GS}(M ^{DG}(d)).
Domain shuffling with transfers and insertions
Here, we present an algorithm for the domain event inference problem for multidomain families evolving according to a locus model[42], in which novel domain arrangements arise through internal duplication, loss, and insertion of domains into an existing gene. This restriction justifies the premises that the history of the family as a whole can be described by a tree. This assumption is consistent with the existence of promiscuous domains that lend themselves to insertion in new chromosomal environments [1, 3, 43, 44] and reports of young genes that arose through duplication of existing genes, followed by acquisition of additional domains [2, 45–48]. Moveover, domain insertion into an existing gene is more likely to be viable since all regulatory and termination signals required for successful transcription are present. In addition, we assume that domain insertions and transfers only involve domains within the same gene family. In other words, for a given domain family, we assume that the domain instances that appear in the gene family under consideration form a clade in the domain tree.
In the context of this model, we introduce an algorithm (Alg. 1) for inferring domain events by reconciling a domain tree with a gene tree that has been previously reconciled with a species tree.
Algorithm 1DEDTL
Input:T_{S}; T_{GS}; T_{D}; {M}_{L}^{DG}\phantom{\rule{2.36043pt}{0ex}}:\phantom{\rule{2.36043pt}{0ex}}L\left({T}_{D}\right)\to L\left({T}_{G}\right)
Output:
\left\{{T}_{DG}^{1}\dots {T}_{DG}^{f}\right\},\phantom{\rule{2.36043pt}{0ex}}\kappa
1
{T}_{GS}^{*}=addLoss\phantom{\rule{2.36043pt}{0ex}}\left({T}_{GS},\phantom{\rule{2.36043pt}{0ex}}{T}_{S}\right)
2
costCalc
\left({\rho}_{D},\phantom{\rule{2.36043pt}{0ex}}{T}_{GS}^{*},\phantom{\rule{2.36043pt}{0ex}}{T}_{S}\right)
3
\left\{{T}_{DG}^{1}\dots {T}_{DG}^{m}\right\}=traceback\phantom{\rule{2.36043pt}{0ex}}\left({\rho}_{D},\phantom{\rule{2.36043pt}{0ex}}{T}_{GS}^{*},\phantom{\rule{2.36043pt}{0ex}}{T}_{S}\right)
4
\left\{{T}_{DG}^{1}\dots {T}_{DG}^{f}\right\}=checkFeasibility\phantom{\rule{2.36043pt}{0ex}}\left(\left\{{T}_{DG}^{1}\dots {T}_{DG}^{m}\right\}\right)
The algorithm proceeds in four steps. First, an additional data structure, {T}_{GS}^{*}, is constructed. {T}_{GS}^{*}consists of an extended reconciled gene tree that contains additional nodes and leaves representing taxa that are missing due to gene losses. This additional data structure is used to determine whether or not the donor and recipient of a horizontal event are in the same genome or a different genome, and to distinguish between domain losses and gene losses. Next, candidate reconciliations are generated in two passes over the domain tree. In the first pass, the dynamic program costCalc visits each d∈ V_{
D
}in postorder and determines the cost of the subtree rooted at dfor each possible event at d. This information is stored in the cost and event tables, K_{
d
}and H_{
d
}. In the second pass, traceback traverses T_{
D
}topdown to generate candidate reconciliations from the information stored in the cost and event tables. In general, there may be more than one optimal set of domain events that reconcile T_{
D
}with T_{
GS
}. The second pass generates all candidate event histories of minimum cost, {T}_{DG}^{1}\dots {T}_{DG}^{m}, where mis the number of candidate histories. In the final step, each candidate history is tested for conflicting temporal constraints. The output is the set of all temporally feasible histories, \left\{{T}_{DG}^{1}\dots {T}_{DG}^{f}\right\}, where fis the number of feasible, optimal reconciliations.
Our domain shuffling event inference algorithm is based on the existing framework for genespecies tree reconciliation with a DTL model [36, 49], but contains additional features to address the complications that arise in reconciliation with three nested trees. We discuss these features in detail, here.
addLoss: We construct {T}_{GS}^{*}=\left({V}_{G}^{*},\phantom{\rule{2.36043pt}{0ex}}{E}_{G}^{*}\right)from T_{
GS
}by placing pseudonodes on each edge (p(g), g) ∈ E_{
G
}, on which losses occurred. Each pseudonode represents an ancestral gene that was present in the species lineage from M ^{GS}(p(g)) to M ^{GS}(g), but cannot be observed because of a gene loss in one child of each of the intervening species. Let {s}_{1}\cdots {s}_{l}be the species between M ^{GS}(g) and M ^{GS}(p(g)) that are absent from T_{
G
}due to gene losses. We insert pseudonodes ϕ_{
s1
}⋯ ϕ_{
sl
}between gand p(g) such that ϕ_{
s1
}becomes the new parent of g, ϕ_{
si
}is the parent of ϕ_{
si−1
}for i= 2 ⋯ l, and p(g) becomes the new parent of ϕ_{
sl
}. For each pseudonode, ϕ_{
s
}, we attached a pseudoleaf, λ_{
s
}, where s\prime =l\left(s\right), if {M}^{GS}\left(g\right)\lessgtr l\left(s\right), and s\prime =r\left(s\right), otherwise. In other words, S'is the child of sthat is not on the path from M ^{GS}(g) to M ^{GS}(p(g)) and \left(s\prime ,\phantom{\rule{2.36043pt}{0ex}}s\right)is the species tree branch on which the loss occurred. Note that a pseudoleaf in {T}_{GS}^{*}may correspond to an internal node in T_{
S
}, because if a gene is missing from an entire clade of species, then the most parsimonious explanation is a single gene loss in the root of the clade.
The addition of pseudonodes allow for more precise estimation of the association between a node in the domain tree and a lineage in the gene tree. Consider, for example, the evolutionary history of the hypothetical family shown in Figure 3, and the reconcilied gene and domain trees for this family, shown in Figure 4. We can infer that the ancestral domain associated with node uin T_{
D
}was in a genome on the EudicotRosid lineage because of the position of the pseudonode ϕ_{
R
}, between uand d1_g1_A. Without the pseudonode, it would not be possible to determine whether or not upredates the divergence between Apple and Berry.
Pseudoleaves are also used to distinguish between gene and domain losses. The domain tree in Figure 4(c) has three stubs indicating missing taxa. Two of these are associated with gene losses (ℓ_{
B
}and ℓ_{
C
}), indicating that only one of the missing domains is due to a domain loss (d2_g2_B).
costCalc: Once the extended reconciled tree has been constructed, the first pass takes T_{
D
}and {T}_{GS}^{*}as input and calls costCalc(d) for each d∈ V_{
D
}in postorder (Alg. 2). The gene association and event at ddepend on g_{
l
}and g_{
r
}, the genes associated with the children of d. The outer loop in cost Calc enumerates all possible (g_{
l
}, g_{
r
}) pairs, and for each pair determines the gene associations and events implied by M ^{DG}(l(d)) = g_{
l
}and M ^{DG}(r(d)) = g_{
r
}. The cost of each configuration is stored in K_{
d
}. A tuple consisting of the event and the mappings of the children of dare stored in H_{
d
}. The logic for these assignments is as follows.
Domain duplication is the only event that results in children mapped to comparable genes. Thus, if g_{
l
}and g_{
r
}are comparable, then \epsilon =\mathcal{D}and the gene associated with dis g= lca(g_{
l
}, g_{
r
}). Horizontal events and codivergences both result in {g}_{l}\lessgtr {g}_{r}. Therefore, if g_{
l
}and g_{
r
}are incomparable, both possibilities are considered. For the codivergence case, the associated gene is again g= lca(g_{
l
}, g_{
r
}). The type of codivergence is determined by inspecting the gene event associated with gene node gin {T}_{GS}^{*}. For the horizontal case, the event is a domain insertion if the donor and recipient gene are in the same genome and a domain transfer if they are in different genomes. For both insertions and transfers, either g_{
l
}or g_{
r
}may be associated with d; i.e., either g_{
l
}or g_{
r
}could be the donor of the domain.
For each scenario, the cost of domain losses must also be determined, excluding cases where domain loss is due to gene loss. Recall that pseudonodes correspond to gene losses. Therefore, the number of losses, {n}_{\mathcal{L}}, on the edge from dto p(d) is the number of nonpseudonodes between g= M ^{DG}(d) and g_{
p
}= M ^{DG}(p(d)). If gand g_{
p
}are not pseudonodes, then this quantity is Δ(g) − Δ(g_{
p
}) − 1, where Δ (g) is the depth of gin the original gene tree, T_{
G
}. However, if gis a pseudonode, ϕ, then Δ (g) is undefined. We define Δ (ϕ) to be the depth of the first nonpseudonode ancestor of ϕ in {T}_{GS}^{*}. In other words, Δ (ϕ) = Δ (u), where u≥_{
G
}ϕ and there exists no v∈ V_{
G
}, such that u≥_{
G
} v≥_{
G
}ϕ. This effectively jumps over all pseudonodes between ϕ and u. If gis a pseudonode and its first nonpseudonode ancestor is a loss, then directly using the depth of the first nonpseudonode ancestor will fail to count this loss. In this case, the number of losses is incremented by one (lines 15 and 16). No such correction is needed when g_{
p
}is a pseudonode.
This formulation allows for efficient calculation of the number of losses under each scenario considered in costCalc. If E(d) is a codivergence, then the number of losses is (Δ (g_{
l
}) − Δ (g) − 1) + (Δ (g_{
r
}) − Δ (g) − 1). If E(d) is a duplication, g_{
l
}, g_{
r
}, and gshould all be at the same depth, and the number of losses is (Δ (g_{
l
}) − Δ (g)) + (Δ (g_{
r
}) − Δ (g)).
traceback: Once the first pass is complete, the cost and event tables are filled for each node in T_{
D
}. The traceback algorithm constructs a minimumcost reconciliation using these tables in a preorder traversal. At every node d∈ V_{
D
}, the appropriate tuple in the event table H_{
d
}is used to assign an event to E(d) and determine the labels of the genes associated with the children of d. The losses that occurred between dand p(d) are inferred in a climbing procedure [38, 50]. For each ancestral node gthat is missing between M ^{DG}(d) and M ^{DG}(p(d)), a loss is inferred in g', the child of gthat is incomparable to M ^{DG}(d). If gis a pseudonode in T_{
GS
}, then g'is a gene loss, λ_{
s
}. Otherwise, a domain loss in g'is inferred.
Algorithm 2DEDTL: CostCalc
Input:{T}_{GS}^{*}; T_{D}; {M}_{L}^{DG}:L\left({T}_{D}\right)\to L\left({T}_{G}\right)
Output:K_{d}, {H}_{d}\phantom{\rule{1em}{0ex}}\forall d\in {V}_{D}; ν
costCalc(d) {
1 if d∈ L(T_{
D
}) {
2 for each g\in {V}_{G}^{*}{
3 if {M}_{L}^{DG}\left(d\right)\lessgtr g{table(d, g, ∞, ∞, null, null) }
4 else{
5
{n}_{\mathcal{L}}=\Delta \left({M}_{L}^{DG}\left(d\right)\right)\Delta \left(g\right)
6 table(d, g, C_{
S
},{n}_{\mathcal{L}}, null, null)
7 }
8 }
9
return
10}
11 costCalc(l(d)), costCalc(r(d))
12 for each\left({g}_{l},\phantom{\rule{2.36043pt}{0ex}}{g}_{r}\right)\in {V}_{G}^{*}\times {V}_{G}^{*}{
13 g= lca(g_{
l
}, g_{
r
})
14
{n}_{\mathcal{L}}=\Delta \left({g}_{l}\right)+\Delta \left({g}_{r}\right)2\cdot \left(\Delta \left(g\right)+1\right)
15 if(g_{l}is pseudonode) { {n}_{\mathcal{L}}+ + }
16 if(g_{r}is pseudonode) { {n}_{\mathcal{L}}+ + }
17 if NOT\left({g}_{l}\lessgtr {g}_{r}\right){
18
// Duplication case;
19
\epsilon \leftarrow \mathcal{D}
20
{n}_{\mathcal{L}}={n}_{\mathcal{L}}+2
21 table(d, g, ε, {n}_{\mathcal{L}}, g_{
l
}, g_{
r
})
22}
23 else{
24
// Codivergence case;
25 if E\left(g\right)=\mathcal{D}\left\{\epsilon \leftarrow {C}_{\mathcal{D}}\right\}// Duplication
26
else if
E\left(g\right)=\mathcal{T}\left\{\epsilon \leftarrow {C}_{\mathcal{T}}\right\}
// Transfer
27 else{ ε ← C_{
S
}} // Speciation
28 table(d, g, ε, {n}_{\mathcal{L}}, g_{
l
}, g_{
r
})
29
// Domain insertion or transfer case;
30 ifM^{GS}(g_{
l
}) = M ^{GS}(g_{
r
}) { // Same genome
31 // Domain insertion case;
32 table(d, g_{
l
}, I, 0, g_{
l
}, g_{
r
}) // g_{
l
} to g_{
r
}
33 table(d, g_{
r
}, I, 0, g_{
l
}, g_{
r
}) // g_{
r
} to g_{
l
}
34} else if {M}^{GS}\left({g}_{l}\right){\lessgtr}_{S}{M}^{GS}{g}_{r}{
35 // Domain transfer case;
36 table(d, g_{
l
}, \mathcal{T}0, g_{
l
}, g_{
r
}) // g_{
l
} to g_{
r
}
37 table(d, g_{
r
}, \mathcal{T}, 0, g_{
l
}, g_{
r
}) // g_{
r
} to g_{
l
}
38}
39 }
40}
table(d, g, ε, {n}_{\mathcal{L}}, g_{
l
}, g_{
r
}) {
41
\kappa \leftarrow \kappa \left(\epsilon \right)+{K}_{l\left(d\right)}\left[{g}_{l}\right]+{K}_{r\left(d\right)}\left[{g}_{r}\right]+{\kappa}_{\mathcal{L}}\cdot {n}_{\mathcal{L}}
42 if\kappa <{K}_{d}\left[g\right]{
43
{K}_{d}\left[g\right]\leftarrow \kappa
44
{H}_{d}\left[g\right]\leftarrow \phantom{\rule{2.36043pt}{0ex}}\mathit{l}\mathit{i}\mathit{s}\mathit{t}\phantom{\rule{2.36043pt}{0ex}}\left(\left(\epsilon ,\phantom{\rule{2.36043pt}{0ex}}{g}_{l},\phantom{\rule{2.36043pt}{0ex}}{g}_{r}\right)\right)
45} else if\kappa ={K}_{d}\left[g\right]{ list({H}_{d}\left[g\right], (ε, g_{
l
}, g_{
r
})) }
46}
If there is more than one optimal candidate reconciliation, T_{
D
}is traversed repeatedly until all minimum cost histories have been generated. The traceback requires no modification to address reconciliation on three levels and is essentially the same as the traceback algorithm for gene treespecies tree reconciliation, as described in [38].
checkFeasibility: Once all candidate reconciliations have been returned, their temporal feasibility is checked to ensure that each solution is valid. This requires criteria for temporal feasibility that accommodate the interplay of domain evolution with gene and species evolution. To be temporally feasible, an inferred domain event history must satisfy two criteria. First, it must be possible to assign a timestamp to every species tree node, such that the timestamps are consistent with the temporal constraints imposed by the combined set of gene and domain transfers. Second, it must be possible to assign a timestamp to every gene tree node, such that the timestamps are consistent with the temporal constraints imposed by the combined set of domain transfers and domain insertions.
For the first criterion, we introduce a species timing graph, {G}_{t}^{S}=\left({V}_{t}^{S},\phantom{\rule{2.36043pt}{0ex}}{E}_{t}^{S}\right), in which the nodes represent species (i.e., {V}_{t}^{S}={V}_{S}) and the edges represent temporal constraints introduced by gene and domain transfers, as well as the temporal relationships imposed by directed tree edges. If gene gis horizontally transferred from species s_{1}to s_{2}, then every domain in gis also implicitly transferred from s_{1}to s_{2}. Thus, every gene transfer, (g_{1}, g_{2}), corresponds to one or more edges in the domain tree. Let {\Lambda}_{G}\subset {E}_{G}^{*}and Λ_{
D
}⊂ E_{
D
}be gene and domain transfer edges in {T}_{GS}^{*}and T_{
DG
}, respectively, and let {\Lambda}_{G}^{}be the set of edges (d_{1}, d_{2}), such that ∃(g_{1}, g_{2}) ∈ Λ_{
G
}, where g_{1}= M ^{DG}(d_{1}), g_{2}= M ^{DG}(d_{2}). Then, {\Lambda}_{G}^{1}\cup {\Lambda}_{D}represents all the temporal constraints arising from horizontal transfers of both genes and domains.
The temporal constraints on species are encoded in {G}_{t}^{S}by adding edges to {E}_{t}^{S}that represent the following three temporal constraints:
1 If species s_{
i
}= p(s_{
j
}), then s_{
i
}must have predated s_{
j
}. ∀(s_{
i
}, s_{
j
}) ∈ E_{
S
}, add (s_{
i
}, s_{
j
}) to {E}_{t}^{S}.
2 If a transfer from species s_{
i
}to species s_{
j
}occurred, then s_{
i
}and s_{
j
}must have coexisted. Further, the parent of s_{
j
}must have predated s_{
i
}, and vice versa. \forall \left({d}_{i},\phantom{\rule{2.36043pt}{0ex}}{d}_{j}\right)\in {\Lambda}_{G}^{1}\cup {\Lambda}_{D}, add (p(s_{
i
}), s_{
j
}) and (p(s_{
j
}), s_{
i
}) to {E}_{t}^{S}, where s_{
i
}= M ^{DS}(d_{
i
}) and s_{
j
}= M ^{DS}(d_{
j
}).
3 If two transfers occur in the same lineage in the reconciled domain tree (i.e., one is ancestral to the other), then the species corresponding to the donor and recipient of the more ancestral event must have occurred no later than the donor and recipient species of the more recent event. Let (d_{1}, d_{2}) and ({d}_{1}^{\prime}, {d}_{2}^{\prime}) be a pair of transfers in {\Lambda}_{G}^{1}\cup {\Lambda}_{D}, such that {d}_{2}{\ge}_{D}{d}_{1}^{\prime}. Then s_{1}= M ^{DS}(d_{1}) and s_{2}= M ^{DS}(d_{2}) must have occurred no later than species {s}_{1}^{\prime}={M}^{DS}\left({d}_{1}^{\prime}\right)and {s}_{2}^{\prime}={M}^{DS}\left({d}_{2}^{\prime}\right). Add (p(s_{1}), {s}_{1}^{\prime}), (p(s_{1}), {s}_{2}^{\prime}), (p(s_{2}), {s}_{1}^{\prime}), and (p(s_{2}), {s}_{2}^{\prime}) to {E}_{t}^{S}.
Note that the third constraint considers all pairs of comparable transfers. When mapping gene transfer events to the reconciled domain tree, each pair of transfers that is comparable in T_{
G
}corresponds to at least one pair of edges in {\Lambda}_{G}^{}that is comparable in T_{
D
}.
For the second criterion, we construct a gene timing graph, {G}_{t}^{G}=\left({V}_{t}^{G},\phantom{\rule{2.36043pt}{0ex}}{E}_{t}^{G}\right), in which the nodes represent genes and the edges represent temporal constraints implied by domain transfers, Λ_{
D
}, and insertions, ○_{
D
}, where ○_{
D
}⊂ E_{
D
}denotes the set of domain insertions in T_{
DG
}. The nodes in {G}_{t}^{G}are the nodes in {T}_{GS}^{*}, including pseudonodes (i.e., {V}_{t}^{G}={V}_{G}^{*}). Edges are added to {E}_{t}^{G}to represent the following three temporal constraints:
1 If gene g_{
i
}= p(g _{
j
}), then g_{
i
}must have predated g_{
j
}. \forall \left({g}_{i},\phantom{\rule{2.36043pt}{0ex}}{g}_{j}\right)\in {E}_{G}^{*}, add (g_{
i
}, g_{
j
}) to {E}_{t}^{G}.
2 If a domain was transferred or inserted from gene g_{
i
}to g_{
j
}, then g_{
i
}and g_{
j
}must have coexisted. Further, the parent of g_{
i
}must have predated g_{
j
}and vice versa. ∀(g_{
i
}, g _{
j
}) ∈ {(M ^{DG}(d_{1}), M ^{DG}(d_{2}))  (d_{1}, d_{2}) ∈ Λ_{
D
}∪ ○_{
D
}}, add (p(g_{
i
}), g_{
j
}) and (p(g_{
j
}), g_{
i
}) to {E}_{t}^{G}.
3 Let (d_{1}, d_{2}) and ({d}_{1}^{\prime}, {d}_{2}^{\prime}) be domain insertions and/or transfers in Λ_{
D
}∪ ○_{
D
}, such that {d}_{2}{\ge}_{D}{d}_{1}^{\prime}. Then g_{1}= M ^{DG}(d_{1}) and g_{2}= M ^{DG}(d_{2}) must have existed no later than {g}_{1}^{\prime}={M}^{DG}\left({d}_{1}^{\prime}\right)and {g}_{2}^{\prime}={M}^{DG}\left({d}_{2}^{\prime}\right). Add (p(g_{1}), {g}_{1}^{\prime}), (p(g_{1}), {g}_{2}^{\prime}), (p(g_{2}), {g}_{1}^{\prime}), and (p(g_{2}), {g}_{2}^{\prime}) to {E}_{t}^{G}.
A domain event history is temporally feasible iffboth timing graphs are acyclic. If either graph contains a cycle, then the candidate history is infeasible and is not reported. A modified topological sorting algorithm in Θ (V_{
t
} + E_{
t
}) [51] is used to test for cycles in G_{
t
}.
Complexity
Our algorithm infers domain shuffling events in polynomial time. addLoss constructs {T}_{GS}^{*}by adding pseudonodes to T_{
GS
}. The number of pseudonodes that can be added to the edge above a node in T_{
GS
}is at most h_{
S
}, where h_{
S
}is the height of T_{
S
}. Therefore, the complexity of addLoss is O\left(\left{V}_{G}^{*}\right\right)=O\left({h}_{S}\left{V}_{G}\right\right). costCalc visits each domain node d∈ V_{
D
}and loops through all pairs of gene nodes \left({g}_{l},\phantom{\rule{2.36043pt}{0ex}}{g}_{r}\right)\in {V}_{G}^{*}\times {V}_{G}^{*}. Since the calculations for each combination of d, g_{
l
}, and g_{
r
}requires only constant time, the total complexity is O\left(\left{V}_{D}\right{V}_{G}^{*}{}^{2}\right)=O\left(\left{V}_{D}\right{h}_{S}^{2}{V}_{G}{}^{2}\right).
traceback constructs a single candidate solution in a preorder traversal of the domain tree. Looking up the event, the number of losses, and mapping of each node requires constant time. For each node d∈ V_{
D
}, the losses between dand p(d) are calculated by climbing from M ^{DG}(d) to M ^{DG}(p(d)). This requires O\left({h}_{G}^{*}\right)time for each d, where {h}_{G}^{*}is the height of {T}_{GS}^{*}. Therefore, the total complexity for returning a single condidate reconciliation in the second pass is O\left({h}_{G}^{*}\left{V}_{D}\right\right). Since the number of pseudonodes added to an edge in T_{
GS
}is bounded by h_{
S
}and there are at most h_{
G
}nodes that are not pseudonodes contributing to {h}_{G}^{*}, {h}_{G}^{*}\le {h}_{S}{h}_{G}. As a result, the complexity of the second pass is O(h_{
S
}h_{
G
}V_{
D
}).
To determine whether a candidate solution is temporally feasible, checkFeasibility tests both the gene and species timing graphs for cycles, using a topological sorting algorithm that runs in O(V_{
t
} + E_{
t
}).
For the species timing graph, {V}_{t}^{S}={V}_{S}. The number of edges in {E}_{t}^{S}depends on the three constraints described in the previous section:

1
E_{
S
} edges are added to {E}_{t}^{S}, one for each species tree edge.

2
Two edges are added to {E}_{t}^{S}for each transfer in {\Lambda}_{G}^{}\cup {\Lambda}_{D}. This results in the addition of at most 2E_{
D
} edges because {\Lambda}_{G}^{}\cup {\Lambda}_{D}<\left{E}_{D}\right.

3
Four edges are added to {E}_{t}^{S}for every pair of comparable transfers in {\Lambda}_{G}^{}\cup {\Lambda}_{D}. Since the number of pairs is bounded by E_{
D
}^{2}, the number of added edges is bounded by 4E_{
D
}^{2}.
Combining all three constraints, the complexity of cycle checking in the species timing graph is O(V_{
S
} + E_{
S
} + E_{
D
} + E_{
D
}^{2}). Because V_{
S
} = O(E_{
S
}) and E_{
D
} ≥ E_{
S
}, E_{
D
}^{2}is the dominant term. Therefore, the complexity can be written as O(E_{
D
}^{2}).
For the gene timing graph, {V}_{t}^{G}={V}_{G}^{*}and the number of edges in {E}_{t}^{G}depends on the three previously described constraints. The first constraint contributes \left{E}_{G}^{*}\rightedges to {E}_{t}^{G}. In the worst case, the second and third constraints contribute the same number of edges for the gene timing graph as for the species timing graph, that is O(E_{
D
}) and O(E_{
D
}^{2}). Thus, the complexity of cycle checking in the gene timing graph is O\left(\left{V}_{G}^{*}\right+\left{E}_{G}^{*}\right+\left{E}_{D}\right+{E}_{D}{}^{2}\right). Recall that \left{E}_{G}^{*}\right=O\left(\left{V}_{G}^{*}\right\right)=O\left({h}_{S}\left{V}_{G}\right\right). Because E_{
D
} ≥ V_{
G
} and E_{
D
} ≥ h_{
S
}, this complexity can be written as O\left(\left{E}_{D}^{2}\right\right).
Domain shuffling with insertions only
The DEDTLmodel includes horizontal transfer of both genes and domains and is suitable for reconstructing the history of domain shuffling events in species that accept foreign DNA. However, this model is not appropriate for analysis of multidomain families in species that do not participate in genetic exchange with other species. For such families, we also consider domain shuffling inference for the restricted model without transfer events. This model is particularly wellsuited to the large and complex multidomain families in vertebrates [12], in which HGT is thought not to occur.
Domain Event Inference without Transfers (DEDL)
Domain events: \left\{{C}_{S},\phantom{\rule{2.36043pt}{0ex}}{C}_{D},\phantom{\rule{2.36043pt}{0ex}}\mathcal{D},\phantom{\rule{2.36043pt}{0ex}}I,\phantom{\rule{2.36043pt}{0ex}}\mathcal{L}\right\}.
Input:A rooted, binary domain tree, T_{
D
}= (V_{
D
}, E_{
D
}); a rooted, binary, DLreconciled gene tree, T_{
GS
}; and a leaf mapping, {M}_{L}^{DG}:L\left({T}_{D}\right)\to L\left({T}_{G}\right)
Output:The set of all temporally feasible, domain shuffling histories T_{
DG
}that minimize
\kappa ={k}_{\mathcal{D}}{n}_{\mathcal{D}}+{\kappa}_{I}{n}_{I}+{\kappa}_{\mathcal{L}}{n}_{\mathcal{L}}.
The overall structure of the algorithm for DEDL is the same as for DEDTL, i.e., Alg. 1. However, since transfers are not allowed, the input gene tree must be reconciled with a species tree under the DLmodel, not the DTLmodel. An extended gene tree, {T}_{GS}^{*}, is constructed using the same procedure as before. With minor modifications, Alg 2 generates solutions to the DEDL problem. In the codivergence case, line 26 is omitted to exclude codivergences with gene transfers \left({C}_{\mathcal{T}}\right)from the event set. The domain transfer case (lines 3437) is eliminated altogether. The second pass is identical for both problems, with the exception that transfers do not appear in the cost and event tables.
The worstcase time complexity for costCalc is the same as in DEDTL, but because domain transfers are not considered and domain insertions are only allowed between genes in the same species, the number of mappings that can be considered is greatly restricted. This suggests a faster runtime in practice.
Determining temporal feasibility is a simpler procedure, since only the gene timing graph must be constructed and tested for cycles. The complexity for gene tree timing graph is O(E_{
D
}^{2}).
Case studies
Here we demonstrate the practical application of our approach using several examples from the Membraneassociated guanylate kinase (Maguk) family [52, 53].
The Maguks are multidomain scaffolding proteins (Figure 5) that play important roles in cellcell communication and adhesion including mediating cell polarity [54, 55], cell proliferation [56], and synaptic plasticity [57, 58]. Scaffolding proteins assemble the components of a signaling cascade in the appropriate configuration [59]. The multidomain architecture is integral to this function because each of the constituent domains is responsible for anchoring specific proteins to the signaling complex. Therefore, acquisition, loss, or replacement of a domain with another from the same family could result in an immediate and dramatic change in interaction partners.
All Maguks have an inactive Guanylate Kinase (GuK) domain, in combination with various adapter domains that anchor downstream pathway proteins to the scaffold. There are six Maguk subfamilies, each with a characteristic domain architecture (Figure 5). The identity, copy number, and order of the auxiliary domains is largely conserved within each subfamily, with minor variations.
To analyze the history of domain shuffling in the Maguk family, we require a species tree, a gene tree, and a tree for each domain of interest. For these case studies, we restrict our analysis to three species, human, mouse and chicken, for which the species tree is well known. For the gene tree, we require a phylogeny that represents the history of the entire gene family locus. This can be a challenge for multidomain families with variable domain content, including the Maguks, because it is not possible to obtain a full length sequence alignment, from which to construct a gene tree. In this analysis, we use the phylogeny of the GuKdomain as a proxy for the history of the gene family. This domain is present in all Maguk architectures and is unique to the Maguk family. In Maguks, the GuKand SH3domains participate in intramolecular interactions that cause them to function together as a unit [56, 60]. This suggests that the GuKdomain is under tight structural and functional constraints and, hence, is unlikely to participate in domain shuffling, making it a suitable proxy for the history of the locus.
With this in mind, we constructed a GuKdomain phylogeny (see Methods) for all members of the Maguk family in mouse, human, and chicken (Figure 6and Fig. S1 in Additional file 1). Using this gene tree, we investigated possible domain shuffling for two Maguk constituent domains: the L27domain in the Membraneassociated Proteins, Palmitoylated (MPP) subfamily, and the PDZdomain, which is found in all Maguk subfamilies, except the calcium channel β (CACNB) proteins.
The L27 domain. The MPPsmediate protein complex formation at cell junctions and play a role in establishing cell polarity during development [54]. All MPPscontain a core GuKSH3PDZdomain architecture and, except MPP1, two Nterminal L27domains (Figure 5).
To determine whether the L27domains coevolved by vertical descent with the Maguk structural core, we constructed trees from the sequences of the first and second L27domains (L271and L272) in human, mouse, and chicken MPPs(see Methods). Outside of the MPPsubfamily, only a few proteins encoded in the human genome possess L27domains (DLG1, Lin7A/B/C, MPDZ, INADL). These proteins possess only a single L27domain, rather than a tandem pair. Structural studies partition singlecopy and tandem L27domains into two distinct subtypes, with distant sequence homology [61]. This suggests that MPP L27domains are more closely related to each other than to other L27domains and, thus, satisfy the assumption of our algorithm, which only considers domains within the current gene family.
The L27phylogeny was reconciled with the GuKreference tree (Figure 7(a)) using the DEDL model and event costs {\kappa}_{\mathcal{D}}=3, {\kappa}_{I}=1.5, and {\kappa}_{\mathcal{L}}=3. The L272subtree (not shown) is consistent the hypothesis that the L272and GuKdomains coevolved without shuffling. However, the reconciled L271tree (Figure 7(b)) suggests that the L271domain in the MPP2/6ancestor was replaced by a copy of L271from the MPP3/7ancestor, by a single domain insertion (Figure 7(c)).
Phylogenetic error is of particular concern in domain tree reconstruction, because domain sequences tend to be short and weakly conserved [42]. To investigate the impact of phylogenetic error on the inferred L271domain insertion, we generated the 95% confidence set of trees with high likelihood scores [62, 63], as described in Methods. Only two topologies are supported by at least 25% of the trees in the confidence set, as shown in the consensus network [64] in Figure 8. In both topologies, MPP2/6/3/7form a clade, which is consistent with the hypothesis that the L271domain was replaced in the MPP2/6ancestor.
Previous models of MPPevolution [52] predicted that the L271domains in MPP2/6and CASKshare a unique, common ancestor. In contrast, our analysis predicts that the L271domains in MPP2/6and MPP3/7are more closely related. This has functional, as well as evolutionary, implications. A vertical descent model implies that MPP2/6are likely to be most functionally similar to CASK, their closest MPPrelative. Our analysis predicts that MPP2/6are more likely to be similar to MPP3/7, with respect to L27mediated interactions. This is consistent with recent experimental results that show that MPP2, MPP3, and MPP7share a common L27mediated complex formation mechanism, while CASKuses a different mechanism [65].
The PDZ domain. All members of the Maguk family, except for the CACNBs, have at least one PDZdomain. Using our newly developed methods, we sought to understand how the varied number of PDZdomains arose in this family. Was there a single PDZin the earliest Maguk that expanded independently in different subfamilies? Or was there a round of ancestral domain duplication followed reciprocal, independent losses?
The maximum likelihood (ML) tree for the PDZdomains (Fig. S2 in Additional file 1) suggests a number of domain shuffling events when reconciled with the GuKreference tree. However, many of the relationships in this tree are associated with low bootstrap values and may not reflect an accurate evolutionary history. To investigate the robustness of the inferred domain shuffling events, we generated the 95% confidence set of PDZtrees as described in Methods. All 121 trees in this set imply that a number of domain shuffling events occurred  some of these events are unique to a single tree, while others are common to a large fraction of the trees. We defined the support of an inferred event in this set to be the fraction of reconciled PDZtrees in which that event occurs (see Methods). The inferred events were scored and ranked based on this support score (Table S1 and Fig. S3 in Additional file 1). The reconciled trees from the PDZconfidence set were then ranked based on the number of highscoring events found in their history. Tree 84 (Fig. S4 in Additional file 1) has nine of the top ten highscoring insertion events. Only one tree has all top ten events, but that tree results in a much higher reconciliation cost (439.9), compared with a cost of 399.1 for Tree 84. We therefore use Tree 84 in the following analysis of PDZdomain evolution in the Maguks (Figure 9).
Our reconstructed history of PDZdomain shuffling shows several interesting trends that disagree with prior analyses based on Wagner parsimony. This is especially true for the ZOand Carmasubfamilies.
All members of the ZO/DLG5subfamily have a cassette of at least three tandemly arranged PDZdomains (and a fourth in DLG5). It has been previously been proposed [52] that two domain duplications in the ZO/DLGancestor gave rise to this cassette, which was subsequently inherited by all ZOsand DLG5, with an additional duplication resulting in the fourth PDZin DLG5.
Our reconstruction, based on domain tree reconciliation, tells a very different story (Figure 10). First, the ZOancestor had a single PDZdomain that was vertically inherited in ZO2and ZO3, but lost in ZO1. Second, the cassette expansion was the result of a domain insertion from MAGIinto the common ancestor of ZO2and ZO3. While these insertions are weakly supported, the fact that both the donor domains (PDZ4and PDZ5in MAGI) and the recipient domains (PDZ1and PDZ2in ZO2and ZO3) are adjacent is suggestive. Third, the PDZcassette in ZO1was the result of three, highly supported domain insertion events from ZO2in amniotes. Considering the fact that these three PDZsare tandemly located in the domain architecture, it is possible that this was the result of a single insertion event involving all three PDZdomains simultaneously. Interestingly, the Cterminal PDZdomain in DLG5and the Cterminal PDZin the ZOproteins were both vertically inherited from the same ancestral copy. The other DLG5 PDZdomains were the result of insertions from the DLGand Carmasubfamilies.
Also of particular interest are the highscoring insertion events in the Carmasubfamily. All Carmagenes have a single PDZdomain that is predicted to be the result of vertical inheritance in Wagner parsimony analysis. According to our domain tree reconciliation analysis, however, this is not the case (Figure 11). Although the single PDZdomain in Carma1and Carma2was vertically inherited, Carma3experienced both an expansion and contraction of its PDZrepertoire in the amniote ancestor. This expansion was due to two independent and highlysupported insertion events  one from MPP4and the other from Carma1. Parallel losses followed, with the copy from MPP4lost in birds and the copy from Carma1lost in mammals. As a result, even though all contemporary Carma3's have the same domain architectures, the PDZdomain in chicken Carma3is paralogous, not orthologous, to the PDZdomain in mouse and human.
These case studies highlight the importance of using information from three levels of evolution  domain, gene, and species  when analyzing the history of a multidomain family. Without multiple species, the divergent history of the PDZsin Carma3in birds and mammals would not be apparent. Note also the pseudonode representing the ancestral Carma2in amniotes. This was the result of a loss of the Carma2gene in chicken. Our algorithm correctly identifies the missing PDZin chicken as a gene loss, not a domain loss.