Basic notation and preliminaries
Given a rooted tree T, we denote its node set, edge set, and leaf set by V (T), E(T), and Le(T) respectively. The root node of T is denoted by rt(T). Given a node v ∈ V (T), we denote its parent by pa
_{
T
} (v), its set of children by Ch
_{
T
} (v), and the subtree of T rooted at v by T
_{
v
}. If two nodes in T have the same parent, they are called siblings. The set of internal nodes of T, denoted I(T), is defined to be V (T)\Le(T). We define ≤ _{
T
} to be the partial order on V (T) where x ≤_{
T
}
y if y is a node on the path between rt(T) and x. The least common ancestor of a nonempty subset L ⊆ V (T) in tree T, denoted as lca_{
T
} (L), is the unique smallest upper bound of L under ≤_{
T
}.
Given x, y ∈ V (T), x →_{
T
}
y denotes the unique path from x to y in T. We denote by d
_{
T
} (x, y) the number of edges on the path x →_{
T
}
y. T is fully binary if every node has either zero or two children. Throughout this paper, the term tree refers to a rooted fully binary tree. Given T and a set L ⊆ Le(T), let T' be the minimal rooted subtree of T with leaf set L. We define the leaf induced subtree T[L] of T on leaf set L to be the tree obtained from T' by successively removing each nonroot node of degree two and adjoining its two neighbors.
The DuplicationLoss problem
A species tree is a tree that depicts the evolutionary relationships of a set of species. Given a gene family for a set of species, a gene tree is a tree that depicts the evolutionary relationships among the sequences encoding only that gene family in the given set of species. Thus, the nodes in a gene tree represent genes. We assume that each leaf of the gene trees is labeled with the species from which that gene was sampled. In order to compare a gene tree G with a species tree S, we require a mapping from each gene g ∈ V (G) to the most recent species in S that could have contained g.
Definition 1 (Mapping). The leafmapping ℒ_{
G, S
} : Le(G) → Le(S) maps a leaf node g ∈ Le(G) to that unique leaf node s ∈ Le(S) which has the same label as g. The extension ℳ_{
G, S
} : V (G) → V (S) of ℒ_{
G, S
}
is the mapping defined by ℳ_{
G, S
}(g) = lca(ℒ_{
G, S
}(Le(G
_{
g
}))).
For any node s ∈ V (S), we use
to denote the set of nodes in G that map to node s ∈ V (S) under the mapping ℳ_{
G, S
}.
Definition 2 (Comparability). Given trees G and S, we say that G is comparable to S if, for each g ∈ Le(G), the leafmapping ℒ_{
G, S
}(g) is well defined. A set of gene trees
is comparable to S if each gene tree in
is comparable to S.
Throughout this paper we use the following terminology:
is a set of gene trees that is comparable to a species tree S, and
.
Definition 3 (Duplication). A node g ∈ I(G) is a (gene) duplication if ℳ_{
G, S
}(g) ∈ ℳ_{
G, S
}(Ch(g)) and we define Dup(G, S) = {g ∈ I(G): g is a duplication}.
Following [8], we define the number of losses as follows.
Definition 4 (Losses). The number of losses Loss(G, S, g) at a node g ∈ I(G), is defined to be:

0, if ℳ_{
G, S'
}(g) = ℳ_{
G, S″} (g') ∀ g' ∈ Ch(g), and

∑_{
g'∈Ch(g)} d
_{
S'
}(ℳ_{
G, S'
}(g), ℳ_{
G, S'
}(g'))  1, otherwise;
where S' = S[Le(G)]. We define Loss(G, S) = Σ_{
g∈I(G)}
Loss(G, S, g) to be the number of losses in G.
Under the duplicationloss model, the reconciliation cost of G with S is simply the duplicationloss cost; i.e., the number of duplications and losses.
Definition 5 (Reconciliation cost).
We define reconciliation costs for gene and species trees as follows:
 1.
Δ(G, S) =  Dup(G, S) + Loss(G, S) is the reconciliation cost from G to S.
 2.
is the reconciliation cost from
to S.
 3.
Problem 1 (DuplicationLoss). Given a set
of gene trees, the DuplicationLoss problem is to find a species tree S* comparable with
, such that
.
Local search problems
Here we first provide the definition of an SPR edit operation [18] and then formulate the related local search problems. The definition and associated local search problems for the TBR edit operation are considered later.
For technical reasons, before we can define the SPR operation, we need the following definition.
Definition 6 (Planted tree). Given a tree T, the planted tree Φ(T) is the tree obtained by adding a root edge {p, rt(T)}, where p ∉ V (T), to T.
Definition 7 (
SPR operation).
Let T be a tree, e = {
u,
v} ∈
E(
T),
where u =
pa(
v),
and X,
Y be the connected components that are obtained by removing edge e from T such that v ∈
X and u ∈
Y.
We define
SPR
_{
T
} (
v,
y)
for y ∈
Y to be the tree that is obtained from Φ(
T)
by first removing edge e, and then adjoining a new edge f between v and Y as follows:
 1.
Create a new node y' that subdivides the edge {pa(y), y}.
 2.
Add edge f between nodes v and y'.
 3.
Suppress the node u, and rename y' as u.
 4.
We say that the tree
SPR
_{
T
} (v, y) is obtained from T by a subtree prune and regraft (SPR) operation that prunes subtree T
_{
v
}
and regrafts it above node y.
Notation. We define the following:
1.
SPR
_{
T
} (v) =∪_{
y∈Y
}{SPR
_{
T
} (v, y)}
2.
SPR
_{
T
} = ∪ _{(u, v)∈E(T)}
SPR
_{
T
} (v)
Throughout the remainder of this manuscript, S denotes a species tree such that
, and v is a nonroot node in V (S).
We now define the relevant local search problems based on the SPR operation.
Problem 2 (SPRScoring (SPRS)).
Given
and S, find a tree T *∈ SPR
_{
S
}
such that Δ (
, T *) =
.
Our goal is to solve the SPRS problem efficiently. To that end, we first define a restricted version of the SPRS problem, called the SPRRestricted Scoring problem.
Problem 3 (SPRRestricted Scoring (SPRRS)).
Given
, S, and v, find a tree T *∈ SPR
_{
S
}(v) such that Δ (
, T *) =
.
Let n =  Le(S), m =  Le(S) +  Le(G) and
, and let us assume, for convenience, that all
have approximately the same size. In the following, we show how to solve the SPRRS problem in O(km) time. Since SPR
_{
S
} = ∪ _{{pa(v), v} ∈ E(S)}
SPR
_{
S
}(v), it is easy to see that the SPRS problem can be solved by solving the SPRRS problem O(n) times. This yields an O(kmn) time algorithm for the SPRS problem. Later, we show that the local search problem corresponding to the TBR operation reduces to solving O(n
^{2}) SPRRS problems; which yields an O(kmn
^{2}) time algorithm for the TBR local search problem. In the interest of brevity, all lemmas and theorems in this paper appear with proofs omitted; however, all proofs are available in [29].
Solving the SPRRS problem
Throughout this section, we limit our attention to one gene tree G; in particular, we show how to solve the SPRRS problem for G in O(m) time. Our algorithm extends trivially to solve the SPRRS problem on the set of gene trees
in O(km) time. For simplicity, we will assume that Le(G) = Le(S). Indeed, if Le(G) ≠ Le(S) then we can simply set the species tree to be S[Le(G)]; this takes O(n) time and, consequently, does not affect the time complexity of our algorithm.
In order to solve the SPRRS problem for G, it is sufficient to compute the values  Dup(G, S') and Loss(G, S') for each S' ∈ SPR
_{
S
}(v). Bansal et al. [22] showed how to compute the value  Dup(G, S') for each S' ∈ SPR
_{
S
}(v), in O(m) time. Losses, however, behave in a very different and much more complex manner compared to gene duplications and it has remained unclear if their computation could be similarly optimized. In this paper we show that it is indeed possible to compute the value Loss(G, S') for each S' ∈ SPR
_{
S
}(v) in O(m) time as well. Altogether, this implies that the SPRRS problem for G can be solved in O(m) time. Next, we introduce some of the basic structural properties that are helpful in the current setting.
Basic structural properties
Consider the tree N
^{
S, v
} = SPR
_{
S
}(v, rt(S)). Observe that, since
, solving the SPRRS problem on instance ⟨{G}, S, v⟩ is equivalent to solving it on the instance ⟨{G}, N
^{
S, v
}, v⟩. Thus, in the remainder of this section, we will work with tree N
^{
S, v
} instead of tree S; the reason for this choice becomes clear in light of Lemmas 3 and 4.
Since S and v are fixed in the current context, we will, in the interest of clarity, abbreviate N
^{
S, v
} simply to N. Similarly, in the remainder of this section, we abbreviate ℳ_{
G, T
} to ℳ_{
T
}, for any species tree T.
Throughout the remainder of this work, let u denote the sibling of v in N. We color the nodes of N asfollows: (i) All nodes in the subtree N
_{
v
} are colored red, (ii) the root node of N is colored blue, and (iii) all the remaining nodes, i.e. all nodes in N
_{
u
}, are colored green. Correspondingly, we color the nodes of G by assigning to each g ∈ V (G) the color of the node ℳ_{
N
}(g).
Definition 8 (Γ). We define Γ to be the tree obtained from G by removing all red nodes (along with any edges incident on these red nodes). Observe that while
Γ
must be binary, it might not be fully binary.
The significance of the tree Γ stems from the following two Lemmas which, together, completely characterize the mappings from nodes in V (G) for each S' ∈ SPR
_{
S
}(v). This characterization is the basis of Lemmas 3 through 8.
Lemma 1. Given G and N, if g ∈ V (G) is either red or green, then ℳ_{
S'
}(g) = ℳ_{
N
}(g) for all S' ∈ SPR
_{
N
} (v).
Lemma 2. Given G and N, if g ∈ V (G) is a blue node, then ℳ_{
S'
} (g) = lca_{
S'
} (v, ℳ_{Γ, N
} (g)) for any S' ∈ SPR
N(v).
Characterizing losses
To solve the SPRRS problem efficiently we rely on the following six lemmas, which make it possible to efficiently infer the value of Loss(G, S', g) for any S' ∈ SPR
_{
N
} (v) and any g ∈ V (G).
Consider any g ∈ I(G), and let g' and g" be its two children. Let a = ℳ_{
N
}(g), b = ℳ_{
N
} (g') and c = ℳ_{
N
} (g"). Without loss of generality, node g must correspond to one of the following six categories: 1) g is red, 2) g is green, 3) g, g', and g" are all blue, 4) g and g' are blue, and g" is green, 5) g and g' are blue, and g" is red, or, 6) g is blue, g' is red, and g" is green.
Lemmas 3 through 8 characterize the behavior of the loss cost Loss(G, S', g), for each S' ∈ SPR
_{
N
} (v), for each of these six cases. At this point, it would help to observe that SPR
_{
N
} (v) = {SPR
_{
N
} (v, s): s ∈ V (N
_{
u
})}.
Lemma 3. If g is red then Loss(G, S', g) = Loss(G, N, g) for all S' ∈ SPR
_{
N
} (v).
Lemma 4. If g is green then Loss(G, S', g) = Loss(G, N, g) + 1 if S' = SPR
_{
N
} (v, x) where b ≤_{
N
}
x <_{
N
}
a or c ≤_{
N
}
x <_{
N
}
a, and Loss(G, S', g) = Loss(G, N, g) otherwise.
Lemma 5.
Let g, g'
and g"
all be blue nodes, x ∈
V (
N
_{
u
}),
and let a' = ℳ
_{Γ, N
} (
g),
b' = ℳ
_{Γ, N
} (
g')
and c' = ℳ
_{Γ, N
} (
g").
 1.
If S' = SPR
_{
N
} (v, x) where x ≮_{
N
}
a', then Loss(G, S', g) = Loss(G, N, g).
 2.
If S' =
SPR
_{
N
} (
v,
x)
where x <
_{
N
}
a', and S" =
SPR
_{
N
} (
v,
pa(
x)),
then,
 (a)
Loss(G, S', g) = Loss(G, S", g) + 1 if b' ≤_{
N
}
x <_{
N
}
a' or c' ≤_{
N
}
x <_{
N
}
a', and,
 (b)
Loss(G, S', g) = Loss(G, S", g) otherwise.
Lemma 6.
Let g and g'
be blue nodes and g"
be a green node, x ∈
V (
N
_{
u
})\{
u},
and let a' = ℳ
_{Γ, N
} (
g),
b' = ℳ
_{Γ, N
} (
g')
and c' = ℳ
_{Γ, N
} (
g").
 1.
If S' =
SPR
_{
N
} (
v,
x)
where x ≮
_{
N
}
a', and S" =
SPR
_{
N
} (
v,
pa(
x)),
then,
 (a)
Loss(G, S', g) = Loss(G, S", g)  1 if a' ≤_{
N
}
x <_{
N
}
u,
 (b)
Loss(G, S', g) = Loss(G, S", g)  1 if a' ≤ _{
N
}
pa(x) <_{
N
}
u but x is not such that a' ≤_{
N
}
x <_{
N
}
u, and,
 (c)
Loss(G, S', g) = Loss(G, S", g) otherwise.
 2.
Let S' =
SPR
_{
N
} (
v,
x)
where x <
_{
N
}
a' and S" =
SPR
_{
N
} (
v,
pa(
x)).
 (a)
If a' ≠
b'
and b"
denotes the child of a'
along the path a' →
_{
N
}
b',
then,
 i.
Loss(G, SPR
_{
N
} (v, b"), g) = Loss(G, SPR
_{
N
} (v, a'), g)  2 if a' ≠ c'. And,
Loss(G, SPR
_{
N
} (v, b"), g) = Loss(G, SPR
_{
N
} (v, a'), g) if a' = c',
 ii.
Loss(G, S', g) = Loss(G, S", g) + 1 if b' ≤ _{
N
}
x <_{
N
}
b",
 iii.
Loss(G, S', g) = Loss(G, S", g) if x is such that x ∈ V (N
_{
b"}) but not such that b' ≤ _{
N
}
x <_{
N
}
a',
 iv.
Loss(G, S', g) = Loss(G, SPR
_{
N
} (v, a'), g) if c' ≤_{
N
}
x <_{
N
}
a', and,
 v.
Loss(G, S', g) = Loss(G, SPR
_{
N
} (v, a'), g) 1 otherwise.
 (b)
If a' =
b',
then,
 i.
Loss(G, S', g) = Loss(G, SPR
_{
N
} (v, a'), g) if c' ≤ _{
N
}
x <_{
N
}
a', and,
 ii.
Loss(G, S', g) = Loss(G, SPR
_{
N
} (v, a'), g) 1 otherwise.
Lemma 7.
Let g and g'
be blue nodes and c be a red node, x ∈
V (
N
_{
u
}),
and let a' = ℳ
_{Γ, N
}(
g).
 1.
If S' = SPR
_{
N
} (v, x) where x <_{
N
}
a', and S" = SPR
_{
N
} (v, pa(x)), then Loss(G, S', g) = Loss(G, S", g) + 1.
 2.
If S' =
SPR
_{
N
} (
v,
x)
where x ≮
_{
N
}
a',
then,
 (a)
Loss(G, S', g) = Loss(G, N, g) if a' ≤_{
N
}
x ≤_{
N
}
u, and,
 (b)
Loss(G, S', g) = Loss(G, S", g) + 1 for S" = SPR
_{
N
} (v, pa(x)) otherwise.
Lemma 8.
Let g be blue, g'
be red, and g'' be green. Let x ∈
V (
N
_{
u
})\{
u}
and a' = ℳ
_{Γ, N
} (
g).
 1.
If S' =
SPR
_{
N
} (
v,
x)
where x ≮
_{
N
}
a',
and S" =
SPR
_{
N
} (
v,
pa(
x)),
then,
 (a)
Loss(G, S', g) = Loss(G, S", g)  1 if a' ≤ _{
N
}
x <_{
N
}
u,
 (b)
Loss(G, S', g) = Loss(G, S", g) if a' ≤ _{
N
}
pa(x) <_{
N
}
u but x is not such that a' ≤_{
N
}
x <_{
N
}
u, and,
 (c)
Loss(G, S', g) = Loss(G, S", g) + 1 otherwise.
 2.
If S' =
SPR
_{
N
} (
v,
x)
where x <
_{
N
}
a',
and S" =
SPR
_{
N
} (
v,
pa(
x)),
then,
 (a)
Loss(G, S', g) = Loss(G, SPR
_{
N
} (v, a'), g) + 2 if x ∈ Ch
_{
N
} (a'), and,
 (b)
Loss(G, S', g) = Loss(G, S", g) + 1 otherwise.
The algorithm
Observe that SPR
_{
N
} (v) = {SPR
_{
N
} (v, s): s ∈ V (N
_{
u
})}. Therefore, the goal of our algorithm is to compute at each node s ∈ V (N
_{
u
}) the value Loss(G, S'), where S' = SPR
_{
N
} (v, s). To do this efficiently, we rely on the characterization of losses given in Lemmas 3 through 8.
The first step of the algorithm is to compute the value Loss(G, N). This "loss value" is assigned to the node u. To compute the loss value for the rest of the nodes our algorithm makes use of six different types of counters at each node in N
_{
u
}; we refer to these counters as counteri, for i ∈ {1,..., 6}. The reason for using these six counters is that the behavior of the loss values can be explained by using six types of patterns (captured by the six counters). These counters make it possible to efficiently compute the difference between the values Loss(G, N) and Loss(G, S'), where S' = SPR
_{
N
} (v, s), for each s ∈ V (N
_{
u
}). Next, we describe each of these six counters; throughout our description, s represents some node in N
_{
u
}.
counter1 If the value of counter1 is x at node s then this implies that the tree SPR
_{
N
} (v, s) incurs x additional losses over the value Loss(G, N).
counter2 If the value of counter2 is x at node s, then this implies that for each t ≤_{
N
}
s the tree SPR
_{
N
} (v, t) incurs an additional x losses over Loss(G, N).
counter3 If the value of counter3 is x at node s, then this implies that for each t≤
_{
N
}
s the tree SPR
_{
N
} (v, t) loses x losses from Loss(G, N).
counter4 If the value of counter4 is x at node s, then this implies that for each t ≤ _{
N
}
s the tree SPR
_{
N
} (v, t) incurs α
_{
t
} · x additional losses over Loss(G, N, where α
_{
t
} = d
_{
N
} (pa(s), t).
counter5 If the value of counter5 is x at node s, then it is equivalent to incrementing counter4 at the sibling of each node on the path u → _{
N
}
s, except at u, by x.
counter6 If the value of counter6 is x at node s, then it is equivalent to incrementing counter4 at both children (if they exist) of the sibling of each node along the path u → _{
N
}
s, except u, and incrementing counter3 at each node along the path u → _{
N
}
s, except at u, by x.
In the remainder of this section we first show how to compute the values of these counters, and then the final loss values, at each node in N
_{
u
}.
Computing the counters
We now describe how the values of the six counters are computed. Initially, each counter at each node in N
_{
u
} is set to 0. Consider any g ∈ I(G), and let g' and g" be its two children. Recall that node g must fall under one of the following six categories: 1) g is red, 2) g is green, 3) g, g', and g" are all blue, 4) g and g ' are blue, and g" is green, 5) g and g' are blue, and g" is red, or, 6) g is blue, g' is red, and g" is green.
Let a = ℳ_{
N
} (g), b = ℳ_{
N
} (g') and c = ℳ_{
N
} (g"). Also, whenever properly defined, let a' = ℳ_{Γ, N
} (g), b' = ℳ_{Γ, N
} (g') and c' = ℳ_{Γ, N
} (g"). Based on Lemmas 3 through 8, we now study how the six counters can be updated so as to capture the behavior of losses in each of these cases.
Case 1. By Lemma 3, we do nothing in this case.
Case 2. Based on Lemma 4, the contribution of any node g that satisfies the condition of case 2 can be captured by incrementing the value of counter1 by one at each node on paths a → _{
N
}
b and a → _{
N
}
b, except at node a.
Case 3. From Lemma 5 it follows that in this case the contribution of g to the loss value changes in a way that is captured by incrementing counter2 by 1, at each node, except a', on the paths a' → _{
N
}
b' and a'→ _{
N
}
c'.
Case 4. According to Lemma 6, if N
_{
v
} is regrafted on an edge of N
_{
u
} that is not in N
_{
a'
}, then the contribution of g to the loss cost is captured by incrementing counter3 by 1 at each node except u along the path u →_{
N
}
a', and at their siblings. If N
_{
v
} is regrafted on an edge of N
_{
u
} that is in N
_{
a'
} then there are two possible cases:
a' ≠ b': Recall that b" represents the child of a' along the path a' → _{
N
}
b'. Here, the contribution of g to the loss cost is captured by (i) incrementing counter3 by two at node b", (ii) incrementing counter2 by one at each node except b" along the path b" →_{
N
}
b'', (iii) incrementing counter3 by one at the sibling of b", and (iv) incrementing counter1 by one at each node except a' on the path a' → _{
N
}
c'.
a' = b': In this case, the contribution of g to the loss cost is captured by (i) incrementing counter3 by one at both children of a', and (ii) incrementing counter1 by one at each node except a' on the path a' → _{
N
}
c'.
Case 5. By Lemma 7, for this case, the change in the loss contribution of g is captured by incrementing counter5 by 1 at node a', and by incrementing counter4 by 1 at both children of a' in N.
Case 6. By Lemma 8, for this case, the change in the loss contribution of g is captured by incrementing counter6 by 1 at node a', and by incrementing counter4 and counter2 by 1 each at both children of a' in N.
Computing the final loss values
Our algorithm considers each internal node of gene tree G, one at a time, and updates the relevant counters at the relevant nodes in N
_{
u
}, as shown in the previous subsection. Then, based on the counters, it computes, at each node s ∈ V (N
_{
u
}) the value α (s) = Loss(G, S')  Loss(G, N), where S' = SPR
_{
N
} (v, s); this can be achieved by performing a constant number of pre and postorder traversals of N
_{
u
}. A final traversal of the tree now allows us to compute the value Loss(G, S') = a(s) + Loss(G, N) at each s ∈ V (N
_{
u
}). Due to space limitations, a more complete description of the algorithm is omitted from this manuscript (but is available in [29]).
For simplicity in stating the time complexity of our algorithm, we assume that all
have approximately the same size. Recall that n =  Le(S), m =  Le(S) +  Le(G) and
.
Lemma 9. The
SPRRS problem on the input instance ⟨{G}, S, v⟩ can be solved in O(m) time.
Remark on Lemma 9: Observe that in cases 2, 3 and 4 (from the previous subsection), handling each g might require updating the counters at Θ(n) nodes, yielding a total time complexity of O(nm) for these cases. However, it is still possible to obtain the O(m) time bound.
Thus, we have the following theorem.
Theorem 1. The
SPRRS and
SPRS problems can be solved in O(km) and O(kmn) time respectively.
The time complexity of the best known (naïve) solution for the SPRS problem is Θ (kmn
^{2}). Our algorithm improves on this by a factor of n.
Speedingup the TBR local search problem
Intuitively, a (rooted) TBR operation may be viewed as being like an SPR operation except that the TBR operation allows the pruned subtree to be arbitrarily rerooted before being regrafted. The TBRS problem is defined analogously to the SPRS problem.
Observe that there are Θ (n) different ways to select a subtree of S to be pruned. Furthermore, there are O(n) different ways to reroot the pruned subtree. The idea is to directly use the solution to the SPRRS problem to compute the duplication and loss costs for the O(n)cardinality subset of TBR
_{
S
} defined by any fixed pruned subtree and its fixed rooting. This yields the following theorem.
Theorem 2.
The
TBRS problem can be solved in O(kmn
^{2}) time.
This improves on the best known solution for the TBRS problem by a factor of n.
The deep coalescence cost model
Our algorithms for efficient SPR and TBR local searches for the duplicationloss model apply directly to the corresponding SPR and TBR local search problems for the deep coalescence model. This can be achieved using any of the following two methods. The first method is to make use of the result of Zhang [15] who showed showed that if G is a uniquely leaflabeled gene tree and S is a species tree such that Le(G) = Le(S), then the deep coalescence cost of G and S is equal to Loss(G, S)  2 Dup(G, S). Thus, our algorithms imply that we can compute the deep coalescence cost of each tree in the SPR (resp. TBR) neighborhood of S in O(n
^{2}) (resp. O(n
^{3})) time. The second method is to use the algorithm for computing losses, presented in this paper, and slightly modifying it to directly compute the deep coalescence cost. Owing to the similarity in the definition of the loss cost and the deep coalescence cost, this can be done in a straightforward manner (details omitted for brevity). Overall our algorithms yield speedups of a factor of n over the fastest current approaches for SPR and TBR local searches for the deep coalescence cost model.