Skip to main content

An efficient exact algorithm for computing all pairwise distances between reconciliations in the duplication-transfer-loss model

Abstract

Background

Maximum parsimony reconciliation in the duplication-transfer-loss model is widely used in studying the evolutionary histories of genes and species and in studying coevolution of parasites and their hosts and pairs of symbionts. While efficient algorithms are known for finding maximum parsimony reconciliations, the number of reconciliations can grow exponentially in the size of the trees. An understanding of the space of maximum parsimony reconciliations is necessary to determine whether a single reconciliation can adequately represent the space or whether multiple representative reconciliations are needed.

Results

We show that for any instance of the reconciliation problem, the distribution of pairwise distances can be computed exactly by an efficient polynomial-time algorithm with respect to several different distance metrics. We describe the algorithm, analyze its asymptotic worst-case running time, and demonstrate its utility and viability on a large biological dataset.

Conclusions

This result provides new insights into the structure of the space of maximum parsimony reconciliations. These insights are likely to be useful in the wide range of applications that employ reconciliation methods.

Background

Phylogenetic tree reconciliation is an important method for studying the evolutionary histories of pairs of taxa such as genes and species, parasites and their hosts, and pairs of symbiont species. Given two phylogenetic trees and the association of their extant taxa, the objective is to find an association of the two trees that best explains their incongruity using a biological model of evolutionary events. In the widely-used Duplication-Transfer-Loss (DTL) model, the events considered are contemporaneous speciation, duplication, host/horizontal gene transfer, and loss/extinction.

Reconciliation is typically performed in a maximum parsimony framework in which each type of event is assigned a non-negative cost and the objective is to find a mapping of one tree (e.g., the gene tree or parasite tree) onto the other tree (e.g., the species tree or host tree) that minimizes the total sum of the costs of the constituent events. A reconciliation of minimum cost is called a maximum parsimony reconciliation (MPR). Although reconciliation is also possible in probabilistic frameworks, the underlying algorithms are generally prohibitively slow and are particularly sensitive to the choice of their many parameters.

While a single maximum parsimony reconciliation can be found in polynomial time [13], the number of MPRs can grow exponentially with the size of the trees [1, 4]. For example, in a benchmark Tree of Life dataset with 100 primarily prokaryotic species [5], over 10% of gene families had more than 105 MPRs and some gene families induced over 10100 MPRs [1]. Moreover, the choice of event costs can have a significant impact on the space of MPRs [6]. Consequently, making inferences from a single MPR may lead to conclusions that are not supported, or even contradicted, by other MPRs.

It is important, therefore, to understand the structure and diversity of MPR space. For example, if MPRs are largely similar to one another, then a single MPR may suffice to make robustly supported conclusions. In other cases, MPR space may be so diverse that conclusions drawn from a single MPR, or even a sample of MPRs, may not be reliable.

A number of prior studies have explored the size and structure of MPR space. The number of MPRs can be computed in polynomial time as a byproduct of computing a maximum parsimony reconciliation [13]. Using a compact representation of MPR space called the reconciliation graph [7], Nguyen et al. [8] showed how a single “median” MPR can be computed in polynomial time. Ozdemir et al. [9] showed that this result can be generalized to find a set of k medoids or k centers that represent MPR space. Ma et al. [10] gave a polynomial-time 2-approximation algorithm for covering MPR space with a set of MPRs that, collectively, contain all of the events that arise in MPR space. Haack et al. [11] showed how the diameter of MPR space can be computed in polynomial time and demonstrated that, in many cases, MPR space is very diverse.

Recently, Huber et al. [12] proposed computing the distribution of pairwise distances between MPRs, with respect to a given distance metric, as a tool for obtaining a deeper understanding of MPR space than was previously possible. If, for example, the pairwise distances between MPRs tend to be small, then choosing a single MPR to represent that space may be justifiable. However, if the pairwise distances are large or the distribution of the distances is multimodal, then the conclusions drawn from a single MPR are likely to be less robust.

Since MPR space is large, the approach proposed by Huber et al. selects a sample of MPRs and iteratively computes their pairwise distances. Because the number of pairs grows quadratically with the sample size, this approach is only viable for small sample sizes, providing a potentially coarse approximation of the true distribution. The problem of whether the exact distribution of pairwise distances across the entire space of MPRs can be computed efficiently was left open [12].

We solve this problem by showing how the pairwise distances between all pairs of MPRs can be computed exactly, without sampling, in time polynomial in the size of the trees. For concreteness, our presentation uses the symmetric distance metric [8] which measures the distance between two reconciliations as the number of events that are found in one reconciliation or the other, but not both. We show that our results are easily extendible to other distance metrics as well. Importantly, the asymptotic running time is not a function of the number of MPRs. In practice, our algorithm computes this distribution in seconds, even for problem instances inducing over 10100 MPRs.

Using a Tree of Life dataset [5], we show that the distributions of pairwise distances can vary dramatically across problem instances and are also sensitive to the event costs. We believe that the PairTree software that accompanies this paper will provide an important tool when performing analyses using DTL maximum parsimony reconciliation.

In summary, the contributions of this paper are:

  1. 1

    An efficient polynomial-time algorithm for computing the pairwise distances between all MPRs for any instance of the DTL reconciliation problem;

  2. 2

    Experimental results that demonstrate both the speed of the algorithm on real datasets and its utility in obtaining new insights into the space of MPRs; and

  3. 3

    A Python implementation of our algorithm in the PairTree package (www.cs.hmc.edu/~hadas/pairtree).

Maximum parsimony reconciliations

A DTL-MPR instance is a 6-tuple (S,G,ϕ,d,t,) where S=(VS,ES) and G=(VG,EG) are binary trees, ϕ is a mapping from the leaves of G to the leaves of S (the mapping need not be one-to-one nor onto), and d, t, and are non-negative costs corresponding to duplication, transfer, and loss events, respectively, which are explained below. We assume that the trees are undated, but all results in this paper can be easily adapted to dated trees as well.

A reconciliation mapping for a given instance is a mapping Φ from the vertices of G to the vertices of S such that Φ(g)=ϕ(g) for each leaf g of G and, if g is an internal vertex of G with children g and g′′, then (1) Φ(g) cannot be a descendant of either Φ(g) or Φ(g′′) and (2) at least one of Φ(g) or Φ(g′′) is equal to or a descendant of Φ(g).

A reconciliation mapping induces four types of events: Each internal vertex gVG induces exactly one speciation, duplication, or transfer event and zero or more loss events. For an internal gene tree vertex g, with children g and g′′, the events induced by Φ are as follows:

  • Speciation event: Vertex g induces a speciation event if one of Φ(g) and Φ(g′′) is in the left subtree and the other is in the right subtree of Φ(g).

  • Duplication event: Vertex g induces a duplication event if each of Φ(g) and Φ(g′′) is either equal to or a descendant of Φ(g) but does not satisfy the requirements for a speciation event.

  • Transfer event: Vertex g induces a transfer event if exactly one of Φ(g) and Φ(g′′) is either equal to or a descendant of Φ(g) and the other is neither an ancestor nor a descendant of Φ(g).

  • Loss events: Each non-root vertex g (including leaf vertices) may induce zero or more loss events as follows: Let p(g) denote the parent of g in tree G. If Φ(p(g)) is ancestral to Φ(g), then each species vertex s on the path from Φ(p(g)) to Φ(g) induces a loss event, except for Φ(g) and also not Φ(p(g)) if p(g) induces a speciation event. For each loss induced by a vertex s on the path from Φ(p(g)) to Φ(g), we say that gpasses through s.

The cost of a reconciliation mapping is the sum of the number of events of each type, weighted by their event costs. Speciations are generally considered null events and thus have cost zero. A reconciliation mapping of minimum cost is called a maximum parsimony reconciliation (MPR). Figure 1a shows an example of a DTL-MPR instance and Fig. 1b, c shows two different MPRs for that instance using duplication, transfer, and loss costs of 1, 4, and 1, respectively.

Fig. 1
figure 1

DTL reconciliation. a An instance of the DTL reconciliation problem comprising a species tree (black), a gene tree (gray), and a leaf mapping. Duplication, transfer and loss costs are 1, 4, and 1, respectively. b and c Two different MPRs, each with total cost 4. d The associated reconciliation graph. Mapping nodes are indicated with double line borders. Event nodes are designated with \(\mathbb {S}\) (speciation event), \(\mathbb {D}\) (duplication event), \(\mathbb {T}\) (transfer event), or \(\mathbb {L}\) (loss event). The reconciliation traversal indicated by solid edges corresponds to the MPR in (b) and the reconciliation traversal indicated by dashed edges corresponds to the MPR in (c); bold edges indicate shared elements of the two MPRs. Figure adapted from Haack et. al [11] with permission

An MPR can be found in time O(|G||S|) [1, 13], although the problem becomes NP-complete if the reconciliation is required to be temporally feasible in the sense that there exists a total ordering of the events such that an event involving a gene vertex g comes earlier in the ordering than any event involving a descendant of g. Fortunately, temporal infeasiblity can be detected when it occurs [13, 14] and experimental results suggest that it is not common [14, 15].

Reconciliation graphs and traversals

The space of all MPRs can be compactly represented using a reconciliation graph (Fig. 1d). This representation was originally developed by Scornavacca et al. [7] for dated trees and later modified and adapted for undated trees [10]. For completeness, this representation is summarized below.

Consider a DTL-MPR instance (S,G,ϕ,d,t,). Let Φ denote the set of all MPRs for this instance. For a gene vertex g, let the children of g be denoted by g and g′′. Then, events(g,s) is the set of the following tuples induced by each MPR ΦΦ:

  • \((\mathbb {S}_{(g,s)}, \{(g^{\prime }, s^{\prime }), (g^{\prime \prime }, s^{\prime \prime })\})\) for each speciation in which g is mapped to s, g is mapped to s or one of its descendants, and g′′ is mapped to s′′ or one of its descendants, where s and s′′ denote the children of s;

  • \((\mathbb {D}_{(g,s)}, \{(g^{\prime }, s), (g^{\prime \prime }, s)\})\) for each duplication in which g is mapped to s.

  • \((\mathbb {T}_{(g,s)}, \{(g^{\prime }, s), (g^{\prime \prime }, \hat {s})\})\) for each transfer in which g is mapped to s and one child, wlog g′′, is mapped to a vertex \(\hat {s}\) that is not ancestrally related to s;

  • \((\mathbb {L}_{(g,s)}, \{(g, s^{\prime })\})\) for each loss in which g passes through s, and s is the vertex that follows s on the path from Φ(p(g)) to Φ(g); and

  • \((\mathbb {C}_{(g,s)}, \varnothing)\) for a contemporaneous leaf association where g and s are leaves and ϕ(g)=s.

We make several observations about this tuple representation of events. First, if g is mapped to s as a speciation event, the children of g, denoted g and g′′, are mapped to descendents of s. However, the speciation event is represented by associating g with one child of s (denoted s) and associating g′′ with the other child of s (denoted s′′). Loss events are introduced for each loss incurred as g (or g′′) passes through species vertices on the path from s (or s′′) to Φ(g) (or Φ(g′′)). Similarly, for a duplication event in which g is mapped to s, the children of g may be mapped to s or descendants of s. However, the duplication event is represented by associating both g and g′′ with s and then loss events are introduced for each loss on the path from s to Φ(g) and on the path from s to Φ(g′′). Finally, if g is mapped to s as a transfer event, then one child of g, wlog g, is mapped to g or one of its descendants while the other child, g′′ is mapped to a vertex \(\hat {s}\) that is not ancestrally related to s. The transfer event is represented by associating g with s (and associating g′′ with \(\hat {s}\)); loss events are introduced for each loss on the path from s to Φ(g).

For each such tuple e, let type(e) denote its first element, namely the event type and the ordered pair (g,s), and let associations(e) denote its second element, namely a set of zero or more ordered pairs. Note that if e corresponds to a speciation, duplication, or transfer event, then associations(e) is a set containing two ordered pairs, each representing an association between a gene tree vertex and a species tree vertex. If e is a loss event, then associations(e) is a set containing one such ordered pair indicating where the loss is incurred.

Reconciliation graph

The reconciliation graph contains a mapping node for each (g,s) pair where g is mapped to s in some MPR and, if not already included, a node (g,s) is also introduced if g passes through s due to a loss event. The reconciliation graph also contains an event node corresponding to each tuple in events(g,s). There is a directed edge from each mapping node (g,s) to each event node in events(g,s) and a directed edge from each event node e to a mapping node corresponding to an ordered pair in associations(e).(Throughout this paper, we use the term vertex for an element of the gene or species tree and the term node for an element of the reconciliation graph.)

The representation is compact by merit of the fact that, while a mapping (g,s) and its events may arise in many different MPRs, they are shared in this graph representation. Therefore, the size of the reconciliation graph is easily seen to be polynomial in the size of the two trees.

Ma et al. give a formal description of the algorithm for constructing undated reconciliation graphs, a derivation of its O(|G||S|2) running, and show that undated reconciliation graphs are acyclic [10]. Figure 1d shows the reconciliation graph for the DTL-MPR instance in Fig. 1a when duplication and loss have cost one and transfer has cost four.

Reconciliation traversal

Next, we define reconciliation traversals, which correspond to MPRs. Let \(\mathbf {sources}(\mathcal {R})\) denote the set of source nodes of reconcilation graph \(\mathcal {R}\) which, by definition, are mapping nodes of the form (rg,·) where rg represents the root of tree G.

For a reconciliation graph \(\mathcal {R}\), a reconciliation traversal (abbreviated as traversal) is a subgraph of \(\mathcal {R}\) whose root is a mapping node in \(\mathbf {sources}(\mathcal {R})\). Each non-leaf mapping node added to the traversal has exactly one of its event node children added to the traversal. Each event node added to the traversal has all of its mapping node children added to the traversal. Figure 1d shows two traversals corresponding to the two MPRs in Fig. 1b, c.

There is a straightforward bijection between the set of MPRs and the set of traversals in the reconciliation graph [10]. A traversal, in turn, can be represented as the set of event nodes that it comprises. Thus, we may represent an MPR as the set of event nodes in the corresponding traversal. For an MPR R, let E(R) denote the set of event nodes in that reconciliation.

The PDV Algorithm described in the next section computes the pairwise distances using a dynamic programming formulation that operates on increasingly larger subgraphs of the reconciliation graph. For any mapping node (g,s) in the reconciliation graph, the subgraph of all nodes reachable from (g,s) is called the reconciliation subgraph rooted at (g,s).

The definition of a traversal is also generalized to begin at any mapping node (g,s) in the reconciliation graph. This is called a reconciliation subtraversal rooted at (g,s). Therefore, a traversal is a subtraversal whose root is in \(\mathbf {sources}(\mathcal {R})\).

Distance

Let XY denote the symmetric set difference of sets X and Y, namely the set of elements that are found in X or Y but not both. Given two reconciliations R1 and R2, the symmetric set distance between them, denoted d(R1,R2) is defined to be |E(R1)E(R2)|. It is easily verified that the symmetric set distance satisfies the requirements of a distance metric.

This distance metric has been used to measure the distance between reconciliations in prior work [79, 11, 16] and has been shown to have a number of desirable properties [8, 16]. Other distance metrics are also possible [12, 16, 17] and in a later section we show how our algorithm can be applied to several of those metrics. Henceforth, unless specified otherwise, we use the term “distance” to mean symmetric set distance.

Vector operations

Throughout this work, vectors are understood to be over the non-negative integers. For a vector v, let v[i] denote the element at index i. We use several standard operations on vectors. Given two vectors u and v, the sum of u and v, denoted u+v is defined by (u+v)[i]=u[i]+v[i]. The convolution of u and v, denoted uv is defined by:

$$(u \ast v) [i] = \sum_{j = 0}^{i} u[j] v[i-j] $$

Given a vector v, σj, the j-place shift of v is defined by:

$$\sigma^{j}(v)[i] = \left\{ \begin{array}{lr} v[i-j] & : i \geq j \\ 0 & : i < j \end{array} \right. $$

Henceforth, we use the shorthand σ for σ1. Note also that σ0(v)=v. Finally, for a vector v and positive integer c, the scale of v by c is defined by (cv)[i]=c×v[i].

Computing pairwise distances

For a given DTL-MPR instance I=(S,G,ϕ,d,t,), let \({\mathcal {S}}(I)\) denote the set of all MPRs. The pairwise distance vector is the vector v such that v[i] denotes the number of pairs of MPRs in \({\mathcal {S}}(I)\) whose distance is exactly i. Note that v[0] is exactly the number of MPRs in \({\mathcal {S}}(I)\) since two MPRs R1 and R2 are at distance 0 if and only if R1=R2. Moreover, the maximum value of i such that v[i]>0 is the maximum distance between any two MPRs, namely the diameter of the space. For convenience in describing the algorithm, we treat these vectors as having infinite dimension. We later establish a bound on the diameter of the space which allows the implementation to use vectors of finite dimension.

The Algorithm

To compute the pairwise distance vector, we first compute the reconciliation graph as described in the previous section. Consider a DTL-MPR instance I=(S,G,ϕ,d,t,) and its reconciliation graph \(\mathcal {R}\). Let PDVg(s1,s2) denote the vector v such that v[i] counts the number of pairs of subtraversals, one rooted at (g,s1) and the other rooted at (g,s2), that differ in exactly i event nodes. Note that s1 and s2 need not be distinct. Since some (g,s) pairs may not be present as mapping nodes in \(\mathcal {R}\), we define PDVg(s1,s2) to be the zero vector if either (g,s1) or \((g, s_{2}) \not \in \mathcal {R}\).

Recall that rg denotes the root of G and each element of \(\mathbf {sources}(\mathcal {R})\) is a mapping node of the form (rg,·). Since every traversal, and thus every MPR, must begin at an element of \(\mathbf {sources}(\mathcal {R})\), the pairwise distance vector for the entire MPR space is:

$$v = \ \underset{(rg, s_{1}), (rg, s_{2}) \in \mathbf{sources}(\mathcal{R})}{\sum} \mathbf{PDV}_{rg}(s_{1}, s_{2}) $$

where the summation represents the vector sum.

We describe a recursive formulation for PDVg(s1,s2) which allows us to compute these values efficiently via dynamic programming (DP). In fact, our approach uses mutual recursion which results in multiple DP tables. In addition to the PDV table, the two “helper” tables are called bothSDT and topSDT. We begin by describing the bothSDT table.

The bothSDT table

A speciation, duplication, or transfer event is called an SDT event to distinguish it from a loss event. We define bothSDTg(s1,s2) as the vector of pairwise distances for the subtraversals rooted at mapping nodes (g,s1) and (g,s2) assuming that the first event nodes in both subtraversals are not loss events.

For the base case, if g is a leaf of the gene tree, and s1=s2, then let s=s1=s2. If ϕ(g)=s, then:

$$\begin{array}{@{}rcl@{}} \mathbf{bothSDT}_{g}(s, s)[0] &=& 1 \\ \mathbf{bothSDT}_{g}(s, s)[i] &=& 0 : i \geq 1 \end{array} $$

This represents the fact that there is a single subtraversal rooted at (g,s), namely the subtraversal that maps g to s and its distance to itself is 0. If ϕ(g)≠s or s1s2 then bothSDTg(s1,s2) is the zero vector since there are no valid pairs of subtraversals rooted at (g,s1) and (g,s2) in those cases.

If g is not a leaf, then let the children of g be denoted g,g′′ and let SDTevents(g,s) denote the set of speciation, duplication, or transfer (not loss) event node children of the mapping node (g,s). If eSDTevents(g,s), the children of e will be mapping nodes of the form (g,κ) and (g′′,τ). Let \(\mathbf {spec}_{{g^{\prime }}}({e}) = \kappa \) and \(\mathbf {spec}_{{g^{\prime \prime }}}({e}) = \tau \).

Our objective is to count the number of pairs of subtraversals rooted at (g,s1) and (g,s2) that begin with SDT events. (Recall that s1 and s2 may be equal.) Let e1 and e2 denote such a pair of SDT events. Note that if e1e2, then these events contribute two to the distance between the subtraversals. If e1=e2 (which can only occur if s1=s2), then these events do not contribute to the pairwise distance.

Consider a given pair of subtraversals (A,A′′) rooted at \((g^{\prime }, \mathbf {spec}_{{g^{\prime }}}({e_{1}}))\) and \((g^{\prime \prime }, \mathbf {spec}_{{g^{\prime \prime }}}({e_{1}}))\), respectively, and another pair (B,B′′) rooted at \((g^{\prime }, \mathbf {spec}_{{g^{\prime }}}({e_{2}}))\) and \((g^{\prime \prime }, \mathbf {spec}_{{g^{\prime \prime }}}({e_{2}}))\), respectively. In the treatment below, we begin by assuming that AB and A′′B′′ and later correct for the cases when one or both pairs may be equal.

To construct a pair of subtraversals rooted at (g,s1) and (g,s2) that result from the events e1 and e2 and the subtraversals A,A′′,B,B′′, we first choose a subtraversal for each of event e1’s two children. Note that A and A′′ can always be chosen, but if, for example, \(\mathbf {spec}_{{g^{\prime }}}({e_{1}}) = \mathbf {spec}_{{g^{\prime }}}({e_{2}})\), then we can also choose e1 with B and A′′. In any case, the second subtraversal, which involves e2, is uniquely induced by the first subtraversal. If e1 with A and A′′ are chosen for the first subtraversal then e2 with B and B′′ comprise the second subtraversal and if e1 with B and A′′ are chosen for the first subtraversal then e2 with A and B′′ comprise the second subtraversal.

Since the relationships of e1 and e2 and their children dictate what choices of subtraversals are possible, we now enumerate the possible cases. In each case, we define the value of a function choices(e1,e2) which is used later.

Case 1: e 1e 2 and their four mapping node children are all distinct.

In this case, exactly one pair of subtraversals is induced: e1,A,A′′ with e2,B,B′′. Thus, choices(e1,e2)=1.

Case 2: e 1e 2 and exactly one mapping node child is shared.

In this case, two pairs of subtraversals are induced. Without loss of generality, assume that \(\mathbf {spec}_{{g^{\prime }}}({e_{1}}) = \mathbf {spec}_{{g^{\prime }}}({e_{2}})\). Then, the two pairs of subtraversals are e1,A,A′′ with e2,B,B′′ and e1,B,A′′ with e2,A,B′′. Thus, choices(e1,e2)=2.

Case 3: e 1e 2 and both pairs of mapping node children are shared.

In this case, four pairs of subtraversals are induced: e1,A,A′′ with e2,B,B′′, e1,A,B′′ with e2,B,A′′, e1,B,A′′ with e2,A,B′′, and e1,B,B′′ with e2,A,A′′. Thus, choices(e1,e2)=4.

Case 4: e 1=e 2.

In this case, two pairs of subtraversals are induced. Letting e=e1=e2, they are e,A,A′′ with e,B,B′′ and e,A,B′′ with e,B,A′′. Thus, choices(e1,e2)=2.

Recall that the ith element of vector bothSDTg(s1,s2) denotes the number of pairs of subtraversals, one rooted at (g,s1) and the other rooted at (g,s2), that begin with SDT events and whose distance is exactly i. Let \(u = \mathbf {PDV}_{g^{\prime }}(\mathbf {spec}_{{g^{\prime }}}({e_{1}}), \mathbf {spec}_{{g^{\prime }}}({e_{2}}))\) and \(v = \mathbf {PDV}_{g^{\prime \prime }}(\mathbf {spec}_{{g^{\prime \prime }}}({e_{1}}), \mathbf {spec}_{{g^{\prime \prime }}}({e_{2}}))\). Observe that \(\sum _{k=0}^{j} \mathbf {choices}(e_{1}, e_{2}) \times u[k] \times v[j-k]\) counts the number of pairs of subtraversals rooted at (g,s1) and (g,s2), using e1 and e2, that differ in exactly j events, without the distance contribution of e1,e2 if e1e2.

The above case analysis, and thus the value of choices(e1,e2), assumes that AB and A′′B′′. If one of A=B or A′′=B′′ then we have overcounted by a factor of two and must adjust accordingly. If both A=B and A′′=B′′, we must divide the count by a factor of four. Note that A=B and A′′=B′′ correspond to a pair of subtraversals counted in u[0] and v[0], respectively.

These observations allow us to compute bothSDTg(s1,s2) using vector convolution, addition, and scaling. First, for a given vector u, define u0 to be the vector that is identical to u at index 0 and is zero at all other indices. Let u−0 denote the vector that is zero at index 0 and is identical to u at all other indices. Letting c=choices(e1,e2), define the scaled convolution operator c as:

$$ {}u \ast_{c} v = c (u_{-0} \ast v_{-0}) + \left\lceil \frac{c}{2} \right\rceil (u_{0} \ast v_{-0} + u_{-0} \ast v_{0}) + \left\lceil \frac{c}{4} \right\rceil (u_{0} \ast v_{0}) $$
(1)

Note that the term c(u−0v−0) accounts for the case that AB and A′′B′′, the next term accounts for the case that exactly one of A=B or A′′=B′′, and the last term accounts for the case that A=A′′ and B=B′′. The ceilings \(\left \lceil \frac {c}{2} \right \rceil \) and \(\left \lceil \frac {c}{4} \right \rceil \) are used so that the corresponding distance vectors are not omitted when c is either 1 or 2.

For two events e1,e2 we define δ(e1,e2)=1 if e1e2 and 0 otherwise, we can now compute bothSDTg(s1,s2) by summing Eq. 1 over all event pairs e1,e2, and shifting the distance vector by 2 if e1e2 to account for the difference in those two events:

$$\begin{array}{@{}rcl@{}} \mathbf{bothSDT}_{g}(s_{1}, s_{2}) & = & \\ & & {\underset{\substack{e_{1} \in \mathbf{SDTevents}(g,s_{1})\\ e_{2} \in \mathbf{SDTevents}(g, s_{2})\\k=\delta(e_{1},e_{2})}}{\sum} \sigma^{2k}(} \\ & & {\mathbf{PDV}_{g^{\prime}}(\mathbf{spec}_{{g^{\prime}}}({e_{1}}), \mathbf{spec}_{{g^{\prime}}}({e_{2}}))} \\ & & {\ast_{\mathbf{choices}(e_{1}, e_{2})}} \\ & & {\mathbf{PDV}_{g^{\prime\prime}}(\mathbf{spec}_{{g^{\prime\prime}}}({e_{1}}), \mathbf{spec}_{{g^{\prime\prime}}}({e_{2}})))} \end{array} $$

The PDV Table

For a given gene vertex g and pair of species vertices s1,s2, the computation of PDVg(s1,s2) depends on the relationship between s1 and s2. There are three possible cases: (1) s1 and s2 are not ancestrally related, (2) s1=s2, and (3) s1 is ancestral to s2 (or vice versa). If s1 and s2 are not ancestrally related, then

$$\begin{array}{@{}rcl@{}} \mathbf{PDV}_{g}(s_{1}, s_{2}) & = & \mathbf{bothSDT}_{g}(s_{1}, s_{2}) \ + \\ & & \underset{t_{1}\in\mathbf{losses}(s_{1})}{\sum} \sigma(\mathbf{PDV}_{g}(t_{1}, s_{2})) \ + \\ & & \underset{t_{2}\in\mathbf{losses}(s_{2})}{\sum} \sigma(\mathbf{PDV}_{g}(s_{1}, t_{2})) \ - \\ & & \underset{\substack{t_{1} \in \mathbf{losses}(s_{1}) \\ t_{2} \in \mathbf{losses}(s_{2})}}{\sum} \sigma^{2}(\mathbf{PDV}_{g}(t_{1}, t_{2})) \end{array} $$

where losses(g,s) denotes the set of children s of s such that there exists a mapping node (g,s) reachable from (g,s) through a single loss event. The first term is the pairwise distance vector for the case that both subtraversals begin with an SDT event. Otherwise, at least one subtraversal begins with a loss. The second term is the pairwise distance vector for the case that the subtraversal rooted at (g,s1) begins with a loss and the third is for the case that the subtraversal rooted at (g,s2) begins with a loss. Those vectors are shifted by one since one subtraversal incurs a loss that the other does not. The last term accounts for the overcounting that occurs when both (g,s1) and (g,s2) begin with losses.

If s1=s2 or s1 and s2 are ancestrally related (in which case we assume WLOG that s1 is ancestral to s2), a special case arises that requires a third DP table in order to avoid overcounting. This table, called topSDTg(s1,s2), computes the distance vector for the case that mapping node (g,s1) begins with an SDT event but (g,s2) begins with either an SDT or a loss event:

$$\begin{array}{@{}rcl@{}} \mathbf{topSDT}_{g}(s_{1}, s_{2}) & = & \mathbf{bothSDT}_{g}(s_{1}, s_{2}) + \\ & & \underset{{t \in \mathbf{losses}(s_{2})}}{\sum}\!\!\!\!\!\!\! \sigma(\mathbf{topSDT}(s_{1}, t)) \end{array} $$

The first term accounts for pairs of subtraversals that both begin with SDT events, and the second term accounts for pairs where the subtraversal rooted at (g,s2) incurs a loss but the subtraversal rooted at (g,s1) begins with an SDT event. The shift in that term accounts for a contribution of one to the distance due to that loss event.

Now, if s1=s2, then let s=s1=s2. In this case:

$$\begin{array}{@{}rcl@{}} \mathbf{PDV}_{g}(s, s) & = & \mathbf{bothSDT}_{g}(s, s) + \\ & & \underset{t \in \mathbf{losses}(s)}{\sum}\!\!\!\! \sigma(\mathbf{topSDT}(s, t)) + \\ & & \underset{\substack{t_{1} \in \mathbf{losses}(s)\\ t_{2} \in \mathbf{losses}(s) \\ k = \delta(t_{1}, t_{2}))}}{\sum}\!\!\!\!\! \sigma^{2k}(\mathbf{PDV}_{g}(t_{1}, t_{2})) \end{array} $$

The first term is the pairwise distance vector when both subtraversals begin with an SDT event. Otherwise, either one or both subtraversals begin with a loss. The second term accounts for pairs where one incurs a loss and the other begins with an SDT event. The third term is for pairs where both begin with a loss. If those losses are identical, they do not contribute to the distance and no shift is required. If those losses are distinct, they contribute two to the distance which results in shifting the distance vector by two.

Next, we consider the remaining case that s1 is ancestral to s2. This case is divided into two subcases. If s2 is not a child of s1 then:

$${}\mathbf{PDV}_{g}(s_{1}, s_{2}) = \mathbf{topSDT}_{g}(s_{1}, s_{2}) \ + \! \underset{t \in \mathbf{losses}(s_{1})}{\sum} \sigma(\mathbf{PDV}_{g}(t, s_{2})) $$

The first term accounts for the case that the subtraversal rooted at (g,s1) begins with an SDT event while the second term accounts for the case that (g,s1) begins with a loss; the shift in the summation accounts for a contribution of one to the distance due to that loss event.

The other subcase is that s2 is a child of s1. In this case, the subtraversal rooted at (g,s1) either begins with an SDT event, a loss event to mapping node (g,s2), or a loss event to a mapping node (g,t) where ts2. If it begins with a loss event e to mapping node (g,s2), the vector PDVg(s2,s2) undercounts the number of resulting pairs of subtraversals because a pair of subtraversals A,B that are both rooted at (g,s2) induce two pairs of subtraversals rooted at (g,s1) and (g,s2) respectively when AB: One pair is e,A with B and the other is e,B with A. Each pair A,B such that AB corresponds to an entry in PDVg(s2,s2) at index greater than 0. Therefore, we must increase those counts by a factor of two by introducing a vector operator ρ(v)=2vv0. Now we have:

$$\begin{array}{@{}rcl@{}} \mathbf{PDV}_{g}(s_{1}, s_{2}) & = & \mathbf{topSDT}_{g}(s_{1}, s_{2}) + \\ & & \sigma(\rho(\mathbf{PDV}_{g}(s_{2}, s_{2}))) + \\ & & \underset{\substack{t \in \mathbf{losses}(s_{1})\\ t \neq s_{2}}}{\sum} \sigma(\mathbf{PDV}_{g}(t, s_{2})) \end{array} $$

Finally, the recursively defined tables PDV, topSDT, and bothSDT are computed via dynamic programming by traversing the reconciliation graph in postorder (i.e., bottom up) so that when computing each table entry, the entries that are required for that computation have already been computed and saved.

Asymptotic runtime analysis

Given a DTL-MPR instance (S=(VS,ES),G=(VG,EG),ϕ,d,t,), let n=|VS| and let m=|VG|. The reconciliation graph can be constructed in time O(mn2) [10]. In order to analyze the running time of computing the dynamic programming tables PDV, topSDT, and bothSDT used by the PDV Algorithm, we first show that the number of events in an MPR is bounded by O(m). From this result, it follows that the diameter is bounded by O(m) since, in the worst case, two MPRs could differ on every event.

Lemma 1

The number of events in an MPR is bounded by O(m).

Proof

First, note that a valid (although not necessarily minimum cost) reconciliation can be constructed by mapping rg to any leaf of S then, using only transfer events, the remainder of G can be embedded on the leaves of S, resulting in a reconciliation that incurs a cost of t for each of the internal vertices of G. Since t is some fixed positive constant, the total cost of the reconciliation is bounded by mt. This establishes a cost bound of O(m) on an MPR. Each internal vertex of G induces exactly one speciation, duplication, or transfer event and thus the total number of such events is O(m). Since an upper bound on the cost of an MPR is tm, and the loss cost is strictly positive, it follows that the total cost incurred by losses in an MPR is bounded by \(\frac {mt}{\ell }\). Since m and are positive constants, the total number of losses is therefore bounded by O(m). Therefore, the total number of events in an MPR is bounded by O(m). □

It follows that the diameter of MPR space for the symmetric set distance metric is bounded by O(m) and thus all of the vectors maintained by the PDV Algorithm can be treated as having dimension O(m). Therefore, all vector addition, shift, and scale operations can be performed in O(m) time while convolution can be performed in time O(m logm) using the Discrete Fast Fourier Transform [18].

Lemma 2

For any pair (g,s), |SDTevents(g,s)|O(n).

Proof

There can be at most two speciation event nodes for (g,s) since g and g′′ can be associated with s and s′′ in two ways and there can be only one duplication event node. There can be O(n) transfer event nodes since one of g or g′′ must remain on s and the other can be transferred to at most one of n vertices of tree S. Thus, the mapping node (g,s) has a number of event node children bounded by O(n). □

Lemma 3

The PDV algorithm has worst-case running time bounded by O(n4m2 logm).

Proof

Computing bothSDTg(s1.s2) requires considering all pairs of SDT events in SDTevents(g,s1) and SDTevents(g,s2) and, for each combination, the dominant cost is due to the convolution. By Lemma 2, the number of pairs is bounded by O(n2) and, as a corollary of Lemma 1, the vector convolutions can be computed in time O(m logm). Thus, the total cost of computing bothSDTg(s1,s2) is bounded by O(n2m logm). Since there are O(n2m) entries in the bothSDT table, the total cost is bounded by O(n4m2 logm). It is easily verified that all of the other table entries require asymptotically less time, and thus the total running time of the algorithm is bounded by O(n4m2 logm). □

Application to other distance metrics

The PDV algorithm was described for the symmetric distance metric but can be applied to other distance metrics as well. For example, in the path distance metric [12, 16, 17], the distance between two reconciliations Φ1 and Φ2 is defined to be \(\sum _{g \in G} d_{S}(\Phi _{1}(g), \Phi _{2}(g))\) where dS(s1,s2) is the distance between vertices s1 and s2 in the species tree, S. The modification to the PDV algorithm for this metric simply replaces the 2k shift in the bothSDT computation by a shift equal to dS(s1,s2) to account for the distance incurred by mapping g to s1 in one MPR and mapping g to s2 in another. The maximum diameter under this distance metric is O(mn) and thus the vector operations now take time O(mn logmn). Therefore, the asymptotic running time for the PDV algorithm for this distance metric is O(n5m2 logmn). Similarly, the PDV algorithm can be easily adapted for other distance metrics such as the binary discrete distance metric [12].

Experimental results

We implemented the PDV Algorithm in Python and validated the implementation by comparing the results to those found by a brute-force solver for all pairs of phylogenetic trees and leaf associations with up to six leaves and with a large sample of trees with up to 10 leaves. Our implementation, called PairTree, is available at www.cs.hmc.edu/ hadas/pairtree.

We tested our code on a widely-used Tree of Life dataset comprising 100 primarily prokaryotic species and 4849 gene trees [5] using duplication, transfer, and loss costs of (2,3,1), (1,2,1), and (1,1,1) since these costs are frequently used in the literature [5, 19]. Although event costs (1,1,1) are perhaps the least likely to be biologically realistic, they induce very large reconciliation graphs which was useful for performance evaluation. Since the motivation for computing pairwise distance vectors is to understand large MPR spaces, we only considered gene families that induced at least 104 MPRs.

Runtime

We used a commodity server (AMD Opteron 6276 2.3 GHz, 503 GB RAM) for all of our experiments and running times are summarized in Table 1. Event costs (1,1,1) induce relatively large reconciliation graphs since all events have equal cost. In particular, an n2 factor in the O(n4m2 logm) worst-case analysis derived previously is due to an upper-bound of O(n) on the number of events associated with a given mapping node. In practice, most reconciliation graphs have relatively few events associated with each mapping node. However, there are cases where the number of event nodes can approach this bound for event costs (1,1,1) where all events have the same cost.

Table 1 Running time and average normalized distance with and without loss events

Distance statistics

We computed statistics on all pairwise distances across all gene trees that induced over 104 MPRs. The normalized distance between a pair of MPRs is defined to be their distance divided by twice the number of internal vertices in the gene tree [11]. In the absence of loss events, the normalized distance is 0 if the two MPRs agree on all speciation, duplication, and transfer events and is 1.0 if the two MPRs disagree on all such events. Due to loss events, the normalized distance can exceed 1.0 in theory. Therefore, we also computed the normalized distances without losses so that the scale is from 0 to 1.0 and a distance of 1.0 implies disagreement on all speciation, duplication, and transfer events. Table 1 summarizes these results and demonstrates that, on average, MPRs disagree on a significant fraction of their constituent events, whether or not losses are considered.

Pairwise distance distributions

An important application of PairTree is in exploring the structure of the MPR space for a specific dataset of interest in order to determine the diversity of solutions and, ultimately, whether a single MPR is likely to be an adequate representative of the space. As an example, Fig. 2 shows the results from three gene trees that exhibit different distance distributions. These examples demonstrate that the pairwise distance distributions are dependent both on the trees themselves and on the event costs.

Fig. 2
figure 2

Pairwise distances (with losses) for three phylogenetic trees for three gene families in the Tree of Life dataset. COG0466 has 87 leaves, COG0651 has 84 leaves, and COG0703 has 85 leaves. All are reconciled to a species tree with 100 leaves. Each of the three rows corresponds to one gene family and the three columns correspond to the DTL cost parameters (2,3,1), (1,2,1), and (1,1,1), respectively. The entry at index 0 of each vector is omitted. These examples demonstrate that the pairwise distance distributions are sensitive to event costs and may be multimodal, indicating the presence of two or more clusters in MPR space

While many of the pairwise distance distributions are unimodal, there are also multimodal distributions which suggest the presence of distinct clusters of MPRs. The number of clusters cannot be directly ascertained from the number of modes in the distribution; MPR space is high-dimensional so a large number of clusters may induce only two modes.

Some distributions are comb-like with many pairs of MPRs at even distances and a smaller number of (or zero) pairs of MPRs at odd distance, or vice versa. This phenomenon is due to some MPRs having few or no losses. In the absence of loss events, the pairwise distance must be even by the definition of the symmetric set distance. Since losses result in shifting the distribution, comb-like distributions with odd distances are also possible.

Conclusions

We have given an efficient polynomial-time algorithm for computing the pairwise distances between all maximum parsimony reconciliations in the DTL model that applies to several distance metrics. Further work is required to determine how these distributions should be interpreted with respect to the number of MPRs required to adequately represent the space. However, the mean and standard deviation alone, both reported by PairTree, provide important insights into the variation of solutions. Moreover, multimodal distributions are indicative of clusters of MPRs, which suggest that multiple MPRs are needed to adequately represent the space.

If there is evidence of clusters in MPR space, it is desirable to find a small set of representative MPRs, with at least one for each cluster. Ozdemir et al. [9] showed how Park’s k-medoid heuristic [20] and González’s k-centers 2-approximation algorithm can be adapted to find k medoids or centers, respectively. However, these algorithms have running times of the form O(nk+3 logn) rendering them viable only for small values of k. Thus, developing efficient clustering algorithms and methods for determining the appropriate number of clusters are important problems for future research.

Availability of data and materials

The software in this paper and the data used in the experiments are available at www.cs.hmc.edu/~hadas/pairtree.

Abbreviations

DTL:

Duplication-transfer-loss

MPR:

Maximum parsimony reconciliation

PDV:

Pairwise distance vector

References

  1. Haack J, Ramirez A, Zupke E, Wu Y, Libeskind-Hadas R. Computing the diameter of the space of maximum parsimony reconciliations in the duplication-transfer-loss model. IEEE/ACM Trans Comput Biol Bioinform. 2019; 16(1):14–22. https://doi.org/10.1109/TCBB.2018.2849732.

    Article  Google Scholar 

  2. Doyon J-P, Scornavacca C, Gorbunov KY, Szöllosi J G, Ranwez V, Berry V. An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. Comp Genom. 2011; 6398:93–108.

    Article  Google Scholar 

  3. Tofigh A, Hallett MT, Lagergren J. Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Trans Comp Bio Bioinfo. 2011; 8(2):517–35.

    Article  Google Scholar 

  4. Chen Z. -Z., Deng F, Wang L. Simultaneous identification of duplications, losses, and lateral gene transfers. IEEE/ACM Trans Comput Biol Bioinform. 2012; 9(5):1515–28.

    Article  Google Scholar 

  5. David LA, Alm EJ. Rapid evolutionary innovation during an archaean genetic expansion. Nature. 2011; 469:93–96.

    Article  CAS  Google Scholar 

  6. Libeskind-Hadas R, Wu Y-C, Bansal MS, Kellis M. Pareto-optimal phylogenetic tree reconciliation. Bioinformatics. 2014; 30(12):87–95.

    Article  Google Scholar 

  7. Scornavacca C, Paprotny W, Berry V, Ranwez V. Representing a set of reconciliations in a compact way. J Bioinforma Comput Biol. 2013; 11(02):1250025.

    Article  Google Scholar 

  8. Nguyen T-H, Ranwez V, Berry V, Scornavacca C. Support measures to estimate the reliability of evolutionary events predicted by reconciliation methods. PLoS ONE. 2013; 8(10):73667.

    Article  Google Scholar 

  9. Ozdemir A, Sheely M, Bork D, Cheng R, Hulett R, Sung J, Wang J, Libeskind-Hadas R. In: Figueiredo D, Martín-Vide C, Pratas D, Vega-Rodríguez MA, (eds).Clustering the Space of Maximum Parsimony Reconciliations in the Duplication-Transfer-Loss Model. Cham: Springer; 2017, pp. 127–39.

    Google Scholar 

  10. Ma W, Smirnov D, Forman J, Schweickart A, Slocum C, Srinivasan S, Libeskind-Hadas R. DTL-RnB: Algorithms and tools for summarizing the space of DTL reconciliations. IEEE/ACM Trans Comp Bio Bioinfo. 2016; 15:411–21.

    Article  Google Scholar 

  11. Haack J, Ramirez A, Zupke E, Wu Y, Libeskind-Hadas R. Computing the diameter of the space of maximum parsimony reconciliations in the duplication-transfer-loss model. IEEE Trans Comput Biol Bioinforma. 2018. Special Issue for APBC 2018.

  12. Huber KT, Moulton V, Sagot M. -F., Sinaimeri B. Exploring and Visualizing Spaces of Tree Reconciliations. Syst Biol. 2018; 68:607–18.

    Article  Google Scholar 

  13. Tofigh A. Using trees to capture reticulate evolution: Lateral gene transfers and cancer progression. PhD thesis, KTH Royal Institute of Technology. 2009.

  14. Ma W, Smirnov D, Libeskind-Hadas R. DTL reconciliation repair. BMC Bioinformatics. 2017; 18(3):76.

    Article  Google Scholar 

  15. Addario-Berry L, Hallett MT, Lagergren J. Towards identifying lateral gene transfer events. Pac Symp Biocomput. 2003; 8:279–90.

    Google Scholar 

  16. Grueter M, Duran K, Ramalingam R, Libeskind-Hadas R. Reconciliation reconsidered: In search of a most representative reconciliation in the duplication-transfer-loss model. In: Proceedings of the 17th Asia Pacific Bioinformatics Conference: 2019. https://doi.org/10.1109/tcbb.2019.2942015.

  17. Huber KT, Moulton V, Sagot M. -F., Sinaimeri B. Geometric medians in reconciliation spaces of phylogenetic trees. Inf Process Lett. 2018; 136:96–101.

    Article  Google Scholar 

  18. Winograd S. On computing the Discrete Fourier Transform. Math Comput. 1978; 32(141):175–99.

    Article  Google Scholar 

  19. Urbini L, Sinaimeri B, Matias C, Sagot M. Exploring the robustness of the parsimonious reconciliation method in host-symbiont cophylogeny. IEEE/ACM Trans Comput Biol Bioinforma. 2018; 1. https://doi.org/10.1109/tcbb.2018.2838667.

    Article  Google Scholar 

  20. Park H-S, Jun C-H. A simple and fast algorithm for k-medoids clustering. Expert Syst Appl. 2009; 36(2):3336–41.

    Article  Google Scholar 

Download references

Acknowledgements

The authors thank Jordan Haack for the proof of Lemma 1, Yi-Chieh Wu for valuable discussions, and the anonymous reviewers for their useful suggestions, which helped improve the exposition of this paper.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 20 Supplement 20, 2019: Proceedings of the 17th Annual Research in Computational Molecular Biology (RECOMB) Comparative Genomics Satellite Workshop: Bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-20.

Funding

Funding for this research was provided by the R. Michael Shanahan Endowment and support from Harvey Mudd College. Publication costs are funded by support from Harvey Mudd College.

Author information

Authors and Affiliations

Authors

Contributions

RLH conceived the research. SS, RM, and RLH developed the algorithm. SS and RM implemented the algorithm. RLH and RM developed the experimental design and RM conducted all experiments. RM and RLH wrote the paper. All authors read and approved the paper.

Corresponding author

Correspondence to Ran Libeskind-Hadas.

Ethics declarations

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Santichaivekin, S., Mawhorter, R. & Libeskind-Hadas, R. An efficient exact algorithm for computing all pairwise distances between reconciliations in the duplication-transfer-loss model. BMC Bioinformatics 20 (Suppl 20), 636 (2019). https://doi.org/10.1186/s12859-019-3203-9

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/s12859-019-3203-9

Keywords