### Basic definitions

Let *H* and *P* be two phylogenetic trees. *H* and *P* will be called host tree, respectively, parasite tree. Let *φ* : *L*(*P*) × *L*(*H*) be a relation over the set of leaf nodes of the parasite tree and the leaf nodes of the host tree. *φ;* is used to describe known host-parasite interactions. A toy example for a cophylogenetic system of four hosts and four parasites and their associations is given in Figure 1 (left).

In order to investigate whether there exists coevolution between hosts and their parasites, their common history is reconstructed from the phylogenies and the known current relationships. Typically, four different types of events are considered for the coevolutionary reconstruction of host-parasite systems: cospeciation events, duplication events, sorting events, and switching events. Cospeciation events refer to simultaneous speciations of host and parasite, duplication events are independent parasite speciations, sorting events correspond to lineage sorting (i.e., a parasite species that lives on a host species remains on only one of the resulting species after a host speciation), and switching events correspond to host shifts. As has been done by other authors (e.g., [3] and [9]) we consider a switch as a speciation of the parasite where one of the resulting species switches to another host. The four event types, that are also utilized in CoRe-PA are depicted in Figure 2.

We need the following definitions. If *p* is a node of a tree, then *p*.*i* denotes the *i*-th child node of *p*. The out-degree of node *p* is denoted with deg(*p*). An association of a parasite *p* ∈ *P* to a host *h* ∈ *H* is denoted as (*p*, *h*). A reconstruction *R* is the set of all associations of all parasites to nodes in the host tree, i.e., for each node *p* ∈ *P* it exists an *h* ∈ *H* such that (*p*, *h*) ∈ *R*. A reconstruction is valid if i) all parasite leaves are mapped to host leaves according to *φ*, ii) if node *p* is mapped to node *h*, then no descendant of *p* is associated with an ancestor of *h*, as this would induce an inconsistency, and iii) at least one child *p*.*i* of *p* has to be associated with an descendant of *h*. We do not consider the case of a speciation of the parasite *p* where both child species change to hosts that are outside of the subtree with root h because such events can not be traced back (many other studies also do not allow such events, e.g., [9]).

Based on a valid reconstruction *R*, the events implied by the associations in *R* can be inferred as follows. For all non-leaf nodes *p* ∈ *P* the association of *p* and of all its children *p*.*i*, 1 ≤ *i* ≤ deg(*p*), is considered. If for example, in the case of binary trees, the association (*p*, *h*) exists, and *p*.1 is mapped to one child of *h* and *p*.2 is mapped to the other child of *h*, then this implies either i) one cospeciation event, or ii) a duplication and two sorting events. This association triple technique has been used before in Tarzan and leads to an efficient reconstruction method (for details see [6]). A valid reconstruction for the coevolutionary system of Figure 1 (left) is depicted in Figure 1 (right). In the reconstruction the three associations (*p*_{3}, *h*_{2}), (*p*_{4}, *h*_{3}), and (*p*_{2}, *h*_{1}) induce one cospeciation and one sorting event (in general many different sets of events may be possible). The three associations (*p*_{2}, *h*_{1}), (*p*_{5}, *h*_{4}), and (*p*_{1}, *h*_{1}) induce one duplication and two sorting events. The depicted reconstruction requires two cospeciations, one duplication, and three sortings.

We will discuss divergence timing information and incompatible reconstruction only briefly in this article and refer to [9] and [6]. Considering again an association (*p*, *h*), where one child *p*.*i* is mapped to a node *h'*, and *h'* is not a descendant of *h*, then this implies (at least) one host switching event. A problem with switches in a reconstruction is that they induce a timing relation between the take-off site and the landing site. A consequence is that the occurrence of several switches in a valid reconstruction can lead to timing relations which are not possible. CoRe-PA includes more sophisticated methods for detecting these so-called incompatible (in contrast to compatible) reconstructions than, for example, Tarzan (for details how these incompatibilities can be resolved see [6]). However we will focus on the parameter-adaptive reconstruction approach in this article. Furthermore, we point out that CoRe-PA includes the same handling of divergence timing information as Tarzan, i.e., nodes can be labeled with divergence timing information and an association (*p*, *h*) is only allowed, if the timing information of *p* and *h* do not disallow this association.

### Dynamic programming approach

In the following a dynamic programming formulation (DP) for the reconstruction problem is given, which is a key component of CoRe-PA. We briefly discuss how the usage of divergence timing information is included, and explain details of runtime optimization techniques that are used. We omit a detailed discussion of how multifurcations and multiple-host parasites are handled (instead of resolving multifurcations by iterating over all possible binary subtrees, as done in Tarzan, non-binary cophylogenetic events where introduced, e.g., events that represents a composition of multiple duplications and cospeciations occurring consecutively taking into account the structure of multifurcating host and parasite nodes).

#### Initial DP formulation

The basic idea of the dynamic programming approach is to traverse the parasite tree *P* in a bottom-up manner. The cheapest cost *C*_{p, h}for a node *p* of *P*, that is mapped on a node *h* of *H*, is stored in the dynamic programming table. If *p* is a leaf node, then the mapping for *p* is defined by the relation *φ* and induces no costs as no coevolutionary event occurs. In the recursive step of the dynamic programming we map all children *p*.1, ..., *p*.deg(*p*) of *p* to nodes in *H*. The mapping of the nodes *p*.*i* to nodes *h*_{
i
}∈ *H* induces i) the recursively computed cost for each association, plus ii) the cost from the cheapest set of events due to *p* being associated with *h*, and the nodes *p*.1, ..., *p*.deg(*p*) being associated with the corresponding *h*_{
i
}. Note that there may exist several possibilities for this set of events to explain the given associations, and the cost-wise cheapest of those is taken. These costs are denoted by min(*E*(*h*, *h*_{1}, ..., *h*_{deg(p)})). Let us consider again the binary example where *h*_{1} and *h*_{2} are children of *h* (i.e., *h*_{1} = *h*.1 and *h*_{2} = *h*.2, or *h*_{1} = *h*.2 and *h*_{2} = *h*.1). In this example min(*E*(*h*, *h*_{1}, *h*_{2})) refers either to the costs for one cospeciation event or to the costs for one duplication and two sorting events. The dynamic programming formulation is as follows:

For details on how min(*E*(*h*, *h*_{1}, ..., *h*_{deg(p)})) is computed in the binary case see Algorithm 1.

**Algorithm 1**: Computing min(*E*(*h*, *h*_{1}, *h*_{2})) in the binary case of Equation 1

**Input**: *h*, *h*_{1}, *h*_{2}, cospeciationCost, sortingCost, duplicationCost, hostswitchCost

**Output**: costs *E*

**1** *E* ← ∞;

**2 if**
*h is not a descendant of h*
_{1}
*or h*
_{2}
**then**

**3** *S*← Compute number of sortings from *h* to *h*_{1} plus sortings from *h* to *h*_{2};

**4** **if** *h*_{1} and *h*_{2}*are in the subtree with root h* **then**

**5** *E* ← *duplicationCost* + (*S* * *sortingCost*);

**6** **if** *h*_{1} and *h*_{2}*are in different subtrees with root h*.1 *and h*.2 **then**

**7** *E* ← min(*E*, *cospeciationCost* + ((*S* - 2) * *sortingCost*);

**8** **end**

**9** **end**

**10** **if** *either h*_{1}*or h*_{2}*is in subtree with root h* **then**

**11** *E* ← *hostswitchCost* + (*S* * *sortingCost*);

**12** **end**

**13 end**

**14 return** *E*;

#### Inclusion of divergence timing information

Similar to the approach in [6], algorithm CoRe-PA allows assigning intervals of time zones to the nodes in one of the trees, e.g., the parasite tree. The nodes in the other tree, e.g., the host tree, have to be assigned to a single time zone. The reason for this is that the reconstruction problem becomes much more complex when nodes in both trees are assigned to time zone intervals [6]. For each possible association (*p*, *h*) we define a value *Z*_{p, h}. The value of *Z*_{p, h}is 0 if the association is valid with respect to the timing information, and it is ∞ otherwise. For the revised DP formulation we add the value *Z*_{p, h}in the recursion step of Equation 1.

#### Optimization

A direct implementation of the DP formulation, as given in Equation 1, would not perform very well, as all possible combinations of all possible associations of nodes *p*.*i* to nodes *h*_{
i
}would be considered in order to compute *C*_{p, h}, i.e., any of the *n*^{deg(p)}combinations of choosing *deg*(*p*) hosts out of the n nodes in the host phylogeny have to be considered. Therefore several improvements are included into the implementation of CoRe-PA. The most important improvement reduces the number of combinations of associations that have to be considered significantly as described in the following. If the costs for *C*_{p, h}are computed according to Equation 1, all possible mappings of each *p*.*i* to all *h* ∈ *H* are considered. Let us assume two possibilities for mappings of *p*.*i*, namely *p*.*i* being mapped to *h'* and *p*.*i* being mapped to *h"*. Let us further assume that *h'* and *h"* are both in a subtree of *H* that has a child of *h* as a root node. As we know the values of *C*_{p.i, h'}and *C*_{p.i, h″}(due to the recursive approach) and as the number of sorting events induced by the pair of associations (*p*, *h*) and (*p*.*i*, *h'*) (respectively (*p*, *h*) and (*p*.*i*, *h"*)) is known, one of the associations (either (*p*.*i*, *h'*) or (*p*.*i*, *h"*)) will dominate the other (unless the costs are equal). This is true for every pair of host nodes that occur in the same subtree of *H* that have a child of *h* as root node. Therefore, only the association that induces the smallest cost in such a subtree must be considered and the number of combinations to be considered in the recursive approach is reduced significantly. This is not only true for all these subtrees, but also for the set of all other nodes that are neither *h* itself nor in one of the just described subtrees. For the binary case pseudocode is given in Algorithm 2.

**Algorithm 2**: Computing *C*_{p, h}in the binary case of Equation 1

**Input**: parasite *p*, host *h*, *C*_{p.1, h'}, *C*_{p.2, h"}, for all *h'*, *h"* ∈ *H*

**Output**: *C*_{p, h}

**1 if** *p* ∈ *L*(*P*) **then**

**2** **return** ((*p*, *h*) ∈ *φ*)?0:∞;

**3 end**

**4** *T* ← 4-partition of *H* specified by {nodes of subtree *h*.1, nodes of subtree *h*.2, {*h*}, remaining nodes};

**5** ← for each *T*^{j}∈ *T* choose the *h'* ∈ *T*^{j}with min(*C*_{p.1, h'}+ sorting costs from *h* to *h'*);

**6** ← for each T ^{j}∈ *T* choose the *h"* ∈ *T*^{j}with min(*C*_{p.2, h"}+ sorting costs from *h* to *h''*);

**7** *C*_{p, h}← ∞;

**8 foreach**
*h*
_{1}
*from*
**do**

**9** **foreach** *h*_{2}*from***do**

**10** *C*_{p, h}← min(*C*_{p, h}, + + min(*E*(*h*, *h*_{1}, *h*_{2})));

**11** **end**

**12 end**

**13 return** *C*_{p, h};

In addition to this dominance-based optimization CoRe-PA utilize tables of precomputed event costs. Assume that an arbitrary parasite node *p* is being mapped on *h* and a child *p*.*i* of *p* is being mapped on *h'*. A certain set of events that have to occur can be precomputed independent from the specific choice of *p* and *p*.*i*: for example, if *h'* is a descendant of *h*, the number of sorting events can be computed; in other cases host switches can be inferred beforehand. In order to perform such precomputations, it is assumed that each possible *h* and *h'* for the mapping of an arbitrary *p* and the child node *p*.*i* is considered. Also in the case that divergence timing information is used, the best take-off and landing sites can be precomputed in the same manner.

Let *n* be the maximal number of nodes in the host or in the parasite tree. It is not difficult to see, that computing a reconstruction with CoRe-PA runs in order of *O*(*n*^{3}), if the maximal degree of the nodes in the trees is assumed to be constant.

### Parameter-adaptive cophylogenetic reconstruction

Several optimization criteria have been investigated in the literature that utilize event-based cophylogenetic reconstruction methods. Examples include the minimization of overall reconstruction costs or the maximization of the number of cospeciations. But all methods are strongly dependant on a good estimation of the cost vector, that assigns costs to the events. Often cospeciation costs are considered to be small (for example ≤ 0), and duplication and host switch costs are usually assumed to be high. However, from a biological point of view, the exact values for these costs are basically unknown. In [3] an inspiring comment is given: "*If each event is associated with a cost that is inversely related to the likelihood of the event (the more likely the event, the smaller the cost) then the most parsimonious reconstruction will also, in some sense, be the most likely explanation of the observed data*.". This comment nicely reflects the underlying idea of the parameter-adaptive approach of CoRe-PA, that will be described in the following. Unlike other methods CoRe-PA does not require any restrictions on the cost values. However, for the parameter-adaptive approach we assume all event costs are between 0 and 1 (If they are larger this can be achieved by multiplication with a suitable factor, as only the ratio between the event costs have an effect on the reconstruction and not the values themselves). Let = (*c*_{1}, ..., *c*_{
m
}) be the cost vector for the *m* possible events. Based on this cost setting it is expected that the event indexed by *i* occurs with probability

i.e., the probability for a certain event is the normalized value of the reciprocal event cost. This ensures that also the ratio between the probabilities of two events are inversely proportional to the ratio between the corresponding cost values. Note that negative cost values can not be considered in this parameter-adaptive approach, as negative probability values can not be interpreted reasonably. However, from a parsimony perspective negative cost values are questionable anyway (see [3]).

Based on the cost vector a cost-minimal reconstruction is inferred using the DP formulation as given above; this in turn leads to relative event frequencies *r*_{
i
}of the events, based on the computed reconstruction. Assume that cost vector is used to determine a reconstruction. The obvious method to determine how good the reconstruction and the cost vector fit, is based on the sum of the differences of the probabilities *p*_{
i
}and the corresponding relative event frequencies *r*_{
i
}of the reconstruction. Formally,

By using as an optimization criteria, a cost vector is sought such that is minimized. The value can be interpreted as a quantification of how unlikely a reconstruction is. Furthermore, if, based on some significance test, there is a strong support for coevolution, but the corresponding is very high, then the support for the coevolutionary signal has still to be questioned.

The parameter-adaptive approach reduces the parameterized cophylogenetic reconstruction problem to a parameter-adaptive optimization problem. Of course, many sophisticated methods are known for finding a good vector , like meta-heuristics [10] or utilizing the concept of a simplex (like in the Nelder-Mead downhill simplex method [11]). In order to be able to present a reasonable statistical analysis of the parameter-adaptive component of CoRe-PA and not to be biased by an underlying optimization method, we present only results that are based on randomly chosen (uniform distribution) cost vectors (although the Nelder-Mead simplex method is already included in CoRe-PA).

### Randomized tests in CoRe-PA

In order to evaluate whether the number of different phylogenetic events of a reconstruction indicates significant coevolution, different randomization tests can be used (see, e.g., [12]). The idea of these tests is to create reconstructions for scenarios where part of the problem instance is randomly changed, e.g., the hosts and parasite associations can be changed randomly. Then the number of events in the reconstructions for the random scenarios can be compared to the reconstruction for the original host parasite scenario. Different opinions have been stated in the literature about what part of the host-parasite data should be randomized when creating random instances for a significance test. Some possibilities are to randomize the parasite tree, the host tree, both trees, or the associations between host and parasites (see [12]). It is important that the random instances are biologically plausible because otherwise the significance results that can be obtained with the tests are biologically useless. Therefore, different methods have been proposed how the random instances should be generated (see [13] for an overview). One randomization test that is integrated in TreeMap is the most often used test in literature on host parasite coevolution (see, e.g., [14]). The test asks whether the maximum proportion of cospeciating nodes inferred is greater than the maximum proportion that can be inferred when one of the phylogenies is randomized. TreeMap allows to randomize either one tree (the host or the parasite tree) or both trees. All these possibilities have been used in the literature.

In [12] the proper use of randomization methods in order to analyze, whether the fit between hosts and parasites can be explained by coevolution, is discussed. It was argued that for a corresponding test it is not appropriate to make random changes in the host or parasite tree. Instead it was proposed to keep the phylogenies of the hosts and the parasites as well as the number of associations. Only the associations between the hosts and parasites should be randomized. This method has been used, e.g., in [14]. For many host parasite systems it can be observed that the number of different parasite species on one host species is small. For such a system it might not be biologically meaningful if a random association between hosts and parasites is created by assigning each parasite a random host with equal probability. Therefore we propose here that random associations should be created that keep the character of the host parasite assignment in the following sense. The number of hosts that have *k* parasite species should be the same in the original host parasite system and the random instance for all integers *k*. We call this a character preserving association. All the discussed methods are included in CoRe-PA. In the case that random trees have to be generated, the well known *β*-splitting model [13] is employed. The *β*-splitting model includes the Markov model and the PDA model as special cases. The method for randomizing the parasite tree (resp. the host tree and both trees) is denoted by RND-parasite (resp. RND-host and RND-both); the character preserving association is denoted by RND-assoc.