PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships
 Cuong Than†^{1}Email author,
 Derek Ruths†^{1} and
 Luay Nakhleh†^{1}Email author
DOI: 10.1186/147121059322
© Than et al; licensee BioMed Central Ltd. 2008
Received: 04 March 2008
Accepted: 28 July 2008
Published: 28 July 2008
Abstract
Background
Phylogenies, i.e., the evolutionary histories of groups of taxa, play a major role in representing the interrelationships among biological entities. Many software tools for reconstructing and evaluating such phylogenies have been proposed, almost all of which assume the underlying evolutionary history to be a tree. While trees give a satisfactory firstorder approximation for many families of organisms, other families exhibit evolutionary mechanisms that cannot be represented by trees. Processes such as horizontal gene transfer (HGT), hybrid speciation, and interspecific recombination, collectively referred to as reticulate evolutionary events, result in networks, rather than trees, of relationships. Various software tools have been recently developed to analyze reticulate evolutionary relationships, which include SplitsTree4, LatTrans, EEEP, HorizStory, and TREX.
Results
In this paper, we report on the PhyloNet software package, which is a suite of tools for analyzing reticulate evolutionary relationships, or evolutionary networks, which are rooted, directed, acyclic graphs, leaflabeled by a set of taxa. These tools can be classified into four categories: (1) evolutionary network representation: reading/writing evolutionary networks in a newly devised compact form; (2) evolutionary network characterization: analyzing evolutionary networks in terms of three basic building blocks – trees, clusters, and tripartitions; (3) evolutionary network comparison: comparing two evolutionary networks in terms of topological dissimilarities, as well as fitness to sequence evolution under a maximum parsimony criterion; and (4) evolutionary network reconstruction: reconstructing an evolutionary network from a species tree and a set of gene trees.
Conclusion
The software package, PhyloNet, offers an array of utilities to allow for efficient and accurate analysis of evolutionary networks. The software package will help significantly in analyzing large data sets, as well as in studying the performance of evolutionary network reconstruction methods. Further, the software package supports the proposed eNewick format for compact representation of evolutionary networks, a feature that allows for efficient interoperability of evolutionary network software tools. Currently, all utilities in PhyloNet are invoked on the command line.
Background
A phylogenetic tree models the evolutionary history of a set of taxa from their most recent common ancestor. The assumptions of strict divergence and vertical inheritance render trees appropriate for modeling the evolutionary histories of several groups of species or organisms. However, when reticulate evolutionary events such as horizontal gene transfer or interspecific recombination occur, the evolutionary history is more appropriately modeled by an evolutionary network.
Evidence of reticulate evolution has been shown in various domains in the Tree of Life. Bacteria obtain a large proportion of their genetic diversity through the acquisition of sequences from distantly related organisms, via horizontal gene transfer (HGT) [1]. Furthermore, more evidence of widespread HGT in plants is emerging recently [2–4]. Interspecific recombination is believed to be ubiquitous among viruses [5, 6], and hybrid speciation is a major evolutionary mechanisms in plants, and groups of fish and frogs [7–10]. All of these processes result in networks, rather than trees, of evolutionary relationships, even though at the gene level evolutionary histories may be treelike, as we now describe.
In the case of interspecific recombination, as illustrated in Figure 1(b), some genetic material is exchanged between pairs of species; in this example, species B and C exchange genetic material. The genes involved in this exchange have an evolutionary history (gene tree T') that is different from that of the genes that are vertically transmitted (gene tree T).
In the case of hybrid speciation, as illustrated in Figure 1(c), the two parents contribute equally to the genetic material of the hybrid: in diploid hybridization, each parent contributes a single copy of each of its chromosomes, while in polyploid hybridization, each parent contributes all copies of its chromosomes. Thus, each set of "orthologous sites" from all taxa has an evolutionary history that is depicted by one of the trees inside the network.
A few software tools for analyzing reticulate evolutionary relationships have been developed recently. The SplitsTree4 tool, which incorporates several algorithms that have been developed by Daniel Huson and his coworkers, is a tool for reconstructing and visualizing splits networks [11]. The tool enables constructing networks from several types of data, including sequence data, distance matrices, and sets of trees. Two major differences exist between SplitsTree4 and PhyloNet. First, SplitsTree4 constructs and analyzes splits networks, which are graphical models of incompatibility in the data, whereas PhyloNet constructs and analyzes evolutionary networks, which are rooted, directed, acyclic graphs, that represent evolutionary relationships. Second, the two tools differ in the utilities they provide, and we view them as complementary. While SplitsTree4 is mainly aimed at reconstructing networks, PhyloNet has several utilities for evaluating networks.
Programs such as EEEP [12], HorizStory [13], LatTrans [14], and TREX [15] are aimed at detecting horizontal gene transfer by reconciling a pair of species/gene trees. The PhyloNet software package that we developed contains an extended implementation of the RIATAHGT algorithm [16] with several improved algorithmic techniques for computing multiple solutions and handling nonbinary trees [17]. The new version of RIATAHGT significantly outperforms, in terms of speed, EEEP, HorizStory and LatTrans, and performs at least as well in terms of accuracy [17, 18]. We have recently added a new heuristic for inferring the support of HGT moves from bootstrap values of gene tree edges. Further, we have added the capability of visualizing the networks computed by RIATAHGT. Besides RIATAHGT, the PhyloNet software package implements methods for comparing and characterizing evolutionary networks, which include: (1) evolutionary network representation: reading/writing evolutionary networks in a newly devised compact form; (2) evolutionary network characterization: analyzing evolutionary networks in terms of three basic building blocks – trees, clusters, and tripartitions; (3) evolutionary network comparison: comparing two evolutionary networks in terms of topological dissimilarities, as well as fitness to sequence evolution under a maximum parsimony criterion; and (4) evolutionary network reconstruction: reconstructing an evolutionary network from a species tree and a set of gene trees. Furthermore, since various evolutionary network utilities use functionalities from the phylogenetic trees domain, PhyloNet provides a set of standalone phylogenetic tree analysis tools.
Results and discussion
The evolutionary network model
In this paper, we assume the "evolutionary network" model, which was formulated independently by Moret et al. [19] and Baroni et al. [20]. We now describe the model as well as some basic definitions and notations that we will use later.
Let T = (V, E) be a tree, where V and E are the tree nodes and tree edges, respectively, and let L(T) denote the tree's leaf set. Further, let χ be a set of taxa (organisms). Then, T is a phylogenetic tree over χ if there is a bijection between χ and L(T). Henceforth, we will identify the taxa set with the leaves they are mapped to, and let [n] = {1,..., n} denote the set of leaflabels. A tree T is said to be rooted if the set of edges E is directed and there is a single node r ∈ V with indegree 0. Let T be a phylogenetic tree on set χ of taxa, and let χ' ⊆ χ be a subset of taxa; then, we denote by Tχ' the subtree with minimum number of nodes and edges that spans the leaves in χ' (in other words, Tχ' is the tree T restricted to subset χ' of its leaves).
For two nodes u and v in directed graph G, we say that v is reachable from u, denoted by $u\u21ddv$ if there exists a directed path from u to v in the tree G. For three nodes u, v and x in directed graph G, we write $u{\u21dd}^{[x]}v$ if all directed paths from u to v go through node x; $u/{\u21dd}^{[x]}v$ if no directed paths from u to v go through node x; and $u{\u21dd}^{[\stackrel{\u02c6}{x}]}v$ if at least one directed path from u to v goes through node x and at least one directed path from u to v does not go through node x. For example, in network N_{1} in Figure 2, rooted at node r_{1}, we have ${r}_{1}{\u21dd}^{[Y]}D$, ${r}_{1}/{\u21dd}^{[Y]}E$, and ${r}_{1}{\u21dd}^{[\stackrel{\u02c6}{X}]}D$.
Evolutionary network representation
Let N = (V = (V_{ N }∪ V_{ T }); E) be an evolutionary network, with V_{ N } = ℓ. We create a forest of ℓ trees as follows.

For every u_{ i }∈ V_{ N }

Compute the set V_{ i }= {v ∈ V : (v, u_{ i }) ∈ E} of u_{ i }'s parents;

Create k new leaves, all labeled with x_{ i }({x_{ i }} ∩ L(N) = ∅);

Delete from V the set of all edges in V_{ i }× {u_{ i }};

Add to V the set of edges V_{ i }× {x_{ i }};

Assign x_{ i }as the name of the tree rooted at node u_{ i };
The result is a forest of trees $\mathcal{F}$ = {t_{1},..., t_{ℓ}} such that (1) L(t_{ i }) ≥ 1 for every 1 ≤ i ≤ ℓ, (2) ${\cup}_{i=1}^{\ell}L({t}_{i})=L(N)$ and (3) L(t_{ i }) ∩ L(t_{ j }) = ∅ for every 1 ≤ i, j ≤ ℓ and i ≠ j. We call $\mathcal{F}$ the tree decomposition of N. Then, the eNewick representation of N is the ℓtuple ⟨n(t_{1});...; n(t_{ℓ})⟩, where n(t_{ i }) is the Newick representation of tree t_{ i }. Figure 3 shows the tree decomposition and eNewick representation of the network N_{1} in Figure 2.
In the case of modeling networks with horizontal gene transfer events, it is often very helpful to the biologist to know what the species tree is and what the additional set of HGT events are. Such information is "lost" in an eNewick representation, unless the representation is extended further to keep a record of the "species tree parent" of each network node. Therefore, in this case (which is the output of RIATAHGT) we opt for the format of a species tree T, in Newick format, followed by a list of the HGT edges, each written as X → Y, where X and Y are two nodes in T.
Evolutionary network characterization
Given an evolutionary network N, the set $\mathcal{T}$(N) is unique. Further, this set informs about the possible gene histories that the network reconciles.
The clusters of the two networks in Figure 2.
Network N_{1}  Network N_{2} 

{B, C}  {B, C} 
{C, D}  {C, D} 
{B, C, D}  {B, C, D} 
{D, E}  {D, E} 
{E, F}  {E, F} 
{D, E, F}  {D, E, F} 
{B, C, D, E, F}  {B, C, D, E, F} 
{A, B, C}  {A, B, C} 
{A, B, C, D}  {A, B, C, D} 
{A, B, C, D, E, F}  {A, B, C, D, E, F} 
{E, F, G}  {E, F, H} 
{D, E, F, G}  {D, E, F, H} 
{G, H}  {G, H} 
{D, E, F, G, H}  {D, E, F, G, H} 
This weighting scheme informs not only about the clusters of taxa that the network represents, but also how many gene trees in the input share each cluster. It is important to note here that this weighting of a cluster should not be confused with, or used in lieu of, support values of clusters, since a cluster may appear in only one gene tree and have a high support (e.g., by having a high bootstrap value on the edge that defines it) whereas a poorly supported cluster may appear in several trees.
The tripartitions of the two networks in Figure 2.
Edge Label  Network N_{1}  Network N_{2} 

1  ⟨{A, B, C}, {D, E, F}, {G, H}⟩  ⟨{A, B, C}, {D, E, F}, {G, H}⟩ 
2  ⟨{G, H}, {D, E, F}, {A, B, C}⟩  ⟨{G, H}, {D, E, F}, {A, B, C}⟩ 
3  ⟨{B, C}, {D, E, F}, {A, G, H}⟩  ⟨{B, C}, {D, E, F}, {A, G, H}⟩ 
4  ⟨{G}, {D, E, F}, {A, B, C, H}⟩  ⟨{H}, {D, E, F}, {A, B, C, G}⟩ 
5  ⟨{B, C}, {D}, {A, E, F, G, H}⟩  ⟨{B, C}, {D}, {A, E, F, G, H}⟩ 
6  ⟨{C}, {D}, {A, B, E, F, G, H}⟩  ⟨{C}, {D}, {A, B, E, F, G, H}⟩ 
7  ⟨{E, F}, {D}, {A, B, C, G, H}⟩  ⟨{E, F}, {D}, {A, B, C, G, H}⟩ 
8  ⟨{E}, {D}, {A, B, C, F, G, H}⟩  ⟨{E}, {D}, {A, B, C, F, G, H}⟩ 
9  ⟨{D}, {}, {A, B, C, E, F, G, H}⟩  ⟨{D}, {}, {A, B, C, E, F, G, H}⟩ 
10  ⟨{D}, {}, {A, B, C, E, F, G, H}⟩  ⟨{D}, {}, {A, B, C, E, F, G, H}⟩ 
11  ⟨{E, F}, {D}, {A, B, C, G, H}⟩  ⟨{E, F}, {D}, {A, B, C, G, H}⟩ 
12  ⟨{E, F}, {D}, {A, B, C, G, H}⟩  ⟨{E, F}, {D}, {A, B, C, G, H}⟩ 
Tripartitionbased characterization of an evolutionary network helps to identify clades across which no genetic transfer occurred. If A_{ e }= X and B_{ e }= ∅ for an edge e = (u, v), this implies that the clade rooted at node v has set X of leaves, and there does not exist any exchange or transfer of genetic material between any organism in X and another organism that is not in X. Equivalently, an evolutionary network can be partitioned into a collection {N_{1}, N_{2},..., N_{ k }} of evolutionary networks that result from N by deleting every edge e for which B_{ e }= ∅. Such a partition informs about the "locality" of reticulation events: each event in N is local to one of the k components in {N_{1}, N_{2},..., N_{ k }}. Further, this partition implies that each of the trees in $\mathcal{T}$(N) has k clades that have the sets {L(N_{1}), L(N_{2}),..., L(N_{ k })} of leaves.
Evolutionary network comparison
Researchers are often interested in quantifying the similarities and differences between two phylogenies reconstructed either from two different sources of data or from two different reconstruction methods. Such a quantification provides insights into agreements and disagreements among analyses, confidence values for different parts of the phylogenies, and metrics for comparing the performance of phylogenetic reconstruction methods. In the context of phylogenetic trees, this quantification is most commonly done based on one of two criteria:

Topological differences. The topologies, or shapes, of two phylogenetic trees are compared, and their differences are quantified. Several measures have been introduced to quantify topological differences and similarities between a pair of trees, such as the RobinsonFoulds measure and the SPR distance; see [25, 26] for a description of several such measures.

Fitness to sequence evolution. When two phylogenies are reconstructed from the same sequence data set, it is common to compare them in terms of how well they model the evolution of the sequences. The most commonly used criteria for measuring such fitness are maximum parsimony, maximum likelihood, and the Bayesian posterior probability; see [25] for a detailed discussion of all three criteria.
In this section, we report on the capabilities in PhyloNet for comparing two evolutionary networks in terms of their topological differences and similarities, as well as in terms of their fitness to sequence evolution based on the maximum parsimony criterion.
For quantifying the dissimilarity between two evolutionary network topologies N_{1} and N_{2}, we want a measure m(·,·) that satisfies three conditions:
Identity: m(N_{1}, N_{2}) = 0 if and only if N_{1} and N_{2} are equivalent;
Symmetry: m(N_{1}, N_{2}) = m(N_{2}, N_{1}); and
Triangle inequality: m(N_{1}, N_{3}) + m(N_{3}, N_{2}) ≥ m(N_{1}, N_{2}) for any evolutionary network N_{3}.
This issue of evolutionary network equivalence was discussed in [19]. The three characterizations of evolutionary networks that we described above induce three measures which we now define. Let N_{1} and N_{2} be two evolutionary networks on the same set X of leaves; we define the three measures as follows.
Treebased comparison
Shown on the right of Figure 6 is the minimumweight edge cover of G, which is the set of edges that satisfies two conditions: (1) each node in G must be the endpoint of at least one edge in the set, and (2) the sum of the weights of the edges in the set is minimum among all sets of edges satisfying condition 1. In this case, the four edges shown are a cover, since each node in G is "covered" by at least one edge (here, each node is covered by exactly one edge). Further, it is of minimum weight, which equals 2, since a simple inspection yields that every other cover has a weight larger than 2. Since the cover has four edges in it, we have m^{ tree }(N_{1}, N_{2}) = (0 + 0 + 1/6 + 1/6)/4 = 1/12. If we use the raw RF values, then m^{ tree }(N_{1}, N_{2}) = (0 + 0 + 1 + 1)/4 = 1/2.
Clusterbased comparison
The rationale behind this measure is that it is the sum of the ratios of clusters present in one but not both networks. The clusterbased measure was first introduced by Nakhleh et al. [29]. The sets C_{1} = $\mathcal{C}$(N_{1}) and C_{2} = $\mathcal{C}$(N_{2}) of the two networks N_{1} and N_{2} in Figure 2 are listed in Table 1, with C_{1} = C_{2} = 14. Since C_{1}  C_{2} = C_{2}  C_{1} = 2 (the two highlighted clusters in Table 1), we have m^{ cluster }(N_{1}, N_{2}) = 1/7. A similar weighting scheme to that described in the previous section can be used to incorporate the fraction of trees in which a cluster appears into the measure calculation.
Tripartitionbased comparison
This measure views the two networks in terms of the sets of edges they define (where an edge is in a 11 correspondence with a tripartition) and computes the sum of the ratios of edges present in one but not both networks. The tripartitionbased measure was devised by Moret et al. [19]. The sets θ_{1} = θ(N_{1}) and θ_{2} = θ(N_{2}) of the two networks N_{1} and N_{2} in Figure 2 are listed in Table 2, with θ_{1} = θ_{2} = 12. Since θ_{1}  θ_{2} = θ_{2}  θ_{1} = 1 (the highlighted tripartition in Table 2), we have m^{ tripartition }(N_{1}, N_{2}) = 1/12.
Which measure to use?
Several distance measures, such as the RobinsonFould measure and the Subtree Prune and Regraft (SPR) distance, have been introduced over the years to quantify the difference between the topologies of a pair of phylogenetic trees; e.g., see [25, 26] for description of many of these measures. Even though these measures may compute different distance values on the same pair of trees, there has been no consensus as to which measure should be used in general [30]. It may be the case that the RobinsonFoulds measure is more commonly used than the others, but this may be a mere reflection of its very low time requirements as compared to the other, more computeintensive, measures.
Regarding the three measures for comparing networks, a scenario analogous to that in phylogenetic trees arises here: each measure gives a different quantification of the dissimilarity between two networks based on one of the three ways to characterize a given network. As shown in the examples above, some or all of these measures may compute the same value for a given pair of networks, but that may not always be the case. Treebased comparison of networks can be viewed as a method to quantify how similar, or dissimilar, two networks are in terms of their quality as a summary of a collection of trees. In some cases, even though two networks "look different," they may be identical in terms of the trees they induce – this is the issue of indistinguishability of networks from a collection of trees that Nakhleh and colleagues discussed in [19]. In such a case, using the treebased comparison, or equivalently the clusterbased comparison, is most appropriate. However, if the similarity/dissimilarity of two networks means something close to an isomorphism, then the tripartitionbased measure is more appropriate. However, it is important to note that none of the three measures described here is a metric on the general space of all evolutionary networks labeled by a given set of taxa.
A practical distinction among the three measures can be derived based on the methods used to infer the evolutionary history of the set of species under study. Methods such as SplitsTree [23] and NeighborNet [24] represent the evolutionary history as a set of splits, or clusters, hence making it more natural to use clusterbased comparison to study their performance. Methods such as RIATAHGT [16] and LatTrans [14] compute evolutionary networks that are rooted, directed, acylic graphs, where internal nodes have an evolutionary implication in terms of ancestry. For these two methods, all three measures are appropriate. When the evolutionary history of a set of species is represented as a collection of its constituent gene trees, the treebased measure is most appropriate.
Finally, a clear distinction can be made among the methods in terms of computational requirements. In their current implementations, the tripartitionbased measure is very fast in practice, taking time that is polynomial in the size of the two networks. On the other hand, the tree and clusterbased measures are much slower, taking time that is exponential in the number of network nodes in the two networks (since these measures compute explicitly all trees inside each of the two networks). In light of recent complexity results that we obtained [31], it is very likely that no polynomialtime algorithms exist for computing the tree and clusterbased measures in general.
Parsimony of evolutionary networks
Nakhleh and colleagues have recently formalized a maximum parsimony (MP) criterion for evolutionary networks [32] and demonstrated its utility in reconstructing evolutionary networks on both biological and synthetic data sets [33]. In this section, we describe a PhyloNet utility that allows for comparing two evolutionary networks in terms of their fitness to the evolution of a sequence data set, based on the MP criterion. We first begin with a brief review of the MP criterion, based on the exposition in [32].
The relationship between an evolutionary network and its constituent trees, as described in the background section, is the basis for the MP extension to evolutionary networks.
Definition 1 The Hamming distance between two equallength sequences x and y, denoted by H(x, y), is the number of positions j such that x_{ j }≠ y_{ j }.
Given a fullylabeled tree T, i.e., a tree in which each node v is labeled by a sequence s_{ v }over some alphabet Σ, we define the Hamming distance of an edge e ∈ E(T), denoted by H(e), to be H(s_{ u }, s_{ v }), where u and v are the two endpoints of e. We now define the parsimony score of a tree T.
Definition 2 The parsimony score of a fullylabeled tree T, is Σ_{e ∈ E(T)}H(e). Given a set S of sequences, a maximum parsimony tree for S is a tree leaflabeled by S and assigned labels for the internal nodes, of minimum parsimony score.
The parsimony definitions can be extended in a straightforward manner to incorporate different site substitution matrices, where different substitutions do not necessarily contribute equally to the parsimony score, by simply modifying the formula H(x, y) to reflect the weights. Let Σ be the set of states that the two sequences x and y can take (e.g., Σ = {A, C, T, G} for DNA sequences), and W the site substitution matrix such that W[σ_{1}, σ_{2}] is the weight of replacing σ_{1} by σ_{2}, for every σ_{1}, σ_{2} ∈ Σ. In particular, the identity site substitution matrix satisfies W[σ_{1}, σ_{2}] = 0 when σ_{1} = σ_{2}, and W[σ_{1}, σ_{2}] = 1 otherwise. The weighted Hamming distance between two sequence is H(x, y) = Σ_{1 ≤ i ≤ k}W(x_{ i }, y_{ i }), where k is the length of the sequences x and y. The rest of the definitions are identical to the simple Hamming distance case. As described above, the evolutionary history of a single (nonrecombining) gene is modeled by one of the trees contained inside the evolutionary network of the species containing that gene. Therefore the evolutionary history of a site s is also modeled by a tree contained inside the evolutionary network. A natural way to extend the treebased parsimony score to fit a dataset that evolved on a network is to define the parsimony score for each site as the minimum parsimony score of that site over all trees contained inside the network.
where TCost(T, s_{ i }) is the parsimony score of site s_{ i }on tree T.
Notice that as usually large segments of DNA, rather than single sites, evolve together, Definition 3 can be extended easily to reflect this fact, by partitioning the sequences S into nonoverlapping blocks b_{ i }of sites, rather than sites s_{ i }, and replacing s_{ i }by b_{ i }in Definition 3. This extension may be very significant if, for example, the evolutionary history of a gene includes some recombination events, and hence that evolutionary history is not a single tree. In this case, the recombination breakpoint can be detected by experimenting with different block sizes.
The MP utility in PhyloNet allows the user to specify two evolutionary networks (either or both of which can be a tree) N_{1} and N_{2} and a sequence data set S, and computes the parsimony scores NCost(N_{1}, S) and NCost(N_{2}, S). The user can then compare the two scores and evaluate the fitness of the networks to the data set S based on the difference in the scores. Further, the utility allows the user, for example, to evaluate the significance of each network edge in a network N by comparing the parsimony scores of two different versions of N that contain different subsets of the network edges in N.
Reconstructing evolutionary networks from species/gene trees
Assuming incongruence among gene and species trees is the result of HGT events only, the Phylogenybased HGT Reconstruction Problem, or HGT Reconstruction Problem for short, is defined as follows:
Problem 1 (HGT Reconstruction Problem)
Input: A species tree ST and a set $\mathcal{T}$ = {T_{1},..., T_{ p }}of gene trees.
Output: An evolutionary network N, obtained by adding a minimal set of edges Ξ to T, such that N contains every tree T_{ i }∈ $\mathcal{T}$
The minimization criterion is a reflection of Occam's razor: in the absence of any additional biological knowledge, HGT events should be used sparingly to explain data features otherwise explainable under a tree model. The problem of finding a minimumcardinality set of HGT events whose occurrence on species tree ST would give rise to the gene trees in set $\mathcal{T}$ is computationally hard [34]. In [16], Nakhleh et al. introduced an accurate, polynomialtime heuristic, RIATAHGT, for solving the HGT Reconstruction Problem for a pair of species and gene trees (in other words, RIATAHGT currently handles the case where $\mathcal{T}$ = 1). In a nutshell, the method computes the maximum agreement subtree [35] of the species tree and each of the gene trees, and adds HGT edges to connect all subtrees that do not appear in the maximum agreement subtree. Theoretically, RIATAHGT may not compute the minimumcardinality set of HGT events; nonetheless, experimental results show very good empirical performance on synthetic as well as biological data [16].
Computing multiple solutions and the graphical output
RIATAHGT was designed originally to compute a single solution to the problem, and was mainly aimed at binary trees. Later, Than and Nakhleh [17] extended the method to compute multiple solutions and to handle nonbinary trees. These two features are very significant: the former allows biologists to explore multiple potential HGT scenarios, whereas the latter allows for analyzing trees in which some edges were contracted due to inaccuracies (see [36] for example). We have conducted an experimental study to compare the performance of RIATAHGT with LatTrans [18]. Although RIATAHGT and LatTrans [14] have almost the same performance in terms of the number of HGT solutions and the solution size, the former runs much faster than the latter.
For a compact representation of multiple solutions, we introduce four terms:

An event: this is a single HGT edge, written in the form of X → Y, where X and Y are two nodes in the species tree.

A subsolution: this is an atomic set of events, which forms a part of an overall solution. In other words, either all or none of the events of a subsolution are taken in a solution.

A component: a set of components and/or subsolutions. Two components at the same level of decomposition are independent, in that an element of each component is needed to form a solution.

A solution: the union of a single element from each component at the highest level.
To illustrate these concepts, consider species tree (((a, b), c), (d, (e, f))) and the gene tree (((a, c), b), ((d, f), e)). Observe, that each HGT event required to reconcile the two trees has both endpoints in the subtree ((a, b), c) or both endpoints in the subtree (d, (e, f)), and no HGT event has endpoints in both subtrees. In this case, RIATAHGT divides the pair of trees into two pairs:

Pair 1: ((a, b), c) and ((a, c), b)

Pair 2: (d, (e, f)) and ((d, f), e),
The compact representation of RIATAHGT's output in terms of subsolutions and components is especially helpful when the number of solutions is large. RIATAHGT also has an option to display all complete solutions. RIATAHGT enumerates all complete solutions that are compactly represented as described in the preceding paragraphs. Each solution, which is a set of HGT events, along with the species tree defines an evolutionary network, which RIATAHGT displays in the eNewick format. For example, for the trees (((a, b), c), (d, (e, f))) and (((a, c), b), ((d, f), e)), RIATAHGT outputs 9 different networks in the eNewick format, if RIATAHGT's option for displaying complete solutions is on. Figure 7(b) shows the corresponding eNewick representations.
From the multiple comparisons between a species and a set of trees, RIATAHGT offers a (strict) consensus network. For each pair of species tree and gene tree, RIATAHGT computes a set of HGT events for reconciling them. To obtain the consensus network, RIATAHGT retains only HGT events that appear in every set of solutions for every pair of species tree and gene tree. Those events are then added to the species to build the consensus network.
We note here that while offering a simple summary of solutions, this way of computing consensus networks may not be appropriate in general; work is under way to address this issue more properly.
Finally, RIATAHGT may report '[time violation?]' next to an inferred HGT X → Y. If this is the case, this indicates that node X lies on the path from Y to the root of the species tree. Theoretically, this indicates that two nodes that do not coexist in time, X and Y in this case, shared genetic material, and hence the warning of 'time violation.' However, this may be the case simply due to incomplete taxon sampling, as discussed in [19]. Therefore, the warning is issued in this case so as to alert that user that this inferred HGT edge is worth further inspection.
Assessing the support of HGT edges
Other utilities
As evident from the description of the methods above, there are fundamental correlations between phylogenetic trees and networks. Hence, many of the evolutionary network utilities use functionalities from the phylogenetic trees domain, which we have implemented and provide as standalone tools in PhyloNet:

A tool for computing the maximum agreement subtree (MAST) of a pair of trees using the algorithm of Steel and Warnow [35]. We also extended the algorithm so that it computes all MASTs of a pair of tree, and this feature is implemented as well.

A tool for computing the RobinsonFoulds distance measure between two phylogenetic trees [27].

A tool for computing the last common ancestor (lca) of a set of nodes in a phylogenetic tree.
Additionally, PhyloNet provides an implementation of the parsimonybased method RECOMP of Ruths and Nakhleh [40, 41] for detecting interspecific recombination in a sequence alignment.
Implementation
A major goal for the PhyloNet package was to make its functionality platformindependent and accessible both to end users for data analysis and to researchers designing new computational methods and techniques. In order to encompass as many platforms as possible, PhyloNet was implemented in Java. As a result, any system with the Java 2 Platform (Version 5.0 or higher) installed can run PhyloNet.
PhyloNet can be used in two ways, depending on how the functionality needs to be accessed. A commandline interface exposes all of PhyloNet's tools on a Unix or DOS commandline. Each command accepts input from and writes output to text files. This allows PhyloNet's functionality to be used for manual data analysis or integrated into scripts for performing largerscale processing. Additionally, a rich and thoroughly documented object model allows the incorporation of any of PhyloNet's functionality into existing Java programs. Also bundled are various programmatic utilities that represent trees, networks, and that read and write these various data structures to and from files.
The command line interface
List of tools and their description.
Tool name  Purpose 

charnet  Computing clusters, trees and tripartitions in a network 
cmpnets  Computing the distance between two networks 
lca  Finding the last common ancestor of a set of nodes 
mast  Computing the maximum agreement subtree 
netpars  Scoring the parsimony of sequences on a pair of networks 
riatahgt  Reconstructing HGT events from a pair of species/gene trees 
rf  Computing the RobinsonFoulds tree measure 
With the exception of the RECOMP tool, all the functionality of PhyloNet is independent of other third party tools. Because RECOMP must compute many trees using Maximum Parsimony trees, this tool requires that PAUP* [42] be installed on the local system. To run a tool in PhyloNet, invoke the executable .jar file downloaded from the PhyloNet project homepage:
java jar phylonet.jar charnet i net.in m tree
Here phylonet.jar is the executable jar downloaded from the project homepage (the fle is assumed to be in the current directory where the user invokes this command), charnet is the name of the tool that decomposes the network contained in file net.in into a set of trees. The reference manual included with the executable jar provides very detailed instructions regarding how to run each tool in the PhyloNet package.
Programmatic interface
Many phylogenetic methods comprise critical, but intermediate, steps in much larger methods. As a result, there is also a need for the functionality in PhyloNet to be available for incorporation into larger programs. As a result, all of PhyloNet's functionality is exposed through an extensive set of Java classes. Each tool is contained within its own Java class and exposes a carefully constructed set of public methods that will be preserved and maintained even as PhyloNet grows. This modular design allows for the easy addition functionality in the future without effecting existing programs that use PhyloNet as a programmatic library. In addition to exposing a consistent API, PhyloNet also provides implementations of the most common phylogenetic data structures: trees and networks. Utility classes are also included that read and write these data structures to and from files. These classes can accelerate not only incorporation of PhyloNet's functionality, but also the development of new phylogenetic functionality within other applications. As PhyloNet grows, programmatic interfaces will be added to provide access to new functionality and tools. Detailed documentation of these libraries is available in JavaDoc form on the PhyloNet website.
Conclusion
Analyzing and understanding reticulate evolutionary relationships have been hindered by the lack of software tools for conducting these tasks. The proposed software package, PhyloNet, offers an array of utilities to allow for efficient and accurate analysis of such evolutionary relationships. These utilities allow for representing networks in a compact way, characterizing networks in terms of basic building blocks and comparing them based on these characterizations, comparing networks in terms of their fitness to the evolution of a given data set of sequences under the maximum parsimony model, and reconstructing networks from species/gene trees.
The software package will help significantly in analyzing large data sets, as well as in studying the performance of evolutionary network reconstruction methods. Further, the software package offers the novel eNewick format for compact representation of evolutionary networks, a feature that allows for efficient interoperability of evolutionary network software tools.
Availability and requirements
 1.
Project name: PhyloNet  Phylogenetic Networks Toolkit.
 2.
Project home page: http://bioinfo.cs.rice.edu/phylonet/index.html.
 3.
Operating system: Platform independent.
 4.
Programming language: Java.
 5.
Other requirements: Java 1.5, PAUP* (for some applications).
 6.
License: GNU GPL.
 7.
Any restrictions to use by nonacademics: The GNU GPL license applies.
Declarations
Acknowledgements
The authors would like to acknowledge the very helpful comments from three anonymous reviewers which helped improve the manuscript, as well as the software tool, significantly. This work is supported in part by the Department of Energy grant DEFG0206ER25734, the National Science Foundation grant CCF0622037, the George R. Brown School of Engineering Roy E. Campbell Faculty Development Award, and the Department of Computer Science at Rice University.
Authors’ Affiliations
References
 Ochman H, Lawrence J, Groisman E: Lateral gene transfer and the nature of bacterial innovation. Nature 2000, 405(6784):299–304.View ArticlePubMed
 Bergthorsson U, Adams K, Thomason B, Palmer J: Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature 2003, 424: 197–201.View ArticlePubMed
 Bergthorsson U, Richardson A, Young G, Goertzen L, Palmer J: Massive horizontal transfer of mitochondrial genes from diverse land plant donors to basal angiosperm Amborella. Proc Natl Acad Sci U S A 2004, 101(51):17747–17752.PubMed CentralView ArticlePubMed
 Mower J, Stefanovic S, Young G, Palmer J: Gene transfer from parasitic to host plants. Nature 2004, 432: 165–166.View ArticlePubMed
 Posada D, Crandall K, Holmes E: Recombination in Evolutionary Genomics. Annu Rev Genet 2002, 36: 75–97.View ArticlePubMed
 Posada D, Crandall K: The effect of recombination on the accuracy of phylogeny estimation. J Mol Evol 2002, 54(3):396–402.View ArticlePubMed
 Ellstrand N, Whitkus R, Rieseberg L: Distribution of spontaneous plant hybrids. Proc Natl Acad Sci U S A 1996, 93(10):5090–5093.PubMed CentralView ArticlePubMed
 Buerkle C, Morris R, Asmussen M, Rieseberg L: The likelihood of homoploid hybrid speciation. Heredity 2000, 84(4):441–451.View ArticlePubMed
 Rieseberg L, Baird S, Gardner K: Hybridization, introgression, and linkage evolution. Plant Molecular Biology 2000, 42: 205–224.View ArticlePubMed
 Linder C, Rieseberg L: Reconstructing patterns of reticulate evolution in plants. Am J Bot 2004, 91: 1700–1708.PubMed CentralView Article
 Huson DH, Bryant D: Application of phylogenetic networks in evolutionary studies. Mol Biol Evol 2006, 23(2):254–267.View ArticlePubMed
 Beiko R, Hamilton N: Phylogenetic identification of lateral genetic transfer events. BMC Evolutionary Biology 2006., 6:
 MacLeod D, Charlebois R, Doolittle F, Bapteste E: Deduction of probable events of lateral gene transfer through comparison of phylogenetic trees by recursive consolidation and rearrangement. BMC Evol Biol 2005, 5(1):27.PubMed CentralView ArticlePubMed
 Hallett M, Lagergren J: Efficient algorithms for lateral gene transfer problems. In Proc 5th Ann Int'l Conf Comput Mol Biol (RECOMB01). New York: ACM Press; 2001:149–156.
 Makarenkov V: TREX: Reconstructing and visualizing phylogenetic trees and reticulation networks. Bioinformatics 2001, 17(7):664–668.View ArticlePubMed
 Nakhleh L, Ruths D, Wang L: RIATAHGT: A Fast and accurate heuristic for reconstrucing horizontal gene transfer. In Proceedings of the Eleventh International Computing and Combinatorics Conference (COCOON 05) Edited by: Wang L. 2005, 84–93. [LNCS #3595]
 Than C, Nakhleh L: SPRbased tree reconciliation: Nonbinary trees and multiple solutions. Proceedings of the Sixth Asia Pacific Bioinformatics Conference (APBC) 2008, 251–260.
 Than C, Ruths D, Innan H, Nakhleh L: Confounding factors in HGT detection: Statistical error, coalescent effects, and multiple solutions. Journal of Computational Biology 2007, 14: 517–535.View ArticlePubMed
 Moret B, Nakhleh L, Warnow T, Linder C, Tholse A, Padolina A, Sun J, Timme R: Phylogenetic networks: modeling, reconstructibility, and accuracy. IEEE/ACM Trans Comput Biol Bioinform 2004, 1(1):13–23.View ArticlePubMed
 Baroni M, Semple C, Steel M: A framework for representing reticulate evolution. Annals of Combinatorics 2004, 8(4):391–408.View Article
 Felsenstein J: The Newick Tree Format.1986. [http://evolution.genetics.washington.edu/phylip/newicktree.html]
 Morin MM, Moret BME: NETGEN: generating phylogenetic networks with diploid hybrids. Bioinformatics 2006, 22(15):1921–1923.View ArticlePubMed
 Huson D: SplitsTree: a program for analyzing and visualizing evolutionary data. Bioinformatics 1998, 14: 68–73.View ArticlePubMed
 Bryant D, Moulton V: NeighborNet: An agglomerative method for the construction of planar phylogenetic networks. Proc 2nd Int'l Workshop Algorithms in Bioinformatics (WABI02), Volume 2452 of Lecture Notes in Computer Science, SpringerVerlag 2002, 375–391.
 Felsenstein J: Inferring Phylogenies. Sunderland, MA: Sinauer Associates, Inc; 2003.
 Semple C, Steel M: Phylogenetics. Oxford Lecture Series in Mathematics and its Applications 24, Oxford University Press; 2003.
 Robinson D, Foulds L: Comparison of phylogenetic trees. Math Biosciences 1981, 53: 131–147.View Article
 Nakhleh L, Sun J, Warnow T, Linder R, Moret B, Tholse A: Towards the development of computational tools for evaluating phylogenetic network reconstruction methods. Proc 8th Pacific Symp on Biocomputing (PSB03), World Scientific Pub 2003, 315–326.
 Nakhleh L, Warnow T, Linder C: Reconstructing reticulate evolution in speciestheory and practice. Proc 8th Ann Int'l Conf Comput Mol Biol (RECOMB04) 2004, 337–346.
 Penny D, Hendy MD: The use of tree comparison metrics. Systematic Zoology 1985, 34: 75–82.View Article
 Kanj I, Nakhleh L, Than C, Xia G: Seeing the trees and their branches in the network in hard. Theoretical Computer Science 2008, 401: 153–164.View Article
 Nakhleh L, Jin G, Zhao F, MellorCrummey J: Reconstructing phylogenetic networks using maximum parsimony. Proc IEEE Comput Syst Bioinform Conf 2005, 93–102.
 Jin G, Nakhleh L, Snir S, Tuller T: Inferring Phylogenetic Networks by the Maximum Parsimony Criterion: A Case Study. Mol Biol Evol 2007, 24: 324–337.View ArticlePubMed
 Bordewich M, Semple C: On the computational complexity of the rooted subtree prune and regraft distance. Annals of Combinatorics 2005, 8(4):409–423.View Article
 Steel M, Warnow T: Kaikoura tree theorems: computing the maximum agreement subtree. Information Processing Letters 1993, 48: 77–82.View Article
 Ruths D, Nakhleh L: Techniques for Assessing Phylogenetic Branch Support: A Performance Study. Proceedings of the 4th Asia Pacific Bioinformatics Conference 2006, 187–196.
 Than C, Jin G, Nakhleh L: Integrating sequence and topology for efficient and accurate detection of horizontal gene transfer. Proceedings of the Sixth RECOM Comparative Genomics Satellite Workshop 2008.
 Bergthorsson U, Richardson AO, Young GJ, Goertzen LR, Palmer JD: Massive horizontal transfer of mitochondrial genes from diverse land plant donors to the basal angiosperm Amborella. Proc Natl Acad Sci USA 2004, 101(51):17747–17752.PubMed CentralView ArticlePubMed
 Shimodaira H, Hasegawa M: Multiple comparisons of loglikelihoods with applications to phylogenetic inference. Molecular Biology and Evolution 1999, 16: 1114–1116.View Article
 Ruths D, Nakhleh L: Recombination and phylogeny: effects and detection. Int J Bioinform Res Appl 2005, 1(2):202–212.View ArticlePubMed
 Ruths D, Nakhleh L: RECOMP: A Parsimonybased Method for Detecting Recombination. Proceedings of the 4th Asia Pacific Bioinformatics Conference 2006, 59–68.
 Swofford DL: PAUP*: Phylogenetic Analysis Using Parsimony (and Other Methods). In Version 4.0. Sinauer Associates, Underland, Massachusetts; 1996.
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.