A perl package and an alignment tool for phylogenetic networks

Background Phylogenetic networks are a generalization of phylogenetic trees that allow for the representation of evolutionary events acting at the population level, like recombination between genes, hybridization between lineages, and lateral gene transfer. While most phylogenetics tools implement a wide range of algorithms on phylogenetic trees, there exist only a few applications to work with phylogenetic networks, none of which are open-source libraries, and they do not allow for the comparative analysis of phylogenetic networks by computing distances between them or aligning them. Results In order to improve this situation, we have developed a Perl package that relies on the BioPerl bundle and implements many algorithms on phylogenetic networks. We have also developed a Java applet that makes use of the aforementioned Perl package and allows the user to make simple experiments with phylogenetic networks without having to develop a program or Perl script by him or herself. Conclusion The Perl package is available as part of the BioPerl bundle, and can also be downloaded. A web-based application is also available (see availability and requirements). The Perl package includes full documentation of all its features.


Background
We briefly recall some definitions and results from [2] on phylogenetic networks.
A phylogenetic network on a set S of taxa is any rooted directed acyclic graph whose leaves (those nodes without outgoing edges) are bijectively labeled by the set S.
Let N = (V, E) be a phylogenetic network on S. A node u ∈ V is said to be a tree node if it has, at most, one incoming edge; otherwise it is called a hybrid node.A phylogenetic network on S is a tree-child phylogenetic network if every node either is a leaf or has at least one child that is a tree node.
Let S = {ℓ 1 , . . ., ℓ n } be the set of leaves.We define the µ-vector of a node u ∈ V as the vector µ(u) = (m 1 (u), . . ., m n (u)), where m i (u) is the number of different paths from u to the leaf ℓ i .The multiset µ(N ) = {µ(v) | v ∈ V } is called the µ-representation of N and, provided that N is a tree-child phylogenetic network, it turns out to completely characterize N , up to isomorphisms, among all tree-child phylogenetic networks on S.
This allows us to define a distance on the set of tree-child phylogenetic networks on S: the µ-distance between two given networks N 1 and N 2 is the symmetric difference of their µ-representations, This defines a true distance, and when N 1 and N 2 are phylogenetic trees, it coincides with the well-known partition distance [8].This representation also allows us to define an optimal alignment between two treechild phylogenetic networks on S, say n = |S|.Given two such networks where, for the sake of simplicity, we assume where • stands for the Manhattan norm of a vector and χ(u, v) is 0 if both u and v are tree nodes or hybrid nodes, and 1/(2n) if one of them is a tree node and the other one is a hybrid node.An optimal alignment is, then, an alignment with minimal weight.

The Extended Newick Format
The eNewick (for "extended Newick") string defining a phylogenetic network appeared in the packages PhyloNet [7] and NetGen [5] related to phylogenetic networks, with some differences between them.The former encodes a phylogenetic network with k hybrid nodes as a series of k trees in Newick format, while the latter encodes it as a single tree in Newick format but with k repeated nodes.
Whereas the Perl module we introduce here accepts both formats as input, a complete standard for eNewick is implemented, based mainly on NetGen and following the suggestions of D. Huson and M. M. Morin (among others), to make it as complete as possible.The adopted standard has the practical advantage of encoding a whole phylogenetic network as a single string, and it also includes mandatory tags to distinguish among the various hybrid nodes in the network.
The procedure to obtain the eNewick string representing a phylogenetic network N goes as follows: Let {H 1 , . . ., H m } be the set of hybrid nodes of N , ordered in any fixed way.For each hybrid node H = H i , say with parents u 1 , u 2 , . . ., u k and children v 1 , v 2 , . . ., v ℓ : split H in k different nodes; let the first copy be a child of u 1 and have all v 1 , v 2 , . . ., v ℓ as its children; let the other copies be children of u 2 , . . ., u k (one for each) and have no children.

Label each of the copies of H as [label]#[type]tag[:branch_length]
where the parameters are: • label (optional) string providing a labelling for the node; • type (optional) string indicating if the node H corresponds to a hybridization (indicated by H) or a lateral gene transfer (indicated by LGT) event; note that other types can be considered in the future; • tag (mandatory) integer i identifying the node H = H i .
• branch_length (optional) number giving the length of the branch from the copy of H under consideration to its parent.
In this way, we get a tree whose set of leaves is the set of leaves of the original network together with the set of hybrid nodes (possibly repeated).Then, the Newick string of the obtained tree (note that some internal nodes will be labeled and some leaves will be repeated) is the eNewick string of the phylogenetic network.The leftmost occurrence of each hybrid node in an eNewick string corresponds to the full description of the network rooted at that node, and although node labels are optional, all labeled occurrences of a hybrid node in an eNewick string must carry the same label.
Consider, for example, the phylogenetic network depicted together with its decomposition in Figure 1.The eNewick string for this network would be ((1,(2)#H1),(#H1,3)); or ((1,(2)h#H1)x,(h#H1,3)y)r; if all internal nodes are labeled.The leftmost occurrence of the hybrid node in the latter string corresponds to the full description of the network rooted at that node: (2)h#H1.
Obviously, the procedure to recover a network from its eNewick string is as simple as recovering the tree and identifying those nodes that are labeled as hybrid nodes with the same identifier.
Notice that gene transfer events can be represented in a unique way as hybrid nodes.Consider, for example, the lateral gene transfer event depicted in Figure 2, where a gene is transferred from species 2 to species 3 after the divergence of species 1 from species 2. The eNewick string ((1,(2,(3)h#LGT1)y)x,h#LGT1)r; describes such a phylogenetic network.A program interpreting the eNewick string can use the information on node types in different ways; for instance, to render tree nodes circled, hybridization nodes boxed, and lateral gene transfer nodes as arrows between edges.

Figure 1 :Figure 2 :
Figure 1: A phylogenetic network N (left), and tree (right) associated to N for computing its eNewick string.r