A perl package and an alignment tool for phylogenetic networks
 Gabriel Cardona^{1}Email author,
 Francesc Rosselló^{1} and
 Gabriel Valiente^{2}
https://doi.org/10.1186/147121059175
© Cardona et al; licensee BioMed Central Ltd. 2008
Received: 20 November 2007
Accepted: 27 March 2008
Published: 27 March 2008
Abstract
Background
Phylogenetic networks are a generalization of phylogenetic trees that allow for the representation of evolutionary events acting at the population level, like recombination between genes, hybridization between lineages, and lateral gene transfer. While most phylogenetics tools implement a wide range of algorithms on phylogenetic trees, there exist only a few applications to work with phylogenetic networks, none of which are opensource libraries, and they do not allow for the comparative analysis of phylogenetic networks by computing distances between them or aligning them.
Results
In order to improve this situation, we have developed a Perl package that relies on the BioPerl bundle and implements many algorithms on phylogenetic networks. We have also developed a Java applet that makes use of the aforementioned Perl package and allows the user to make simple experiments with phylogenetic networks without having to develop a program or Perl script by him or herself.
Conclusion
The Perl package is available as part of the BioPerl bundle, and can also be downloaded. A webbased application is also available (see availability and requirements). The Perl package includes full documentation of all its features.
Keywords
Background
We briefly recall below some definitions and results from [4] on phylogenetic networks. See [5] for an introduction to reticulation in phylogenetic analysis.
A phylogenetic network on a set S of taxa is any rooted directed acyclic graph whose leaves (those nodes without outgoing edges) are bijectively labeled by the set S.
Let N = (V, E) be a phylogenetic network on S. A node u ∈ V is said to be a tree node if it has, at most, one incoming edge; otherwise it is called a hybrid node. A phylogenetic network on S is a treechild phylogenetic network if every node either is a leaf or has at least one child that is a tree node. Treechild phylogenetic network include galledtrees [6, 7] as a particular case.
Let S = {ℓ_{1}, ..., ℓ_{ n }} be the set of leaves. We define the μvector of a node u ∈ V as the vector μ(u) = (m_{1}(u), ..., m_{ n }(u)), where m_{ i }(u) is the number of different paths from u to the leaf ℓ_{ i }. The multiset μ(N) = {μ(v)  v ∈ V} is called the μrepresentation of N and, provided that N is a treechild phylogenetic network, it turns out to completely characterize N, up to isomorphisms, among all treechild phylogenetic networks on S.
This allows us to define a distance on the set of treechild phylogenetic networks on S: the μdistance between two given networks N_{1} and N_{2} is the symmetric difference of their μrepresentations,
d_{ μ }(N_{1}, N_{2}) = μ(N_{1}) Δ μ(N_{2}).
This defines a true distance, and when N_{1} and N_{2} are phylogenetic trees, it coincides with the wellknown partition distance [8].
where  ·  stands for the Manhattan norm of a vector and χ (u, v) is 0 if both u and v are tree nodes or hybrid nodes, and 1/(2n) if one of them is a tree node and the other one is a hybrid node. An optimal alignment is, then, an alignment with minimal weight, which can be computed using the Hungarian algorithm [9].
Implementation and results
The extended Newick format
The eNewick (for "extended Newick") string defining a phylogenetic network appeared in the packages PhyloNet [10] and NetGen [11] related to phylogenetic networks, with some differences between them. The former encodes a phylogenetic network with k hybrid nodes as a series of k trees in Newick format, while the latter encodes it as a single tree in Newick format but with k repeated nodes.
Whereas the Perl module we introduce here accepts both formats as input, a complete standard for eNewick is implemented, based mainly on NetGen and following the suggestions of D. Huson and M. M. Morin (among others), to make it as complete as possible. The adopted standard has the practical advantage of encoding a whole phylogenetic network as a single string, and it also includes mandatory tags to distinguish among the various hybrid nodes in the network.
The procedure to obtain the eNewick string representing a phylogenetic network N goes as follows: Let {H_{1}, ..., H_{ m }} be the set of hybrid nodes of N, ordered in any fixed way. For each hybrid node H = H_{ i }, say with parents u_{1}, u_{2}, ..., u_{ k }and children v_{1}, v_{2}, ..., v_{ℓ}: split H in k different nodes; let the first copy be a child of u_{1} and have all v_{1}, v_{2}, ..., v_{ℓ} as its children; let the other copies be children of u_{2}, ..., u_{ k }(one for each) and have no children. Label each of the copies of H as
[label]# [type]tag [:branch_length]
where the parameters are:

label (optional) string providing a labelling for the node;

type (optional) string indicating if the node H corresponds to a hybridization (indicated by H) or a lateral gene transfer (indicated by LGT) event; note that other types can be considered in the future;

tag (mandatory) integer i identifying the node H = H_{ i }.

branch_length (optional) number giving the length of the branch from the copy of H under consideration to its parent.
We obtain a tree from this procedure whose set of leaves is the set of leaves of the original network together with the set of hybrid nodes (possibly repeated). The Newick string of the obtained tree (note that some internal nodes will be labeled and some leaves will be repeated) is the eNewick string of the phylogenetic network. The leftmost occurrence of each hybrid node in an eNewick string corresponds to the full description of the network rooted at that node. Although node labels are optional, all labeled occurrences of a hybrid node in an eNewick string must carry the same label.
The procedure to recover a network from its eNewick string simply requires recovering the tree and identifying those nodes that are labeled as hybrid nodes with the same identifier.
The perl module
The Perl module Bio::PhyloNetwork, available as part of the BioPerl bundle [12], implements all the data structures needed to work with treechild phylogenetic networks, as well as algorithms for:

reconstructing a network from its eNewick string (in all its different flavours),

reconstructing a network from its μrepresentation,

exploding a network into the set of its induced subtrees,

computing the μrepresentation of a network and the μdistance between two networks,

computing an optimal alignment between two networks,

computing tripartitions [13, 14] and the tripartition error between two networks, and

testing if a network is time consistent [15], and in such a case, computing a temporal representation.
The underlying data structure is a Graph::Directed object, with some extra data, for instance the μrepresentation of the network. It makes use of the Perl module Bio::PhyloNetwork::muVector that implements basic arithmetic operations on μvectors. Two extra modules, Bio::PhyloNetwork::Factory and Bio::PhyloNetwork::RandomFactory, are provided for the sequential and random generation (respectively) of all treechild phylogenetic networks on a given set of taxa.
The web interface and the java applet
The web interface allows the user to input one or two phylogenetic networks, given by their eNewick strings. A Perl script processes these strings and uses the Bio::PhyloNetwork package to compute all available data for them, including a plot of the networks that can be downloaded in PS format; these plots are generated through the application GraphViz and its companion Perl package.
Given two networks on the same set of leaves, their μdistance is also computed, as well as an optimal alignment between them. The algorithm to compute such an alignment relies on the Hungarian algorithm [9]. If their sets of leaves are not the same, their topological restriction on the set of common leaves is first computed followed by the μdistance and an optimal alignment.
A Java applet displays the networks side by side, and whenever a node is selected, the corresponding node in the other network (with respect to the optimal alignment) is highlighted, provided it exists. This is also extended to edges. Similarities between the networks are thus evident at a glance and, since the weight of each matched node is also shown, it is easy to see where the differences are.
Conclusion
The Perl module Bio::PhyloNetwork relies on the BioPerl bundle and implements several algorithms on phylogenetic networks, from parsing and temporal representation to distances between phylogenetic networks and optimal alignments. The companion Java applet and webbased application make use of the Bio::PhyloNetwork module and allow the user to make simple experiments with phylogenetic networks without having to develop a program or Perl script by him or herself.
While the Bio::PhyloNetwork module computes distances between galledtrees and treechild phylogenetic networks, it will also support the more general treesibling phylogenetic networks in a next release.
Availability and requirements
The Perl package is available as part of the BioPerl bundle, at the url http://www.bioperl.org/. It can also be downloaded from the url http://dmi.uib.es/~gcardona/BioInfo/BioPhyloNetwork.tgz (see Additional file 1). The webbased application is available at the url http://dmi.uib.es/~gcardona/BioInfo/. The Perl package includes full documentation of all its features.
Declarations
Acknowledgements
The research described in this paper has been partially supported by the Spanish CICYT project TIN 200407925C0301 GRAMMARS and by Spanish DGI projects MTM200607773 COMGRIO and MTM200615038C0201.
Authors’ Affiliations
References
 Strimmer K, Moulton V: Likelihood Analysis of Phylogenetic Networks using Directed Graphical Models. Mol Biol Evol. 2000, 17 (6): 875881.View ArticlePubMedGoogle Scholar
 Strimmer K, Wiuf C, Moulton V: Recombination Analysis using Directed Graphical Models. Mol Biol Evol. 2001, 18: 9799.View ArticlePubMedGoogle Scholar
 Hillis DM, Wilcox TP: Phylogeny of the New World True Frogs (Rana). Mol Phylogenet Evol. 2005, 34 (2): 299314. 10.1016/j.ympev.2004.10.007.View ArticlePubMedGoogle Scholar
 Cardona G, Rosselló F, Valiente G: Comparison of TreeChild Phylogenetic Networks. IEEE T Comput Biol. 2008,Google Scholar
 Posada D, Crandall KA: Intraspecific Gene Genealogies: Trees grafting into Networks. Trends Ecol Evol. 2001, 16 (1): 3745. 10.1016/S01695347(00)020267.View ArticlePubMedGoogle Scholar
 Gusfield D, Eddhu S, Langley C: Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination. J Bioinformatics Comput Biol. 2004, 2 (1): 173213. 10.1142/S0219720004000521.View ArticleGoogle Scholar
 Gusfield D, Eddhu S, Langley C: The Fine Structure of Galls in Phylogenetic Networks. INFORMS J Comput. 2004, 16 (4): 459469. 10.1287/ijoc.1040.0099.View ArticleGoogle Scholar
 Robinson DF, Foulds LR: Comparison of Phylogenetic Trees. Math Biosci. 1981, 53 (1/2): 131147. 10.1016/00255564(81)900432.View ArticleGoogle Scholar
 Munkres J: Algorithms for the Assignment and Transportation Problems. J SIAM. 1957, 5: 3238. [http://siamdl.aip.org/getabs/servlet/GetabsServlet?prog=normal&id=SMJMAP000005000001000032000001&idtype=cvips&gifs=Yes]Google Scholar
 Rice University BioInformatics Group: PhyloNet: Phylogenetic Networks Toolkit (v. 1.4). [http://bioinfo.cs.rice.edu/phylonet/]
 Morin MM, Moret BME: NetGen: Generating Phylogenetic Networks with Diploid Hybrids. Bioinformatics. 2006, 22 (15): 19211923. 10.1093/bioinformatics/btl191.View ArticlePubMedGoogle Scholar
 Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E: The BioPerl Toolkit: Perl Modules for the Life Sciences. Genome Res. 2002, 12 (10): 16111618. 10.1101/gr.361602. [http://www.bioperl.org/]PubMed CentralView ArticlePubMedGoogle Scholar
 Moret BME, Nakhleh L, Warnow T, Linder CR, Tholse A, Padolina A, Sun J, Timme R: Phylogenetic Networks: Modeling, Reconstructibility, and Accuracy. IEEE T Comput Biol. 2004, 1 (1): 1323. 10.1109/TCBB.2004.10.View ArticleGoogle Scholar
 Cardona G, Rosselló F, Valiente G: Tripartitions do not always discriminate Phylogenetic Networks. Math Biosci. 2008, 211 (2): 356370. 10.1016/j.mbs.2007.11.003.View ArticlePubMedGoogle Scholar
 Baroni M, Semple C, Steel M: Hybrids in Real Time. Syst Biol. 2006, 55 (1): 4656. 10.1080/10635150500431197.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Comments
View archived comments (1)