 Software
 Open Access
 Published:
A perl package and an alignment tool for phylogenetic networks
BMC Bioinformatics volume 9, Article number: 175 (2008)
Abstract
Background
Phylogenetic networks are a generalization of phylogenetic trees that allow for the representation of evolutionary events acting at the population level, like recombination between genes, hybridization between lineages, and lateral gene transfer. While most phylogenetics tools implement a wide range of algorithms on phylogenetic trees, there exist only a few applications to work with phylogenetic networks, none of which are opensource libraries, and they do not allow for the comparative analysis of phylogenetic networks by computing distances between them or aligning them.
Results
In order to improve this situation, we have developed a Perl package that relies on the BioPerl bundle and implements many algorithms on phylogenetic networks. We have also developed a Java applet that makes use of the aforementioned Perl package and allows the user to make simple experiments with phylogenetic networks without having to develop a program or Perl script by him or herself.
Conclusion
The Perl package is available as part of the BioPerl bundle, and can also be downloaded. A webbased application is also available (see availability and requirements). The Perl package includes full documentation of all its features.
Background
Phylogenetic networks have been studied over the last years as a richer model of the evolutionary history of sets of organisms than phylogenetic trees, because they take into account not only mutation events but also evolutionary events acting at the population level, like recombination between genes, hybridization between lineages, and lateral gene transfer. The latter turn phylogenies into reticulate networks, which are best modeled as directed acyclic graphs [1, 2]. For instance, Figure 1 shows two phylogenies inferred from evolutionary distances among three species of frog: R. Aurora, R. Boylii and R. Temporaria [3], enriched with a hypothetical reticulation event (between the R. Amerana and R. Laurasiana groups), which turned them into phylogenetic networks.
We briefly recall below some definitions and results from [4] on phylogenetic networks. See [5] for an introduction to reticulation in phylogenetic analysis.
A phylogenetic network on a set S of taxa is any rooted directed acyclic graph whose leaves (those nodes without outgoing edges) are bijectively labeled by the set S.
Let N = (V, E) be a phylogenetic network on S. A node u ∈ V is said to be a tree node if it has, at most, one incoming edge; otherwise it is called a hybrid node. A phylogenetic network on S is a treechild phylogenetic network if every node either is a leaf or has at least one child that is a tree node. Treechild phylogenetic network include galledtrees [6, 7] as a particular case.
Let S = {ℓ_{1}, ..., ℓ_{ n }} be the set of leaves. We define the μvector of a node u ∈ V as the vector μ(u) = (m_{1}(u), ..., m_{ n }(u)), where m_{ i }(u) is the number of different paths from u to the leaf ℓ_{ i }. The multiset μ(N) = {μ(v)  v ∈ V} is called the μrepresentation of N and, provided that N is a treechild phylogenetic network, it turns out to completely characterize N, up to isomorphisms, among all treechild phylogenetic networks on S.
This allows us to define a distance on the set of treechild phylogenetic networks on S: the μdistance between two given networks N_{1} and N_{2} is the symmetric difference of their μrepresentations,
d_{ μ }(N_{1}, N_{2}) = μ(N_{1}) Δ μ(N_{2}).
This defines a true distance, and when N_{1} and N_{2} are phylogenetic trees, it coincides with the wellknown partition distance [8].
This representation also allows us to define an optimal alignment between two treechild phylogenetic networks on S, say n = S. Given two such networks N_{1} = (V_{1}, E_{1}) and N_{2} = (V_{2}, E_{2}) (where, for the sake of simplicity, we assume V_{1} ≤ V_{2}), an alignment is just an injective mapping M : V_{1} → V_{2}. The weight of this alignment is
where  ·  stands for the Manhattan norm of a vector and χ (u, v) is 0 if both u and v are tree nodes or hybrid nodes, and 1/(2n) if one of them is a tree node and the other one is a hybrid node. An optimal alignment is, then, an alignment with minimal weight, which can be computed using the Hungarian algorithm [9].
Implementation and results
The extended Newick format
The eNewick (for "extended Newick") string defining a phylogenetic network appeared in the packages PhyloNet [10] and NetGen [11] related to phylogenetic networks, with some differences between them. The former encodes a phylogenetic network with k hybrid nodes as a series of k trees in Newick format, while the latter encodes it as a single tree in Newick format but with k repeated nodes.
Whereas the Perl module we introduce here accepts both formats as input, a complete standard for eNewick is implemented, based mainly on NetGen and following the suggestions of D. Huson and M. M. Morin (among others), to make it as complete as possible. The adopted standard has the practical advantage of encoding a whole phylogenetic network as a single string, and it also includes mandatory tags to distinguish among the various hybrid nodes in the network.
The procedure to obtain the eNewick string representing a phylogenetic network N goes as follows: Let {H_{1}, ..., H_{ m }} be the set of hybrid nodes of N, ordered in any fixed way. For each hybrid node H = H_{ i }, say with parents u_{1}, u_{2}, ..., u_{ k }and children v_{1}, v_{2}, ..., v_{ℓ}: split H in k different nodes; let the first copy be a child of u_{1} and have all v_{1}, v_{2}, ..., v_{ℓ} as its children; let the other copies be children of u_{2}, ..., u_{ k }(one for each) and have no children. Label each of the copies of H as
[label]# [type]tag [:branch_length]
where the parameters are:

label (optional) string providing a labelling for the node;

type (optional) string indicating if the node H corresponds to a hybridization (indicated by H) or a lateral gene transfer (indicated by LGT) event; note that other types can be considered in the future;

tag (mandatory) integer i identifying the node H = H_{ i }.

branch_length (optional) number giving the length of the branch from the copy of H under consideration to its parent.
We obtain a tree from this procedure whose set of leaves is the set of leaves of the original network together with the set of hybrid nodes (possibly repeated). The Newick string of the obtained tree (note that some internal nodes will be labeled and some leaves will be repeated) is the eNewick string of the phylogenetic network. The leftmost occurrence of each hybrid node in an eNewick string corresponds to the full description of the network rooted at that node. Although node labels are optional, all labeled occurrences of a hybrid node in an eNewick string must carry the same label.
Consider, for example, the phylogenetic network depicted together with its decomposition in Figure 2. The eNewick string for this network would be ((1, (2)#H1), (#H1,3)); or ((1, (2)h#H1)x, (h#H1,3)y)r; if all internal nodes are labeled. The leftmost occurrence of the hybrid node in the latter string corresponds to the full description of the network rooted at that node: (2)h#H1.
The procedure to recover a network from its eNewick string simply requires recovering the tree and identifying those nodes that are labeled as hybrid nodes with the same identifier.
Notice that gene transfer events can be represented in a unique way as hybrid nodes. Consider, for example, the lateral gene transfer event depicted in Figure 3, where a gene is transferred from species 2 to species 3 after the divergence of species 1 from species 2. The eNewick string ((1, (2, (3)h#LGT1)y)x, h#LGT1)r; describes such a phylogenetic network. A program interpreting the eNewick string can use the information on node types in different ways; for instance, to render tree nodes circled, hybridization nodes boxed, and lateral gene transfer nodes as arrows between edges.
The perl module
The Perl module Bio::PhyloNetwork, available as part of the BioPerl bundle [12], implements all the data structures needed to work with treechild phylogenetic networks, as well as algorithms for:

reconstructing a network from its eNewick string (in all its different flavours),

reconstructing a network from its μrepresentation,

exploding a network into the set of its induced subtrees,

computing the μrepresentation of a network and the μdistance between two networks,

computing an optimal alignment between two networks,

computing tripartitions [13, 14] and the tripartition error between two networks, and

testing if a network is time consistent [15], and in such a case, computing a temporal representation.
The underlying data structure is a Graph::Directed object, with some extra data, for instance the μrepresentation of the network. It makes use of the Perl module Bio::PhyloNetwork::muVector that implements basic arithmetic operations on μvectors. Two extra modules, Bio::PhyloNetwork::Factory and Bio::PhyloNetwork::RandomFactory, are provided for the sequential and random generation (respectively) of all treechild phylogenetic networks on a given set of taxa.
The web interface and the java applet
The web interface allows the user to input one or two phylogenetic networks, given by their eNewick strings. A Perl script processes these strings and uses the Bio::PhyloNetwork package to compute all available data for them, including a plot of the networks that can be downloaded in PS format; these plots are generated through the application GraphViz and its companion Perl package.
Given two networks on the same set of leaves, their μdistance is also computed, as well as an optimal alignment between them. The algorithm to compute such an alignment relies on the Hungarian algorithm [9]. If their sets of leaves are not the same, their topological restriction on the set of common leaves is first computed followed by the μdistance and an optimal alignment.
A Java applet displays the networks side by side, and whenever a node is selected, the corresponding node in the other network (with respect to the optimal alignment) is highlighted, provided it exists. This is also extended to edges. Similarities between the networks are thus evident at a glance and, since the weight of each matched node is also shown, it is easy to see where the differences are.
Conclusion
The Perl module Bio::PhyloNetwork relies on the BioPerl bundle and implements several algorithms on phylogenetic networks, from parsing and temporal representation to distances between phylogenetic networks and optimal alignments. The companion Java applet and webbased application make use of the Bio::PhyloNetwork module and allow the user to make simple experiments with phylogenetic networks without having to develop a program or Perl script by him or herself.
While the Bio::PhyloNetwork module computes distances between galledtrees and treechild phylogenetic networks, it will also support the more general treesibling phylogenetic networks in a next release.
Availability and requirements
The Perl package is available as part of the BioPerl bundle, at the url http://www.bioperl.org/. It can also be downloaded from the url http://dmi.uib.es/~gcardona/BioInfo/BioPhyloNetwork.tgz (see Additional file 1). The webbased application is available at the url http://dmi.uib.es/~gcardona/BioInfo/. The Perl package includes full documentation of all its features.
References
 1.
Strimmer K, Moulton V: Likelihood Analysis of Phylogenetic Networks using Directed Graphical Models. Mol Biol Evol. 2000, 17 (6): 875881.
 2.
Strimmer K, Wiuf C, Moulton V: Recombination Analysis using Directed Graphical Models. Mol Biol Evol. 2001, 18: 9799.
 3.
Hillis DM, Wilcox TP: Phylogeny of the New World True Frogs (Rana). Mol Phylogenet Evol. 2005, 34 (2): 299314. 10.1016/j.ympev.2004.10.007.
 4.
Cardona G, Rosselló F, Valiente G: Comparison of TreeChild Phylogenetic Networks. IEEE T Comput Biol. 2008,
 5.
Posada D, Crandall KA: Intraspecific Gene Genealogies: Trees grafting into Networks. Trends Ecol Evol. 2001, 16 (1): 3745. 10.1016/S01695347(00)020267.
 6.
Gusfield D, Eddhu S, Langley C: Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination. J Bioinformatics Comput Biol. 2004, 2 (1): 173213. 10.1142/S0219720004000521.
 7.
Gusfield D, Eddhu S, Langley C: The Fine Structure of Galls in Phylogenetic Networks. INFORMS J Comput. 2004, 16 (4): 459469. 10.1287/ijoc.1040.0099.
 8.
Robinson DF, Foulds LR: Comparison of Phylogenetic Trees. Math Biosci. 1981, 53 (1/2): 131147. 10.1016/00255564(81)900432.
 9.
Munkres J: Algorithms for the Assignment and Transportation Problems. J SIAM. 1957, 5: 3238. [http://siamdl.aip.org/getabs/servlet/GetabsServlet?prog=normal&id=SMJMAP000005000001000032000001&idtype=cvips&gifs=Yes]
 10.
Rice University BioInformatics Group: PhyloNet: Phylogenetic Networks Toolkit (v. 1.4). [http://bioinfo.cs.rice.edu/phylonet/]
 11.
Morin MM, Moret BME: NetGen: Generating Phylogenetic Networks with Diploid Hybrids. Bioinformatics. 2006, 22 (15): 19211923. 10.1093/bioinformatics/btl191.
 12.
Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E: The BioPerl Toolkit: Perl Modules for the Life Sciences. Genome Res. 2002, 12 (10): 16111618. 10.1101/gr.361602. [http://www.bioperl.org/]
 13.
Moret BME, Nakhleh L, Warnow T, Linder CR, Tholse A, Padolina A, Sun J, Timme R: Phylogenetic Networks: Modeling, Reconstructibility, and Accuracy. IEEE T Comput Biol. 2004, 1 (1): 1323. 10.1109/TCBB.2004.10.
 14.
Cardona G, Rosselló F, Valiente G: Tripartitions do not always discriminate Phylogenetic Networks. Math Biosci. 2008, 211 (2): 356370. 10.1016/j.mbs.2007.11.003.
 15.
Baroni M, Semple C, Steel M: Hybrids in Real Time. Syst Biol. 2006, 55 (1): 4656. 10.1080/10635150500431197.
Acknowledgements
The research described in this paper has been partially supported by the Spanish CICYT project TIN 200407925C0301 GRAMMARS and by Spanish DGI projects MTM200607773 COMGRIO and MTM200615038C0201.
Author information
Affiliations
Corresponding author
Additional information
Authors' contributions
All authors conceived the method, prepared the manuscript, contributed to the discussion, and have approved the final manuscript. GC implemented the software. GV also implemented part of the software.
Electronic supplementary material
12859_2007_2160_MOESM1_ESM.tgz
Additional file 1: BioPhyloNetwork. Compressed (gzip) archive (tar) of the perl module Bio::PhyloNetwork (containing the files Bio/PhyloNetwork/Factory.pm, Bio/PhyloNetwork/RandomFactory.pm, Bio/PhyloNetwork/muVector.pm, Bio/PhyloNetwork/FactoryX.pm, Bio/PhyloNetwork/TreeFactory.pm, Bio/PhyloNetwork/GraphViz.pm, Bio/PhyloNetwork/TreeFactoryMulti.pm, and Bio/PhyloNetwork/TreeFactoryX.pm) and the corresponding test module (containing the files Bio/PhyloNetwork/t/Factory.t, Bio/PhyloNetwork/t/TreeFactory.t, Bio/PhyloNetwork/t/muVector.t, Bio/PhyloNetwork/t/GraphViz.t, Bio/PhyloNetwork/t/RandomFactory.t, Bio/PhyloNetwork/t/lib/BioperlTest.pm, Bio/t/PhyloNetwork.t, and Bio/t/lib/BioperlTest.pm). (TGZ 28 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Cardona, G., Rosselló, F. & Valiente, G. A perl package and an alignment tool for phylogenetic networks. BMC Bioinformatics 9, 175 (2008). https://doi.org/10.1186/147121059175
Received:
Accepted:
Published:
Keywords
 Directed Acyclic Graph
 Perl Script
 Lateral Gene Transfer
 Tree Node
 Optimal Alignment
Comments
View archived comments (1)