Phylogenetic networks provide an explicit representation of the evolutionary relationships among sequences, genes, chromosomes, genomes, or species. They differ from phylogenetic trees by the explicit modeling, by means of hybrid nodes instead of only tree nodes, of reticulate evolutionary events such as recombination, hybridization, or lateral gene transfer, and differ also from the implicit networks that allow for visualization and analysis of incompatible phylogenetic signals [1]. Phylogenetic networks have been extensively used in evolutionary studies, especially at the population level, where reticulate evolutionary events are quite common [2].
Over the past two decades, plant and animal biologists have been performing molecular phylogenetic analyses and submitting phylogenetic trees and associated data matrices to TreeBASE, a repository of phylogenetic trees [3], and some journals either require or encourage authors to submit their phylogenetic data to databases such as TreeBASE [4, 5]. Sharing phylogenetic data has been eased by the multiple efforts to maintain such a database of published phylogenetic trees, which already contains over 5,000 phylogenetic trees with about 100,000 taxa from about 2,000 studies made by over 3,000 authors, but also by the adoption of a standard format for representing phylogenetic trees, the Newick format.
The Newick format [6, 7], adopted June 26, 1986 by an informal committee meeting at Newick's seafood restaurant in Dover, New Hampshire, USA during the Society for the Study of Evolution meeting in Durham, New Hampshire, is the de facto standard for representing phylogenetic trees, and it is quite convenient since it makes it possible to describe a whole phylogenetic tree in linear form in a unique way, once the phylogenetic tree is drawn or the ordering among children nodes is fixed. The Newick description of a phylogenetic tree is a string of nested parentheses annotated with taxa names and possibly also with branch lengths or bootstrap values, obtained by traversing the phylogenetic tree in postorder and following some simple rules that allow for parsing a Newick string into the corresponding phylogenetic tree, and vice versa. In fact, almost every phylogenetic software tool includes an option to export phylogenetic trees in Newick format, and open-source code for parsing Newick strings is readily available in a number of software toolkits, including for instance BioPerl, BioPython, BioJava, and BioRuby.
When it comes to phylogenetic networks, however, there is not yet, to the best of our knowledge, any effort to build and maintain a database of published phylogenetic networks, nor do journals require or even encourage authors to submit explicit phylogenetic networks as supplementary material. The adoption of a standard representation for phylogenetic networks would certainly be an important first step towards reverting this situation. Every evolutionary biologist might certainly benefit from such a standard, because of the importance of standards for sharing data [8, 9], and every computational biologist might also benefit as well, given the importance of standards for tool development [10].
A first proposal of a compact representation for phylogenetic networks can be found in the NetGen package for phylogenetic networks [11], where a phylogenetic network with k hybrid nodes is represented as a single phylogenetic tree in Newick format but with k repeated nodes. For example, the phylogenetic network with two hybrid nodes of Figure 1 is transformed by replicating each hybrid node as shown in Figure 2, and the resulting representation is the following Newick string:
-
((1, ((2, (3, (4)Y#H1)g)e, (((Y#H1, 5)h, 6)f)X#H2)c)a, ((X#H2, 7)d, 8)b)r;
The representation of a whole phylogenetic network as a single string facilitates phylogenetic data sharing and exchange, because the string can then be embedded in the text of email messages, a collection of these strings can be put together in a text file with one line for each string, etc. It also facilitates the use phylogenetic networks in computer programs because in programming languages such as C, C++ and Java, in scripting languages such as Perl and Python, and in text processing languages such as awk, sed and grep, for instance, the single string representing a phylogenetic network can be easily input to a program or script through a command line interface or a graphical user interface, or read as a single line from a text file.
A second proposal of a compact representation for phylogenetic networks can be found in the PhyloNet package for phylogenetic trees and networks [12], where a phylogenetic network with k hybrid nodes is represented as a series of k + 1 phylogenetic trees in Newick format. For example, the phylogenetic network with two hybrid nodes of Figure 1 is decomposed into three phylogenetic trees as shown in Figure 3, and the resulting representation is the following series of Newick strings:
-
((1, ((2, (3, Y)g)e, X)c)a, ((X, 7)d, 8)b)r;
-
(((Y, 5)h, 6)f)X;
-
(4)Y;
In the actual representation used in [12], however, the phylogenetic trees are represented by Newick strings (without the final semicolon, without root node label, and without any internal node labels) assigned either to the whole phylogenetic network or to the hybrid node, such as
-
N = ((1, ((2, (3, Y)), X)), ((X, 7), 8))
-
X = (((Y, 5), 6))
-
Y = (4)
for the phylogenetic network with two hybrid nodes of Figure 1.
In any case, the representation of a phylogenetic network as a set of several strings makes it more difficult to share and exchange phylogenetic data, however, because it requires additional mark-up to properly keep the strings of different phylogenetic networks apart, especially when the strings of several phylogenetic networks are assembled together in a text file. Even in the case of a single phylogenetic network, though, the number of strings comprising the representation of the phylogenetic network is not made explicit in the representation and thus, additional mark-up is also needed in this case to indicate the end of the series of Newick strings. This second proposal results thus in a longer and more involved representation than the first proposal, using k more strings and 2k more symbols to represent a phylogenetic network with k hybrid nodes.
An additional drawback of this representation of a phylogenetic network as a set of several strings is the incomplete modeling of lateral gene transfer events, where the distinction between the reticulate edge and the other edge coming into the hybrid node is lost. The representation of a phylogenetic network is thus complemented in [12] with an explicit list of the lateral gene transfer arrows. For example, the lateral gene transfer event depicted in Figure 4 would be represented as follows.
-
N = ((1, (2, H)), H)
-
H = (3)
-
2 -> 3
Recently, in an open session at the Current Challenges and Problems in Phylogenetics workshop, held at the Isaac Newton Institute for Mathematical Sciences, Cambridge, UK in September 2007, several computational biologists gathered together and agreed upon an extended Newick format as a standard for the representation for phylogenetic networks. The meeting was followed by extensive discussion by email, with important contributions by Gabriel Cardona, Daniel Huson, Monique Morin, David Posada, and Gabriel Valiente.
The second proposal was discarded because of the practical issues of dealing with several Newick strings as the representation of a single phylogenetic network, and the first proposal was adopted with a few minor improvements. The agreed-upon standard was then included in the Bio::PhyloNetwork package for phylogenetic networks in Perl [13] and made available as part of the BioPerl bundle [14]. The extended Newick format is described in detail in the Results section.
Later on, however, the second proposal was claimed in [15] to be the extended Newick format for phylogenetic networks, although it was already discarded at the Current Challenges and Problems in Phylogenetics workshop for the reasons exposed above, and despite the fact that the extended Newick format for phylogenetic networks was already published [13]. This is, in our opinion, unfortunate for a series of reasons, the most important being that the publication under the same name and in the same journal of two standard formats for representing phylogenetic networks, will only add confusion and make it harder for biologists to share phylogenetic network data. This will be further addressed in the Discussion section.
Furthermore, there are a number of mistakes in the description of the second proposal as published in [15]. First, the description of a network with ℓ hybrid nodes is defined as a set of ℓ trees, but the description has ℓ + 1 trees instead. Second, in the procedure for decomposing a network into a set of trees, k new terminal nodes labeled x
i
are created for each hybrid node u
i
with k parent nodes V
i
, the edges V
i
× {u
i
} are removed and the edges V
i
× {x
i
} are added, but the latter is not well-defined because there are multiple nodes labeled x
i
while these edges are to be added from each parent in V
i
to only one new terminal node labeled x
i
. Third, the resulting set of trees is claimed to have the same terminal nodes as the network, but this is clearly false as new terminal nodes are introduced, multiple times indeed. Fourth, the resulting trees are also claimed to have disjoint sets of terminal nodes but this is also false, it does not even hold for the example in [[15], Figure 3].