Taxon ordering in phylogenetic trees: a workbench test
© Cerutti et al; licensee BioMed Central Ltd. 2011
Received: 8 September 2010
Accepted: 22 February 2011
Published: 22 February 2011
Skip to main content
© Cerutti et al; licensee BioMed Central Ltd. 2011
Received: 8 September 2010
Accepted: 22 February 2011
Published: 22 February 2011
Phylogenetic trees are an important tool for representing evolutionary relationships among organisms. In a phylogram or chronogram, the ordering of taxa is not considered meaningful, since complete topological information is given by the branching order and length of the branches, which are represented in the root-to-node direction. We apply a novel method based on a (λ + μ)-Evolutionary Algorithm to give meaning to the order of taxa in a phylogeny. This method applies random swaps between two taxa connected to the same node, without changing the topology of the tree. The evaluation of a new tree is based on different distance matrices, representing non-phylogenetic information such as other types of genetic distance, geographic distance, or combinations of these. To test our method we use published trees of Vesicular stomatitis virus, West Nile virus and Rice yellow mottle virus.
Best results were obtained when taxa were reordered using geographic information. Information supporting phylogeographic analysis was recovered in the optimized tree, as evidenced by clustering of geographically close samples. Improving the trees using a separate genetic distance matrix altered the ordering of taxa, but not topology, moving the longest branches to the extremities, as would be expected since they are the most divergent lineages. Improved representations of genetic and geographic relationships between samples were also obtained when merged matrices (genetic and geographic information in one matrix) were used.
Our innovative method makes phylogenetic trees easier to interpret, adding meaning to the taxon order and helping to prevent misinterpretations.
Phylogenetic trees are an important tool in evolutionary biology for representing the history of evolution of organisms. They are composed of nodes, representing hypothetical ancestors, and branches or edges, reflecting the relationship between nodes. Terminal nodes or taxa represent the taxa whose evolution has been investigated, and they can represent extant or extinct organisms . In phylograms and chronograms, branches contain information, e.g. character changes or evolutionary time; in both cases they represent distances between nodes. Consequently, the taxon order is meaningless, and the closeness of taxa can be misleading. For example, two taxa lying adjacent to each other on a phylogenetic tree may actually be very distantly related, creating problems of interpretation.
Previous work has already accepted the challenge of ordering the taxa to add a meaning according to the genetic distance. The software Neighbor-Net  is shown to build a network which also minimizes the distance among taxa, given a matrix that fulfill the Kalmanson inequalities [3, 4]. This software relies on phylogenetic networks, using an algorithm based on Neighbor Joining . Levy and Pachter  showed that the algorithm, considering the problem as a "traveling salesman" problem, is robust for ordering taxa according to a distance matrix. Other studies have endeavored to minimize the distance between taxa as a minimum Hamiltonian path [6, 7], either while building the tree or after having built it.
In the current study we apply this method to different data sets using different types of distances. We chose published phylogenetic trees of three RNA viruses infecting different hosts with different modes of transmission: Vesicular stomatitis virus (VSV) presented by Perez et al. , West Nile virus (WNV) presented by Bertolotti et al.  and Rice yellow mottle virus (RYMV) presented by Abubakar et al. . We reorganized each tree using genetic and geographic distance matrices as well as combined distances (geographic and genetic) to improve graphical representation by optimizing the order of taxa.
In the original paper, the authors declared that more than the 90% of viral genetic variance was contained within sampling sites. Modifying the tree using a matrix of genetic distances yielded a tree that did not tend to group taxa by geographic location (Figure 4c); rather, the modified tree acquired the aforementioned "C"-like shape. As reported in the same paper, the authors found a significant association between viral genetic and spatial distances, with samples collected from the same site likely to be genetically similar. The modified tree obtained with merged genetic and geographic distance matrices tended to move samples collected from the same location closer, supporting the original contention that that viruses from the same location are also genetically similar (Figure 4d). Furthermore the "C"-like shape of the tree was conserved even when merged matrices were used. The new trees obtained therefore graphically support the statistical analyses in the original publication showing weak genetic-geographic association .
In our previous studies [8, 9] we introduced an innovative method to give a meaning to the order of taxa on a phylogenetic tree using an evolutionary algorithm. Here we test the algorithm using different viral systems, trees and distances with the aim of improving the graphic representation of the tree without modifying its topology. Qualitatively, the most improved tree was obtained when we evolved the VSV phylogenetic tree using a matrix of geographic distances, since the order of taxa on the improved tree closely mirrored geography. In the case of WNV, our algorithm generated trees that support previous studies of genetic diversity and spatial correlation. In the case of RYMV, we identified a possible limitation of the algorithm: the improvement in graphical representation may be less pronounced when the tree contains small numbers of taxa. For this reason we evaluated the role of r, highlighting that, in case of small sample sets, small radii may be sufficient to achieve marked improvements. The strength of this method is that a phylogeny can be inferred using any available method for building trees, using distance and algorithm, from a simple approach (like Neighbor-Joining) to a more accurate Bayesian inference. The user can choose the appropriate method, using the genetic information contained in the considered sequences and, in the next step, add a second matrix, containing for example geographic information. The tests of the algorithm presented here showed promise. Very good results were achieved when the geographic distribution of the samples was linear, as in the VSV and RYMV cases. When the geographic distribution was more complex, as in the case of WNV, the grouping of taxa collected from the same site is hard to see. In this case, geographic information alone is not enough, but much greater improvement is gained when both genetic and geographic information are incorporated, as shown in Figure 3d. In conclusion, the algorithm tested in this paper offers a customizable method that can help biologists to better represent results of their phylogenetic analyses, improving the interpretation of phylogenetic trees and making them more understandable.
The (5 + 5)- EA used in the present study is a particular case of the large family of (λ + μ)-EAs. As briefly reported in Figure 1, the algorithm starts from an original tree and creates. λ-1 trees by random swaps of pairs of internal nodes. In this way, a total of λ starting trees, with the same topology but different order of taxa, are obtained. In each generation, μ trees are generated by random mutation of those selected with μ tournaments between couples chosen by random sampling with reintroduction among the λ trees. The next generation parents are then the λ fittest trees among the λ + μ ones.
The selection of the best trees is based on the fitness evaluated as the sum of the distances between each taxon and the next r tips, where r is the radius in the fitness evaluation. The distances are contained in a matrix, used as input for the algorithm. Any available data that are able to discriminate among taxa can be used to generate the distance matrix. As a starting tree, an original tree obtained by any available method of phylogenetic inference with any method (e.g. distance-based, parsimony-based, or likelihood-based) is suitable.
For the experimental validation of the algorithm, we selected three phylogenetic trees from published works, following two criteria: availability of genetic and geographic data and published phylogenetic trees and associated phylogeographic interpretations. All the distance matrices were normalized to the highest value in order to have values lying in the range [0, 1].
We reconstructed the tree presented by Perez et al. using the maximum-likelihood optimality criterion as implemented in PAUP* version b10  and the nucleotide substitution parameters as estimated using Modeltest, version 3.7 , as reported in . Unlike the original tree, the new tree contains only 55 taxa instead of 59, because of availability of both sequences and geographic coordinates. For the genetic distance matrix, we used nucleotide-level distances corrected with the HKY substitution model . For the geographic distance matrix, we used euclidean distances between collection sites in a Latitude/Longitude coordinate system. The combined matrix of distances was created averaging the cells in the genetic and geographic matrices.
The tree is from Bertolotti et al.  and contains 140 samples collected from Chicago, USA. We re-evaluated the best evolution model with Modeltest, version 3.7, and inferred the phylogeny using MrBayes software [22, 23]. The genetic distance matrix was created from nucleotide sequence data and an uncorrected p-distance . The geographic distance matrix was created from Euclidean distances between collection sites in a Latitude/Longitude coordinate system. The combined matrix of distances was created with the same method used for the VSV example.
RYMV is a positive-sense single-stranded RNA virus belonging to the genus Sobemovirus and it is considered to be among the most important rice pathogens within sub-Saharan Africa [17, 18]. We considered the tree published by Abubakar et al. and re-built it using neighbor-joining tree and pairwise nucleotide sequence distances with the Kimura two-parameters model, as reported in the original paper , but removing the out-group sequence; thus the tree has 39 taxa instead of 40. The geographic distances were computed as a Euclidean distances between the centroids of the district (for Tanzania) or of the area of the state (for the other states) in a UTM coordinate system, zone 36.
Data on country borders were obtained from shapefiles available online, managed and plotted with R software  and the packages maptools  and shape . In particular, USA border data were downloaded from http://www.census.gov/geo/www/cob/st2000.html in ESRI Shapefile (.shp) format for all 50 States, D.C., and Puerto Rico; Mexico shapefile was downloaded from http://www.vdstech.com/map_data.htm and Africa shapefile was downloaded from http://www.maplibrary.org/index.php. Collection site coordinates for VSV were kindly provided by L. Rodriguez, A. Perez and S. Pauszek. WNV collection site coordinates were already available. RYMV collection site coordinates were extracted from the shapefile using GRASS-GIS software ; specifically, coordinates of Tanzanian collection sites were selected as the centroid of the collection district (Mwanza, Mbeya, Morogoro, Pemba, as described in ), and coordinates of the other states were selected as the centroids of those states.
Algorithms were written in R, using the package 'ape' . The runs were performed on the cluster IBM-BCX available at the Supercomputing Group of the CINECA Systems & Tecnologies Department between February and June 2010.
The authors thank the Supercomputing Group of the CINECA Systems & Tecnologies Department for supporting in computations. MG acknowledges funding (60% grant) by the Ministero dell'Università e della Ricerca Scientifica e Tecnologica. LB gratefully acknowledges financial support by Ricerca Sanitaria Finalizzata 2008-Regione Piemonte. This work is supported by the National Science Foundation/National Institutes of Health Ecology of Infectious Diseases Program under award number EF-0840403. Furthermore the authors thank L. Rodriguez, A. Perez and S. Pauszek for kindly sharing genetic and geographic data. The authors also acknowledge Donal Bisanzio for his precious help for geographic data management and map plotting.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.