(5 + 5)- EA
The (5 + 5)- EA used in the present study is a particular case of the large family of (λ + μ)-EAs. As briefly reported in Figure 1, the algorithm starts from an original tree and creates. λ-1 trees by random swaps of pairs of internal nodes. In this way, a total of λ starting trees, with the same topology but different order of taxa, are obtained. In each generation, μ trees are generated by random mutation of those selected with μ tournaments between couples chosen by random sampling with reintroduction among the λ trees. The next generation parents are then the λ fittest trees among the λ + μ ones.
The selection of the best trees is based on the fitness evaluated as the sum of the distances between each taxon and the next r tips, where r is the radius in the fitness evaluation. The distances are contained in a matrix, used as input for the algorithm. Any available data that are able to discriminate among taxa can be used to generate the distance matrix. As a starting tree, an original tree obtained by any available method of phylogenetic inference with any method (e.g. distance-based, parsimony-based, or likelihood-based) is suitable.
Experimental validations
For the experimental validation of the algorithm, we selected three phylogenetic trees from published works, following two criteria: availability of genetic and geographic data and published phylogenetic trees and associated phylogeographic interpretations. All the distance matrices were normalized to the highest value in order to have values lying in the range [0, 1].
VSV data
We reconstructed the tree presented by Perez et al. using the maximum-likelihood optimality criterion as implemented in PAUP* version b10 [19] and the nucleotide substitution parameters as estimated using Modeltest, version 3.7 [20], as reported in [12]. Unlike the original tree, the new tree contains only 55 taxa instead of 59, because of availability of both sequences and geographic coordinates. For the genetic distance matrix, we used nucleotide-level distances corrected with the HKY substitution model [21]. For the geographic distance matrix, we used euclidean distances between collection sites in a Latitude/Longitude coordinate system. The combined matrix of distances was created averaging the cells in the genetic and geographic matrices.
WNV data
The tree is from Bertolotti et al. [13] and contains 140 samples collected from Chicago, USA. We re-evaluated the best evolution model with Modeltest, version 3.7, and inferred the phylogeny using MrBayes software [22, 23]. The genetic distance matrix was created from nucleotide sequence data and an uncorrected p-distance [24]. The geographic distance matrix was created from Euclidean distances between collection sites in a Latitude/Longitude coordinate system. The combined matrix of distances was created with the same method used for the VSV example.
RYMV data
RYMV is a positive-sense single-stranded RNA virus belonging to the genus Sobemovirus and it is considered to be among the most important rice pathogens within sub-Saharan Africa [17, 18]. We considered the tree published by Abubakar et al. and re-built it using neighbor-joining tree and pairwise nucleotide sequence distances with the Kimura two-parameters model, as reported in the original paper [14], but removing the out-group sequence; thus the tree has 39 taxa instead of 40. The geographic distances were computed as a Euclidean distances between the centroids of the district (for Tanzania) or of the area of the state (for the other states) in a UTM coordinate system, zone 36.
Geographic data
Data on country borders were obtained from shapefiles available online, managed and plotted with R software [25] and the packages maptools[26] and shape[27]. In particular, USA border data were downloaded from http://www.census.gov/geo/www/cob/st2000.html in ESRI Shapefile (.shp) format for all 50 States, D.C., and Puerto Rico; Mexico shapefile was downloaded from http://www.vdstech.com/map_data.htm and Africa shapefile was downloaded from http://www.maplibrary.org/index.php. Collection site coordinates for VSV were kindly provided by L. Rodriguez, A. Perez and S. Pauszek. WNV collection site coordinates were already available. RYMV collection site coordinates were extracted from the shapefile using GRASS-GIS software [28]; specifically, coordinates of Tanzanian collection sites were selected as the centroid of the collection district (Mwanza, Mbeya, Morogoro, Pemba, as described in [14]), and coordinates of the other states were selected as the centroids of those states.
Computational performance
Algorithms were written in R, using the package 'ape' [29]. The runs were performed on the cluster IBM-BCX available at the Supercomputing Group of the CINECA Systems & Tecnologies Department between February and June 2010.