Maximum parsimony heuristics
We study heuristics that use the maximum parsimony (MP) optimization criterion for inferring the evolutionary history between a collection of taxa. Each of the taxa in the input is represented by a molecular sequence such as DNA or RNA. These sequences are put into a multiple alignment, so that they all have the same length. Maximum parsimony then seeks a tree, along with inferred ancestral sequences, so as to minimize the total number of evolutionary events by counting only point mutations.
Parsimony ratchet
Parsimony ratchet is a particular kind of phylogenetic search performed with alternating cycles of reweighting and Tree Bisection Recombination (TBR). The approach works as follows: starting with an initial tree, a few of the characters (between 5 – 25%) are sampled, and reweighted. It suffices to say here that reweighting of characters involves duplicating the characters so that each shows up twice (or more) in the resulting dataset. Then, using these reweighted characters, TBR search is performed until a new starting tree is reached using this subset of data. This new starting tree is then used with the original data set to repeat the phylogenetic search. Parsimony ratchet tries to refine the search by generating a tree from a small subset of the data and using it as a new starting point. If the new tree is better than the old one, then the new one is used as the new starting tree. Otherwise, the old one is kept.
Rec-I-DCM3
Recursive-Iteration DCM3 (Rec-I-DCM3) [4] implements a disk-covering method (DCM) [9–11] to improve the score of the trees it finds. A DCM is a divide-and-conquer technique that consists of four stages: divide, solve, merge, and refine. At a high level, these stages follow directly from DCM being a divide-and-conquer technique.
Rec-I-DCM3, involves all of the above DCM stages, but in addition, is both recursive and iterative. The recursive part concerns the divide stage of the DCM, where overlapping subsets of the input tree's leaf nodes may be further divided into yet smaller subsets (or subproblems). This is an important enhancement to the DCM approach since for very large datasets, the subproblems remain too large for an immediate solution. Thanks to the recursion, the subproblems are eventually small enough to be solved directly using some chosen base method. At this point, Rec-I-DCM3 uses strict consensus merger to do the work of recombining the overlapping subtrees to form a single tree solution. The iterative part of Rec-I-DCM3 refers to the repetition of the entire process just described. That is, the resulting tree solution becomes the input tree for a subsequent iteration of Rec-I-DCM3.
Comparing collections of trees
RF distance matrix
Given a collection of t evolutionary trees, we would like to quantify the topological differences that exist between them. We compute the t × t Robinson-Foulds (RF) matrix, which represents the dissimilarity between each pair of trees. Cell (i, j) in the t × t RF matrix represents the RF distance between the two trees labeled T
i
and T
j
. The Robinson-Foulds (RF) distance computes the number of bipartitions (or evolutionary relationships) that differ between them. A bipartition is an internal edge e of a phylogenetic tree that separates the taxa on one side from the taxa on the other. The division of the taxa into two subsets is the bipartition B
i
associated with edge e
i
. Let Σ (T) be the set of bipartitions defined by all edges in tree T . The RF distance between trees T1 and T2 is defined as
Our figures plot the RF rate, which is obtained by normalizing the RF distance by the number of internal edges and multiplying by 100. Assuming n is the number of taxa, there are n - 3 internal edges in a binary tree. Hence the maximum RF distance between two trees is n - 3, which results in an RF rate of 100%. The RF rate allows us to compare topological differences when the number of taxa is different. Thus, the RF rate varies between 0% and 100% signify that trees T1 and T2 are identical and maximally different, respectively.
Relative entropy
Entropy represents the amount of chaos in the system. We use entropy to quantitatively capture the distribution of parsimony scores and RF rates among the collection of trees of interest. In our plots, we show relative entropy, which is a normalization of entropy, to allow the comparison of entropy values across different population sizes. Relative entropy ranges from 0% to 100%. Higher entropy values indicated more diversity (heterogeneity) among the population of trees. Lower entropy values indicate less diversity (homogeneity) in the population.
Let λ represent the total number of objects (parsimony scores or RF rates) in the population of trees. For example, suppose we want to partition a population of 10 trees based on their parsimony scores. Then, λ = 10. However, if we are interested in partitioning the 10 trees based on the upper triangle of the corresponding 10 × 10 RF matrix, then or 45 since the RF matrix is symmetric. Next, we group the λ objects into P total partitions. Each partition i contains n
i
individuals with identical values. For RF, each individual in partition i will have the same RF value. An individual in the RF matrix refers to a cell location (p, q).
We can compute the entropy (E
T
) of the collection of parsimony scores as:
where p
i
= . The highest entropy value (E
max
) is log λ . Relative entropy (E
rel
) is defined as the quotient between the entropy E
T
and the maximum entropy E
max
and multiplying by 100 to obtain a percentage. Thus,
Resolution rate
For n taxa, a resolved, unrooted binary tree will have n - 3 bipartitions (or internal edges). Trees with less than n - 3 bipartitions are considered to have unresolved relationships among the n taxa. In general, binary (or 100% resolved) trees are preferred by life scientists. The resolution rate of a tree is the percentage of bipartitions that are resolved. One common use of this measure is related to evaluating consensus trees, which are used to summarize the information from a set of t trees. The strict consensus method returns a tree such that the bipartitions of the tree are only those bipartitions that occur in all of the t trees. The majority consensus tree incorporates those bipartitions that occur in at least 50% of the t trees of interest. Highly resolved consensus trees denote that a high degree of similarity was found among the collection of trees.
Experimental methodology
Datasets
We used the following biological datasets as input to study the behavior of the maximum parsimony heuristics.
-
1.
A 60 taxa dataset (2,000 sites) of ensign wasps composed of three genes (28S ribosomal RNA (rRNA), 16S rRNA, and cytochrome oxidase I (COI)) [12]. The best-known parsimony score is 8,698, which was established by both Pauprat and Rec-I-DCM3.
-
2.
A 174 taxa dataset (1,867 sites) of insects and their close relatives for the nuclear small subunit ribosomal RNA (SSU rRNA) gene (18S). The sequences were manually aligned according to the secondary structure of the molecule [13]. The best-known parsimony score is 7,440, which was established by both Pauprat and Rec-I-DCM3.
-
3.
A set of 500 aligned rbcL DNA sequences (759 parsimony-informative sites) [14] of seed plants. The best-known parsimony is 16,218, which both Pauprat and Rec-I-DCM3 found.
-
4.
A set of 567 "three-gene" (rbcL, atpB, and 18s) aligned DNA sequences (2,153 sites) of angiosperms [15]. The best-known parsimony score is 44,165, which both Pauprat and Rec-I-DCM3 found.
Starting trees
All methods used PAUP*'s random sequence addition module to generate the starting trees. First, the ordering of the sequences in the dataset is randomized. Afterwards, the first three taxa are used to create an unrooted binary tree, T . The fourth taxon is added to the internal edge of T that results in the best MP score. This process continues until all taxa are added to the tree. The resulting tree is then used as the starting tree for a phylogenetic analysis.
Implementation and platform
We set the parameters of the Pauprat and Rec-I-DCM3 algorithms according to the recommended settings in the literature. We use PAUP* [6] to analyze our four datasets using the parsimony ratchet heuristic. The implementation of the parsimony ratchet was implemented using PAUP* [6]. For our analysis, we randomly selected 25% of the sites and doubled their weight; initially, all sites are equally weighted. On each dataset, we ran 5 independent runs of the parsimony ratchet, each time running the heuristic for 1,000 iterations. For Rec-I-DCM3, it is recommended that the maximum subproblem size is 50% of the number of sequences for datasets with 1,000 or less sequences and 25% of then number of sequences for larger datasets not containing over 10,000 sequences. We used the recommended settings established by Roshan et. al [4] for using TNT as a base method within the Rec-I-DCM3 algorithm.
We used the HashRF algorithm [16, 17] to compute the RF distances between trees. Each heuristic was run five times on each of the biological datasets. All experiments were run on a Linux Beowulf cluster, which consists of four, 64-bit, quad-core processor nodes (16 total CPUs with gigabit-switched interconnects). Each node contains four, 2 GHz AMD Opteron processors and they share 4 GB of memory. We note that both Rec-I-DCM3 and parsimony ratchet are sequential algorithms. The parallel computing environment was used as a way to execute multiple, independent batch runs concurrently.