SlowFaster, a user-friendly program for slow-fast analysis and its application on phylogeny of Blastocystis
© Kostka et al; licensee BioMed Central Ltd. 2008
Received: 04 October 2007
Accepted: 15 August 2008
Published: 15 August 2008
Slow-fast analysis is a simple and effective method to reduce the influence of substitution saturation, one of the causes of phylogenetic noise and long branch attraction (LBA) artifacts. In several steps of increasing stringency, the slow-fast analysis omits the fastest substituting alignment positions from the analysed dataset and thus increases its signal/noise ratio.
Our program SlowFaster automates the process of assessing the substitution rate of the alignment positions and the process of producing new alignments by deleting the saturated positions. Its use is very simple. It goes through the whole process in several steps: data input – necessary choices – production of new alignments.
SlowFaster is a user-friendly tool providing new alignments prepared with slow-fast analysis. These data can be used for further phylogenetic analyses with lower risk of long branch attraction artifacts.
The long branch attraction (LBA) artifact  still remains one of important causes of biases and mistakes in phylogenetic analyses of sequence data . LBA causes taxa with long branches to be artifactually grouped with or attracted to other long branched taxa (i.e., fast evolving taxa or taxa evolving for a long time separate from other groups, e.g. outgroups). An important source of LBA is substitution saturation of positions in alignment (the term "mutational saturation" is also used, although it is not correct in this context). It would be ideal to have positions that underwent a single or a few changes during evolution, but many positions in real alignments are subject to multiple substitutions. This subset of rapidly evolving positions is the source of stochastic noise rather than useful signal. However, these saturated positions are responsible for a major part of information used in phylogenetic analyses , which could confuse most of the tree-reconstructing methods. Because there are only four possible states for nucleic acid data (20 for amino acids), it is probable that a part of saturated positions will evolve randomly – convergently into the same state. It could then be erroneously judged as a synapomorphy. LBA can thus be a major problem especially in maximum parsimony, but occurs also in other analyses . Maximum likelihood can, under an appropriate model of evolution, deal better with saturated positions, but datasets containing sites with different rates of substitution across the tree (covarion-like) may still be problematic . Besides LBA, a high level of saturation in the dataset may cause signal simply to be overwhelmed by noise at least at some points of the tree topology. Such nodes could be resolved incorrectly or (at least) with a low statistical support.
It has been shown that in real alignments, LBA can be a major problem . An effective way to estimate and reduce the effect of substitution saturation and LBA is removal of fast-evolving data. One such method is slow-fast analysis of the dataset . The positions of the alignment are divided into several classes according to their substitution rate (estimated within a priori defined monophyletic groups). Several new alignments are then created, which contain only positions with a substitution rate lower than several thresholds, ranging from maximum to minimum rate. Thus the signal/noise ratio of the alignments successively increases, however, on the expense of amount of positions included in the alignment. Technically, the Slow-Fast method needs some input tree topology to work with. The topology must be provided by primary phylogenetic analysis of the dataset or by another independent method. This topology is needed for recognition of some monophyletic groups (whose relative positions on the tree is not necessary to know before slow-fast analysis). Maximum parsimony is then used to determine the number of changes for each position within the monophyletic subgroups. Substitution rates assigned to positions are thus independent from interrelationships among the monophyletic groups, and therefore, these interrelationships may in turn be investigated without the fear of circularity. When each position is assigned its number of changes, those with the highest substitution rate are gradually omitted from new alignments. The following phylogenetic analyses of these new datasets (starting from the dataset containing the positions with the highest substitution rate) then provide results based on decreasing number of sequence data, however, with decreasing risk of artefactual groupings of long branches. There are several good examples of successful use of slow-fast analysis, see e.g. [6–11].
Although the slow-fast analysis is relatively powerful and very simple in principle, it is quite demanding when one wants to determine the number of changes for individual alignment positions (e.g., with the help of PAUP , using the "describetree" command) and the manual procedure of deleting of positions by editing the dataset is especially very time consuming. We believe that this is one of the most important reasons why this method is used relatively scarcely. Clearly, a computer program that provides this evaluation of positions and which produces new alignments would be handy. To our knowledge, the only software providing slow-fast analysis have been MUST . MUST is a complex package, yet it still does not provide a quick and easily operated tool for this type of analysis. This is what our program SlowFaster does. It is a user-friendly tool to conduct slow-fast analysis and produce a set of new alignments without fast evolving positions. It have several additional functions. Note that another program for slow-fast analysis was presented recently .
SlowFaster was programmed in Borland Delphi and runs under MS Windows. Both the executable file [see Additional file 1] and the source code [see Additional file 2] are available as supplements. The program leads the user in several steps through the process of generating new datasets. Original alignment is loaded in FASTA, Phylip or NEXUS format. The program works with both nucleic acid and amino acid alignments and supports usual ambiguity coding. The topology needed for the recognition of monophyla is loaded as a tree in the Newick ("bracketed") format (PAUP users can use "savetree format = phylip" command to obtain tree in Newick format). After choosing the monophyletic groups by simply clicking on the branches of the depicted tree, parsimony is used to count the number of changes of every alignment position within the selected groups. Finally, new alignments are produced (in FASTA, Phylip or NEXUS format). Each of the new datasets has a number which is a threshold: positions with greater number of changes were omitted from this dataset. As the threshold gets lower and lower, the datasets contain fewer and fewer data because the more saturated positions were deleted from them. These datasets can be then further analysed to obtain phylogenies with a lower risk of LBA. During the whole process, there are hints shown in a window, telling the user what to do in the given step.
The software was tested thoroughly on several model datasets [see Additional file 3] and also on dataset of Hampl et al. . In this latter case, we obtained the same new datasets with our program (Hampl et al. obtained them with the help of PAUP and through careful manual deletion of positions).
An interesting alternative to slow-fast method is using substitution rates estimated with maximum likelihood (ML). Although ML estimates are not implemented in SlowFaster, this program enables production of alignments without positions with high rates through the "Load changes" button. The rates can be counted in another software. E.g. Tree-Puzzle , if rate heterogeneity is selected, gives information on the rate category of each position in its outfile under "Combination of categories that contributes the most to the likelihood". These data can be simply copied in a file which is then loaded in SlowFaster. New alignments are then produced directly from these data. More generally, any sequence of any (even real) numbers can be loaded and the software will divide positions in rate categories (their number is specified by the user) based on these values.
The program also creates a log file which contains useful information, most notably groups used for changes counting, list of positions with certain number of changes and number of changes for all positions from the first one to the last.
Results and discussion
As an example, we analysed an alignment of 34 SSU rDNA sequences of 31 isolates of Blastocystis + 3 outgroups. Blastocystis is an unusual protist, a sister group of slopalinids (used as the outgroup) within the group of stramenopiles. See e.g. [15, 16] for a review of Blastocystis. Although these nonflagellated, multinucleated gut commensals comprise a single genus, their SSU rDNA phylogeny shows clearly that they are rather long branched taxa in comparison to other stramenopiles. Their branches are even longer than, for example, branches separating classes of autotrophic stramenopiles. This group is therefore suspected of a high level of substitution saturation. We sequenced SSU rRNA genes of five Blastocystis isolated from tortoises to improve taxon sampling by increasing the number of non-mammalian and non-bird isolates in the analysis (the vast majority of Blastocystis sequences available in GenBank are from bird or mammalian isolates). The accession numbers of the five new sequences (GERA3b, GERA3a, GECA2, KINIX2 and GEPA2) are [GenBank:EF209016], [GenBank:EF209017], [GenBank:EF209018], [GenBank:EF209019] and [GenBank:EF209020], respectively.
Use of SlowFaster
At this time, our SlowFaster program was employed to perform the slow-fast analysis. First, the alignment used in our analyses was loaded via the "Load alignment" button. Then the tree topology shown in fig. 1 was loaded via the "Load tree" button. In typical slow-fast analyses, several monophyletic subgroups are chosen in this step. We decided to select the single subtree of all Blastocystis isolates. This arrangement was enabled by the fact that we were mostly interested in resolving the two nodes represented in the input tree by trichotomies. Assigning substitution rates to alignment positions was thus independent from the true topology of these nodes. When the Blastocystis-containing subgroup was chosen in the tree window of SlowFaster program, new datasets in NEXUS format were created by clicking the "New alignments" button. Also, alignments of the same length as these new datasets, but shortened by random deletion of positions, were prepared by checking the "jackknives" checkbox on the program screen. These were used to test whether the loss of informative positions influences decrease of bootstrap support of the resulting tree topology more than shortening the datasets itself. We did not use the "Weights" feature of the program. When this checkbox is checked, the algorithm will assign different weights to changes within different chosen monophyletic groups. Changes within smaller groups would have assigned greater weight (if group A is twice as taxon-rich as group B, changes within it will have half the weight of the weight of changes in group B). The impact of large monophyla is then not dominant just because they contain more taxa.
The maximum number of observed changes in a position of our alignment was 9. Thus, nine new alignments were created. They were labeled BlastoS8 down to BlastoS0, where the number is the threshold. BlastoS0 alignment was of course of no use in this particular case (the analysis with just one monophyletic group) as it contained only those positions that did not change during the evolution of Blastocystis. All other alignments were analysed phylogenetically by all four methods (ML, MP, LD, MD) and topologies of the 32 resulting trees were bootstrapped.
It is highly probable that in some point of the slow-fast analysis, the profit from diminishing noise is lower than the loss from diminishing information. To roughly estimate the effect of the lack of information, we used average values of bootstraps as a measure of reliability of the alignments . We found that this average value drops suddenly for the alignment BlastoS1 which is therefore likely to suffer from lack of information and the resulting trees obtained from this dataset were not taken into account. To further prove this decision, "jackknifed" datasets of the same length but shortened by random deletion of position were also analysed. For each of eight datasets (Blasto_S1 to S8), ten of these randomly shortened datasets were analysed (80 alignments on the whole: Blasto_J1_1 to J1_10, J2_1 to J2_10, ... J8_1 to Blasto_J8_10). Within each dataset, the average value of bootstraps was determined and average of these averages for ten dataset of the same length were compared to average bootstrap value of the respective dataset resulting from slow-fast analysis. This comparison showed that the bootstrap values does not change much when analysing J8_x down-to J1_x datasets (e.g. all these average values ranged from 84.76 to 86.36 in ML analyses or from 90.15 to 91.5 in LD). On the contrary, the downfall of bootstraps was much more prominent in Blasto_S1 dataset when compared to Blasto_S2 – BlastoS8 datasets (e.g. 87.19 for original dataset, 86.90 for Blasto_S2, but 81.13 for Blasto_S1 in ML analyses, or 91.29 and 88.03 vs. 79.13, respectively, for LD).
Overview of results from slow-fast analysis of Blastocystis alignment
Overall, the slow-fast analysis, provided by the program SlowFaster, proved to be a useful tool to solve uncertain phylogenies by increasing the signal/noise ratio. In the Blastocystis SSU rDNA tree it was able to make a choice among competing hypotheses and add more confidence in some other cases. Our software automates quite time-consuming slow-fast analysis.
Availability and requirements
Project name: SlowFaster
Project home page: http://natur.cuni.cz/flegr/programs/slowfaster.htm
Operating system: MS Windows
Programming language: Borland Delphi
Any restrictions to use by non-academics: none
The software can be accessed through the project home page and its current version is included with the manuscript as an additional file.
This work was supported by Czech Ministry of Education (projects MSM 0021620828 and MSM 6007665806) and Czech Science Foundation (project 206/05/0371). We also want to thank Vladimir Hampl for providing us with his data for software testing. Lastly, we owe a huge thank to four anonymous reviewers. Without them the SlowFaster would not be what it is.
- Felsenstein J: Cases in which parsimony or compatibility methods will be positively misleading. Syst Zool 1978, 27: 401–410. 10.2307/2412923View ArticleGoogle Scholar
- Brinkmann H, Giesen M, Zhou Y, de Raucourt GP, Philippe H: An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics. Syst Biol 2005, 54: 743–757. 10.1080/10635150500234609View ArticlePubMedGoogle Scholar
- Gribaldo S, Philippe H: Pitfalls in tree reconstruction and the phylogeny of eukaryotes. In Organelles, genomes and Eukaryote phylogeny. Edited by: Hirt RP, Horner DS. CRC Press, Boca Raton/London/New York/Washington, D.C; 2004:133–152.View ArticleGoogle Scholar
- Bergsten J: A review of long-branch attraction. Cladistics 2005, 21: 163–193. 10.1111/j.1096-0031.2005.00059.xView ArticleGoogle Scholar
- Inagaki Y, Susko E, Fast NM, Roger AJ: Covarion Shifts Cause a Long-Branch Attraction Artifact That Unites Microsporidia and Archaebacteria in EF-1 a Phylogenies. Mol Biol Evol 2004, 21: 1340–1349. 10.1093/molbev/msh130View ArticlePubMedGoogle Scholar
- Brinkmann H, Philippe H: Archaea Sister Group of Bacteria? Indications from Tree Reconstruction Artifacts in Ancient Phylogenies. Mol Biol Evol 1999, 16: 817–825.View ArticlePubMedGoogle Scholar
- Zhu G, Keithly JS, Philippe H: What is the phylogenetic position of Cryptosporidium? IJSEM 2000, 50: 1673–1681.PubMedGoogle Scholar
- Brochier C, Phillipe H: Phylogeny: a non-hyperthermophilic ancestor for bacteria. Nature 2002, 417: 224. 10.1038/417244aView ArticleGoogle Scholar
- Hampl V, Cepicka I, Flegr J, Tachezy J, Kulda J: Critical analysis of the topology and rooting of the parabasalian 16S rRNA tree. Mol Phyl Evol 2004, 32: 711–723. 10.1016/j.ympev.2004.03.005View ArticleGoogle Scholar
- Hampl V, Vrlik M, Cepicka I, Pecka Z, Kulda J, Tachezy J: Affiliation of Cochlosoma to trichomonads confirmed by phylogenetic analysis of the small-subunit rRNA gene and a new family concept of the order Trichomonadida. IJSEM 2006, 56: 305–312.PubMedGoogle Scholar
- Bapteste E, Susko E, Leigh J, Ruiz-Trillo I, Bucknam J, Doolittle WF: Alternative methods for concatenation of core genes indicate a lack of resolution in deep nodes of the prokaryotic phylogeny. Mol Phyl Evol 2008, 25: 83–91. 10.1093/molbev/msm229View ArticleGoogle Scholar
- Swofford DL: PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts; 2002.Google Scholar
- Philippe H: MUST, a computer package of management utilities for sequences and trees. Nucleic Acids Res 1993, 21: 5264–5272. 10.1093/nar/21.22.5264PubMed CentralView ArticlePubMedGoogle Scholar
- Schmidt HA, Strimmer K, Vingron M, von Haeseler A: TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 2002, 18: 502–504. 10.1093/bioinformatics/18.3.502View ArticlePubMedGoogle Scholar
- Stenzel DJ, Boreham PFL: Blastocystis hominis revisited. Clin Microbiol Rev 1996, 9: 563–584.PubMed CentralPubMedGoogle Scholar
- Tan KSW, Singh M, Yap EH: Recent advances in Blastocystis hominis research: hot spots in terra incognita. Int J Parasitol 2002, 32: 789–804. 10.1016/S0020-7519(02)00005-XView ArticlePubMedGoogle Scholar
- Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG: The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 1997, 24: 4876–4882. 10.1093/nar/25.24.4876View ArticleGoogle Scholar
- Hall TA: BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl Acids Symp Ser 1999, 41: 95–98.Google Scholar
- Posada D, Crandall CA: MODELTEST: testing the model of DNA substitution. Bioinf 1998, 14: 817–818. 10.1093/bioinformatics/14.9.817View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.