Surprising results on phylogenetic tree building methods based on molecular sequences

Gonnet, Gaston H

doi:10.1186/1471-2105-13-148

Methodology article
Open access
Published: 27 June 2012

Surprising results on phylogenetic tree building methods based on molecular sequences

Gaston H Gonnet¹

BMC Bioinformatics volume 13, Article number: 148 (2012) Cite this article

9238 Accesses
8 Citations
3 Altmetric
Metrics details

Abstract

Background

We analyze phylogenetic tree building methods from molecular sequences (PTMS). These are methods which base their construction solely on sequences, coding DNA or amino acids.

Results

Our first result is a statistically significant evaluation of 176 PTMSs done by comparing trees derived from 193138 orthologous groups of proteins using a new measure of quality between trees. This new measure, called the Intra measure, is very consistent between different groups of species and strong in the sense that it separates the methods with high confidence.

The second result is the comparison of the trees against trees derived from accepted taxonomies, the Taxon measure. We consider the NCBI taxonomic classification and their derived topologies as the most accepted biological consensus on phylogenies, which are also available in electronic form. The correlation between the two measures is remarkably high, which supports both measures simultaneously.

Conclusions

The big surprise of the evaluation is that the maximum likelihood methods do not score well, minimal evolution distance methods over MSA-induced alignments score consistently better. This comparison also allows us to rank different components of the tree building methods, like MSAs, substitution matrices, ML tree builders, distance methods, etc. It is also clear that there is a difference between Metazoa and the rest, which points out to evolution leaving different molecular traces. We also think that these measures of quality of trees will motivate the design of new PTMSs as it is now easier to evaluate them with certainty.

Background

Phylogenetic tree reconstruction from molecular sequences (PTMS) was first suggested by Emile Zuckerkandl and Linus Pauling [1] and is now one of the major tools in the arsenal of bioinformatics. By PTMS we will understand methods which build a phylogenetic tree based solely on sequences, either coding DNA or amino acids. Of the many people who have contributed to this field, J. Felsenstein deserves special mention for his many contributions summarized in his book [2].

Computing phylogenies is ubiquitous, and not only of academic interest, but also quite practical: selecting model organisms [3], tracing disease [4], finding vectors [5], finding suitable defenses to new viruses [6], maximizing diversity for species conservation, [7] tracing ancestry and population movements [8, 9] and many other problems are solved with the aid of good phylogenetic trees.

The state of testing of PTMS is far from satisfactory. This is obvious when we see the discrepancies between the results from bioinformatics and the accepted taxonomies produced by biologists, and the high confidence measures that bioinformatics has tried to attach to their results [10–12]. In short, in our experience, the distrust that biologists may have on PTMS is justifiable.

Most results in the literature supporting PTMSs use:(i) extensive simulations, (ii)measures of quality, (iii) small scale comparisons of some specific trees, (iv) some intuition. These techniques are useful, but limited. Specifically, simulations are excellent to discover errors and to find the variability that we may expect from the methods. Yet simulations usually rely on a model of evolution (e.g. Markovian evolution). It is then expected that a method which uses the same model will perform best. Measures of quality include bootstrapping, branch support confidence and indices on trees (like least squares error in distance trees or likelihood in maximum likelihood (ML) trees). These measures also rely on some statistical model which is essentially an approximation of reality. Bootstrapping values have suffered from over-confidence and/or misinterpreted and are sensitive to model violations [13–16]. Furthermore these techniques are directed towards assessing a particular tree rather than assessing the methods. Small scale comparisons are valuable but usually lack the sample size to make the results statistically strong. We consider any evidence which is in numbers less than 100 to be “anecdotal”. Any study where a subset of cases is selected is a candidate to suffer from the bias arising from an author trying to show the best examples for his/her method. Finally, intuitions are very valuable, but cannot stand scientific scrutiny. We refer as intuitions, decisions which are not based on strict optimality criteria. E.g. character weights in traditional parsimony methods; using global or local alignments; various methods for MSA computation; various measures of distances, etc.

The main problem is that there is no “gold-standard” against which methods can be evaluated. Hopefully this paper will provide two such standards.

Computing phylogenetic trees consumes millions of hours in computers around the world. Because some of these computations are so expensive and not reliable, biologists are tempted to use faster, lower quality, methods. This evaluation (which itself consumed hundreds of thousands of hours) will help bioinformaticians extract the most of their computations. In particular, as we show, some of the best PTMS are remarkably fast to compute.

We measure the quality of the PTMS in two ways, by their average difference on trees which have followed the same evolution and by their average distance to taxonomic trees. This allows us to find the best methods, and by averaging in different ways, the best components of the methods.

There is no single method that is best in all circumstances. Some of the classes of species show a preference for a particular method. This should not come as a surprise, different organisms may leave different molecular imprints of their evolution.

Results

We now introduce the two measures on PTMSs.

The Intra measure

For a given PTMS and several orthologous groups (OGs) we can construct a tree for every OG. The trees should all follow the same evolutionary history, hence the trees should all be compatible (Figure 1, shaded yellow). The average distance between trees built from different OGs is thus a measure of quality of the method (the smaller the distance, the better the method). We call this measure the Intra measure. Since the PTMS does not get any information about the species of the input sequences, the only way for it to produce a smaller distance between trees is by extracting information from the sequences. In this sense, the best algorithm is the algorithm which extracts the most relevant information from the sequences to derive the phylogeny; which is exactly what we want. In mathematical terms the Intra measure of a PTMS M is the expected value:

Intra (M) = E [d (M (g_{i}), M (g_{j}))]

(1)

where g_i and g_j are two different orthologous groups. The distance d(.,.) is the Robinson-Foulds distance [17] between two trees built with the same PTMS over different OGs. It is computed only over the species appearing in both OGs (Figure 2). We estimate this expected value from all the available pairs of OGs. The measure will be incorrect for the cases of lateral gene transfers (LGT), where sequences do not follow the same evolution. LGT events will be few and since all methods will be affected we do not expect a bias from them.

The Taxon measure

This measures how far the computed tree is from the true taxonomic tree. A smaller distance, averaged over a large number of OGs, means a better method. For a given PTMS and several orthologous groups (OGs) we compute the distance between the tree built on each OG and the true taxonomic tree (or its approximation from NCBI, Figure 1, shaded blue). We call this average distance the Taxon measure. The trees derived from the taxonomy represent the consensus and summary of many scientific papers, databases and experts and could be described as the “state of the art”. Errors in the taxonomy should affect all methods equally and will be like random noise.(Biases derived from the use of these methods for building the Taxonomy are discussed in the Caveats (iv) section.) In mathematical terms the Taxon measure of a PTMS M is the expected value:

Taxon (M) = E [d (M (g), T_{g})]

(2)

where d(.,.) is the RF distance between two trees, g is an orthologous group, M(g) is the tree produced by M applied to the sequences in g and T_g is the taxonomic tree for the species in the group g. We estimate the expected value by the average over all the orthologous groups available to us. Notice, that while the taxonomic tree is a single tree, we will be sampling tens of thousands of different subsets of this single tree (and many hundreds of totally independent subsets). See Methods, Table 1, for full results. In [18, 19] a similar idea is used, that of comparing the trees against a small, indisputable, topology.

Table 1 Taxon and Intra measure, output 1

Surprising results on phylogenetic tree building methods based on molecular sequences

Abstract

Background

Results

Conclusions

Background

Results

The Intra measure

The Taxon measure

Averaging over the component methods

Discussion

Caveats, what can go wrong?

Conclusions

Methods

Source data

Sequence/group cleanup

Bayesian methods

Tree building methods

Taxonomies database

Computation

Correlations as the main test

Distances between trees

Absolute vs relative distances vs 0-1 distances

Large trees vs small trees

Long sequences vs short sequences

Variance reduction techniques

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us