Improving duplicated nodes position in vertebrate gene trees
© Peres and Crollius; licensee BioMed Central Ltd. 2015
Published: 13 February 2015
While gene phylogenies are essential for many biological evolutionary studies, phylogenetic reconstructions are difficult to model, especially when they include gene duplications. In this study, we have developed a method to improve the positions of duplications in gene trees produced by TreeBest, a widely used method at the core of the "Ensembl compara" pipeline.
In order to automatically identify incorrectly positioned duplications, we investigated a method that relies on the confidence score, a measure between 0 and 1 introduced by TreeBest that is assigned to each duplication node. This score reflects the ratio between the number of species with a duplicated gene and the total number of species derived from this node. A well-supported duplication will thus have a score closer to 1.
With our method, if a duplication node is considered to be poorly supported it is replaced by a speciation node, and the duplication is moved to the following node which is tested using the same method. If the new duplication node passes the test, the duplication is maintained at this new position in the tree.
To test our method comprehensively, we ran it on all 20194 phylogenetic trees available in the Ensembl compara database version 71. The resulting 20194 new edited gene trees were then compared with the original Ensembl gene trees by feeding both databases to AGORA, an algorithm developed in our laboratory to reconstruct ancestral gene orders. This tool allowed us to assess the quality of the new gene trees as its performances are very sensitive to the quality of the input gene trees, in particular because the length of the reconstructed ancestral chromosomal regions varies substantially depending on the quality of the input gene trees.
We find that using the confidence score method significantly improves the positions of duplications within gene trees when compared to the initial Ensembl gene tree database. The optimal value is obtained with a threshold score of 0.3, at which 39% of the 197 894 duplication nodes of the Ensembl gene tree database are edited, resulting in an increase in the N50 length for the ancestral reconstruction of the 58 vertebrate ancestors. These results suggest that our improved gene trees are more reliable.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.