DiscML: an R package for estimating evolutionary rates of discrete characters using maximum likelihood
© Kim and Hao; licensee BioMed Central Ltd. 2014
Received: 26 July 2014
Accepted: 25 September 2014
Published: 27 September 2014
The study of discrete characters is crucial for the understanding of evolutionary processes. Even though great advances have been made in the analysis of nucleotide sequences, computer programs for non-DNA discrete characters are often dedicated to specific analyses and lack flexibility. Discrete characters often have different transition rate matrices, variable rates among sites and sometimes contain unobservable states. To obtain the ability to accurately estimate a variety of discrete characters, programs with sophisticated methodologies and flexible settings are desired.
DiscML performs maximum likelihood estimation for evolutionary rates of discrete characters on a provided phylogeny with the options that correct for unobservable data, rate variations, and unknown prior root probabilities from the empirical data. It gives users options to customize the instantaneous transition rate matrices, or to choose pre-determined matrices from models such as birth-and-death (BD), birth-death-and-innovation (BDI), equal rates (ER), symmetric (SYM), general time-reversible (GTR) and all rates different (ARD). Moreover, we show application examples of DiscML on gene family data and on intron presence/absence data.
DiscML was developed as a unified R program for estimating evolutionary rates of discrete characters with no restriction on the number of character states, and with flexibility to use different transition models. DiscML is ideal for the analyses of binary (1s/0s) patterns, multi-gene families, and multistate discrete morphological characteristics.
Many evolutionary processes involve transitions among different discrete characteristic states, including changes in morphological characteristics , sequence gain and loss [2, 3], gene family expansion and contraction , gain and loss of mobile promoters  and epigenetic characteristics such as methylation . Evolutionary rates of discrete characters have been estimated using programs primarily developed for constructing ancestral character states such as the ACE function of the APE package  in R, standalone programs BayesTraits  and Mesquite . Recently, great efforts have been made to estimate gene family turnover rates. The GLOOME program maps gain and loss rates using binary characters (or 1s/0s) , while Count , BEGFE , BadiRate , and CAFE3  employ birth-and-death (BD) models to study gene family expansion and contraction.
Some of these programs have advanced (or realistic) features that are not implemented in other programs. For instance, the BayesTraits program implements a Γ-distribution for rate variation . The GLOOME program allows the estimation of prior root probabilities of the character states [10, 15]. The BadiRate program allows variable birth rates and death rates, and corrects for unobservable data . Furthermore, many multistate characters do not necessarily evolve in a BD manner , and should therefore be modeled using transition rate matrices other than BD.
In order to perform accurate rate estimation on a variety of discrete characters, we have developed a unified program DiscML by implementing the advanced features mentioned above as well as flexible options for transition rate matrices.
DiscML estimates the evolutionary rates of discrete characters by fitting the distribution of all character states (the data) on a given phylogeny. The data need to be in a matrix format (vector format for a single site) as required in many other phylogenetic programs in R (see examples in Additional file 1). The provided phylogeny is required to have branch lengths, as branch lengths will be used as a relative time scale in the analysis. The evolutionary rates, transition rate matrices, and additional parameters discussed below will be optimized to maximize the likelihood of the data. The optimization is achieved using the PORT routines  implemented in the nlminb function in R.
Implementation of rate variation in the analysis
Rate variation among the character sites has long been recognized and implemented in DNA analyses , but has been missing from most analyses of non-DNA discrete characters (but see ). DiscML considers rate variation among the character sites by implementing a discrete Γ distribution (with the option of alpha=TRUE).
Estimation of prior root probabilities
Most programs for the analysis of discrete characters assume only uniformly distributed prior root probabilities, e.g., , (a is the total number of character states). DiscML allows the estimation of prior root probabilities ( π a ) for different character states (with the option of rootprobability=TRUE).
Flexibility on both the transition model and the number of character states
so that the evolutionary rate parameter ( μ) is the average number of transition events per site per evolutionary time unit .
Forced reversibility and flexible irreversible options
In DiscML, the default setting is reversible=FALSE and users have the flexibility to conduct analysis by assuming irreversible evolutionary processes. Unlike in reversible processes, the root position can greatly affect the maximum likelihood calculation in irreversible cases [22, 23]. Therefore, it is only meaningful to perform irreversible analysis on a rooted tree. If the provided phylogenetic tree is unrooted, DiscML will first reroot the tree by midpoint rooting, and perform analysis on the midpoint rooted tree.
Correction for unobservable data
DiscML estimates from the gene family data in the Bacillaceae (B1, B2, B3) clades
ER+0+ π + Γ
ER+0+ πREV + Γ
Site and branch specific estimations
Even though the default setting of DiscML is to perform rate estimation by fitting the distribution pattern of all character sites on a phylogeny, there is an option to perform rate estimation on individual sites (ind=TRUE). Individual rates can be graphically displayed using plotmu=TRUE. Furthermore, DiscML allows branch specific rate estimation, which can be specified using ‘$’ on branches in the provided tree file. For instance, (((taxon1$1: 0.01, taxon2$1: 0.01)$3: 0.01, taxon3$2: 0.02)$3: 0.01, taxon4$2: 0.03); specifies three rates, one for the branches leading to taxon1 and taxon2 ($1), one for the branches leading to taxon3 and taxon4 ($2), and one for the remaining branches ($3). The modified tree files are no longer in the conventional Newick format, we have developed a function read.tree2 in DiscML to read such modified tree files.
DiscML allows binary (1s/0s) analysis on data with more than two character states by converting all non-zero characters to 1s with simplify=TRUE.
Results and discussion
Computational time on an Intel Core i7 (3.4 Ghz) 16 GB RAM Dell desktop to generate the results in Table 1
ER (1s/0s only)
0 m 49 s
1 m 00 s
1 m 26 s
ER+0 (1s/0s only)
1 m 39 s
2 m 01 s
3 m 03 s
0 m 48 s
1 m 06 s
1 m 36 s
1 m 58 s
2 m 20 s
3 m 01 s
7 m 54 s
6 m 58 s
8 m 28 s
1 m 04 s
1 m 15 s
1 m 17 s
3 m 14 s
4 m 47 s
5 m 31 s
9 m 53 s
9 m 12 s
16 m 59 s
9 m 04 s
9 m 54 s
11 m 44 s
1 m 36 s
2 m 34 s
2 m 21 s
2 m 41 s
3 m 13 s
4 m 40 s
12 m 00 s
39 m 01 s
45 m 23 s
ER+0+ π + Γ
82 m 22 s
81 m 20 s
178 m 27 s
ER+0+ πREV + Γ
80 m 13 s
67 m 33 s
91 m 42 s
Separate rates on branches estimated from the gene family data in the Bacillaceae (B1, B2, B3) clades
We illustrated the versatility of DiscML on different types of data and analyses. With a great flexibility and fast computational speed, we are confident that DiscML can be used in a variety of studies on different discrete characters.
Availability and requirements
Project name: DiscMLProject home page:http://cran.r-project.org/web/packages/DiscML/index.htmlOperating system(s): Platform independent.Programming language: R.Other requirements: R (2.14 or newer); R-package: ape from CRAN.License: GNU GPL
The authors would like to thank two anonymous reviewers for their helpful comments, Dr. Edward Golenberg for critical reading of the manuscript, Dr. Brian Golding for suggestions during the development of DiscML, Dr. Baojun Wu for assistance in collecting yeast mitochondrial data used in Figures 2 and 3. The work was supported by funds from Wayne State University to WH.
- Lewis PO: A likelihood approach to estimating phylogeny from discrete morphological character data. Syst Biol. 2001, 50: 913-915. 10.1080/106351501753462876.View ArticlePubMedGoogle Scholar
- Csűrös M: Likely scenarios of intron evolution. Comparative Genomics. Lecture Notes in Computer Science. Edited by: McLysaght A, Huson DH. 2005, Berlin: Springer, 47-60.Google Scholar
- Hao W, Golding GB: The fate of laterally transferred genes: life in the fast lane to adaptation or death. Genome Res. 2006, 16: 636-643. 10.1101/gr.4746406.View ArticlePubMed CentralPubMedGoogle Scholar
- Hahn MW, De Bie T, Stajich JE, Nguyen C, Cristianini N: Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res. 2005, 15: 1153-1160. 10.1101/gr.3567505.View ArticlePubMed CentralPubMedGoogle Scholar
- van Passel MW, Nijveen H, Wahl LM: Birth, death, and diversification of mobile promoters in prokaryotes. Genetics. 2014, 197: 291-299. 10.1534/genetics.114.162883.View ArticlePubMed CentralPubMedGoogle Scholar
- Schmitz RJ, Schultz MD, Urich MA, Nery JR, Pelizzola M, Libiger O, Alix A, McCosh RB, Chen H, Schork NJ, Ecker JR: Patterns of population epigenomic diversity. Nature. 2013, 495: 193-198. 10.1038/nature11968.View ArticlePubMed CentralPubMedGoogle Scholar
- Paradis E, Claude J, Strimmer K: APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics. 2004, 20: 289-290. 10.1093/bioinformatics/btg412.View ArticlePubMedGoogle Scholar
- Pagel M, Meade A, Barker D: Bayesian estimation of ancestral character states on phylogenies. Syst Biol. 2004, 53: 673-674. 10.1080/10635150490522232.View ArticlePubMedGoogle Scholar
- Maddison WP, Maddison DR: Mesquite: a modular system for evolutionary analysis. 2011, Version 2.75, [http://mesquiteproject.org],Google Scholar
- Cohen O, Ashkenazy H, Belinky F, Huchon D, Pupko T: GLOOME: gain loss mapping engine. Bioinformatics. 2010, 26: 2914-2915. 10.1093/bioinformatics/btq549.View ArticlePubMedGoogle Scholar
- Csűrös M: Count: evolutionary analysis of phylogenetic profiles with parsimony and likelihood. Bioinformatics. 2010, 26: 1910-1912. 10.1093/bioinformatics/btq315.View ArticlePubMedGoogle Scholar
- Liu L, Yu L, Kalavacharla V, Liu Z: A Bayesian model for gene family evolution. BMC Bioinformatics. 2011, 12: 426-10.1186/1471-2105-12-426.View ArticlePubMed CentralPubMedGoogle Scholar
- Librado P, Vieira FG, Rozas J: BadiRate: estimating family turnover rates by likelihood-based methods. Bioinformatics. 2012, 28: 279-281. 10.1093/bioinformatics/btr623.View ArticlePubMedGoogle Scholar
- Han MV, Thomas GW, Lugo-Martinez J, Hahn MW: Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Mol Biol Evol. 2013, 30: 1987-1987. 10.1093/molbev/mst100.View ArticlePubMedGoogle Scholar
- Cohen O, Rubinstein ND, Stern A, Gophna U, Pupko T: A likelihood framework to analyse phyletic patterns. Philos Trans R Soc Lond B Biol Sci. 2008, 363: 3903-3911. 10.1098/rstb.2008.0177.View ArticlePubMed CentralPubMedGoogle Scholar
- Hibbett DS: Trends in morphological evolution in homobasidiomycetes inferred using maximum likelihood: a comparison of binary and multistate approaches. Syst Biol. 2004, 53: 889-903. 10.1080/10635150490522610.View ArticlePubMedGoogle Scholar
- Gay DM: Usage Summary for Selected Optimization Routines. Computing science technical report 153. 1990, Murray Hill: AT&T Bell LaboratoriesGoogle Scholar
- Yang Z: Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 1994, 39: 306-314. 10.1007/BF00160154.View ArticlePubMedGoogle Scholar
- Yap VB, Speed T: Rooting a phylogenetic tree with nonreversible substitution models. BMC Evol Biol. 2005, 5: 2-10.1186/1471-2148-5-2.View ArticlePubMed CentralPubMedGoogle Scholar
- Hao W, Golding GB: Inferring bacterial genome flux while considering truncated genes. Genetics. 2010, 186: 411-426. 10.1534/genetics.110.118448.View ArticlePubMed CentralPubMedGoogle Scholar
- Waddell PJ, Steel MA: General time-reversible distances with unequal rates across sites mixing gamma and inverse Gaussian distributions with invariant sites. Mol Phylogenet Evol. 1997, 8: 398-414. 10.1006/mpev.1997.0452.View ArticlePubMedGoogle Scholar
- Felsenstein J: Inferring Phylogenies. 2004, Sunderland: Sinauer Associates, Inc.Google Scholar
- Boussau B, Gouy M: Efficient likelihood computations with nonreversible models of evolution. Syst Biol. 2006, 55: 756-758. 10.1080/10635150600975218.View ArticlePubMedGoogle Scholar
- Felsenstein J: Phylogenies from restriction sites: a maximum- likelihood approach. Evolution. 1992, 46: 159-173. 10.2307/2409811.View ArticleGoogle Scholar
- Nelder JA, Mead R: A simplex method for function minimization. Comput J. 1965, 7: 308-313. 10.1093/comjnl/7.4.308.View ArticleGoogle Scholar
- Hao W, Golding GB: Uncovering rate variation of lateral gene transfer during bacterial genome evolution. BMC Genomics. 2008, 9: 235-10.1186/1471-2164-9-235.View ArticlePubMed CentralPubMedGoogle Scholar
- Lopez P, Casane D, Philippe H: Heterotachy, an important process of protein evolution. Mol Biol Evol. 2002, 19: 1-7. 10.1093/oxfordjournals.molbev.a003973.View ArticlePubMedGoogle Scholar
- Wang HC, Spencer M, Susko E, Roger AJ: Testing for covarion-like evolution in protein sequences. Mol Biol Evol. 2007, 24: 294-305.View ArticlePubMedGoogle Scholar
- Spencer M, Sangaralingam A: A phylogenetic mixture model for gene family loss in parasitic bacteria. Mol Biol Evol. 2009, 26: 1901-1908. 10.1093/molbev/msp102.View ArticlePubMedGoogle Scholar
- Hao W, Golding GB: Patterns of bacterial gene movement. Mol Biol Evol. 2004, 21: 1294-1307. 10.1093/molbev/msh129.View ArticlePubMedGoogle Scholar
- Wu B, Hao W: Horizontal transfer and gene conversion as an important driving force in shaping the landscape of mitochondrial introns. G3 (Bethesda). 2014, 4: 605-612. 2014.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.