# MixtureTree: a program for constructing phylogeny

- Shu-Chuan Chen
^{1}Email author, - Michael S Rosenberg
^{2}and - Bruce G Lindsay
^{3}

**12**:111

**DOI: **10.1186/1471-2105-12-111

© Chen et al; licensee BioMed Central Ltd. 2011

**Received: **1 September 2010

**Accepted: **21 April 2011

**Published: **21 April 2011

## Abstract

### Background

MixtureTree v1.0 is a Linux based program (written in C++) which implements an algorithm based on mixture models for reconstructing phylogeny from binary sequence data, such as single-nucleotide polymorphisms (SNPs). In addition to the mixture algorithm with three different optimization options, the program also implements a bootstrap procedure with majority-rule consensus.

### Results

The MixtureTree program written in C++ is a Linux based package. The User's Guide and source codes will be available at http://math.asu.edu/~scchen/MixtureTree.html

### Conclusions

The efficiency of the mixture algorithm is relatively higher than some classical methods, such as Neighbor-Joining method, Maximum Parsimony method and Maximum Likelihood method. The shortcoming of the mixture tree algorithms, for example timing consuming, can be improved by implementing other revised Expectation-Maximization(EM) algorithms instead of the traditional EM algorithm.

## Background

Methods for constructing a phylogeny given a set of the DNA sequences is always a popular topic in both biological and statistical research. Many classical methods are popular, such as Neighbor-Joining (NJ) method, Maximum Parsimony (MP) method, Maximum Likelihood (ML) method, and Bayesian (MCMC) approaches ([1] and [2]). There are also many programs which implement these methods, including PHYLIP [3], PAUP [4], and MEGA [5]. Chen and Lindsay introduced a mixture likelihood algorithm as a novel and natural way to deal with such problems because the distribution of offsprings' sequence is *a mixture* of parental distributions [6]. Unlike the classic methods, this approach uses the frequencies of each sequence within the population to help reconstructing the phylogeny from binary sequences. The model proposed by Chen and Lindsay [6] was done mathematically, as a first step, and that it should not be particularly problematic with most SNPs are bi-allelic with only a very small proportion (well less than 1% and probably less than 0.1%) tri-allelic or quad-allelic ([7] and [8]). The algorithm uses a K-component bernoulli mixture to model binary sequences. It is well-known that when the parameter takes values in an infinite dimensional space, the maximum likelihood estimation often fails. To overcome the above issue, we can first maximize over a constrained subspace of the parameter space then relax the constraint as the sample size grows. In this case, the maximum likelihood estimation will then work. The parameter used to create a constrained subspace is called a sieve parameter [9]. In our case, the sieve parameter is the mutation rate *p* which is considered as a function of time in the mixture model. By varying *p* from 0 to 0.5, the mixture algorithm can give an estimate of the recent common ancestors of the given sequences. In order to obtain the mixture tree of the observed sequences, the Expectation-Maximization (EM) algorithm is employed. To overcome the small weights (*π*) problem in the regular EM, the nature way comes up is that we do not update the weights *π*. Such an algorithm, we call it FixEM. The Modal EM is a nonparametric statistical approach to clustering via mode identification in the Bernoulli mixtures ([10] and [11]). The MixtureTree program contains the regular EM algorithm plus these two related optimization algorithms, Fixed EM (FixEM) algorithm and Modal EM (ModalEM) algorithm. Any and all can be chosen to estimate the ancestral sequences. We have found that the FixEM and ModalEM algorithms have better computational efficiency over the regular EM algorithm [11]. After constructing the phylogeny, it is common for researchers to carry out a nonparametric bootstrap analysis ([12, 13], and [14]) in which one creates bootstrap samples from the empirical distribution of sites from the original sequence data. The MixtureTree algorithm also implements a majority-rule consensus tree method from PHYLIP. This method is originally proposed by Margush and McMorris [15] and also allows one to easily find the strict consensus tree.

## Implementation

The input function in the MixtureTree program can read DNA sequences in a simple tabular format, in which all the sequences should be stacked in the form:

*Sequence Name* *Sequence* *Sequence' s Frequency*

The parameters setting can be changed in the parameter file. Three different optimization options can be chosen in the algorithm. The output function of the program writes the estimated mixture tree in the commonly-used Newick format which can be read, viewed, and manipulated by many other programs. Whether the EM algorithm converges can be checked in the output file *em.txt*. If chosen as an option, the bootstrap trees will also be output in Newick format. The summary of the bootstrap trees will be in a separate le. All of the output trees can be easily imported into other packages, such as the R package APE for further manipulations. Details can be found in the UserGuide at http://math.asu.edu/~scchen/MixtureTree.html.

## Results and Discussion

## Conclusions

From the comparison above, we can see that the efficiency of the mixture algorithm is relatively higher than the other three methods. However, there are also some shortcomings of the mixture algorithm, for example, it is more time-consuming to obtain the phylogeny than the other two methods. This shortcoming can be solved by implementing the Fixed EM or Modal EM instead of the traditional EM algorithm.

## Availability and requirements

The MixtureTree construction project and source codes can be found in the link http://math.asu.edu/~scchen/MixtureTree.html. It is Linux based program, written in C++, which implements an algorithm based on mixture models for reconstructing phylogeny from binary sequence data, such as single-nucleotide polymorphisms (SNPs). Any user uses the program needs to cite the MixtureTree website and the papers listed there.

## Declarations

### Acknowledgements and Funding

The authors thank referees for their valuable comments. SC and MR also thank their group members, Reynaldo Castro-Estrada, Mingze Li, and Ran Wang for helping with preparing the package and writing the manual of the package. This work is supported by National Science Foundation under the award number DMS 0714949 to Shu-Chuan Chen and Michael S Rosenberg and DMS 0714839 to Bruce G Lindsay. In addition, SC also acknowledge the support of National Center for Theoretical Sciences (South), Taiwan.

## Authors’ Affiliations

## References

- Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP: Bayesian inference of phylogeny and its impact on evolutionary biology. Science 2001, 294: 2310–2314. 10.1126/science.1065889View ArticlePubMedGoogle Scholar
- Ronquist F, Huelsenbeck JP: MRBAYES 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 2003, 19: 1572–1574. 10.1093/bioinformatics/btg180View ArticlePubMedGoogle Scholar
- Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 1989, 5: 164–166.Google Scholar
- Swofford DL: PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Sunderland: MA: Sinauer; 1998.Google Scholar
- Kumar S, Dudley J, Nei M, Tamura K: MEGA: A biologist-centric software for evolutionary analysis of DNA and protein sequences. Briefings in Bioinformatics 2008, 9: 299–306. 10.1093/bib/bbn017PubMed CentralView ArticlePubMedGoogle Scholar
- Chen SC, Lindsay B: Building mixture trees from binary sequence data. Biometrika 2006, 93(4):843–860. 10.1093/biomet/93.4.843View ArticleGoogle Scholar
- Ohtani T, Ikeda S, Lwin H, Arai T, Muramatsu M, Sawabe M: Polymorphisms of the formylpeptide receptor gene (FPR1) and susceptibility to stomach cancer in 1531 consecutive autopsy cases. Biochemical and Biophysical Research Communications, in press.Google Scholar
- Harismendy O, Bansal V, Bhatia G, Nakano M, Scott M, Wang X, Dib C, Turlotte E, Sipe JC, Murray SS, Deleuze JF, Bafna V, Topol J E, Frazer KA: Population sequencing of two endocannabinoid metabolic genes identifies rare and common regulatory variants associated with extreme obesity and metabolite level. Genome Biology 2010, 11(11):R118. 10.1186/gb-2010-11-11-r118PubMed CentralView ArticlePubMedGoogle Scholar
- Geman S, Hwang CR: Nonparametric maximum likelihood estimation by the method of sieves. The Annals of Statistics 1982, 10(2):401–414. 10.1214/aos/1176345782View ArticleGoogle Scholar
- Li J, Ray S, Lindsay B: A nonparametric statistical approach to clustering via mode identification. Journal of Machine Learning Research 2007, 8: 1687–1723.Google Scholar
- Chen SC, Li M, Rosenberg M, Lindsay B: Mixture tree construction and its applications. In Handbook of Computational Statistics: Statistical Bioinformatics. Edited by: Lu HS, Schölkopf B, Zhao H. Springer-Verlag; in press.Google Scholar
- Felsenstein J: Statistical inference of phylogenies (with Discussion). J R Statist Soc A 1983, 146: 246–272. 10.2307/2981654View ArticleGoogle Scholar
- Felsenstein J: Inferring Phylogenies. Boston: Sinauer; 2003.Google Scholar
- Holmes S: Bootstrapping phylogenetic trees: theory and methods. Statist Sci 2003, 18: 241–255. 10.1214/ss/1063994979View ArticleGoogle Scholar
- Margush T, McMorris FR: Note Consensus n-Trees. Bulletin of Mathematical Biology 1981, 43(2):239–244.Google Scholar
- Hudson RR: Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 2002, 18: 337–338. 10.1093/bioinformatics/18.2.337View ArticlePubMedGoogle Scholar
- Robinson DF, Foulds LR: Comparison of Phylogenetic Trees. Mathematical Biosciences 1981, 53: 131–147. 10.1016/0025-5564(81)90043-2View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.