The package GenNon-h presented here simulates DNA sequences evolving on a given phylogenetic tree directly from the transition matrices. In other words, the goal of developing GenNon-h was to generate DNA sequences following discrete-time Markov models on a given phylogenetic tree (with assigned branch lengths measured as the expected number of substitutions per site). Given a phylogenetic tree T, the parameters of these models are the root distribution and the substitution matrices P
e
assigned to each edge e of T. The entries of P
e
correspond to the conditional probabilities P(x|y e) that a nucleotide y at the parent node of e is substituted by nucleotide x at the child node (we refer to[1, chapter 8] for an introduction to these type of Markov processes). There is no extra restriction on these matrices, in contrast to continuous-time processes, where the substitution matrices are the exponential of a rate matrix Q multiplied by the number of events. Whereas generating matrices of type exp(tQ) with a preassigned number of substitutions (branch length) is relatively easy, the task of generating (discrete-time) Markov matrices with a preassigned expected number of substitutions per site has been recently solved in[2]. The software GenNon-h is implemented for the (discrete-time) Jukes-Cantor JC69, Kimura two K80 and three K81 parameters, strand symmetric SSM, and general Markov models GMM.
The shape of the substitution matrices (in all cases sum of rows is equal to 1 and the entries are nonnegative) for JC69, K80, K81, SSM and GMM models is given by:
For these models, the algorithms given in[2] can generate any stochastic matrix corresponding to a given branch length (except for the general Markov model, where the algorithms cover a dense set of matrices that is not proved to be complete). We refer the reader to[2] for more details and references of these models.
In the continuous-time models, the process is often assumed to be homogeneous (with the same rates of substitutions in all lineages) and the rate matrix Q is common throughout the entire tree. This is not an implicit hypothesis for the discrete-time Markov processes, thus the evolutionary processes simulated using GenNon-h are nonhomogeneous. Although a variety of methods exist for simulating alignments under the continuous-time models (see for example Rose[3], Seq-Gen[4], and Evolver in PAML[5]), we provide the first software for generating multiple sequence alignments evolving under the discrete-time Markov processes on trees.
Other powerful software include the Bayesian phylogenetic methods of[6] and[7]. MRBAYES[6] is a robust inference and model selection method, which provides a variety of tools for nucleotide and amino-acid data analysis. BEAST[7] enables to generate MSA with rate matrices differing at distinct edges. However, both packages are based on the rate-based continuous-time models for nucleotide data and as such, assume the exponential form of the substitution matrices. This in turn restricts the space of possible matrices assigned to a given edge length. GenNon-h, which aims to explore the space of all matrices, mimics the stochastic character of evolution as attempted to be captured by mathematical models. GenNon-h is the first software tailored specifically for the purpose of generating the DNA sequence alignments evolving on phylogenetic trees under the nonhomogeneous models. GenNon-h does not aim to replace the existing methods, but to serve as an option for researchers, whose interest lies in testing the performance of the algorithms on data generated under the assumption of more general models. A newly created software Empar (http://genome.crg.es/cgi-bin/phylo_mod_sel/Empar.pl,[8]), enables to infer the parameters of the model considered in this work.
It is worth pointing out that the strand symmetric model and the general Markov model considered in GenNon-h do not lie in the class of stationary models, which adds to the flexibility of the framework presented here. As a comparison, we note that the software packages that are prevalently in use consider only stationary and time-reversible models. This is due to the fact that the continuous-time models require computing exponentials of rate matrices, which is a nontrivial task for a general matrix. This contributes to the practice of using the time-reversible rate-based models irrespective of the setting under consideration.
As shown in[9], the substitution parameters for the GMM model (and thus for all its submodels), are identifiable up to permutation. In GenNon-h we fabricate matrices of the Diagonal Largest in Column (DLC) type[10], whenever possible, i.e. matrices whose largest entry in every column is placed on the diagonal. DLC matrices share an important feature of being identifiable– there exists a unique set of substitution matrices satisfying the DLC condition and a unique root distribution that leads to the given joint distribution at the leaves. In other words, the data generated from the DLC matrices and sufficient alignment lengths have high likelihood of being identifiable and therefore can be safely used to test hypotheses about the tree or the data.