Recodon: Coalescent simulation of coding DNA sequences with recombination, migration and demography
© Arenas and Posada; licensee BioMed Central Ltd. 2007
Received: 02 August 2007
Accepted: 20 November 2007
Published: 20 November 2007
Coalescent simulations have proven very useful in many population genetics studies. In order to arrive to meaningful conclusions, it is important that these simulations resemble the process of molecular evolution as much as possible. To date, no single coalescent program is able to simulate codon sequences sampled from populations with recombination, migration and growth.
We introduce a new coalescent program, called Recodon, which is able to simulate samples of coding DNA sequences under complex scenarios in which several evolutionary forces can interact simultaneously (namely, recombination, migration and demography). The basic codon model implemented is an extension to the general time-reversible model of nucleotide substitution with a proportion of invariable sites and among-site rate variation. In addition, the program implements non-reversible processes and mixtures of different codon models.
Recodon is a flexible tool for the simulation of coding DNA sequences under realistic evolutionary models. These simulations can be used to build parameter distributions for testing evolutionary hypotheses using experimental data. Recodon is written in C, can run in parallel, and is freely available from http://darwin.uvigo.es/.
Coalescent theory  provides a very powerful framework for the simulation of samples of DNA sequences. Coalescent simulations can be very useful to understand the statistical properties of these samples under different evolutionary scenarios , to evaluate and compare different analytical methods , to estimate population parameters  and for hypothesis testing . Not surprisingly, several simulation programs have recently been developed under this framework [6–12]. In order to obtain meaningful biological inferences from simulated data it is important that the generating models are as realistic as possible. However, increasing model complexity usually results in longer computing times, and most programs usually focus on a restricted set of biological scenarios. Currently, we lack a tool for the simulation of samples of coding sequences that have evolved in structured populations with recombination and fluctuating size, typical for example of fast evolving pathogens and MHC genes [13, 14]. Here, we introduce a new simulation program, called Recodon, to fill this gap.
The simulation of data in Recodon is accomplished in two main steps. First, the genealogy of the sample is simulated under the coalescent framework with recombination, migration and demographics. Second, codon sequences are evolved along this genealogy according to a nucleotide or codon substitution model.
Simulation of genealogies
For each replicate, genealogies are simulated according to thecoalescent under a neutral Wright-Fisher model [15, 16]. Waiting times to a coalescence, recombination or migration event are exponentially distributed, and depend on the number of lineages, effective population size (N), recombination, migration and growth rates. Time is scaled in units of 2N generations. Recombination occurs with the same probability between different sites (either nucleotides or codons). A finite island model [16, 17] is assumed, where migration takes place at a constant rate between different demes. Multiple demographic periods can be specified, each one with its own initial and final effective population size, and length (number of generations). Positive or negative exponential growth is assumed.
Simulation of nucleotide and codon sequences
Key arguments for Recodon. The user can specify several parameters to implement different simulation scenarios. These arguments can be entered in the command line or read from a text file.
Number of replicates
Number of sites (bp or codons)
Effective population size
Exponential growth rate
2.1 × 10-5
1000 5000 200
5 × 10-6
1.2 × 10-4
Number of demes
5.1 × 10-4
0.4 0.3 0.1 0.2
Relative substitution rates
1.0 2.3 2.1 3.0 4.2 1.0
Nonsynonymous/synonymous rate ratio3
Rate variation among sites4
Proportion of invariable sites
The input of the program consists of a series of arguments that can be entered in the command line or, more conveniently, specified in a text file (Table 1). These arguments fully parameterize the simulations, and control the amount of information that is sent to the console or output files.
The principal output of the program is a set of sampled aligned nucleotide or codon sequences in sequential Phylip format. Additional information that can be saved to different files includes the genealogies, divergence times, breakpoint positions, or the ancestral sequences. Replicates can be filtered out depending on the number of recombination events, and an independent outgroup sequence can also be evolved. At the end of the simulations, a summary of the different events is printed to the console.
Results and Discussion
We have developed a new program, called Recodon, for the simulation of coding DNA sequences. The program can run in parallel over multiple processors using the MPI libraries. The models implemented imitate the simultaneous action of several evolutionary processes, like recombination, migration, non-constant population size or selection at the molecular level. Understanding the joint effects of these processes is important in order to obtain more realistic estimates of population genetic parameters from real data [3, 20–22].
Recodon has been validated in several ways. The output of the program was contrasted with the theoretical expectations for the mean and variances for different values, like the number of recombination and migration events, or the times to the most recent common ancestor . In addition, results obtained with Recodon were in agreement with those obtained with other programs  under different evolutionary scenarios. Finally, substitution and codon model parameters were estimated from the simulated data using
Coalescent simulations like those implemented in Recodon can be used to generate numerical expectations for different parameters under complex evolutionary scenarios, in which different processes interact in a simultaneous fashion. This can be very important to understand the interaction of different parameters, which complicates enormously their estimation . Indeed, realistic simulation models are essential to evaluate different methods and strategies for estimating parameters and testing hypotheses from real data.
One potential application of Recodon could be the study of fast-evolving pathogens like HIV-1, which show high recombination and adaptation rates for coding genes . For example, we could use this program to understand whether intrapatient genetic diversity for the env gene should increase with decreasing migration rates. Then we could test whether the number and diversity of env haplotypes sampled from a patient, all other conditions equal, ressemble the simulated cases with (or without) compartmentalization. Simulated data can also be used to obtain numerical estimates of population genetic parameter using approximate Bayesian computation [4, 27–30]. Estimation by simulation can be especially useful in situations where the likelihood for a model is not known, or is computationally prohibitive to evaluate, which is often the case under complex biological scenarios.
In the future we plan to relax some of the current assumptions, like an homogeneous recombination rate .
Recodon is a versatile program for the simulation of codon alignments under complex population models. This program fills a gap in the current array of coalescent programs for the simulation of DNA sequences, as no single program is able to simulate codon sequences sampled from populations with recombination, migration and growth. Data simulated with this program can be used to study both theoretical and empirical properties of DNA samples under biologically realistic scenarios.
Availability and requirements
Recodon is written in ANSI C, and it has been compiled without problems in Mac OS X, Linux Debian and Windows. It can run in parallel using the MPI libraries in architectures with several processors. The program is freely available at http://darwin.uvigo.es/, including executables, source code and documentation. The program is distributed under the GNU GPL license.
This work was partially supported by grant BFU2004-02700 (MCyT) to DP and by the FPI fellowship BES-2005-9151 (MEC) to MA from the Spanish government. Several functions were taken from code provided by R. Nielsen and Z. Yang. We want to thank J. Carlos Mouriño at the Supercomputing Center of Galicia (CESGA) for extensive help with code parallelization.
- Kingman JFC: The coalescent. Stochastic Processes and their Applications 1982, 13: 235–248. 10.1016/0304-4149(82)90011-4View ArticleGoogle Scholar
- Innan H, Zhang K, Marjoram P, Tavare S, Rosenberg NA: Statistical tests of the coalescent model based on the haplotype frequency distribution and the number of segregating sites. Genetics 2005, 169(3):1763–1777. 10.1534/genetics.104.032219PubMed CentralView ArticlePubMedGoogle Scholar
- Carvajal-Rodriguez A, Crandall KA, Posada D: Recombination Estimation Under Complex Evolutionary Models with the Coalescent Composite-Likelihood Method. Mol Biol Evol 2006, 23(4):817–827. 10.1093/molbev/msj102PubMed CentralView ArticlePubMedGoogle Scholar
- Beaumont MA, Zhang W, Balding DJ: Approximate Bayesian computation in population genetics. Genetics 2002, 162(4):2025–2035.PubMed CentralPubMedGoogle Scholar
- DeChaine EG, Martin AP: Using coalescent simulations to test the impact of quaternary climate cycles on divergence in an alpine plant-insect association. Evolution Int J Org Evolution 2006, 60(5):1004–1013.View ArticleGoogle Scholar
- Excoffier L, Novembre J, Schneider S: SIMCOAL: a general coalescent program for the simulation of molecular data in interconnected populations with arbitrary demography. J Hered 2000, 91: 506–509. 10.1093/jhered/91.6.506View ArticlePubMedGoogle Scholar
- Spencer CC, Coop G: SelSim: a program to simulate population genetic data with natural selection and recombination. Bioinformatics 2004, 20(18):3673–3675. 10.1093/bioinformatics/bth417View ArticlePubMedGoogle Scholar
- Mailund T, Schierup MH, Pedersen CN, Mechlenborg PJ, Madsen JN, Schauser L: CoaSim: a flexible environment for simulating genetic data under coalescent models. BMC Bioinformatics 2005, 6: 252. 10.1186/1471-2105-6-252PubMed CentralView ArticlePubMedGoogle Scholar
- Marjoram P, Wall JD: Fast "coalescent" simulation. BMC Genet 2006, 7: 16. 10.1186/1471-2156-7-16PubMed CentralView ArticlePubMedGoogle Scholar
- Hudson RR: Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 2002, 18(2):337–338. 10.1093/bioinformatics/18.2.337View ArticlePubMedGoogle Scholar
- Hellenthal G, Stephens M: msHOT: modifying Hudson's ms simulator to incorporate crossover and gene conversion hotspots. Bioinformatics 2007, 23(4):520–521. 10.1093/bioinformatics/btl622View ArticlePubMedGoogle Scholar
- Posada D, Wiuf C: Simulating haplotype blocks in the human genome. Bioinformatics 2003, 19(2):289–290. 10.1093/bioinformatics/19.2.289View ArticlePubMedGoogle Scholar
- Edwards SV, Hedrick PW: Evolution and ecology of MHC molecules: from genomics to sexual selection. Trends in Ecology and Evolution 1998, 13(8):305–311. 10.1016/S0169-5347(98)01416-5View ArticlePubMedGoogle Scholar
- Awadalla P: The evolutionary genomics of pathogen recombination. Nat Rev Genet 2003, 4(1):50–60. 10.1038/nrg964View ArticlePubMedGoogle Scholar
- Fisher RA: The Genetical Theory of Natural Selection. Oxford: Oxford University Press; 1930.View ArticleGoogle Scholar
- Wright S: Evolution in Mendelian populations. Genetics 1931, 16: 97–159.PubMed CentralPubMedGoogle Scholar
- Hudson RR: Island models and the coalescent process. Mol Ecol 1998, 7: 413–418. 10.1046/j.1365-294x.1998.00344.xView ArticleGoogle Scholar
- Tavaré S: Some probabilistic and statistical problems in the analysis of DNA sequences. In Some mathematical questions in biology – DNA sequence analysis. Volume 17. Edited by: Miura RM. Providence, RI: Amer. Math. Soc; 1986:57–86.Google Scholar
- Goldman N, Yang Z: A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 1994, 11(5):725–736.PubMedGoogle Scholar
- Anisimova M, Nielsen R, Yang Z: Effect of Recombination on the Accuracy of the Likelihood Method for Detecting Positive Selection at Amino Acid Sites. Genetics 2003, 164(3):1229–1236.PubMed CentralPubMedGoogle Scholar
- Shriner D, Nickle DC, Jensen MA, Mullins JI: Potential impact of recombination on sitewise approaches for detecting positive natural selection. Genet Res 2003, 81: 115–121. 10.1017/S0016672303006128View ArticlePubMedGoogle Scholar
- Posada D: Evaluation of methods for detecting recombination from DNA sequences: empirical data. Mol Biol Evol 2002, 19(5):708–717.View ArticlePubMedGoogle Scholar
- Hudson RR: Gene genealogies and the coalescent process. Oxf Surv Evol Biol 1990, 7: 1–44.Google Scholar
- Kosakovsky Pond SL, Frost SD, Muse SV: HYPHY: Hypothesis testing using phylogenies. Bioinformatics 2005, 21: 676–679. 10.1093/bioinformatics/bti079View ArticleGoogle Scholar
- Swofford DL: PAUP*: Phylogenetic Analysis Using Parsimony (*and Other Methods). 4th edition. Sunderland, Massachusetts: Sinauer Associates; 2000.Google Scholar
- Rambaut A, Posada D, Crandall KA, Holmes EC: The causes and consequences of HIV evolution. Nature Review Genetics 2004, 5: 52–61. 10.1038/nrg1246View ArticleGoogle Scholar
- Excoffier L, Estoup A, Cornuet JM: Bayesian analysis of an admixture model with mutations and arbitrarily linked markers. Genetics 2005, 169(3):1727–1738. 10.1534/genetics.104.036236PubMed CentralView ArticlePubMedGoogle Scholar
- Tanaka MM, Francis AR, Luciani F, Sisson SA: Using approximate Bayesian computation to estimate tuberculosis transmission parameters from genotype data. Genetics 2006, 173(3):1511–1520. 10.1534/genetics.106.055574PubMed CentralView ArticlePubMedGoogle Scholar
- Tallmon DA, Luikart G, Beaumont MA: Comparative evaluation of a new effective population size estimator based on approximate bayesian computation. Genetics 2004, 167(2):977–988. 10.1534/genetics.103.026146PubMed CentralView ArticlePubMedGoogle Scholar
- Shriner D, Liu Y, Nickle DC, Mullins JI: Evolution of intrahost HIV-1 genetic diversity during chronic infection. Evolution Int J Org Evolution 2006, 60(6):1165–1176.Google Scholar
- Wiuf C, Posada D: A coalescent model of recombination hotspots. Genetics 2003, 164(1):407–417.PubMed CentralPubMedGoogle Scholar
- Nei M, Gojobori T: Simple method for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 1986, 3(5):418–426.PubMedGoogle Scholar
- Korber B: HIV Signature and Sequence Variation Analysis. In Computational Analysis of HIV Molecular Sequences. Edited by: Rodrigo AG, Learn GH. Dordrecht, Netherlands: Kluwer Academic Publishers; 2000:55–72.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.