mbs: modifying Hudson's ms software to generate samples of DNA sequences with a biallelic site under selection
© Teshima and Innan; licensee BioMed Central Ltd. 2009
Received: 19 December 2008
Accepted: 30 May 2009
Published: 30 May 2009
The pattern of single nucleotide polymorphisms, or SNPs, contains a tremendous amount of information with respect to the mechanisms of the micro-evolutionary process of a species. The inference of the roles of these mechanisms, including natural selection, relies heavily on computer simulations. A coalescent simulation is extremely powerful in generating a large number of samples of DNA sequences from a population (species) when all mutations are neutral, and Hudson's ms software is frequently used for this purpose.
However, it has been difficult to incorporate natural selection into the coalescent framework.
We herein present a software application to generate samples of DNA sequences when there is a biallelic site targeted by selection. This software application, referred to as mbs, is developed by modifying Hudson's ms. The mbs software is so flexible that it can incorporate any arbitrary histories of population size changes and any mode of selection as long as selection is operating on a biallelic site.
mbs provides opportunities to investigate the effect of any mode of selection on the pattern of SNPs under various demography.
The coalescent provides a very efficient simulation tool for generating DNA samples drawn from populations [1, 2]. Hudson's software, ms, is widely used in population genetics largely because of its flexibility . ms can generate patterns of DNA polymorphism under the infinite-site model with a complicated demographic history, given that all mutations are neutral. ms is frequently used for estimating demographic and mutational parameters (including point mutation and recombination rates) and for testing for natural selection.
However, provided that a number of genes in a genome are subject to selection, understanding how selection affects the pattern of DNA polymorphism is very important in population genetics (e.g., ). Incorporating selection into the coalescent has been a challenging problem, and one approach has been to consider a biallelic-structure with the original and derived allelic classes . The frequencies of the two allelic classes can change over time, and once their historical trajectory is given, the coalescent algorithm can trace the ancestral lineages of sampled chromosomes backward in time conditional on the trajectory. A simple application of this idea is to perform a selective sweep , in which the trajectory of a beneficial allele in its quick fixation process is given in a deterministic form.
In addition, it is possible to apply this idea to more complex modes of selection. The most important point is that the coalescent works as long as a trajectory of the two allelic classes is given. The trajectory can be obtained by any method, including theory and simulation. This flexibility allows us to incorporate any mode of selection together with the effect of random genetic drift at the biallelic site. The changes of the past population sizes can also be simultaneously considered. However, modification of the standard coalescent algorithm to incorporate these complexities is relatively difficult and feasible for only a limited number of specialists [6, 7].
Here, we provide a very user-friendly software application to generate a biallelic sample of DNA sequences (called mbs), which incorporates any change in the trajectory of allelic frequency and population size (Additional file 1). The software has inline commands and an output form similar to those of Hudson's ms . The allelic frequency trajectory and population size changes must be prepared (either theoretically or by simulation, or even arbitrarily) and stored in an input file before running mbs. This flexibility enables users to simulate patterns of DNA polymorphism in any situation, as long as selection works at a single biallelic site. The software can be widely used for advanced purposes, including simultaneous inferences of selection and demography.
The software application assumes the Wright-Fisher model in a finite population. Following Hudson's ms , the standard coalescent assumptions are used to simulate a random genealogical history of a recombining chromosome and to place random mutations on the chromosome. The basic neutral parameters, (i.e., population mutation and recombination rates) are given by per-site rates, 4N0μ and 4N0r, respectively, where N0 is the current population size and μ and r are the mutation and recombination rates, respectively, per bp per generation. Multiple mutations at a single site are allowed. In addition, we assume that there is a single biallelic site targeted by selection in the simulated region, which consists of a finite number of neutral sites. At the selected site, all chromosomes have either of the two states, 0 or 1, representing the original and derived allelic classes, respectively. The treatment of these two allelic classes is similar to that of two subpopulations. In brief, coalescent events are limited to chromosomes within the same allelic class. The time to the next coalescent event (backward in time) depends on the population size (or frequency) of the allelic class. A mutation at the selected site will change the allelic class. In addition, migration of a partial segment occurs through recombination between the two allelic classes.
Results and Discussion
To generate samples conditional on input files, mbs requires a command line similar to that of ms. An example is:
./mbs nsam -t 0.01 -r 0.01 -s 1000 250 -f 2 5 traj
This command generates a sample of chromosomes of size nsam, which will be randomly assigned to either the ancestral or derived allelic classes according to their current frequencies. Here, -t and -r specify the population mutation and recombination rates per bp, which are set to be 4N0 μ = 4N0r = 0.01 in this example. The integers following the '-s' switch represent the number of sites in the simulated region and the position of the selected site, respectively. The command line argument '-f 2 5 traj' means that the software performs five replications of the simulation for two independent histories of N and p (for a total of 10 replications), which are stored in separate files named "traj_0.dat" and "traj_1.dat". In general, a single run of mbs accepts a finite number (say, k) of input files named "traj_0.dat", "traj_1.dat", ⋯, "traj_k-1.dat".
The output of mbs is almost identical to that of Hudson's ms with a few modifications. For each replication, the simulated pattern of polymorphism is output as follows. The first line indicates that this is the result of the first replication with the first trajectory file ("traj_0.dat"). The allelic status of the selected site (a: ancestral, d: derived) for the sample is also given. The second line gives S, i.e., the number of neutral polymorphic sites, the positions of which are provided in the next line. The following nsam lines are for the polymorphism information of the sampled chromosomes. Each line consists of a string of 0 s and 1 s, representing the allelic status at the S neutral polymorphic sites listed above. The following is an example:
//0-1 allele: a a d d
positions: 76 205 213 240 251 335 370 506 599 698 749 948 984 997
where the sample size is nsam = 4, the first two samples of which are assigned to the ancestral allelic class and the final two samples of which are assigned to the derived allelic class.
There are other possible complications, which are briefly described below. Detailed documentation is available at the web site indicated below. We also provide a simple simulation program to generate trajectories of allele frequencies under typical modes of selection, including directional selection with arbitrary dominance (including selective sweep) and overdominant selection.
While an output consists of sequences with two allelic states, 0 and 1, with the default setting, an optional command allows the creation of sequences with four allelic states, 0, 1, 2, and 3, which represent the four nucleotides, A, T, G, and C.
Recombination rate heterogeneity
The default setting assumes a uniform distribution of recombination over the simulated region, but recombination hot spots or any kind of distribution of recombination rate can be incorporated. In this case, another input file is required.
We have presented a software application, mbs, to generate samples of DNA sequences when there is a biallelic site targeted by selection. mbs was developed by modifying commonly used Hudson's ms software, so that it has inline commands and an output form similar to those of ms. The mbs software is so flexible that it can incorporate any arbitrary histories of population size changes and any mode of selection. This provides opportunities to investigate the effect of any mode of selection on the pattern of SNPs under various demography.
Availability and requirements
Project name: mbs
Project home page: http://www.sendou.soken.ac.jp/esb/innan/InnanLab/software.html
Operating system: Platform independent
Programming language: C
Other requirements: none
Any restrictions on use by non-academics: none
The present study was supported by grants to HI from the Graduate University for Advanced Studies and from the Japan Society for the Promotion of Science.
- Hudson RR: Gene genealogies and the coalescent process. In Oxford Surveys in Evolutionary Biology. Volume 7. Edited by: Futuyma D, Antonovics J. New York: Oxford University Press; 1990:1–44.Google Scholar
- Hudson RR: Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 2002, 18: 337–338. 10.1093/bioinformatics/18.2.337View ArticlePubMedGoogle Scholar
- Nielsen R, Hellmann I, Hubisz M, Bustamante C, Clark AG: Recent and ongoing selection in the human genome. Nat Rev Genet 2007, 8: 857–868. 10.1038/nrg2187PubMed CentralView ArticlePubMedGoogle Scholar
- Hudson RR, Kaplan NL: The coalescent process in models with selection and recombination. Genetics 1988, 120: 831–840.PubMed CentralPubMedGoogle Scholar
- Bravermann JM, Hudson RR, Kaplan NL, Langley CH, Stephan W: The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics 1995, 140: 783–796.Google Scholar
- Kim Y, Stephan W: Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics 2002, 160: 765–777.PubMed CentralPubMedGoogle Scholar
- Teshima KM, Coop G, Przeworski M: How reliable are empirical genomic scans for selective sweeps? Genome Res 2006, 16: 702–712. 10.1101/gr.5105206PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.