 Methodology article
 Open Access
A new method for modeling coalescent processes with recombination
 Ying Wang†^{1},
 Ying Zhou†^{2},
 Linfeng Li†^{3},
 Xian Chen^{1},
 Yuting Liu^{4},
 ZhiMing Ma^{1}Email author and
 Shuhua Xu^{2}Email author
https://doi.org/10.1186/1471210515273
© Wang et al.; licensee BioMed Central Ltd. 2014
 Received: 7 March 2014
 Accepted: 17 July 2014
 Published: 11 August 2014
Abstract
Background
Recombination plays an important role in the maintenance of genetic diversity in many types of organisms, especially diploid eukaryotes. Recombination can be studied and used to map diseases. However, recombination adds a great deal of complexity to the genetic information. This renders estimation of evolutionary parameters more difficult. After the coalescent process was formulated, models capable of describing recombination using graphs, such as ancestral recombination graphs (ARG) were also developed. There are two typical models based on which to simulate ARG: backintime model such as ms and spatial model including Wiuf&Hein’s, SMC, SMC’, and MaCS.
Results
In this study, a new method of modeling coalescence with recombination, Spatial Coalescent simulator (SC), was developed, which considerably improved the algorithm described by Wiuf and Hein. The present algorithm constructs ARG spatially along the sequence, but it does not produce any redundant branches which are inevitable in Wiuf and Hein’s algorithm. Interestingly, the distribution of ARG generated by the present new algorithm is identical to that generated by a typical backintime model adopted by ms, an algorithm commonly used to model coalescence. It is here demonstrated that the existing approximate methods such as the sequentially Markov coalescent (SMC), a related method called SMC′, and Markovian coalescent simulator (MaCS) can be viewed as special cases of the present method. Using simulation analysis, the time to the most common ancestor (TMRCA) in the local trees of ARGs generated by the present algorithm was found to be closer to that produced by ms than time produced by MaCS. Sampleconsistent ARGs can be generated using the present method. This may significantly reduce the computational burden.
Conclusion
In summary, the present method and algorithm may facilitate the estimation and description of recombination in population genomics and evolutionary biology.
Keywords
 Recombination
 Coalescence
 Ancestral recombination graph (ARG)
 Sequentially markov coalescent (SMC)
Background
The genealogical relationship between sequences in a population is an important issue in recent analyses of the dynamics of sequence evolution at the population level. The genealogical relationship among a number of sampled sequences drawn from a particular generation of a large haploid population can be modeled using the Kingman’s coalescent process [1, 2]. This method has been successfully applied in haploid type data such as bacteria simulation, the estimation of population genetics parameters, and the inference of demographic events. However, the coalescent process involves no recombination and this cannot be ignored when studying diploid populations. For example, the histories of different loci in a genomic region may differ due to recombination events.
The first model of coalescence with recombination was described by Hudson [3]. This was shortly after Kingman’s coalescent process was formulated. Due to the increased complexity added by recombination, a graph rather than a single tree is needed to describe the genealogical relationship. This graph, called an ancestral recombination graph (ARG), is made up of many local coalescent trees [4]. An ARG can be considered a random graph. Each branch in the ARG represents a lineage that carries some ancestral material to the sample. Here, the term "ancestral material" refers to chromosomal regions that are eventually inherited by any of the samples of interest drawn from the presentday population. The node in ARG at which two branches converge denotes a coalescent event, and the node at which one branch splits into two denotes a recombination event.
An algorithm that can rapidly generate independent ARGs from populations evolving with both coalescence and recombination can be of great use. First, they can facilitate data analysis. Samples produced using various models can be combined with data to test hypotheses. Second, it can be used to estimate the recombination rate. The question of whether recombination events are clustered in hotspots is of enormous interest at present. It also has considerable relevance to the efficient design of association studies [5].
There are two main representative algorithms that can simulate ARG according to a given recombination rate. One is Hudson’s ms [6]. It is the simplest and is used in many applications. The other is Wiuf and Hein’s spatial algorithm [7]. These two algorithms stress different aspects of the process. The algorithm of ms has a Markovian structure and is computationally straightforward. Ancestral lineages related to the sampled chromosomes remain unchanged until coalescence or recombination. In contrast, Wiuf and Hein’s spatial approach of simulating genealogies along a sequence has a complex, nonMarkovian structure. The distribution of the next local tree depends on all previous local trees rather than on the current genealogy alone. It begins with a coalescent tree at the left end of the sequence and adds more different local trees gradually along the sequence, which form part of the ARG. The algorithm terminates at the right end of the sequence when the full ARG is determined.
Because only type 1 and type 2 contribute to the gene structure of the sample, ARG should in principle contain only these two types of recombination and the branches containing other types of recombination are regarded as redundant ones in simulated ARG. It seems that ms is the briefest way to simulate the ARG according to its distribution because it considers only type 1 and type 2 recombination. Wiuf and Hein’s method also simulates the other three types of recombination, which may produce some redundant branches and increase computation burden in generating ARG. When simulating hundreds of thousands of ARGs with a large recombination rate is required (e.g. to estimate recombination rate of a long DNA sequence based on full likelihood), even ms is not efficient enough, neither is it easy to approximate. Although the original spatial algorithm of Wiuf and Hein’s method produces a lot of redundant branches, several approximate spatial algorithms have been developed to reduce the redundant branches in ARG and simulate large samples of long sequences [8–10].
Likelihoodbased inference is one statistical method that is commonly used to apply the corresponding algorithm to estimations of the recombination rate. The likelihood can be estimated by simulating ARGs from the coalescent distribution given the recombination rate r and mutation rate θ. The simulated data can be examined to see if they match the observed data. By repeating this process many times with different values of θ and r, maximum likelihood estimations of the statistics can be obtained [9]. However, because the vast majority of ARGs is not consistent with the sample and contributes nothing to the likelihood, this naïve method is infeasible. With a complete history, it is easy to calculate both the probability of the data and of the history given the coalescent model and associated parameters. The central difficulty is that, from an essentially infinite set of histories that could give rise to the data, it is hard to find histories that are highly probable under the assumed model. There are two approaches that have been developed to handle this difficulty [11].
The first approach involves sophisticated Monte Carlo methods such as importance sampling [12] and Markov Chain Monte Carlo [5, 13, 14]. MCMC starts from an initial guess and then tends to make subsequent modifications that are more likely to be accepted, with a probability that is proportional to how likely they occur under the assumed model. Importance sampling approximates the optimal proposal density to calculate the likelihood. Both methods create bias towards the simulation of ARGs, which makes significant contributions to the likelihood.
As a supplementary method, the second approach of estimating recombination rate is to simplify or approximate the coalescent model itself. Based on Wiuf and Hein’s spatial point of view, McVean and Cardin developed sequentially Markov coalescent (SMC), an approximation of the standard coalescent process [9]. This algorithm reduces the topology simulated to a tree rather than a graph. If two ancestral lineages have no interval in common where they share ancestral material, they are not allowed to coalesce. By restricting coalescent events in this way, the resulting process has a Markovian structure in the sequential generation of genealogies along a chromosome. The SMC starts with a coalescent tree at the lefthand end of the sequence and progressively modifies the tree with recombination events as it moves to the right. Marjoram and Wall modified the approximation [8]. In their system, the old lineage is not removed until after the point of coalescence of the new one has been determined. This allows for the possibility that the new lineage coalesces with the one that was to be erased and that no change occurs in the genealogy. Chen, Marjoram, and Wall described an intermediate approach (MaCS), which is a compromise between the accuracy of the standard coalescence and the speed of the SMC [10]. In the SMC, coalescent events are restricted to edges within the last local tree only. While in MaCS, coalescent events are restricted to edges among any of the last k (denoted as the tree retention parameter) local trees. It models the relationships between recombination events that are physically close to each other and treats those that are far apart as independent.
The essence of these approximate methods is to simplify the recombination of type 2 events and the coalescence of lineages that contain distant ancestral material. These methods offer significant improvements with respect to computational efficiency and sequence length. However, the effects of these simplifications have not yet been clearly classified.
This paper reports the establishment of a new method of modeling coalescence with recombination. It offers several improvements over that of Wiuf and Hein’s method. A new algorithm based on the new model is proposed to generate ARGs equal to ms. Similar to the algorithm designed by Wiuf and Hein, the present algorithm constructs ARG spatially along the sequence [7]. However, it will not produce any redundant branches, which are inevitable in Wiuf and Hein’s algorithm. It is here suggested that the above approximate methods (SMC [9], SMC′ [8], MaCS [10]) be viewed as special cases of our new algorithm. Using simulated analysis, the present algorithm was compared to MaCS. The time to the most common ancestor (TMRCA) in the local trees of ARGs generated by the present algorithm was even more similar to that produced by ms than that produced by MaCS was. The present method can generate sampleconsistent ARGs, which might significantly reduce the computational burden.
Results
Model assumptions
 (1)
A gene, here treated as a length of DNA, is represented by the unit interval [0,1).
 (2)
The population is assumed to evolve through discrete generations in a WrightFisher manner, which means that each generation is of 2 N genes in size. As usual, time is measured in units of 2 N generations. N → ∞ and 4Nr → ρ remains fixed, where r is the regional recombination rate per generation per gene, and ρ is the global population recombination rate.
 (3)
The present algorithm was designed under the infinite sites model, in which the mutation is independent of the coalescent with recombination so that it occurs with a Poisson distribution on the ARG.
 (4)
In the present algorithm, a gene copies the genetic information of its parents if recombination does not occur. If recombination does occur, only one breakpoint is assumed and the genetic information of the gene comes from its parents. Specifically, each gene chooses its parents from the previous generation according to the following rules: a) with probability 1–r, a gene from the previous generation is uniformly chosen; b) with probability r, a recombination event occurs, and two genes are uniformly chosen from the previous generation; c) each gene chooses its parents independently.
Meanwhile, the position of the breakpoint S is selected (independent of the generating events of the other genes) according to a given distribution with density p(s). The intervals [0,S) and [S,1), which are from the first and second parents, respectively, form the offspring gene. Here the random variable S possesses a continuous distribution.
Generation of ARG along sequences with SC
Generating ARG is key to simulating coalescent processes with recombination. Most studies of theoretical population genomics do not generate ARG, even though many algorithms have been developed to do so. These include Hudson’s ms [6], Wiuf and Hein’ algorithm [7], Chen et al.’s MaCS [10], SMC [9], and SMC′ [8]. These algorithms can be roughly classified into two categories according to the different methodologies they use. One generates ARG back in time. Hudson’s ms is a representative example [6]. It is the most accurate algorithm because of its complete ARG space. The other generates ARG by gradually constructing a series of local trees along the sequence, such as Wiuf and Hein’ algorithm [7], MaCS [10], SMC [9], and SMC′ [8]. These algorithms are collectively called spatial algorithms. Particularly, MaCS, SMC and SMC’ are approximate spatial algorithms, which can generate ARG with longer sequences than ms can because they lose some information during the generation of ARG space.

Step 1. j?=?0, S_{0}?=?0, Build a standard coalescent tree X^{0}.

Step 2. Generate a breakpoint of recombination S_{j + 1} in the interval [S,1) and choose a location from the current ARG X^{S}_{ j } .

Step 3. Build a new ARG X^{Sj+1} by adding a new coalescent branch to the current ARG X^{Sj}, and the coalescent event begins from the selected location and ends to any position of X^{Sj}.

Step 4. Repeat steps 2 to 4 until S_{j+1}?>?1, and then take the total ARG X^{1} as the ARG X^{Sj}.
 (1)
Wiuf and Hein’s algorithm simulates type 1, 2, and other three types of recombination events because this algorithm fixes any recombination breakpoints without determining its type during step 2. It is therefore obvious that Wiuf and Hein’s algorithm generates more redundant information than ms (ms simulates types 1 and 2 but no other unnecessary recombination [8]).
 (2)
MaCS simulates only type 1 recombination events and does not consider other types. This is because, during step 2, MaCS just fixes the type 1 recombination breakpoints by choosing a location from the current tree, but not the whole current ARG. In this way, MaCS ignores type 2 recombination information which ms does not.
Now the key problem for spatial algorithms is distinguishing type 2 from type 3 and adding useful recombination to the final total ARG.
To solve this type of problem, a new algorithm is here proposed to distinguish type 2 from type 3 events. In this algorithm, it is type 2 rather than type 3 recombination events that are entered into previous local trees after the latter type 1 breakpoints have been fixed. Specifically, the second step of this process differs from that of Wiuf&Hein’s, which we put recombination break points on the current local tree instead of the current ARG. This caused the type 1 recombination breakpoints to be fixed after step 2. Then step 3 is refined: A coalescent event beginning at the selected location was added to the current tree, and the end of the coalescent event is also located. If the location is on an old branch, then type 2 recombination breakpoints can be found back on this old branch. Finally, a full ARG can be built without missing any type 2 recombination events.
In fact, the probability distribution on ARG space generated by this algorithm, which is based on a spatial model along a sequence, is identical to that produced by an algorithm based on the backintime model. To see this intuitively, we construct the whole ARG by constructing its constrain or projection on [0, s) from s = 0 to s = 1. The difference between the constrain of [0, S_{ j }) and [0, S_{j + 1}) (X^{Sj} and X^{Sj+1}, respectively) can be more than 1 branch. That’s because if there is a recombination with break point in [S_{ j }, S_{j + 1}), we can obtain it at least when we obtained X^{Sj+1}. So, in order to get X^{Sj+1}, we choose to add one or more than one branches to X^{Sj}.
The experimental proof of the above identification can be seen in the performance of SC and SCsample section. The details of the procedures associated with the present algorithm can be found in the methods section. The Mathematical framework of the algorithm is provided in another paper [16].
Generate sampleconsistent ARG
The idea of sampleconsistent ARG first appeared in the work of Song [17]. In their study, ARG was used to estimate the minimal number of recombination events. ARG with the minimal recombination numbers definitely helped the study of recombination. We do not think that ARG with minimal number of recombination represents a true ARG. Therefore, we attempt to design an algorithm that can model a group of ARG which are consistent with sample in a reasonable way, rather than simply produce all the ARGs in the whole ARG space, neither does it generate ARGs with minimal number of recombination events. We believe that by this way our method generates sampleconsistent ARGs which reflect true genealogical information of the samples and it will definitely help us estimate parameters of population demographic history.

Step 1 Generate a standard coalescent tree X^{ 0 }, which is consistent with the left site of the sequence.

Step 2 One by one, confirm whether the following sites are consistent with the current tree X^{ 0 }, and find the first inconsistent site P_{1}. Then generate a breakpoint of type 1 recombination S_{1} in the interval [0, P_{1}], and choose a location from the current ARG X^{ 0 }.

Step 3 Build a new ARG X^{S1} by adding branches to X^{ 0 } at the chosen location to make the local tree after S_{ 1 } consistent with the first site after S_{ 1 }.
Repeat step 2 to 3 until S_{j + 1} > 1 to get an ARG consistent with the sample for the full sequence.
Performance of SC and SCsample
The present results show that SC is more similar to ms with respect to both the mean and variance of the TMRCA than MaCS is. This confirms that the results of the theoretical study that ARGs generated by SC and ms share similar statistical properties, indicating that SC performs better than MaCS in the modeling of ARG.
Comparison of average time cost and memory usage between SC and ms
Sample size  Region (L)  ρ(4N_{ e }Lr_{ p })  SC  ms 

20  10 Mb  1000  45 s (14 MB)  13 s (63 MB) 
10 Mb  10,000  3 h 43 min 22 s (210 MB)  2 h 3 min 53 s (344 MB)  
100 Mb  1000  57 s (14 MB)  N/A  
100 Mb  10,000  4 h 30 min 18 s (238 MB)  N/A  
100  10 Mb  1000  1 min 57 s (19 MB)  23 s (148 MB) 
10 Mb  10,000  7 h 11 min 34 s (281 MB)  N/A  
100 Mb  1000  2 min 11 s (20 MB)  N/A  
100 Mb  10,000  7 h 13 min 3 s (289 MB)  N/A  
1000  1 Mb  1000  3 min 13 s (26 MB)  12 s (215 MB) 
10 Mb  3 min 20 s (26 MB)  N/A 
Discussion
In the present study, a new method for modeling coalescent processes with recombination was developed. This method offers some improvements over Wiuf and Hein’s method. It covers all the commonly used spatial simulation algorithms with approximations, i.e. SMC, SMC′, and MaCS. Based on this method, a new algorithm, SC, was developed for the simulation of ARG and the generation of data. This algorithm has been shown to be able to simulate ARG with the same distribution as that produced by a backintime simulation algorithm. Another relevant algorithm, SCsample, was also developed for generating sampleconsistent ARG. The present method and algorithms have considerable potential to facilitate modeling and statistical inference of recombination.
Another study showed that the distribution of ARG generated by the new algorithm is identical to that generated by a typical backintime model [16]. In this study, computer simulation experiments confirmed that this feature was common to both the backintime method and the alongsequence method with respect to the simulation of ARG. In practice, SC takes slightly more time than ms for the same simulation but use less RAM. However, ms does not work as well as SC when the sequence is very long (e.g., 100 Mb). These comparisons indicate that neither of these two methods can generate ARG of long sample sequences in a satisfactory manner. Considering the above situation and rapid accumulation of huge genomic data, one possible solution or a tradeoff could be effected by approximating ARG to replace full ARG.
Several approximate methods have been developed for the simulation of ARG, such as the sequentially Markov coalescent (SMC), a related method called SMC′ and Markovian coalescent simulator (MaCS). These existing methods can be considered as special cases of the present method. Marjoram and Wall classified recombination into 5 types [8]. However, these methods all ignore type 2 recombination and some of the coalescent events associated with it. This may affect ARG reconstruction and statistical inference of recombination.
The influence of type 2 recombination on ARG was assessed by comparing the statistical properties of ms, SC, and MaCS (h = L). It was concluded that ignoring type 2 recombination would reduce both the mean and variance of the times to MRCA, although the reduction is not very remarkable. These results indicated that type 1 recombination might play a more important role in history than type 2 recombination. However, type 2 recombination was found to be present extensively and much more common than type 1 when long sequences are considered and simulated, suggesting that this type 2 recombination should not be ignored in simulation, especially for the simulation of long sequences. Therefore, an algorithm that takes into account type 2 recombination should be used for the simulation of long sequences. This is one of the reasons why the SC algorithm was developed.
During the simulation of recombination, regardless of which algorithm was used, the process is complex and time consuming. Many nonsampleconsistent ARGs are generated. For this reason, a new concept, sampleconsistent ARG, was developed, and an algorithm, SCsample, was used to simulate sampleconsistent ARG. However, on the one hand, simulations based on SCsample save a considerable amount of time and significantly increase efficiency over that of any nonsampleconsistent algorithm.
Taken together, the two algorithms developed in this study improved the modeling of coalescence with recombination. In a future study, new approximated methods should be developed to handle largescale simulation of big data. Coalescence with recombination can be modeled using a random sequence {(S_{ i }, Z^{ i }) : i ≥ 0}. These different methods of approximation actually use different S_{ i } and Z^{ i }. One possible solution is to approximate the random sequence {(S_{ i }, Z^{ i }) : i ≥ 0} from the mathematical aspects. These methods, when well established, can greatly facilitate studies of recombination modeling and recombination rate estimation.
Conclusions
In this study, we developed a new method for modeling coalescent processes with recombination, and we demonstrated that our method has comparable performances with ms in generating ARG, a computer program commonly used to simulate coalescence. An outstanding feature of our method is that it does not produce any redundant branches which are inevitable in Wiuf and Hein’s algorithm. In addition, our method can generate sampleconsistent ARGs. Interestingly, we elucidated that the existing approximate methods (SMC, SMC’, MaCS) are all special cases of our method. We believe our new method and algorithm will facilitate the modeling of recombination and advance our understanding of evolution of recombination events within and between populations.
Methods
SC
Step 1. Construct a standard coalescent tree ${\mathcal{T}}^{0}$(c.f. [2]) at the position S_{0} = 0 (the left endpoint of the sequence) and assign each branch of the tree with label 0. Let ${\mathrm{X}}^{0}={\mathcal{T}}^{0}$.
Here, ${L}_{{S}_{i}}$ is the total branch length of the current local tree ${\mathcal{T}}^{{S}_{i}}$, ρ is global population recombination rate, and p(u) is the density of the distribution of break point (see Model Assumptions for the explanation of p(u)). If S_{i + 1} ≥ 1, break; otherwise, go to step 3.
Step 3. Uniformly choose a recombination location on ${\mathcal{T}}^{{S}_{i}}$. Let j = 0, and let ${T}_{0}^{i+1}$ denote the latitude (i.e. the height from the bottom to the location) of the chosen location.
Step 4. At the site of recombination, a new branch with label i + 1 is created by forking off the recombination node and moving backward in time (i.e. along the direction of increasing latitude). With equal exponential rate 1, the new branch will tend to coalesce to each branch in ${X}^{{S}_{i}}$ that has a higher latitude than ${T}_{j}^{i+1}$. In this way, if there are l branches in ${X}^{{S}_{i}}$ at the current latitude, then the time before coalescence is exponentially distributed with parameter l. Note that at different latitudes there may be a different number of l of branches. Let the branch to which the new branch coalesces be called EDGE, and let ${T}_{j+1}^{i+1}$ be the latitude of the coalescent point and j = j + 1.
Step 5. If the EDGE is labeled with i, which means the EDGE has coalesced to the current tree, go to step 6; if the EDGE is labeled with a k and k is less than i, which means the EDGE has coalesced to an old lineage, then a potential recombination event should be considered. The waiting time t of the possible recombination event on the EDGE is exponentially distributed with parameter ${2}^{1}\rho {\displaystyle {\int}_{{S}_{k+1}}^{{S}_{i+1}}p\left(u\right)\mathit{\text{du}}}$.
Case 5.1. If ${T}_{j}^{i+1}+t$ is less than the latitude of the upper node of the EDGE which is denoted by H, then it is the next recombination location. Let ${T}_{j+1}^{i+1}={T}_{j}^{i+1}+t$. The part of the branch above ${T}_{j+1}^{i+1}$ is no longer called EDGE. Let j = j + 1 and go to step 4.
Case 5.2. If ${T}_{j}^{i+1}+t\ge H$, choose the upper edge of the current EDGE with larger labels to be the next EDGE. Let ${T}_{j+1}^{i+1}=H$, j = j + 1 and go to step 5.
Step 6. Let ${X}^{{S}_{i+1}}$ be the collection of all the branches in ${X}^{{S}_{i}}$ and all the new branches labeled i +1. Starting from each node 1 ≤ m ≤ n at the bottom of the graph, specify a path moving along the edges in ${X}^{{S}_{i+1}}$ increasing in latitude until it reaches the top of the graph. Whenever a recombination node is encountered, choose the edge with the larger label. The collection of all the paths then forms the local tree ${\mathcal{T}}^{{S}_{i+1}}$. Update all the branches in ${\mathcal{T}}^{{S}_{i+1}}$ with label i + 1.
It is here noted that step 5 shows the key differences between the present method and other spatial algorithms. This step finds the missing type 2 recombination events. Essentially, the present algorithm gives the new branch a chance to leave when it coalesces into an old branch.
SCsample

Step 1. Value the coalescent tree from bottom to the top according to the following rules: (a) Value a node with 1 if its two son nodes are all with value 1. (b) Value a node with 0 if its two son nodes are all with value 0. (d) Value a node with 2 if one of its two son nodes is with value 0, the other is 1. (e) Value a node with 1 if one of its two son nodes is with value 1, the other is 2. (f) Value a node with 0 if one of its two son nodes is with value 0, the other is 2. (g) Value a node with 2 if its two son nodes both have value 2.

Step 2. For the top node of the coalescent tree, (a) change its value to 0 if it is valued with 2, (b) remain its value if it is valued with 0 or 1.

Step 3. Revalue each 2valued node with its parent node’s value.

Step 4. The number of the mutation edges is the MMN.
See Additional file 4: Figure S4 for an example of the MMN algorithm.

Step 1. At position S_{0}?=?0, a coalescent tree T_{0} is constructed as a modification of the standard coalescent: (a) Give the value of the left site (first site) of the sample to the leaf node. (b) Randomly choose two nodes with the same value to coalesce and give the same value to the parent node. (c) When only one node with value 0 or 1 remains, the simulation becomes the standard coalescent simulation.

Step 2. Assign each leaf node of the current tree a value of the sites after the current position of the sample, and then use the above MMN algorithm to assign every node of the current tree by making sure the number of edges with the value of the top and bottom nodes difference is minimal. In this way, each node on the current tree is valued by a 0–1 vector. Edges that have different values at the top and bottom nodes at a given are called mutation edges of the site.

Step 3. Denote the first site that has more than one mutation edges as P_{i + 1}. The next recombination point S_{i + 1} is uniformly chosen between S_{ i } and P_{i + 1} (or S_{i + 1} is regenerated as before until S_{i + 1}?<?P_{i + 1}).

Step 4. Randomly choose a recombination site on the mutation edges of the site at position P_{i + 1}.

Step 5. The new lineage can only coalesce into 3 different types of edges: type A, the edges in the current tree with their values from S_{ i } to P_{i + 1} are the same, type B, an old branch that leads to type A edges in the current tree, type C, a branch beyond the local MRCA.
Uniform distribution was used when finding the next recombination point because the algorithm was designed to be independent of a prior recombination rate. Any ARG generated using this method can be viewed as a randomly selected ARG consistent with the sample. In principle, all kinds of algorithms can be modified. These can be considered special cases in our model (such as SMC and MaCS) to generate ARG consistent with the sample.
Calculation of the differences among ARGs by the mean and variance of the heights of local trees
Here, ${m}_{\mathit{\text{SC}}}^{i}$ represents mean height of the i^{th} local tree generated by SC. It ${v}_{\mathit{\text{SC}}}^{i}$ represents variance of the height of the i^{th} local tree. So, based on the mean and variance of the local tree’s height, the difference in ARGs generated by difference algorithm can be estimated.
Availability and requirements
The algorithm for SC and SCsample are implemented in C++ by modifying the code of MaCS. SC is implemented by the same set of demographic models used in ms, but SCsample can only handle the classical homogeneous effective population size model (constant population size). Like MaCS, SC considers variations in recombination rate and intragenic gene conversion using a piecewise constant model and intragenic gene conversion, respectively [18]. The source code of SC and SCsample can be downloaded from the website: http://www.picb.ac.cn/PGG/resource.php

Project name: SC and SCSample

Project home page: http://www.picb.ac.cn/PGG/resource.php

Operating system: GNU/Linux

Programming language: C++

Other requirements: g++ version 4.4.6 or higher; boost version 1.41 or higher

License: GNU GPL

Any restrictions to use by nonacademics: license needed
Notes
Declarations
Acknowledgments
We would like to thank the colleagues for their helpful discussion. In particular, we are greatly indebted to Drs. DeXin Zhang and WeiWei Zhai for offering their valuable insight idea and constant encouragement throughout this work. These studies were supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (CAS) (XDB13040100), by 973 project (2011CB808000), NSFC Creative Research Groups (11021161), National Natural Science Foundation of China (NSFC) grants (91331204, 31171218, 11001010), by the Fundamental Research Funds for the Central Universities (2011JBZ019, 2012RC048). S.X. is MaxPlanck Independent Research Group Leader and member of CAS Youth Innovation Promotion Association. ZM.M. also gratefully acknowledges the support by National Center for Mathematics and Interdisciplinary Sciences (NCMIS). S.X. also gratefully acknowledges the National Program for Topnotch Young Innovative Talents of The "TenThousandTalents" Project and the support of K.C.Wong Education Foundation, Hong Kong. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Authors’ Affiliations
References
 Kingman JFC: On the genealogy of large populations. J Appl Probab. 1982, 19A: 2743.View ArticleGoogle Scholar
 Kingman JFC: The coalescent. Stoch ProcessAppl. 1982, 13: 235248.View ArticleGoogle Scholar
 Hudson RR: Properties of a neutral allele model with intragenic recombination. Theor Popul Biol. 1983, 23 (2): 183201.View ArticlePubMedGoogle Scholar
 Griffiths RCaM P: An ancestral recombination graph. Prog Popul Genet Hum Evol. 1997, 87: 257270.View ArticleGoogle Scholar
 Stumpf MPH, Goldstein DB: Demography, recombination hotspot intensity, and the block structure of linkage disequilibrium. Curr Biol. 2003, 13: 18.View ArticlePubMedGoogle Scholar
 Hudson RR: Generating samples under a Wright&Fisher neutral model of genetic variation. Bioinformatics. 2002, 18: 337338.View ArticlePubMedGoogle Scholar
 Wiuf C, Hein J: Recombination as a point process along sequences. Theor Popul Biol. 1999, 55: 248259.View ArticlePubMedGoogle Scholar
 Marjoram P, Wall JD: Fast "coalescent" simulation. BMC Genet. 2006, 7: 16View ArticlePubMed CentralPubMedGoogle Scholar
 McVean GA, Cardin NJ: Approximating the coalescent with recombination. Philos Trans R Soc Lond B Biol Sci. 2005, 360 (1459): 13871393.View ArticlePubMed CentralPubMedGoogle Scholar
 Chen GK, Marjoram P, Wall JD: Fast and flexible simulation of DNA sequence data. Genome Res. 2009, 19: 136142.View ArticlePubMed CentralPubMedGoogle Scholar
 Stumpf MPH, McVean GAT: Estimating recombination rates from populationgenetic data. Nat Rev Genet. 2003, 4 (12): 959968.View ArticlePubMedGoogle Scholar
 Fearnhead P, Donnelly P: Estimating recombination rates from population genetic data. Genetics. 2001, 159: 12991318.PubMed CentralPubMedGoogle Scholar
 Kuhner MK, Beerli P, Yamato J, Felsenstein J: Usefulness of single nucleotide polymorphism data for estimating population parameters. Genetics. 2000, 156: 439447.PubMed CentralPubMedGoogle Scholar
 Nielsen R: Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics. 2000, 154: 931942.PubMed CentralPubMedGoogle Scholar
 Griffiths RC, Marjoram P: Ancestral inference from samples of DNA sequences with recombination. J Comput Biol. 1996, 3: 479502.View ArticlePubMedGoogle Scholar
 Chen X, Ma ZM, Wang Y: Markov jump processes in modeling coalescent with recombination. Ann Stat. 2014, 42 (4): 13611393.View ArticleGoogle Scholar
 Song YS, Hein J: Constructing minimal ancestral recombination graphs. J Comput Biol. 2005, 12 (2): 147169.View ArticlePubMedGoogle Scholar
 Wiuf C, Hein J: The coalescent with gene conversion. Genetics. 2000, 155 (1): 451462.PubMed CentralPubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.