ACG: rapid inference of population history from recombining nucleotide sequences

Background Reconstruction of population history from genetic data often requires Monte Carlo integration over the genealogy of the samples. Among tools that perform such computations, few are able to consider genetic histories including recombination events, precluding their use on most alignments of nuclear DNA. Explicit consideration of recombinations requires modeling the history of the sequences with an Ancestral Recombination Graph (ARG) in place of a simple tree, which presents significant computational challenges. Results ACG is an extensible desktop application that uses a Bayesian Markov chain Monte Carlo procedure to estimate the posterior likelihood of an evolutionary model conditional on an alignment of genetic data. The ancestry of the sequences is represented by an ARG, which is estimated from the data with other model parameters. Importantly, ACG computes the full, Felsenstein likelihood of the ARG, not a pairwise or composite likelihood. Several strategies are used to speed computations, and ACG is roughly 100x faster than a similar, recombination-aware program. Conclusions Modeling the ancestry of the sequences with an ARG allows ACG to estimate the evolutionary history of recombining nucleotide sequences. ACG can accurately estimate the posterior distribution of population parameters such as the (scaled) population size and recombination rate, as well as many aspects of the recombinant history, including the positions of recombination breakpoints, the distribution of time to most recent common ancestor along the sequence, and the non-recombining trees at individual sites. Multiple substitution models and population size models are provided. ACG also provides a richly informative graphical interface that allows users to view the evolution of model parameters and likelihoods in real time.

1. Node height modifier (frequency 10.0) The node height modifier alters the distance of a single coalescent or recombination node from the tips. A single, non-root c-node or r-node is selected with equal probability, and the minimum height of its parent(s) and maximum height of its offspring is recorded. A new uniformly distributed random number between the two heights is generated, and the proposed node height is assigned to that value.
2. Root height modifier (frequency 2.0) This modifier proposes a new value for the height of the root node, and depends on an exogenously defined variable, say m. Given the current root height, say r, a new root height is proposed as a uniformly defined random variable in the interval (r-m/2, r+m/2). If the newly proposed root height is less than the maximum height of the offspring, the proposed height is reflected over this offspring height.
3. Breakpoint shift (frequency 1.0) This modifier alters the location of a single recombination breakpoint in a sliding-window fashion. First, a single r-node is selected with equal probability from among all r-nodes. Given some window size, say m, and the current position, say p, the new position is a uniformly distributed variable in (p-m/2, p+m/2). As with the root height modifier, proposals beyond the boundary (in this case, beyond the sites over which the ARG is defined) are reflected over the boundary. The window size m is changed during execution to maintain a proposal acceptance rate near 25%.
4. Breakpoint swap (frequency 1.0) This modifier swaps the two parents of a recombination node. First, a single r-node is selected with equal probability from among all r-nodes. Then, using the terminology described above, p 0 and p 1 are exchanged such that sites [n..r) 'lead to' p 1 , and sites in [r, m) 'lead to' p 0.

Local swap (frequency 2.0)
This modifier changes the branching structure of the ARG by moving one offspring or parent node to nearby branch. First, an non-tip node is selected with equal probability from among all non-tip nodes. If the node is a c-node, the offspring with the greatest height is selected to move. The offspring node is then transferred to the branch connecting the c-node to the other offspring, as in figure A, below. If the chosen node is an r-node, then the parent of the r-node with the smallest height is moved to the branch connecting the r-node to the other parent. No node heights are altered during any step. This procedure is similar to the local swap, but introduces more radical changes to the branching structure. In a manner similar to the local swap, an internal node is chosen uniformly from among all internal nodes. If the internal node is a c-node, then the offspring with the greatest height is chosen to move. If the internal node is an r-node, then the parent with the smallest height is chosen to move. Potential recipient branches for the moving node are then collected from among all branches that cross the height of the moving node (black dots, Figure B, below). A recipient location is chosen uniformly from among these potential locations, and the moving node is attached to this location. As in the local swap move, no node heights are altered. Figure B: Diagram of 'wide swap' proposal kernel. Top: a node is chosen to move by selecting an internal node in a manner similar to the local swap procedure. Middle: potential recipient locations are found on all branches that cross the height of the moving node. Bottom: a recipient location is chosen and the moving node is attached to it. 7. Add/ remove branch: (frequency 2.0) This kernel either adds or removes one branch, consisting of a recombination node and a coalescent node. Adding branch proposal: Heights for the new nodes are chosen in the following manner. First, a truncated exponential random variate is generated with mean r/2 and maximum r, where r is the height of the root node of the ARG. Second, a height for the coalescent node is chosen by generating another exponential random variate with rate r/2 (but no maximum), and adding to this the height of the first node. Next, potential branches on which the new nodes will reside are collected by examining all branches that cross the heights generated in the previous step.
If the height of the new c-node is greater than r, the c-node will be the new root and an additional branch is generated to connect the previous root to the new root. The new branch is inserted into the ARG at the chosen locations. Finally, a location for the new recombination breakpoint is chosen uniformly among all sites at which recombinations may occur. Branch removal: All 'removable' branches are collected in the following manner. A branch is defined as an r-node and a c-node, where the c-node is necessarily a parent of the r-node. A removable branch is branch whose removal will not create an ARG with only a single branch passing through any given time. A branch is selected at uniform from among all removable branches, and is then removed from the ARG.
Unlike other ARG proposal mechanisms, both the add and remove procedures generate nontrivial Hastings-Green ratios. The probability of a given add move is the following: where λ is half the height of the root node, x is the height of the new r-node, y is the height of the new c-node. The probability of a remove-branch move is the reciprocal of the number of removable branches, which is tabulated on each add and remove step. Finally, following the reversible-jump methodology described in Robert & Casella (2010), a Jacobian term of 2 is induced on the add move, and 1/2 on the reverse move.