ACG: rapid inference of population history from recombining nucleotide sequences
 Brendan D O'Fallon^{1}Email author
DOI: 10.1186/147121051440
© O'Fallon; licensee BioMed Central Ltd. 2013
Received: 23 March 2012
Accepted: 29 January 2013
Published: 5 February 2013
Abstract
Background
Reconstruction of population history from genetic data often requires Monte Carlo integration over the genealogy of the samples. Among tools that perform such computations, few are able to consider genetic histories including recombination events, precluding their use on most alignments of nuclear DNA. Explicit consideration of recombinations requires modeling the history of the sequences with an Ancestral Recombination Graph (ARG) in place of a simple tree, which presents significant computational challenges.
Results
ACG is an extensible desktop application that uses a Bayesian Markov chain Monte Carlo procedure to estimate the posterior likelihood of an evolutionary model conditional on an alignment of genetic data. The ancestry of the sequences is represented by an ARG, which is estimated from the data with other model parameters. Importantly, ACG computes the full, Felsenstein likelihood of the ARG, not a pairwise or composite likelihood. Several strategies are used to speed computations, and ACG is roughly 100x faster than a similar, recombinationaware program.
Conclusions
Modeling the ancestry of the sequences with an ARG allows ACG to estimate the evolutionary history of recombining nucleotide sequences. ACG can accurately estimate the posterior distribution of population parameters such as the (scaled) population size and recombination rate, as well as many aspects of the recombinant history, including the positions of recombination breakpoints, the distribution of time to most recent common ancestor along the sequence, and the nonrecombining trees at individual sites. Multiple substitution models and population size models are provided. ACG also provides a richly informative graphical interface that allows users to view the evolution of model parameters and likelihoods in real time.
Keywords
Coalescent ARG Ancestral recombination graph Bayesian inference RecombinationBackground
Reconstruction of population history from genetic data often requires computationally intensive Markov chain Monte Carlo strategies to estimate Bayesian posterior or likelihood surfaces (e.g.[1]). Tools that perform this task are sometimes called “genealogy samplers”[2] because they construct many quasiindependent samples of the genealogy describing the ancestry of the sequences. Genealogy samplers have become essential components of modern population genetic analysis, with the most popular tools, MrBayes[3, 4], BEAST[5], IMa[6], and LAMARC[7] accumulating some 10,000 citations over the last decade. While these tools have rapidly increased in sophistication, with the exception of LAMARC they share a common limitation: they cannot be used to accurately analyze sequences with recombination. This restriction means that analyses cannot be carried out on most alignments of nuclear data, and as a result typical investigations are limited to mitochondria and nonrecombining viruses. While LAMARC can be used on recombining sequences, analyses are impeded by slow performance and single runs may take several days or weeks to complete.
Several challenges face population genetics analysis with recombining data. When sequences recombine, their ancestry cannot be described by a simple tree or genealogy, and instead must be represented as an Ancestral Recombination Graph (ARG,[8]). All but the simplest of ARGs are difficult to visualize, and few resources are available for drawing, reading, and simulating ARGs. Additionally, the space of likely ARGs supported by a given alignment is often much larger than the space of trees, hence MCMC algorithms likely will take longer to converge. When satisfactory convergence is reached, few tools exist to extract meaningful information from ARGs.
Despite these challenges, the presence of recombination also facilitates some aspects of analysis. Because nearby sites may have partially independent histories more power may exist to infer population parameters. Similarly, evolutionary forces such as selection may create detectable genealogical features that would be obscured if all sites shared the same tree. For instance, a selective sweep may shorten the time to most recent common ancestor (TMRCA) in a given region, but such a shortening may be obscured if the region is completely linked to regions affected by different forces. Inference of complex demography may also be aided  for example, periods of admixture with other populations may cause some regions to have a relatively ancient TMRCA, but such features can only exist when recombinations allow some genomic regions to have an alternative history.
ACG (Analysis of reCombinant Genealogies) is a graphical desktop application that aims to overcome the challenges inherent in ARG inference, and to provide rapid and informative coalescent analysis of recombinant genetic data. It takes as input an alignment of homologous nucleotide sequences, and executes a Bayesian Markov chain Monte Carlo algorithm to infer the posterior distribution of parameters such as population size, recombination rate, transition to transversion ratio, as well as the history of the sequences represented by an ARG. Importantly, ACG computes the “full” likelihood of the ARG using a modified Felsenstein pruning algorithm[9], not an adhoc or composite likelihood. As with other genealogy samplers, the raw output of the program is a collection of parameters sampled from the Markov chain. ACG also contains many data collection tools and utilities designed to simplify the analysis of the sampled states.
Implementation
where D is the input alignment and M is an evolutionary model containing parameters that are estimated from the data. At minimum, the model includes a description of how nucleotide sequences change over time (for instance, the Felsenstein 1984 or TimuraNei 1993 model), a structure describing the ancestral relationships among the samples, and a function describing population size. Each of these submodels may in turn encapsulate one or more parameters that are estimated from the data. For instance, the TimuraNei 1993 model of nucleotide substitution involves two parameters that affect the transitiontotransversion ratio in addition to a vector describing stationary state nucleotide frequencies. As in LAMARC, the ancestry of the samples is represented by an Ancestral Recombination Graph (ARG), which is estimated from the data simultaneously with other model parameters.
To estimate the probability of the model given the data, ACG constructs and executes a Markov chain whose stationary state is the desired distribution (eq. 1). Generation of new states involves proposing a new value for a selected parameter, calculating the likelihood of the newly proposed state as well as the Hastings ratio associated with the proposal, and accepting or rejecting the state based on the MetropolisHastingsGreen criterion[10, 11]. A typical run involves repeating this procedure for some tens of millions of steps. States are sampled periodically and properties of model parameters are recorded by a variety of data collection utilities. If the chain has reached stationarity, the sampled parameters may be assumed to be correlated draws from the posterior density Pr{Model  Data}.
While the above scheme is similar to that used in other genealogy samplers, several aspects of the implementation are worthy of note. Most importantly, ACG implements data structures and MCMC proposal kernels that allow ARGs to be sampled from the data, where the probability of a particular ARG being sampled is proportional to its full likelihood under the data and some model of nucleotide substitution. Currently, the Felsenstein '84 (F84) and TimuraNei '93 (TN93) substitution models are supported. Seven different proposal kernels operate on ARGs, these include some previously described and two novel kernels. ARG proposal kernels are detailed in the Additional file1: Appendix A: ARG proposal kernels.
ARGs are a complex and rich source of information regarding the history of populations, and ACG provides several novel features that aid in interpreting the collection of ARGs sampled by the Markov chain. First, ACG by default tracks many of the bulk properties of ARGs, such as the number of recombination breakpoints and height of the deepest accessible nodes. In addition, ACG records the locations of all recombination breakpoints as well as the TMRCA across the sequence. ACG also provides utilities to examine the consensus tree at individual sites. These consensus trees are familiar, nonrecombining trees ancestral to a single site only. Such trees may be useful when the ancestry in a small region is of particular interest, or when several such regions are to be compared. Further, ACG provides a companion utility (the argutils tool) that is capable of examining a single ARG and collecting information from it, such as the positions and heights of all recombination breakpoints, a list of all of the marginal trees contained in the ARG, or a plot of the TMRCA across the length of the sequence subtended by the ARG.
Users may interact with ACG in several ways. ACG features a simple commandline interface suitable for batch processing, but also provides a rich graphical user interface (GUI). The GUI allows users to both construct an analysis by selecting parameters, proposal kernels, and model priors, as well as to observe selected parameters and likelihoods as they change in real time as the chain progresses. Observed parameters and likelihoods can be viewed in trace or histogram form, and allow for rapid assessment of MCMC characteristics and convergence. Input files may also be saved and reloaded from within the GUI, and saved input can be executed from the command line.
A primary design goal of ACG is efficient computational performance and several techniques are used to speed calculations. First, every MCMC state involves calculation of new likelihoods followed by acceptance of rejection of the proposed state. If the state is accepted, the entire proposed proposed state must be moved to the current state before the next step can be initiated. In contrast to some other algorithms, ACG does not perform a full copy of the data and instead uses a referenceswapping technique to move the information to the desired location. Because the state data may be quite large and must be updated with every MCMC step, reference swapping results in a significant performance increase compared to copying. A second optimization technique involves identification of all identical alignment columns and computation of the data likelihood only once for each unique column. While many implementations involve some degree of alignment column ‘aliasing’, ACG performs this aliasing at every node where sites coalesce, again substantially reducing the number of likelihood calculations performed. Finally, ACG tracks which ARG nodes are affected by various proposals, and recomputes likelihoods only for the nodes and sites ranges affected.
Results and discussion
Resolution of individual recombination breakpoints in space and time
To demonstrate some of the unique features of ACG, we present a small analysis based on simulation data. To begin, an ARG was simulated with 10 sequences of length 10,000 sites under the standard neutral coalescent model with the population size parameter θ = 2Nμ= 0.02 and recombination parameter ρ = 2Nr = 1.0, using the argutils package included with ACG (where N is population size, μ is the mutation rate, and r is the recombination rate). The resulting ARG contained a total of 13 recombination breakpoints. All marginal trees were extracted again using argutils, and nucleotide sequences were simulated along each tree using seqgen[12], using the F84 model of evolution. These sequences were then used as input to a standard ACG run. The run was conducted for 20,000,000 MCMC steps using a single chain, and completed in 53 minutes. Examination of parameter value traces suggested convergence in fewer than 1,000,000 MCMC steps for all parameters.
Empirical example
Sample identifier and population of origin for samples used in the empirical example
ID  Population 

NA07357  CEU (European ancestry, United States) 
NA18501  YRI (Yoruba, Idaban, Nigeria) 
NA18558  CHB (Han Chinese, Beijing) 
NA18940  JPT (Japan, Tokyo) 
NA19025  LWK (Luhya in Webuye, Kenya) 
NA19649  MXL (Mexican ancestry, United States) 
NA19670  MXL (Mexican ancestry, United States) 
NA20510  TSI (Toscan, Italy) 
NA20845  GIH (Gujarati Indian, India) 
NA20846  GIH (Gujarati Indian, India) 
NA20850  GIH (Gujarati Indian, India) 
NA21737  MKK (Masai, Kinyawa, Kenya) 
 10
Mb is too large a region for a single run of ACG, hence the region was divided into 50 Kb fragments, with 10 Kb overlap between adjacent fragments, and ACG was run on each fragment independently. Initially, runs were conducted for 50,000,000 MCMC steps using 4 chains in a Metropoliscoupled scheme. The first 50% of steps were discarded as burnin, and independence was assessed by comparing the means and standard deviations of the data likelihood trace between adjacent quartiles of the nonburnin portion of the run. Some 20% of chains did not reach convergence in the initial run, for these chains ACG was run again using the maximally likely ARG found during the initial run as the starting ARG for the new run. This procedure was repeated until all chains reached convergence. Typical run time was 35 hours on a 2.2 GHz Intel Xeon quadcore processor.
Figure4 also demonstrates that ACG can be used to assess recombination rate and the number of recombination breakpoints in a specified region. Currently, ACG assumes that recombination rate is constant across sites and over time, although it was estimated independently for each 50Kb fragment. This functionality is similar to that provided by tools such as LDHat[15], but is likely to be more accurate than compositelikelihood based methods for several reasons. First, ACG considers the “full” likelihood of the ARG, not a composite of pairwise likelihoods. Additionally, ACG employs more flexible models of nucleotide substitution that coestimate base frequencies and transition/transversion parameters along with other likelihood features. Finally, ACG provides not a point estimate of recombination rate but an estimate of the posterior distribution, allowing an appropriate characterization of the degree of uncertainty in the estimated parameter.
Performance
A primary design goal of ACG is high computational performance. To compare the performance of ACG to that of the only other recombination aware genealogy sampler, LAMARC[7], 10 data sets each including 20 sequences of length 50Kb were generated using the following procedure. First, ARGs were simulated using the argutils package included with ACG with θ=0.01 and ρ=1. ARGs were decomposed into marginal trees also using argutils, from which nucleotide sequences were generated with seqgen[12] using the F84 model of evolution. For each of the ten input alignments both ACG and LAMARC were run for exactly one hour of real time, and both tools estimated θ, ρ, the transition totransversion ratio, as well as the structure of the ARG. Log files of parameter values and the likelihood of the data conditional on the ARG were produced for all runs with sampling every 5000 steps from ACG and every 20 steps from LAMARC. The program Tracer (v 1.5) was then used to examine the log files and calculate the effective sample size (ESS). To assess convergence we examined the likelihood of the data conditional on the ARG (the “data likelihood”).
In several respects ACG significantly outperformed LAMARC. In raw number of MCMC states processed ACG was over 100fold faster than LAMARC, computing on average 1.8×10^{7} total states, while LAMARC computed 1.4×10^{5}. In terms of effective sample size (ESS) for the data likelihood, the mean for ACG was 1284 (range 1783122) indicating satisfactory convergence, while for LAMARC the mean was 23 (range 653). Similarly, ESSs for the scaled population size parameter θ were on average 1861 (range 1883964) for ACG and 19 (range 836) for LAMARC. In addition, ACG's memory requirements are typically modest, and 512MB per chain is sufficient for typical data sets. Finally, we note that these results used only a single chain. Because ACG can execute multiple heated chains simultaneously the performance margin over LAMARC is likely to be further increased when multiple CPU cores are available.
Validation
Another potential source of error stems from the accuracy of the likelihood computations, in particular the likelihood of the data conditional on the ARG (the “data likelihood”). To ensure that ACG computes the correct likelihood at every MCMC step, test classes were implemented that emit marginal trees and their associated likelihoods periodically during a run. Utility scripts were developed to read these files and input each tree into an external program capable of computing the likelihood of an alignment given a tree, in this case the DNAMLK tool distributed with the phylip (3.69) package[16]. When ACG functions as expected, the sum of the log likelihoods for each marginal tree should be equal to the full log likelihood of the ARG. This procedure has been conducted numerous times during development and has been used to identify and correct data likelihood calculation errors.
Conclusions
Bayesian genealogy samplers provide a means of inferring a wide variety of population parameters from genetic data. ACG builds upon techniques developed in earlier samplers, and significantly extends the range of input data that can be considered as well as the types of data that can be collected. Most importantly, ACG’s use of an ARG to represent the ancestry of the sequences enables the examination of alignments of nuclear DNA, opening new avenues of investigation for common alignments of sequence data (although the data must be properly phased). While ARGs present considerable inferential challenges compared to nonrecombining trees, ACG provides several utilities for ameliorating these difficulties and extracting useful information from the cloud of ARGs sampled. These utilities include estimation of TMRCA along the sequence, locations of recombination breakpoints in space and time, and construction of consensus trees at particular sites of interest. In addition to these tools, the argutils utility included with ACG provides a number of convenient functions, including breaking an ARG into marginal trees, extracting a single tree from a specific site, enumerating recombination positions from an ARG, and simulating neutral ARGs.
While ACG shares some features with LAMARC, including the use of an ARG to represent ancestral relationships, several important features distinguish the programs. As demonstrated above, ACG is roughly 100fold faster than LAMARC. Additionally, ACG can estimate the shapes of marginal trees at specific sites, the locations of recombination breakpoints along the sequence as well as in time, and the time to most recent common ancestor along the sequence. ACG can also import data from the. VCF file format commonly used in nextgeneration sequencing projects. LAMARC, in contrast, has several features ACG does not, most significantly the ability to model multiple populations and the migration rates between them.
Finally, ACG offers a convenient graphical interface that allows users to not only construct an analysis, but also to observe the evolution of various parameters and likelihoods in trace or histogram form in real time, allowing researchers to monitor many features of the analysis as it unfolds.
Availability and requirements
ACG is freely available for academic use and will operate on any platform with a Java Virtual Machine version 1.6 or higher installed.
Project name: ACG
Homepage:http://arup.utah.edu/acg
Operating systems: Platform independent
Programming language: Java (v 1.6)
Requirements: Java 1.6 or higher
License: Copyright 2012 Brendan O’Fallon, freely available for academic use
Abbreviation
 ARG:

Ancestral Recombination Graph.
Declarations
Acknowledgements
I would like to thank the members of the Felsenstein/Kuhner lab, especially Mary Kuhner, for help understanding the complexities of recombination. Funding was provided by NSF postdoctoral research fellowship DBI0906018, an NIH Genome Training Grant, and the ARUP Institute for Clinical and Experimental Pathology.
Authors’ Affiliations
References
 Yang Z, Rannala B: Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Mol Biol Evol 1997,14(7):717724. 10.1093/oxfordjournals.molbev.a025811View ArticlePubMedGoogle Scholar
 Kuhner M: Coalescent genealogy samplers: windows into population history. Trends Ecol Evol 2008,24(2):8693.View ArticlePubMedGoogle Scholar
 Huelsenbeck JP, Ronquist F: MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics 2001,17(8):754755. 10.1093/bioinformatics/17.8.754View ArticlePubMedGoogle Scholar
 Ronquist F, Huelsenbeck JP: MrBayes 3: phylogenetic inference under mixed models. Bioinformatics 2003,19(12):15721574. 10.1093/bioinformatics/btg180View ArticlePubMedGoogle Scholar
 Drummond AJ, Rambaut A: BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol 2007, 7: 214. 10.1186/147121487214PubMed CentralView ArticlePubMedGoogle Scholar
 Hey J, Nielsen R: Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proc Natl Acad Sci USA 2007, 104: 27852790. 10.1073/pnas.0611164104PubMed CentralView ArticlePubMedGoogle Scholar
 Kuhner M: LAMARC 2.0: maximum likelihood and Bayesian estimation of population parameters. Bioinformatics 2006,22(6):768770. 10.1093/bioinformatics/btk051View ArticlePubMedGoogle Scholar
 Griffiths RC, Marjoram P: Ancestral inference from samples of DNA sequences with recombination. J Comp Biol 1996,3(4):479502. 10.1089/cmb.1996.3.479View ArticleGoogle Scholar
 Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 1981, 17: 368376. 10.1007/BF01734359View ArticlePubMedGoogle Scholar
 Green PJ: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 1995,82(4):711732. 10.1093/biomet/82.4.711View ArticleGoogle Scholar
 Metropolis N, Rosenbluth AW, Resonbluth MN, Teller AH, Teller E: Equations of state calculations by fast computing machines. J Chem Phys 1953,21(6):10871092. 10.1063/1.1699114View ArticleGoogle Scholar
 Rambaut A, Grassly N: SeqGen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Bioinformatics 1997,13(3):235238. 10.1093/bioinformatics/13.3.235View ArticleGoogle Scholar
 Sabeti PC, Reich DE, Higgins JM, Levine HZP, Richter DJ, et al.: Detecting recent positive selection in the human genome from haplotype structure. Nature 2002, 419: 832837. 10.1038/nature01140View ArticlePubMedGoogle Scholar
 Voight BF, Kudaravalli S, Wen X, Pritchard JK: A Map of recent positive selection in the human genome. PLoS Biol 2006,4(3):e72. 10.1371/journal.pbio.0040072PubMed CentralView ArticlePubMedGoogle Scholar
 McVean G, Awadalla P, Fearnhead P: A coalescentbased method for detecting and estimation recombination from gene sequences. Genetics 2002, 160: 12311241.PubMed CentralPubMedGoogle Scholar
 Felsenstein J: PHYLIP (phylogeny inference package) version 3.69. Distributed by the J. Felsenstein. Seattle: Department of Genetics, University ofWashington; 1993.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.