Legofit: estimating population history from genetic data

Background Our current understanding of archaic admixture in humans relies on statistical methods with large biases, whose magnitudes depend on the sizes and separation times of ancestral populations. To avoid these biases, it is necessary to estimate these parameters simultaneously with those describing admixture. Genetic estimates of population histories also confront problems of statistical identifiability: different models or different combinations of parameter values may fit the data equally well. To deal with this problem, we need methods of model selection and model averaging, which are lacking from most existing software. Results The Legofit software package allows simultaneous estimation of parameters describing admixture, and the sizes and separation times of ancestral populations. It includes facilities for data manipulation, estimation, analysis of residuals, model selection, and model averaging. Conclusions Legofit uses genetic data to study the history of a subdivided population. It is unaffected by recent history and can therefore focus on the deep history of population size, subdivision, and admixture. It outperforms several statistical methods that have been widely used to study population history and should be useful in any species for which DNA sequence data is available from several populations.

one statistic that range from 50 to 600%, depending on the separation time of Neanderthals and Denisovans. Petr et al. [11] show that similar bias in another statistic underlies an apparent (but artifactual) decline in the frequency of Neanderthal DNA in Europe during the past 45,000 years. To avoid these biases, one must simultaneously estimate the parameters that underlie them.
In addition to bias, there are also problems of statistical identifiability, which arise when several models fit the data equally well. Identifiability problems can lead us to prefer incorrect models of history, and they can make confidence intervals unrealistically narrow. Consequently, it is likely that some of the recent findings summarized above are incorrect.
The Legofit package [12,13] introduces methods that address these problems. It reduces bias by allowing simultaneous estimation of the parameters that introduce bias into competing estimators. It uses model selection and model averaging to cope with identifiability problems, and it uses residual analysis to diagnose misspecified models. This article will not attempt a comprehensive review of genetic methods for estimation of population history.
Instead, it will describe Legofit and compare it against several methods that are widely used in the study of archaic admixture.

Nucleotide site patterns
Legofit works with the frequencies of nucleotide site patterns [14,15], which are defined below. The first step in any analysis involves tabulating site pattern frequencies from data. Legofit provides tools that tabulate these frequencies from standard data formats and also from several forms of simulation output.
Site patterns are illustrated in Fig. 1. A nucleotide site exhibits the yn site pattern if random nucleotides drawn from populations Y and N carry the derived allele, but those drawn from other populations carry the ancestral allele. They represent the special case of the site frequency spectrum [16] in which the sample consists of one haploid genome per population.
Many different gene trees-even trees with different topologies-may contribute to any given site pattern. Nonetheless, let us begin with a particular gene tree, which is shown in Fig. 1. There we see a population network and, embedded within it, the gene tree (or gene genealogy) of one particular locus (nucleotide site). A mutation on the red branch would generate yn, whereas one on the blue branch would generate ynd. Mutations elsewhere would generate other site patterns. Let B i represent the length in generations of the branch generating site pattern i. For example, B yn is the length of the red branch Fig. 1 Population network with embedded gene tree. A mutation on the solid red branch would generate site pattern yn (shown in red at the base of the tree). One on the solid blue branch would generate ynd. "0" and "1" represent the ancestral and derived alleles. Key: X, Africa; Y, Eurasia; N, Neanderthal; D, Denisovan in Fig. 1 and B ynd is the length of the blue branch. The gene tree will vary from locus to locus, and in any given gene tree many of these lengths will be zero. For example, B xy = 0 in Fig. 1, because no single mutation on that gene tree could generate site pattern xy.
At a particular locus, and conditional on B i , the number of mutations on the branch generating pattern i is Poisson with mean uB i , where u is the mutation rate per nucleotide site per generation. We use the model of infinite sites [17], which assumes that u is small enough that we can ignore the possibility of multiple mutations on a given branch. To this standard of approximation, the unconditional probability of site pattern i on a random gene tree is uE[B i ], where the expectation is with respect to the coalescent process constrained by the network of populations.
Let I i represent the count of site pattern i across all sequenced nucleotide positions. Its expected value is where L is the number of nucleotide positions in the sequence. The probability that a particular polymorphic site exhibits pattern i is where is the set of site patterns under study.
In previous publications [10,18] we and others have derived analytical expressions for E[B i ] under particular models of history. This analytical approach becomes difficult as models grow in complexity. Legofit relies instead on computer simulations, which make it feasible to deal with complex models of history. In each iteration of the simulation, the coalescent algorithm builds a gene genealogy analogous to the one in Fig. 1 This approach simulates branch lengths but not mutations, and the simulations can be done in parallel. For a given level of accuracy, it is orders of magnitude faster than programs that simulate both mutation and recombination, as shown in the Additional file 1. This speed makes it possible to deal with the entire suite of site patterns and with complex models involving tens of populations. Nonetheless, this is still a computationally intensive approach. In a recent analysis [19], we studied nine different models. This took 10 days to do but would have taken 12 years without parallel processing. This 440-fold speed-up was possible because the calculations were parallelized not only across cores on each compute node, but also across nodes on the cluster at our local Center for High-Performance Computing. The legofit program parallelizes automatically across cores. Section 4 of the Additional file 1 describes methods for parallel processing on a cluster.
To validate our numerical approach to estimating probabilities, we compared it with theoretical results in models for which analytical theory is feasible [10]. We can also validate by comparing the expected values generated by our method to data simulated in other ways. This is done in Fig. 2, which shows that all three simulators generate distributions of site pattern frequencies that are centered around the expected values estimated by legofit. This verifies the reliability of our approach.

Models of history
A model of population history is specified in a file whose name ends with ".lgo. " This file specifies the population network and the location of genetic samples within it. It uses a flexible syntax to describe population histories of arbitrary complexity. Populations can separate, combine, exchange migrants, and change in size. Changes in population size occur in discrete steps, and episodes of gene flow are modeled as discrete events, but there is no limit on the number of steps or episodes of gene flow. A model with K samples generates 2 K − 2 site patterns. For example, 10 samples would generate 1022 site patterns, which would provide a rich basis for estimating parameters.
Parameters fall into three categories: (1) free parameters are estimated by legofit; (2) fixed parameters have values that do not change; and (3) constrained parameters are specified as known functions of one or more other parameters. Constrained parameters model relationships among variables. We use them below to reexpress free variables in terms of principal components.

Tabulating site patterns from data
The first stage of analysis involves tabulating site patterns from DNA sequence data. These data need not be phased, but they should be free of ascertainment bias. In the discussion above, I assumed that one haploid genome is sampled from each population. Real samples are larger, and a given nucleotide site may contribute to several site patterns. The contribution to a given site pattern is the probability that a sub-sample, consisting of one haploid genome drawn at random from the larger sample of each population, would exhibit this site pattern. For example, consider a model with three populations, X, Y, and N, and let p iX , p iY , and p iN represent derived allele frequencies at the ith polymorphic site in the samples from these populations. Then site pattern xy occurs at site i with probability , p. S131). Aggregating over sites, I xy = i z i summarizes the information in the data about this site pattern. In general, for the jth site pattern, the analogous summary is I j . In this formulation I j is no longer a count. It is the expected count in a random subsample of the full sample.
The Legofit package includes programs for tabulating site patterns from data and from several publicly-available programs for coalescent simulation: ms [20], msprime [21], and scrm [22].

Estimation
Legofit estimates parameters by maximizing the composite likelihood, where P j is as given in Eq. 1, is the set of site patterns under study, and θ is a vector of free parameters. This is not the full likelihood, because it ignores linkage disequilibrium and treats nucleotide sites as though they were independent. Legofit uses a numerical algorithm-differential evolution (DE, [23])-to maximize L. DE maintains a swarm of points, which are initially distributed widely across the Fig. 2 Deviation from expected values in 50 data sets generated by each of three simulation programs: ms [20], msprime [21], and scrm [22]. All simulations assume the same model of history, which is illustrated in Fig. 1 and described fully in the Additional file 1. Expected values were calculated with legosim. Blue circles show 50 simulated data sets parameter space. In each generation, these points mutate and recombine to form offspring, which then undergo selection to form the next generation. The objective functions of the points are evaluated in parallel, in separate threads of execution. This process involves several stages, beginning with an initial stage in which the objective function is evaluated with modest precision and progressing to a final stage, which typically uses two million simulation replicates per function evaluation. This provides much more precision than a sample of two million polymorphic nucleotide sites, because we are simulating branch lengths only-not mutation or recombination. (See the Additional file 1 for details).

Bootstrap confidence intervals
The Legofit package uses a bootstrap [24] to measure uncertainty. Because linked loci are not statistically independent, we cannot use an ordinary bootstrap. Instead, Legofit uses a moving-blocks bootstrap [25], which resamples blocks of nucleotides. By default, each block consists of 500 polymorphic nucleotide sites.
Bootstrap replicates approximate independent samples from the stochastic process that produced the original data. By applying legofit to many bootstrap replicates, we obtain an approximation of the sampling distribution of the estimates. This distribution is used to estimate confidence intervals.
Each bootstrap replicate is analyzed by a separate instance of the legofit program. These instances can operate in parallel, on separate nodes of a compute cluster. Legofit is thus parallel in two senses: within each node, legofit uses multiple threads to parallelize across the points maintained by the DE algorithm. It also uses multiple nodes to parallelize across bootstrap replicates.

Model selection
The study of population history requires that we choose among complex, non-nested models. Better fits can usually be achieved with more complex models, but this improvement may be illusory-the consequence of fitting noise rather than signal. Overfitting, as this is called, can produce incorrect inferences about population history [26]. We may report evidence of gene flow or of bottlenecks in population size where no such inference is warranted. Reliable inference requires that we protect against overfitting. This is not possible with the genetic methods currently used to study archaic admixture.
In other statistical contexts, such problems might be addressed via tools such as Akaike's information criterion (AIC, [27]), or the Bayesian information criterion (BIC, [28]), which penalize complex models in a principled way. These tools, however, require access to the full likelihood function, which is never available for genomescale data sets. Because of the size and complexity of the nuclear genome, all statistical methods simplify the problem in some way. Legofit uses composite likelihood, which ignores genetic linkage and treats nucleotide sites as though they were statistically independent. This produces unbiased estimates but does not allow us to use AIC or BIC.

Bootstrap estimate of predictive error (bepe)
Bepe is analogous to cross-validation, but uses bootstrap replicates instead of partitions of the data. The first step in the process uses legofit to fit a given model to each bootstrap replicate. These runs report the predicted frequency of each nucleotide site pattern. Legofit's "bepe" program then calculates the mean squared difference between these bootstrap-predicted frequencies and those in the real data and applies a small bias correction. The resulting estimate of predictive error compares favorably with cross-validation ( [24], sec. 17.6). It is convenient, because we need bootstraps anyway for confidence intervals.

Composite likelihood information criterion (clic)
Clic generalizes Akaike's information criterion (AIC, [27]) to the case of composite likelihood. Varin and Vidoni ( [30], p. 523) define an information criterion that is the negative of I have reversed the sign so that we can select models by minimizing (rather than maximizing) clic. In this expression, L is composite likelihood (Eq. 2), θ is the vector of parameters, C is a matrix whose ijth entry is the sampling covariance between the ith and jth parameters, and H is the expectation of the negative of the Hessian matrix, and "tr" represents the matrix trace. I estimate C from covariances across bootstrap or simulation replicates. H is a matrix of expectations of secondorder partial derivatives of ln L with respect to pairs of parameters. Rather than taking these expectations, I evaluate the derivatives at the maximum composite likelihood estimate,θ [31]. Within a small neighborhood nearθ , ln L can be approximated by a quadratic surface, (4) where α is the Y intercept, and β i and γ ij are regression coefficients.
I estimate α, β i , and γ ij by ordinary least squares, using points in the neighborhood of the estimate,θ . Then H is assembled using the second-order derivatives of ln L, as implied by Eq. 4. Finally, C and H are used with Eq. 3 to calculate clic.

Bootstrap model averaging (booma)
Below, we will consider three models whose bepe values are 2.17 × 10 −7 , 5.54 × 10 −7 , and 6.17 × 10 −5 . The first model has the smallest value and is therefore preferred. But the other values are also small. Are we justified in ignoring them? To answer this question, let us consider the problem of model averaging.
When no model is clearly superior, it is better to average across several than to choose just one [32]. Otherwise, confidence intervals are misleadingly narrow because they ignore uncertainty about the model itself. In model averaging, individual models are assigned weights as discussed below. Parameters are estimated as the weighted average of estimates from individual models. Most authors rely on information criteria to provide the weights [33]. One could use clic in this way, but I prefer bootstrap model averaging [32], which works with either bepe or clic.
This method is implemented by the Legofit program "booma. " Some model selection criterion (bepe or clic) is calculated separately for the real data and for each bootstrap replicate. (To calculate bepe for a bootstrap replicate, we pretend that the replicate is real data and the real data are a bootstrap replicate.) If there are 50 bootstrap replicates, this process gives us 51 values of the model selection criterion for each model. For each of these 51 cases, booma asks which model "wins, " i.e., which has the lowest value of the criterion. The weight of the ith model is the fraction of cases in which it is the winning model.
Using these weights, booma averages across models to obtain a model-averaged estimate of each parameter. If a parameter is present in only a subset of the models, the weights are re-normalized so that they sum to unity across this subset. This averaging is applied not only to the real data but also to each bootstrap replicate. This allows us to estimate confidence intervals for model-averaged estimators.
If one model is clearly superior, its weight will be unity and those of the other models will be zero. This provides a simple criterion for choosing one model over its alternatives. For the three models mentioned at the top of this section, the weights were 1, 0, and 0. This implies that the differences among the bepe values are large compared to those expected in repeated sampling from the stochastic process that generated the original data. We are therefore justified in rejecting all models but the first. This analysis is described in more detail below. Figure 3 illustrates a problem of statistical identifiability, which arises frequently not only with Legofit, but with all methods that estimate complex population histories. Each panel in the figure is a bivariate scatterplot comparing two parameters. Each point indicates the estimated values of the two parameters in one simulation replicate. In several panels, the points fall along straight lines, indicating that the parameters are tightly correlated. These associations represent ridges in the composite likelihood surface and imply that our statistical problem has fewer dimensions than parameters. This does not lead to incorrect inferences, but it does broaden the confidence intervals of the parameters involved.

Identifiability and principal components
These problems can be ameliorated by reducing the dimension of the parameter space. The Legofit package includes pclgo, a program that calculates principal components from the bootstrap replicates and then uses these to re-express the free variables in terms of principal components. Predictive error (as measured by bepe) can be improved by excluding principal components with small eigenvalues. This usually tightens confidence intervals.
By default, pclgo merely re-expresses the free variables in terms of the principal components, and there is no reduction in dimension. To reduce dimensionality, the user must specify a tolerance criterion. The command pclgo -tol 0.001 would include only those components that explain at least a fraction 0.001 of the variance. Different choices of this tolerance criterion constitute different models, and we can choose among them using bepe or clic, together with booma. [10] document pronounced biases in the statistics that underlie our current understanding of archaic admixture. These biases are profound if there are multiple sources of admixture. To check for such bias in legofit, I simulate data under the model in Fig. 1, which allows gene flow into Eurasia (Y ) not only from Neanderthals (N), but also from Denisovans (D). Details of this model and of all the analyses below can be found in the Additional file 1. Here, I summarize results. Figure 4 shows the true parameter values (red crosses) and sampling distributions (blue circles) estimated using legofit from 50 independent simulation replicates. I used pclgo to reduce dimensionality. This involves excluding dimensions that explain less than some arbitrarily-chosen fraction of the variance. I considered three models: one in terms of the original variables (without using pclgo), one using principal components with no reduction of dimension, and one excluding components that explain less than a fraction 0.001 of the variance. The weights of these three models are 0, 0.42, and 0.58 using bepe and 0, 0.12, and 0.88 using clic. Thus, pclgo seems to improve estimates, especially when some principal components are excluded. Figure 4 shows the bepe version of the model-averaged estimates.

Rogers and Bohlender
All of the sampling distributions enclose the true parameter values, and several are reassuringly narrow. Nonetheless, some bias is evident in the distributions of Neanderthal admixture (m N ) and Denisovan admixture (m D ). The mean estimates of these parameters are closer together than are the true parameter values. This is because Neanderthals and Denisovans are sister populations, and it is hard to tell them apart. We get a better estimate of total archaic admixture, m N + m D , than of the difference, m N − m D .
For comparison with legofit's estimates of the admixture fraction, Fig. 5 shows the behavior of three previouslypublished estimators [2,3] that have been used to study archaic admixture in humans. Nea and den work by comparing the frequencies with which derived alleles are shared by pairs of samples from different populations. Nea has also been called R Neandertal [2]. Rogers and Bohlender [10] show that these estimators have large biases, especially when (as in the present model) a population receives gene flow from more than one source. Thus, it is no surprise that nea and den exhibit large biases in Fig. 5. Indeed, the black triangles show that the observed bias is in good agreement with theoretical expectations.
Many studies have cited an estimate that about 6% of Papuan DNA derives from Denisovans. This result is due to Meyer et al. [3], who inferred it using TreeMix [34]. However, these authors suspected that the result was biased, because their analysis excluded Neanderthals ( [3], supp. note 12). The TreeMix results in Fig. 5 should avoid this problem, because Neanderthals are included along with Denisovans and moderns from Africa and Eurasia. TreeMix was able to detect a signal of gene flow from Neanderthals into Eurasians. As the figure shows, however, its estimate of the admixture fraction was profoundly biased. TreeMix was unable to detect gene flow from Denisovans into Eurasians. This episode of gene flow did not appear in the output from any of the simulation replicates. Instead, TreeMix reported evididence of gene flow in various parts of the tree. These episodes of gene flow were not consistent from replicate to replicate and did not exist in the simulation model.

A worked example
In Fig. 4, we had the advantage of working with the true model of history. This is never the case with real data. Let us therefore consider how the analysis might proceed if we did not know the true model in advance. We would start by examining site pattern frequencies, which are shown in Fig. 6. The most common patterns (apart from singletons) are xy and nd, reflecting the shared ancestry of populations X and Y and of N and D. Let us therefore fit a model with a tree of form ((X, Y ), (N, D)). This model is misspecified, because it omits gene flow. The residuals of this model are shown in Fig. 7 along with those of a correctly-specified model. The misspecified model generates many residuals that are far from zero, and these discrepancies provide clues about what is wrong with the model. For example, note that the misspecified model has positive residuals for yn and ynd but a negative residual for y. This suggests that we should add N → Y gene flow to the model, because such gene flow inflates the first two of these site patterns but deflates the third. Table 1 compares the two models and shows that the one with N → Y gene flow is unambiguously better than the one without gene flow. However, the residuals of this new model (not shown) still show discrepancies, which might lead us to consider adding D → Y gene flow to the model. Table 2 shows that this third model is unambiguously better than the one with only one episode of gene flow. The residuals (right panel of Fig. 7) show that this model provides a good description of the data. In this example, the correct model was identifiable because the alternate models could not fully account for the pattern in the data.
There are also less tractable identifiability problems. Let us consider two. Figure 8 shows a model that is like that in the simulations (Fig. 1) but has an additional episode of gene flow from a "superarchaic" population (S) into Denisovans (D), as suggested by Prüfer et al [8]. When the superarchaic admixture fraction is zero, this model reduces to that used in our simulations. As expected, legofit's estimate of this parameter was very close to zero in all simulation replicates, and all other parameters were also well estimated. Consequently, this model provides an excellent fit to the data, comparable to that in the right panel of Fig. 7. Nonetheless, I expected bepe and Fig. 7 Residuals from misspecified and correctly-specified models. Each circle represents one of the simulated data sets in Fig. 3. The misspecified model ignores the two episodes of gene flow seen in Fig. 1 clic to prefer the correct model because of its simplicity. Instead, bepe and clic gave appreciable weight to both models but preferred the more complex one, as shown in Table 3. This did not lead to incorrect inferences, because all parameters were well estimated. Table 4 illustrates another identifiability problem. It compares the standard model ( Fig. 1) with one in which the order of the two admixture events is reversed: D → Y admixture precedes N → Y admixture. This change has little effect on site pattern frequencies, and all parameters are well estimated. I expected bepe and clic to weight these models roughly equally. The table shows that they do give appreciable weight to both models but prefer the (incorrect) reversed model. In another experiment (not shown), using ms instead of msprime, bepe gave 94% of the weight to the true model. Bepe and clic both behave sensibly when dealing with models that are indistinguishable or nearly so. In such cases, they tend to give appreciable weight to several models. We cannot assume, however, that they will always prefer the correct model.

Discussion
There are two reasons for studying site patterns rather than the full site frequency spectrum, the first of which involves statistical power at deep time scales. As we look backwards into the past, large samples coalesce rapidly to small collections of ancestors. For this reason, although large samples are essential for recent history, their value is limited in the distant past. Furthermore, the random-haploid samples used by legofit provide an advantage: they insulate the analysis from recent population history. If we had sampled several haploid genomes from population X in Fig. 1, then our model would need parameters describing changes in the size of X since its separation from Y. With legofit, these parameters aren't needed, because no coalescent events can occur until X and Y merge into their ancestral population. Thus, site pattern frequencies reduce the parameter count without losing much power at deep time scales. They allow us to study the deep history of multiple populations.

Conclusions
The Legofit package provides computer programs for estimating population histories. It uses the frequencies of nucleotide site patterns to summarize genetic data. The package includes programs that tabulate these frequencies, calculate their expected values, and use them to estimate parameters describing population history. It includes facilities for model selection and model averaging. It uses principal components to reduce the complexity of high-dimensional models of history. Legofit outperforms several methods that have been widely used to study archaic admixture in humans and should be useful in any species for which DNA sequence data is available from several populations.

Availability and requirements
Project name Legofit Project home page https://github.com/alanrogers/legofit Operating system Linux and macOS