AD-LIBS: inferring ancestry across hybrid genomes using low-coverage sequence data

Background Inferring the ancestry of each region of admixed individuals’ genomes is useful in studies ranging from disease gene mapping to speciation genetics. Current methods require high-coverage genotype data and phased reference panels, and are therefore inappropriate for many data sets. We present a software application, AD-LIBS, that uses a hidden Markov model to infer ancestry across hybrid genomes without requiring variant calling or phasing. This approach is useful for non-model organisms and in cases of low-coverage data, such as ancient DNA. Results We demonstrate the utility of AD-LIBS with synthetic data. We then use AD-LIBS to infer ancestry in two published data sets: European human genomes with Neanderthal ancestry and brown bear genomes with polar bear ancestry. AD-LIBS correctly infers 87–91% of ancestry in simulations and produces ancestry maps that agree with published results and global ancestry estimates in humans. In brown bears, we find more polar bear ancestry than has been published previously, using both AD-LIBS and an existing software application for local ancestry inference, HAPMIX. We validate AD-LIBS polar bear ancestry maps by recovering a geographic signal within bears that mirrors what is seen in SNP data. Finally, we demonstrate that AD-LIBS is more effective than HAPMIX at inferring ancestry when preexisting phased reference data are unavailable and genomes are sequenced to low coverage. Conclusions AD-LIBS is an effective tool for ancestry inference that can be used even when few individuals are available for comparison or when genomes are sequenced to low coverage. AD-LIBS is therefore likely to be useful in studies of non-model or ancient organisms that lack large amounts of genomic DNA. AD-LIBS can therefore expand the range of studies in which admixture mapping is a viable tool. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1613-0) contains supplementary material, which is available to authorized users.

. While conceptually appealing, this formulation quickly becomes computationally intractable when there is a large population size N or number of generations since admixture g.
To improve computational efficiency, continuous approximations to the genetic drift problem, based on the mathematics of diffusion, have been proposed. Kimura [2] made an influential contribution, but his solution did not account for the so---called "absorbing" states of loss and fixation. We implemented the solution to the pure drift equation, without selection or migration, given by McKane and Waxman [3]. Given a population size N, a number of generations t, and an initial allele (or recombination) frequency of 1/2N, the probability density of allele frequency x in the t th generation is given by In our implementation, we terminate the infinite sums when newly added terms are lower than 10 ---20 and we divide the allele frequency spectrum into 500 bins (or 2N---1 bins, if 2N---1 < 500), plus one bin each for the probabilities of allele loss and fixation. For any given number of generations t, we use the above equations to compute probability density at the upper and lower limit of each bin, obtain probabilities from probability density via the trapezoid rule, and map the probability of each bin to the frequency value at its midpoint, giving a vector of probabilities V and a vector of allele frequencies F, each with 502 entries corresponding to bins indexed by b, where 0, 501 . We then iterate over each generation 0 ≤ < , and compute V and F for an allele that arose in generation at .
With these values, the transition probabilities between the three ancestry states can be computed per site as in Table 8. To transform these into transition probabilities between windows, each can be multiplied by the window size w, so that the transition probabilities between ancestry states are as follows: Since many transition probabilities depend upon the window size w, we risk obtaining sums of transition probabilities from individual states that are greater than 1 when the window size is large. We take two steps to prevent this: first, we cap both the skip probability s and the probability of ending the sequence 1 at maximum values of 0.05. In our experience, this number is high enough to still allow the model to detect skipped windows and to end sequences in the appropriate places. Second, given the other model parameters, we calculate the maximum possible window size that will allow the sum of transition probabilities out of each state to fall below 1. If the user chooses a window size that exceeds this threshold, the program notifies the user and exits.
Our implementation also considers some of the unique properties of the X chromosome [4].
Since there are fewer copies of the X chromosome than any autosome in a given population, the X chromosome only recombines approximately 2 3 as often as the autosomes. We therefore build a different model on chromosome sequences or genomic scaffolds specified to belong to the X chromosome: on these sequences, the r parameter is taken to be 2 3 of its default, autosomal value: e = 2 3 . Furthermore, the z parameter quantifies genetic drift and thus depends on the population size N, but the effective population size of the X chromosome is 3 4 of the autosomal value, again owing to there being fewer copies of X chromosomes than autosomes in circulation in a population. We therefore recompute z using the same technique described earlier, but with g = 3 4 , to give e , used in place If the organisms under study follow the ZW rather than the XY sex determination system, the same concept holds, except that the "X chromosome sequences" supplied to AD---LIBS should be the names of sequences or genomic scaffolds belonging to the Z chromosome, and the "males" specified to AD---LIBS should actually be females.

Emission probability distributions: expectation
For a description of emission "scores" used by AD---LIBS, see Methods. In this section, we describe the expected distributions of these scores used by AD---LIBS.
The expected distribution of IBS tracts between two haplotypes depends only on the parameter , or average nucleotide diversity per site between those two haplotypes. We therefore, as a first step, compute the average genome---wide nucleotide diversity per site within population A, referred to as i , within population B, referred to as j , and between the two populations, referred to as ij . Given that describes how often one expects to see a nucleotide difference, 1 is the expected length of a haplotype before a difference is observed. IBS tract lengths tend to follow an exponential distribution with as a parameter. Since our model considers samples of IBS tracts within genomic windows, however, we expect our mean IBS tract lengths to follow a normal distribution with a mean equal to the expected value and a standard deviation equal to the expected sample standard deviation, with the expected number of samples equal to the window size times : In AD---LIBS, all three emission probability distributions are modeled as normal distributions, due to successful performance on training data. We also reserve a specific value to be used as a "skip" score; distributions are set such that the skip score has zero probability under all three emission probability distributions. We also create an emission probability distribution for the three skip states that is only capable of emitting this value.
As with the transition probabilities, sequences belonging to the X chromosome present an edge case in which changes to the model are necessary. Since the effective population size of the X chromosome is 3 4 that of the autosomes, we multiply i , j , and ij by 3 4 before calculating the emission probability distribution parameters.
Whereas the need to keep transition probabilities below 1 sets an upper bound on window size, our expected emission probability distributions set a lower bound on window size. If  Difference from true f (rounded) Overall accuracy

C D
AD---LIBS accuracy on simulated data, using incorrect population parameters. Simulations here used the "single---pulse" admixture model (see Figure 1 A, C, and E) except where otherwise noted, with 10kb windows, which were automatically adjusted by AD---LIBS as necessary. Asterisks denote accuracy significantly lower (p < 0.001) than that obtained using correct parameters. A: AD---LIBS accuracy using different prior population size estimates (true N = 3000) and correct number of generations since admixture (1000). B: AD---LIBS accuracy using different estimates for the number of generations since admixture (true g = 1000) and correct population size (3000). C: AD---LIBS accuracy using prior estimates of polar bear admixture proportion that differed from the true value, rounded to the nearest 10%. D: Same as C, but using the "migration" admixture model (see Figure 1 B, D, and F), which produced a wider range of true admixture proportions.

Figure S2
Results from downsampling four ABC Islands brown bears, three Scandinavian brown bears, and four polar bears to 0.5x, 1x, 2x, 5x, and 10x coverage along the longest genomic scaffold, running HAPMIX and AD---LIBS on the four ABC Islands bears at each coverage depth, and comparing these runs to results obtained from running both programs on the full---coverage versions of the same individuals. Each line represents an individual ABC Islands bear and each color represents a specific low coverage/full coverage comparison. A and B assess homozygous polar bear (AA) calls, C and D assess heterozygous (AB) calls, and E and F assess homozygous brown bear (BB) calls. A, C, and E measure the percent of low--coverage calls that were "correct" according to the high---coverage runs, while B, D, and F measure the percent of the high---coverage runs' calls that were also detected by the low--coverage runs. In almost every case, AD---LIBS is more consistent with itself than other comparisons. We also note that low---coverage AD---LIBS inferences of homozygous polar and brown bear ancestry are more often correct, according to HAPMIX run at full coverage, than HAPMIX run at low coverage (A and E). AD---LIBS may, however, erroneously call more windows heterozygous than HAPMIX does (C), leading to its missing some windows of homozygous polar (B) and brown bear (F) ancestry.