Overall ARCHes method
Our approach begins with dividing the genome into a large number of small windows (e.g., 3–4 centimorgans each), such that, in a recently admixed individual, each of the maternal and paternal haplotypes in a given window are likely to each come from a single population. For each window, we construct a BEAGLE haplotype-cluster model [10] from a large, unlabeled training set of haplotypes. A BEAGLE haplotype-cluster model is a directed acyclic graph with haplotype represented as a path traversing the graph. Each node of the graph represents a cluster of haplotypes. A BEAGLE model is often interpreted as Markov model where the states are the nodes (Additional file 1: Fig. S1), and thus as an “arbitrary order Markov model” of SNPs along a haplotype. Using a reference panel of genotypes from individuals whose ancestry is known in each window, we then annotate each state in the haplotype models with the probability that genotype sequences from a given population belong to the haplotype cluster represented by the state (Fig. 3).
Given a new potentially admixed genotype sequence \(x\), we assume that the ancestors of \({\varvec{x}}\) are all ultimately from the \(K\) origin groups, and that \({\varvec{x}}\) is admixed recently enough that haplotypes from each group are longer than genomic windows mentioned above, and those haplotypes are much more likely to span an entire window than part of one—i.e., the size of the windows is chosen based on the expected age of admixture. We run a genome-wide hidden Markov model (HMM) whose hidden states are the true assignment (population label pairs) in each window. The emission probabilities are the probability distributions of diploid population assignments for each window arising from the annotated BEAGLE models and the transition probabilities (the probability that the population assignment will change at any point along the genome) are learned through an expectation–maximization (E–M) algorithm. We assign diploid ancestry to each window and estimate the global assignment based on the Viterbi path through this HMM. We also sample paths through the HMM to estimate the uncertainty of assignment amounts.
We describe our detailed method in the following sections, and provide pseudocode in Additional file 1: Appendix S1.
Annotating haplotype cluster models
We follow Browning and Browning [10] in building haplotype cluster models. Briefly, we divide the genome into W partially overlapping windows with approximately the same number of SNPs. Within each window, we build a haplotype cluster model from a large, unlabeled set of training phased haplotypes. For simplicity, we restrict to biallelic variants, and code them as 0 and 1. Building this haplotype cluster model from a large, unlabeled set of individuals provides a “background” of haplotype diversity against which we can measure the informativeness of different haplotypes.
With a haplotype cluster model built for each window, we can then annotate populations using the haplotype cluster model. Recall that each path through a BEAGLE model corresponds to a realization of a haplotype, and each node at a given SNP represents a cluster of haplotypes that are similar near that SNP. For the genotypes of a reference individual in window w, xw, we compute the probability that the individual’s two haplotypes pass through two specific nodes in the graph, u and v, at SNP d,
$${P}_{d}\left(u,v|{x}_{w}\right)=\frac{{P}_{d}\left({x}_{w},u,v\right)}{P\left({x}_{w}\right)}$$
where we compute \({P}_{d}\left(u,v|{x}_{w}\right)\) and \(P\left({x}_{w}\right)\) using a modification of the forward–backward algorithm for hidden Markov models, treating the node as a hidden state (see Additional file 1: Appendix S1 for pseudocode). In the following, we will refer to the HMM used to analyze the BEAGLE models as the haplotype HMM, and its properties as haplotype emission probabilities, and haplotype probabilities. This contrasts with the ancestry HMM we use to smooth ancestry estimates across the genome, which is described in the subsequent section.
We then marginalize over one of the haplotypes of each diploid to create a haplotype posterior probability that the genotypes \({x}_{w}\) in window w passes through node u at SNP d,
$${P}_{d}\left(u|{x}_{w}\right)={\sum }_{v}{P}_{d}\left(u,v|{x}_{w}\right)$$
Finally, we annotate a node u by its average haplotype probability in a set of individuals belonging to a reference population p, \({R}_{p}=\{{x}_{i,p,w},i\in \mathrm{1,2},\dots ,{n}_{p}\}\) where \({n}_{p}\) is the total number of reference samples in population p. Then, we compute
$${P}_{d}\left(u|p\right)=\frac{1}{{n}_{p}}{\sum }_{i=1}^{{n}_{p}}{P}_{d}\left(u|{x}_{{\varvec{i}},{\varvec{p}},{\varvec{w}}}\right)$$
(1)
This equation gives us the probability that an individual drawn from population p will pass through node u at SNP d of the haplotype cluster model for window w.
During the annotation process, we may choose to downsample the genotypes of the reference panel by setting some genotypes at random to ‘missing’ and annotating states of the model by summing over the possible genotypes at those locations. Doing this has the effect of annotating states that represent haplotypes that are similar to those of a reference genotype, but not exactly the same, and is intended to boost performance in reference panels that have few representative examples. We may use the same reference panel individual several times in the annotation process, with a different downsampled genotype each time.
Ancestry emission probabilities for test individuals in windows
With Eq. (1) in hand, we can compute the probability that a test individual’s genotypes in a given window w descend from a specific pair of populations. Letting t be the unphased genotype of our test individual, we first compute the probability of t given that the two haplotypes in window w belong to clusters u and v of the haplotype cluster model at SNP d,
$${P}_{d}\left({\mathbf{t}}_{w}|u,v\right)=\frac{{P}_{d}\left({t}_{w},u,v\right)}{{P}_{d}\left(u,v\right)},$$
where \({P}_{d}\left({\mathbf{t}}_{w},u,v\right)\) is computed using the haplotype forward–backward algorithm and \({P}_{d}\left(u,v\right)\) is obtained by multiplying the transition matrices of the haplotype cluster model up to SNP d (equivalent to running the haplotype forward algorithm up to SNP d with all haplotype emission probabilities set equal to 1).
We then want to know the probability that the individual’s two haplotypes come from populations p and q using the information around SNP d. We compute this quantity by first computing the probability that a haplotype passes through nodes u and v and SNP d of window w given underlying populations p and q by averaging over the equally likely combinations of whether node u corresponds to population p and node v corresponds to population q or vice versa,
\({P}_{d}\left(u,v|p,q\right)=\frac{1}{2}\left({P}_{d}\left(u|p\right){P}_{d}\left(v|q\right)+{P}_{d}\left(u|q\right){P}_{d}\left(v|p\right)\right)\).
Note that this result is equivalent to assuming that the two haplotype clusters that make up a diploid sample are independent, and that the two populations that make up those haplotypes are also independent.
Now, we use the law of total probability to average over all haplotype clusters at SNP d, and compute the probability that the individual’s haplotype clusters at that point arise from populations p and q,
\({P}_{d}\left({\mathbf{t}}_{w}|p,q\right)={\sum }_{u,v}{P}_{d}\left({\mathbf{t}}_{w}|u,v\right){P}_{d}\left(u,v|p,q\right)\).
This probability weighs similarity to haplotypes in population p and q more strongly for SNPs closest to SNP d in window w, because we have no a priori knowledge of which part of a window is most informative about population membership, we finally compute our ancestry emission probability for a window by averaging over the population probability for every SNP in the window,
$$P\left({\mathbf{t}}_{w}|p,q\right)=\frac{1}{D}{\sum }_{d}{P}_{d}\left({\mathbf{t}}_{w}|p,q\right)$$
(2)
where D is the total number of SNPs in window w. This process can then be repeated for every window in the genome to obtain the probability of the test individual’s genotype in each window, given that the two haplotypes arose from any pair of populations p and q.
Smoothing ancestry estimates using a genome-wide ancestry hidden Markov model
In principle, the ancestry emission probabilities computed in the previous section could be used to compute maximum likelihood estimates of diploid local ancestry in each window, one at a time. However, doing so would result in highly noisy ancestry estimates. Instead, we share information across the genome using an additional layer of smoothing via a genome-wide hidden Markov model (Fig. 4). Moreover, because ancestry segments from recent admixture are expected to be longer than a single window, this model helps reduce false ancestry transitions.
If we wish to assign ancestry to \(K\) populations, the hidden states of our hidden Markov model are the \(\left(\genfrac{}{}{0pt}{}{K}{2}\right)+K\) possible unphased ancestry pairs, \(\left(p,q\right)\), with ancestry emission probabilities window w given by Eq. (2). Because we model unphased diploid ancestry, we define a population pair as unordered, i.e. \(\left(p,q\right)\) is the same ancestry assignment as \(\left(q,p\right).\) Our ancestry hidden Markov model assumes that between windows ancestry can change for one of the two haplotypes with probability \(\tau .\) The assumption that ancestry switches only for one of the two haplotypes within an individual is both biologically realistic (assuming individuals are admixed relatively recently) and greatly reduces the complexity of the hidden Markov model. Thus, a change occurs from \(\left(p,q\right)\) to \(\left({p}^{^{\prime}},{q}^{^{\prime}}\right)\) to any pair such that exactly one of \({p}^{^{\prime}}\) or \({q}^{^{\prime}}\) is different from \(p\) or \(q\). Each new ancestry pair is drawn with probability proportional to the stationary probability of that ancestry pair,\({\pi }_{p,q}\). In full, the transition probabilities are
$$P\left( {p^{\prime},q^{\prime}{|}p,q} \right) = \left\{ {\begin{array}{*{20}l} {1 - \tau } \hfill & {if\,p^{\prime} = p,\,q^{\prime} = q} \hfill \\ {\tau \frac{{\pi_{p^{\prime},q^{\prime}} }}{{Z_{p,q} }}} \hfill & {if\,p^{\prime} \ne p, \,q = q^{\prime}\,or\, p^{\prime} = p,\,q^{\prime} \ne \,q} \hfill \\ 0 \hfill & {otherwise} \hfill \\ \end{array} } \right.$$
(3)
where the normalizing constant \({Z}_{p,q}\) is given by summing over all accessible unphased haplotype pairs.
Between chromosomes, both ancestry pairs are allowed to change, and the ancestry at the start of each chromosome is drawn independently from that individual’s global distribution of ancestry pairs, \({\pi }_{p,q}\). For a more formal description of how changes between chromosomes are handled, see Additional file 1: Appendix S1.
We initialize the \({\pi }_{p,q}\) to a uniform distribution and \(\tau\) to some low value, and use a modified Baum-Welch algorithm to update \({\pi }_{p,q}\) and \(\tau\) (see Additional file 1: Appendix S1). Empirically, we observed a tendency to overfit by estimating a large \(\tau\) parameter, resulting in inference of a large number of different ancestries; thus we run a fixed number of update steps, rather than stopping at convergence.
Estimating ancestry proportions in individuals
In principle, the value \({\uppi }_{p}={\sum }_{q}{\uppi }_{p,q}\) could be used as an estimate of the admixture proportion from population \(p\) in an individual. However, we instead opt to use a path-based approach that also allows us to obtain credible intervals of the ancestry proportions conditioned on the inferred parameters. Specifically, we provide a point estimate of global ancestry proportions by computing the maximum probability path through the HMM using the Viterbi algorithm, and computing the proportion of windows (weighted by their length) that are assigned to population \(p\). We then provide a credible interval by then sampling paths from the posterior distribution on paths, and for each one can compute the ancestry proportion in the same way as from the Viterbi path.
Below we describe experiments we did for benchmarking ARCHes and RFMix [9].
Reference panel and testing data
We build our reference panel using genotypes from proprietary candidates who explicitly provided prior consent to participate in this research project and have all family lineages tracing back to the same geographic region. All the candidates were genotyped on Ancestry’s SNP array and were analyzed through a quality control pipeline to remove samples with low genotype call rates, samples genetically related to each other, and samples who appear as outliers from their purported population of origin based on Principal Component Analysis. The reference panel contains 11,051 samples, representing ancestry from 32 global regions (Additional file 1: Table S1). We then use 1705 individuals from 1000 Genomes [12] and HGDP Project [13] from 15 populations as testing data. We use SNP array data of individuals from the 1000 Genomes [12] and HGDP [13] projects and limit them to around 300,000 SNPs that overlap with Ancestry’s SNP array. Lists of populations and associated sample counts included in reference panel and testing data are specified in Additional file 1: Tables S1 and S2, respectively. We align populations that come from different data sources, in some cases combining populations together. For example, we combined the ancestries that are assigned to ‘England, Wales, and Northwestern Europe’ and ‘Ireland & Scotland’ to represent ancestry for ‘Britain’. We combined the ancestry that are assigned to ‘Benin & Togo’ and ‘Nigeria’ to represent ancestry for ‘Yoruba’.
Simulation
We simulate 100 individuals with an admixture history similar to modern Latinos that admixed 12 generations ago with 45% Native American, 50% European and 5% African ancestry. We constructed 100 12-generation pedigrees and randomly selected founders from the reference panel, with the ratio of 45% Native American (from the Maya and Peru regions), 50% European (from the France, Britain, Italy, Spain and Finland regions), and 5% African ancestry (from the Yoruba region). We then simulate the DNA recombination process and obtained the genotypes of the descendant in each pedigree, which are admixed at roughly 45% Native American, 50% European and 5% African.
We simulate genomes of admixed individuals with ancestors from a pair of neighboring populations by simulating genotypes where 1000 Genomes and HGDP test examples serve as the two parents, four grandparents, eight great-grandparents, or 16 great-great-grandparents of a pedigree and the admixed example evaluated is the lone descendant of that set. The examples in this test set are, on average, 50–50% admixed, 25–75% admixed, 12.5–87.5% admixed, or 6.25–93.75% admixed. We simulate 20 individuals for each of the 16 different pairings and 4 different levels of admixture, with half of them representing a minority admixture from one region, and half of them representing a minority admixture from the other region.
Since RFMix requires phased haplotypes for both query and reference individuals, we use Eagle [14] v2 with the HRC [15] reference panel to get phased haplotypes of the simulated individuals as well as for the individuals in the reference panel. However, ARCHes requires only the unphased, diploid genomic sequences for both query and reference individuals.
RFMix parameters
We first used default parameters in RFMIX v2.03-r0 (https://github.com/slowkoni/rfmix). We then performed a parameter sweep using different number of generations since admixture (the -G parameter), with value of 2, 4, 6 and 8 coupled with different window sizes (set both conditional random field window size and random forest window size) with values of 0.2 cM, 0.5 cM, 100 SNPs (roughly 1 cM) and 300 SNPs (roughly 3 cM) on chromosome 1 of simulated pair admixed individuals. We then selected the parameters with the best performance, namely 4 generations since admixture and a window size 0.2 cM, and ran RFMix on the whole genome of simulated pair admixed individuals. For simulated latino individuals, we used 12 generations since admixture and a window size 0.2 cM. For single origin individuals, we used 2 generations since admixture and a window size 0.2 cM. None of the RFMix runs used the E-M procedure or phase error correction.
Note that for both RFMix and ARCHes, we use the HapMap [16] genetic recombination map for GRCh37 to estimate recombination distance.
ARCHes parameters
We divide the genome into 3882 windows of 80 SNPs each, overlapping by 5 SNPs (with some adjustments made near chromosome boundaries). We build a haplotype model for each of these windows from a separate cohort of 50,000 haplotypes selected from the Ancestry database that are not already in the population reference panel. We phase these genotypes with Eagle [14], although we do not find that the particular phasing method, or even the diversity of this cohort has a measurable impact on the accuracy of our approach. We tie small groups of 3–4 windows together by disallowing population assignment transitions within those groups, which allows us to set the granularity with which we assign local population assignments (there are 1001 such window groups) and has the benefit of increased computational efficiency. ARCHes's haplotype model annotation process is robust to missing data, which is handled by marginalizing over all possible genotypes. In fact, the annotations may benefit from intentionally downsampling reference panel genotypes so that variations in haplotypes are considered as well, and the amount of downsampling and the number of downsampled genotypes used for annotation are tunable parameters of the annotation process. In our experiments, we sample each reference panel genotype sequence 100 times, each time setting 20% of genotypes to missing and annotating the 3882 haplotype models with them. This training process takes approximately 15 to build each haplotype model and 15 min to annotate it, although that process is parallelizable and need not be carried out again, regardless of the size of the test set. We set the initial τx parameter to be 0.01 and learned this parameter using 10 iterations of the E-M approach described above. ARCHes assigns diploid local ancestry to 1001 windows of the genome and the global ancestry estimates are summarized from these 1001 windows.