Ancestry inference using reference labeled clusters of haplotypes

Background We present ARCHes, a fast and accurate haplotype-based approach for inferring an individual’s ancestry composition. Our approach works by modeling haplotype diversity from a large, admixed cohort of hundreds of thousands, then annotating those models with population information from reference panels of known ancestry. Results The running time of ARCHes does not depend on the size of a reference panel because training and testing are separate processes, and the inferred population-annotated haplotype models can be written to disk and reused to label large test sets in parallel (in our experiments, it averages less than one minute to assign ancestry from 32 populations using 10 CPU). We test ARCHes on public data from the 1000 Genomes Project and the Human Genome Diversity Project (HGDP) as well as simulated examples of known admixture. Conclusions Our results demonstrate that ARCHes outperforms RFMix at correctly assigning both global and local ancestry at finer population scales regardless of the amount of population admixture. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04350-x.

. Average estimated ancestry proportions for single-origin individuals from each testing population. In this matrix figure, each row represents single-origin individuals from the testing population. Each column represents each of the possible 30 populations that the single-origin individuals might be assigned to. Concordance of global ancestry assignments and diploid local ancestry assignments for simulated admixed individuals from 16 different pairings of 11 populations. Admixture level 2 means each of two parents belongs to different populations (50%-50% admixture), admixture level 4 means one of four grandparents belongs to one population and the other three belong to the other population (roughly 25%-75% admixture), admixture level 8 means one of eight great-grandparents belongs to the first population (12.5%-87.5%), and level 16 means one of sixteen great-great-grandparents belong to the first population (6.25%-93.75%).     Appendix S1 A Implementation of BEAGLE Haplotype Models The haplotype models we annotate and use to compute the ancestry HMM emission probabilities are BEA-GLE [BB07] haplotype models, but they must be computed once and written to disk, and di↵er from the BEAGLE implementation in the following ways.
First, the transition probabilities are based on the haplotype counts observed in the training set, but smoothed so that all possible haplotypes have a nonzero probability. Specificially, the transition probability from a haplotype cluster corresponding to allele a is P (a) = 1 2 + n a n a + nā where n a is the count of haplotypes in the cluster with allele a and nā is the count with the alternative allele, and is a user-specified weight (we set = 10 4 ).
Second, because we build large haplotype models from hundreds of thousands of training examples, and the diploid-HMMs are quadratically larger than the haplotype models, we discard a portion of the lowestprobability states in the diploid-HMM state space after each step in the forward procedure, in order to make the procedure more e cient. Specifically, after computing the set of states (i.e., possible pairs of haplotype clusters) for each SNP d, we sort states by forward probability and discard the least likely states, but no more than a small proportion ✏ (we set ✏ = 10 6 ) of the probability mass at level d.
Third, in the haplotype model building procedure, we decide when to merge two haplotype clusters based on slightly di↵erent criteria than BEAGLE. Let n x and n y be the respective number of haplotypes in haplotype clusters X and Y , and let n x (h) and n y (h) be the observed occurrences of a haplotype h in X and Y , respectively. The frequency of h in X and Y is estimated to bep ny , respectively. BEAGLE will not merge two clusters if We reformulate the inequality as a hypothesis test based on Welch's t-test [Wel47] which is, in the form of the inequalities above, where C is a pre-defined constant. The concern with Welch's t-test is that it is too confident in its estimation of variance when the frequency estimate is close to 0 or 1. To avoid this problem, we regularize the frequency estimate using a symmetric beta distribution as a prior. Thus, we replace thep estimates with their posteriors:p and rewrite (6) as We use ↵ = = 1 2 and C 2 = 20.

B Pseudocode for Diploid HMM Forward and Backward Procedures
Algorithm 1 Diploid HMM forward procedure for a sequence x of D diploid genotypes (values are all homozygous 0 or 1, heterozygous, or missing) and a model M of D + 1 levels. M has a start state S, a transition function t(u, a) that maps a haplotype model state u to the state at the next level associated with the allele a transition (a 2 0, 1), and a transition probabilty function ⇢(u, a) that maps a haplotype model state u to the transition probability associated with allele a.
The procedure populates f , where f (u 1 , u 2 ) is the forward likelihood of a diploid HMM state (u 1 , u 2 ). It also stores states of the diploid HMM that are consistent with the genome at each level and their outgoing transitions (and the probabilities associated with those transitions) to a data structure ↵ (so that the genotype need not be re-examined during the backward procedure).
The optional subroutine Trim removes the diploid HMM states in a set with the lowest f values. It is often possible to remove a large proportion of states and yet keep (e.g., ) 99.9999% of the likelihood mass contained in the set of SNPs. We use Trim only for reasons of e ciency. 1: procedure Diploid-Forward(x, w, Mw) 2: Let Dw be the number of SNPs in Mw 3: Let W(x, w) be the subsequence of genotypes in x that correspond to the SNPs in window w.

4:
Let S be the start state of model Mw 5: Let t and ⇢ be Mw's transition functions, mapping a state to a state and probability, respectively 6: Let ↵(d) be an initially empty data structure containing diploid HMM states at level d, 7: and the states they transition to with what probability 8: f (S, S)) 1 // both haplotypes must start in the haplotype model start state 9: Add state (S, S) to ↵(0) with no outgoing transitions (yet) 10: for d 2 0, 1, 2, ..., Dw 1 do

12:
Let P be an initially empty list of diploid HMM state transitions and their likelihoods 13: if W(x, w) d+1 is Homozygous 0 then 14: Add ((t(u 1 , 0), t(u 2 , 0)), ⇢(u 1 , 0) ⇥ ⇢(u 2 , 0)) to P 15: if W(x, w) d+1 is Homozygous 1 then 16: Add ((t(u 1 , 1), t(u 2 , 1)), ⇢(u 1 , 1) ⇥ ⇢(u 2 , 1)) to P 17: if W(x, w) d+1 is Heterozygous then // Consider both possibilities 18: Add ((t(u 1 , 0), t(u 2 , 1)), ⇢(u 1 , 0) ⇥ ⇢(u 2 , 1)) to P 19: Add ((t(u 1 , 1), t(u 2 , 0)), ⇢(u 1 , 1) ⇥ ⇢(u 2 , 0)) to P 20: if W(x, w) d+1 is Missing then // Consider all possibilities 21: Add ((t(u 1 , 0), t(u 2 , 0)), ⇢(u 1 , 0) ⇥ ⇢(u 2 , 0)) to P 22: Add ((t(u 1 , 0), t(u 2 , 1)), ⇢(u 1 , 0) ⇥ ⇢(u 2 , 1)) to P 23: Add ((t(u 1 , 1), t(u 2 , 0)), ⇢(u 1 , 1) ⇥ ⇢(u 2 , 0)) to P 24: Add ((t(u 1 , 1), t(u 2 , 1)), ⇢(u 1 , 1) ⇥ ⇢(u 2 , 1)) to P 25: for is the backward likelihood of a diploid HMM state (u 1 , u 2 ). D is the number of SNPs in the window associated with the haplotype model, ↵ is the set of diploid HMM states at each level and their probabilistic outgoing transitions as computed by Diploid-Forward. 1: procedure Diploid-Backward(D, ↵) 2: Initialize b(u 1 , u 2 ) 0 for all diploid HMM states (u 1 , u 2 ) 3: for d 2 D 1, D 2, ..., 2, 1, 0 do 4: for each diploid HMM state (u 1 , u 2 ) 2 ↵(d) do // (u 1 , u 2 ) is a source state Algorithm 3 Diploid HMM forward-backward procedure (see Diploid-Forward and Diploid-Backward). The procedure populates f and b, where f (u 1 , u 2 ) is the ("forward") likelihood that a path through the diploid HMM ends in state (u 1 , u 2 ) after emitting d alleles (where d is the level of u 1 and u 2 ) of a haplotype in the input genotype sequence x, and b(u 1 , u 2 ) is the likelihood of all paths from (u 1 , u 2 ) to the end state. The probability P d (u 1 , u 2 |x) that the haplotypes of genotype sequence x belongs to clusters u 1 and u 2 is calculated as The genome-wide ancestry HMM computes the likelihoods that a test instance's genotype sequence, t, in a genomic window w (denoted t w ) is explained by populations p and q for a set of populations and genomic windows. It is parameterized by ⇡ t and ⌧ t , 1 which are typically learned for a specific test instance. The ancestry HMM representing a K populations and a set of SNPs on multiple chromosomes has a single silent (non-emitting) state before the first, after the last, and in-between each chromosome, and a series of (K+1)⇥K 2 emitting states for each window of each chromosome, each corresponding to a population assignment (p, q) with 1  p  q  K. Figure 4 illustrates such a genome-wide HMM with K = 3 populations. Let the emitting state corresponding to window w and population assignment (p, q) be denoted S w,p,q . Its emission probability P (t w |p, q) is precomputed and fixed based on the genotype t w in window w. Let S c represent the silent state that preceeds the emitting states correpsonding to windows on chromosome c. Thus the start state of the HMM is S 1 (and if the HMM represents C chromosomes, the end state would be S C+1 ). Let C(c) map a chromosome number to the window that begins the chromosome. Then, our HMM transitions from silent states to emitting states S c ! S C(c),p,q , from emitting states in the last window of a chromosome to a silent state S C(c+1) 1,p,q ! S c+1 , and from emitting states to emitting states for windows w that are not the first or last in a chromosome S w,p,q ! S w+1,p 0 ,q 0 . A transition from S w,p,q ! S w+1,p 0 ,q 0 represents a change in population assignment between windows w and w + 1 if p 6 = p 0 or q 6 = q 0 .
The transition probabilities from a silent state to an emitting state S c ! S C(c),p,q is ⇡ t,(p,q) , where ⇡ t is a learned parameter vector over all possible assignments (p, q) (1  p  q  K) indicating a global assigment preference. The transition probability from a state in the last window of a chromosome to a silent state S C(c) 1,p,q ! S c is always 1, and transitions between emitting states on the same chromosome, from state 1 The subscript t may be dropped from these and other terms when there is only one test genotype instance in question.
(p, q) in window w to state (p 0 , q 0 ) (with p 0  q 0 ) in window w + 1, are as follows: (p 00 ,q 00 )|p 00 q 00 ,p=p 00 q=q 00 ⇡ t,(p 00 ,q 00 ) where ⌧ t is a parameter representing the probability of changing population assignment that enforces the bias against changing population assignments from window to window ( is the exclusive or operator). We initialize ⇡ t to a uniform distribution, and ⌧ t to a (typically low) initial value and learn ⇡ t and ⌧ t using expectation-maximization over a number of iterations (similar to the standard Baum-Welch algorithm [Rab89], except that ⇡ t and ⌧ t are tied to all state transition probabilities).
Let F t (s) be the forward probability, the sum probability of all paths through the Ancestry HMM (as opposed to the haplotype HMM used to calculate per-window emission probabilities) that start in the start state and end in state s (including the emission of state s) and B t (s) be the backward probability of all paths through the HMM that start in state s (excluding emission) and end in the end state. F and B are computed recursively as follows.
For the emitting states in the first window of a chromosome, for all p 0 and q 0 . When a window w is not the first window of a chromosome, where P (S w 1,p,q ! S w,p 0 ,q 0 |⇡ t , ⌧ t ) is given by (11). The forward probability of the silent state preceding chromosome c is Similarly, if there are C chromosomes in the model, For the last window on chromosome c, for all p and q. When window w is not the last window on a chromosome, Finally, for the silent state preceding chromosome c, After computing F t and B t , we compute the expectation for each ⇡ t,(p,q) as and reset each ⇡ t,(p,q) to the value that maximizes the likelihood of E(⇡ t,(p,q) ): We learn ⌧ t in a similar fashion, by updating it based on the expected number of transitions that do not change assignment, compared to all transitions. If there are C chromosomes,

D Computing the Viterbi Path
The Viterbi path is the single most likely path (relative to a genotype sequence t) through the genome-wide To compute V, we must first define M t , where M t (s) is the probability of the most likely path through the HMM that start in the start state and end in state s (including the emission of state s), analagous to the forward probability F t (s) AppendixC, but referring to the probability of the single most likely path instead of the sum probability of all paths.
For the emitting states in the first window of a chromosome, M t (S C(c),p 0 ,q 0 ) = M t (S c ) ⇥ ⇡ t,(p 0 ,q 0 ) ⇥ P (t C(c) |p 0 , q 0 ) for all p 0 and q 0 . When a window w is not the first window of a chromosome, M t (S w,p 0 ,q 0 ) = argmax 1pqK M t (S w 1,p,q ) ⇥ P (S w 1,p,q ! S w,p 0 ,q 0 |⇡ t , ⌧ t ) ⇥ P (t w |p 0 , q 0 ) And for a silent state that is not the start state, The Viterbi path V is then defined for windows that are the last window in a chromosome, c, as V t,C(c+1) 1 = argmax 1pqK M t (S C(c+1) 1 , p, q), and for all other windows as V t,w = argmax 1p 0 q 0 K P (S w,p,q ! S w+1,p 0 ,q 0 |⇡ t , ⌧ t ) ⇥ P (t w+1 |p 0 , q 0 ) ⇥ M t (S w+1,p 0 ,q 0 ).

E Computing Path Samples
Let choose be a operator that chooses an argument with a probability relative to an expression so that choose x2D f (x) returns x with probability f (x) P x 0 2D f (x 0 ) . Then a stochastic path Q for a genomic sequence t is defined over all windows 1  w  W as follows. For windows that are last in a chromosome, c, Q t,C(c+1) 1 = choose p,q F t (S C(c+1) 1,p,q ).