A fast leastsquares algorithm for population inference
 R Mitchell Parry^{1} and
 May D Wang^{1, 2, 3}Email author
DOI: 10.1186/147121051428
© Parry and Wang; licensee BioMed Central Ltd. 2013
Received: 15 March 2012
Accepted: 6 November 2012
Published: 23 January 2013
Abstract
Background
Population inference is an important problem in genetics used to remove population stratification in genomewide association studies and to detect migration patterns or shared ancestry. An individual’s genotype can be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential quadratic programming. This paper proposes a leastsquares simplification of the binomial likelihood model motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily incorporates the degree of admixture within the sample of individuals and improves estimates without requiring trialanderror tuning.
Results
We show that the expected value of the leastsquares solution across all possible genotype datasets is equal to the true solution when part of the problem has been solved, and that the variance of the solution approaches zero as its size increases. The Leastsquares algorithm performs nearly as well as Admixture for these theoretical scenarios. We compare leastsquares, Admixture, and FRAPPE for a variety of problem sizes and difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater degree of admixture, leastsquares performs better than the other methods. On simulated mixtures of real population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than Leastsquares. The leastsquares approach, however, performs within 1.5% of the Admixture error. On individual genotypes from the HapMap project, Admixture and leastsquares perform qualitatively similarly and within 1.2% of each other. Significantly, the leastsquares approach nearly always converges 1.5 to 6times faster.
Conclusions
The computational advantage of the leastsquares approach along with its good estimation performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in estimation performance between all algorithms decreases. In addition, when prior information is known, the leastsquares approach easily incorporates the expected degree of admixture to improve the estimate.
Background
The inference of population structure from the genotypes of admixed individuals poses a significant problem in population genetics. For example, genome wide association studies (GWAS) compare the genetic makeup of different individuals in order to extract differences in the genome that may contribute to the development or suppression of disease. Of particular interest are single nucleotide polymorphisms (SNPs) that reveal genetic changes at a single nucleotide in the DNA chain. When a particular SNP variant is associated with a disease, this may indicate that the gene plays a role in the disease pathway, or that the gene was simply inherited from a population that is more (or less) predisposed to the disease. Determining the inherent population structure within a sample removes confounding factors before further analysis and reveals migration patterns and ancestry [1]. This paper deals with the problem of inferring the proportion of an individual’s genome originating from multiple ancestral populations and the allele frequencies in these ancestral populations from genotype data.
Methods for revealing population structure are divided into fast multivariate analysis techniques and slower discrete admixture models [2]. Fast multivariate techniques such as principal components analysis (PCA) [28] reveal subspaces in the genome where large differences between individuals are observed. For casecontrol studies, the largest differences commonly due to ancestry are removed to reduce false positives [4]. Although PCA provides a fast solution, it does not directly infer the variables of interest: the population allele frequencies and individual admixture proportions. On the other hand, discrete admixture models that estimate these variables typically require much more computation time. Following a recent trend toward faster gradientbased methods, we propose a faster simpler leastsquares algorithm for estimating both the population allele frequencies and individual admixture proportions.
Pritchard et al. [9] originally propose a discrete admixture likelihood model based on the random union of gametes for the purpose of population inference. In particular, their model assumes HardyWeinberg equilibrium within the ancestral populations (i.e., allele frequencies are constant) and linkage equilibrium between markers within each population (i.e., markers are independent). Each individual in the current sample is modeled as having some fraction of their genome originating from each of the ancestral populations. The goal of population inference is to estimate the ancestral population allele frequencies, P, and the admixture of each individual, Q, from the observed genotypes, G. If the population of origin for every allele, Z, is known, then the population allele frequencies and the admixture for each individual have a Dirichlet distribution. If, on the other hand, P and Q are known, the population of origin for each individual allele has a multinomial distribution. Pritchard et al. infer populations by alternately sampling Z from a multinomial distribution based on P and Q; and P and Q from Dirichlet distributions based on Z. Ideally, this Markov Chain Monte Carlo sampling method produces independent identically distributed samples (P,Q) from the posterior distribution P(P,QG). The inferred parameters are taken as the mean of the posterior. This algorithm is implemented in an opensource software tool called Structure[9].
The binomial likelihood model proposed by Pritchard et al. was originally used for datasets of tens or hundreds of loci. However, as datasets become larger, especially considering genomewide association studies with thousands or millions of loci, two problems emerge. For one, linkage disequilibrium introduces correlations between markers. Although Falush et al. [10] extended Structure to incorporate loose linkage between loci, larger datasets also pose a computational challenge that has not been met by these samplingbased approaches. This has led to a series of more efficient optimization algorithms for the same likelihood model with uncorrelated loci. This paper focuses on improving computational performance, leaving the treatment of correlated loci to future research.
Tang et al. [11] proposed a more efficient expectation maximization (EM) approach. Instead of randomly sampling from the posterior distribution, the FRAPPE EM algorithm [11] starts with a randomly initialized Z, then alternates between updating the values of P and Q for fixed Z, and maximizing the likelihood of Z for fixed P and Q. Their approach achieves similar accuracy to Structure and requires much less computation time. Wu et al. [12] specialized the EM algorithm in FRAPPE to accommodate the model without admixture, and generalized it to have different mixing proportions at each locus. However, these EM algorithms estimate an unnecessary and unobservable variable Z, something that more efficient algorithms could avoid.
Alexander et al. [13] proposed an even faster approach for inferring P and Q using the same binomial likelihood model but bypassing the unobservable variable Z. Their closesource software, Admixture, starts at a random feasible solution for P and Q and then alternates between maximizing the likelihood function with respect to P and then maximizing it with respect to Q. The likelihood is guaranteed not to decrease at each step eventually converging at a local maximum or saddle point. For a moderate problem of approximately 10000 loci, Admixture achieves comparable accuracy to Structure and requires only minutes to execute compared to hours for Structure[13].
Another feature of Structure’s binomial likelihood model is that it allowed the user to input prior knowledge about the degree of admixture. The prior distribution for Q takes the form of a Dirichlet distribution with a degree of admixture parameter, α, for every population. For α = 0, all of an individual’s alleles originate from the same ancestral population; for α > 0, individuals contain a mixture of alleles from different populations; for α = 1, every assignment of alleles to populations is equally likely (i.e., the noninformative prior); and for α → ∞, all individuals have equal contributions from every ancestral population. Alexander et al. replace the population degree of admixture parameter in Structure with two parameters, λ and γ, that when increased also decrease the level of admixture of the resulting individuals. However, the authors admit that tuning these parameters is nontrivial [14].
This paper contributes to population inference research by (1) proposing a novel leastsquares simplification of the binomial likelihood model that results in a faster algorithm, and (2) directly incorporating the prior parameter α that improves estimates without requiring trialanderror tuning. Specifically, we utilize a two block coordinate descent method [15] to alternately minimize the criterion for P and then for Q. We adapt a fast nonnegative leastsquares algorithm [16] to additionally include a sumtoone constraint for Q and an upperbound for P. We show that the expected value for the estimates of P (or Q) across all possible genotype datasets are equal to the true values when Q (or P) are known and that the variance of this estimate approaches zero as the problem size increases. Compared to Admixture, the leastsquares approach provides a slightly worse estimate of P or Q when the other is known. However, when estimating P and Q from only the genotype data, the leastsquares approach sometimes provides better estimates, particularly with a large number of populations, small number of samples, or more admixed individuals. The leastsquares approximation provides a simpler and faster algorithm, and we provide it as Matlab scripts on our website.
Results
First, we motivate a leastsquares simplification of the binomial likelihood model by deriving the expected value and covariance of the leastsquares estimate across all possible genotype matrices for partially solved problems. Second, we compare leastsquares to sequential quadratic programming (Admixture’s optimization algorithm) for these cases. Third, we compare Admixture, FRAPPE, and leastsquares using simulated datasets with a factorial design varying dataset properties in G. Fourth, we compare Admixture and leastsquares using real population allele frequencies from the HapMap Phase 3 project. Finally, we compare the results of applying Admixture and leastsquares to real data from the HapMap Phase 3 project where the true population structure is unknown.
Matrix notation
Genotype matrix  Population allele frequencies matrix  Individual admixture matrix 

$G=\left[\begin{array}{l}{g}_{11}\phantom{\rule{1em}{0ex}}{g}_{12}\phantom{\rule{1em}{0ex}}\cdots \phantom{\rule{1em}{0ex}}{g}_{1N}\\ {g}_{21}\phantom{\rule{1em}{0ex}}{g}_{22}\phantom{\rule{1em}{0ex}}\cdots \phantom{\rule{1em}{0ex}}{g}_{2N}\\ \phantom{\rule{0.5em}{0ex}}\vdots \phantom{\rule{2.5em}{0ex}}\vdots \phantom{\rule{2.5em}{0ex}}\ddots \phantom{\rule{1em}{0ex}}\vdots \\ {g}_{M1}\phantom{\rule{0.5em}{0ex}}{g}_{M2}\phantom{\rule{0.5em}{0ex}}\cdots \phantom{\rule{0.5em}{0ex}}{g}_{\mathit{\text{MN}}}\end{array}\phantom{\rule{4em}{0ex}}\right]$  $P=\left[\begin{array}{l}{p}_{11}\phantom{\rule{1em}{0ex}}{p}_{12}\phantom{\rule{1em}{0ex}}\cdots \phantom{\rule{1em}{0ex}}{p}_{1K}\\ {p}_{21}\phantom{\rule{1em}{0ex}}{p}_{22}\phantom{\rule{1em}{0ex}}\cdots \phantom{\rule{1em}{0ex}}{p}_{2K}\\ \phantom{\rule{0.5em}{0ex}}\vdots \phantom{\rule{2.5em}{0ex}}\vdots \phantom{\rule{2.5em}{0ex}}\ddots \phantom{\rule{1em}{0ex}}\vdots \\ {p}_{M1}\phantom{\rule{0.5em}{0ex}}{p}_{M2}\phantom{\rule{0.5em}{0ex}}\cdots \phantom{\rule{0.5em}{0ex}}{p}_{\mathit{\text{MK}}}\end{array}\phantom{\rule{4em}{0ex}}\right]$  $Q=\left[\begin{array}{l}{q}_{11}\phantom{\rule{1em}{0ex}}{q}_{12}\phantom{\rule{1em}{0ex}}\cdots \phantom{\rule{1em}{0ex}}{q}_{1N}\\ {q}_{21}\phantom{\rule{1em}{0ex}}{q}_{22}\phantom{\rule{1em}{0ex}}\cdots \phantom{\rule{1em}{0ex}}{q}_{2N}\\ \phantom{\rule{0.5em}{0ex}}\vdots \phantom{\rule{2.5em}{0ex}}\vdots \phantom{\rule{2.5em}{0ex}}\ddots \phantom{\rule{1em}{0ex}}\vdots \\ {q}_{M1}\phantom{\rule{0.5em}{0ex}}{q}_{M2}\phantom{\rule{0.5em}{0ex}}\cdots \phantom{\rule{0.5em}{0ex}}{q}_{\mathit{\text{KN}}}\end{array}\phantom{\rule{4em}{0ex}}\right]$ 
g_{ li } ∈ {0, 1, 2} : number of reference alleles at l th locus for i th individual.  $0\le {P}_{\mathit{lk}}\le 1$: percentage of reference alleles at l th locus in k th population.  ${q}_{\mathit{\text{ki}}}\ge 0,\underset{k=1}{\overset{K}{\Sigma}}{q}_{\mathit{\text{ki}}}=1$: fraction of i th individual’s genome originating from k th population. 
M = number of loci (markers)  $1\le l\le M$  
N = number of individuals  $1\le i\le N$  
K = number of populations  $1\le k\le K$ 
Empirical estimate and upper bound on total variance
The bound using the properties of the Dirichlet distribution in Equation 17 provides a bound of 0.01. As the number of samples increases, the difference between the bound and the asymptotic bound for the Dirichlet distributed q will approach zero.
Intuitively, the error in the leastsquares estimate for P and Q decreases as the number of individuals and the number of loci increases, respectively. Figure 1 supports this notion, suggesting that on very large problems for which the gradient based and expectation maximization algorithms were designed, the error in the leastsquares estimate approaches zero.
Comparing leastsquares approximation to binomial likelihood model
Given estimates of the population allele frequencies, early research focused on estimating the individual admixture [17]. We also note that the number of iterations and convergence properties confound the comparison of iterative algorithms. To avoid these problems and emulate a practical research scenario, we compare leastsquares to sequential quadratic programming (used in Admixture) when P or Q are known a priori. In this scenario, each algorithm converges in exactly one step making it possible to compare the underlying updates for P and Q independently. For N = 100, 1000, and 10000; and α = 0.1, 1, and 2; we consider a grid of twodimensional points for p, where p_{ i } = {0.05, 0.15, …, 0.95}. For each trial, we first generate a random Q such that every column is drawn from a Dirichlet distribution with shape parameter, α. Then, we randomly generate a genotype using Equation 11. We compute the leastsquares solution using Equation 27 and use Matlab’s builtin function ‘fmincon’ to minimize the negative of the loglikelihood in Equation 7, similar to Admixture’s approach. We repeat the process for 1000 trials and aggregate the results.
Root mean squared error in P for known Q and K= 2
RMSE (%)  N= 100  N= 1000  N= 10000  

α=0.1  α=1.0  α=2.0  α=0.1  α=1.0  α=2.0  α=0.1  α=1.0  α=2.0  
SQP  4.35  6.03  7.41  1.37  1.90  2.37  0.43  0.60  0.75 
LS  4.37  6.16  7.68  1.38  1.93  2.40  0.44  0.61  0.76 
Simulated experiments to compare leastsquares to Admixture and FRAPPE
Sources of variation in root mean squared error
ANOVA  Error variance for P  Error variance for Q  Time variance  

Factors and interactions  Sum squared error (×10^{2})  Percent  Sum squared error (×10^{4})  Percent  Sum squared error (×10^{4})  Percent 
K  59.0  8.2  44.0  3.9  58.7  3.2 
N  519.6  72.4  376.2  33.0  585.5  32.2 
Α  63.1  8.8  341.1  29.9  33.2  1.8 
Algorithm  0.1  0.0  1.7  0.1  266.3  14.6 
K × N  32.1  4.5  32.6  2.9  98.2  5.4 
K × α  9.0  1.3  8.2  0.7  4.4  0.2 
K × Algorithm  0.0  0.0  0.4  0.0  55.1  3.0 
N × α  29.1  4.1  282.6  24.8  58.8  3.2 
N × Algorithm  0.0  0.0  2.1  0.2  445.6  24.5 
Α × Algorithm  0.2  0.0  8.4  0.7  10.5  0.6 
Error  5.7  0.8  43.2  3.8  204.4  11.2 
Total  717.9  100.0  1140.4  100.0  1820.4  100.0 
Root mean squared error for Q
K  N  α  AD  LS  FRAPPE  Significance  LSα 

2  100  0.10  0.48  0.72  0.52  AD = FR < LS  0.64 
2  100  0.50  1.12  1.13  1.03  FR = AD = LS  1.18 
2  100  1.00  2.22  2.22  2.29  AD = LS = FR  2.22 
2  100  2.00  4.13  4.11  4.50  LS = AD = FR  3.84 
2  1000  0.10  0.57  0.97  0.63  AD < FR < LS  0.74 
2  1000  0.50  0.69  0.74  0.71  AD < FR < LS  0.74 
2  1000  1.00  0.86  0.91  1.00  AD < LS < FR  0.91 
2  1000  2.00  1.58  1.65  2.33  AD = LS < FR  0.93 
2  10000  0.10  0.59  1.03  0.61  AD < FR < LS  0.76 
2  10000  0.50  0.70  0.81  0.72  AD < FR < LS  0.73 
2  10000  1.00  0.74  0.77  0.79  AD < LS < FR  0.77 
2  10000  2.00  0.89  0.97  1.32  AD < LS < FR  0.96 
3  100  0.10  0.62  0.74  0.63  AD = FR < LS  0.66 
3  100  0.50  2.01  1.81  2.00  LS < FR = AD  1.91 
3  100  1.00  3.49  3.23  3.60  LS < AD = FR  3.23 
3  100  2.00  5.77  5.39  5.89  LS < AD = FR  5.00 
3  1000  0.10  0.68  1.15  0.73  AD < FR < LS  0.76 
3  1000  0.50  0.85  0.88  0.89  AD < LS = FR  0.93 
3  1000  1.00  1.18  1.17  1.35  LS = AD < FR  1.17 
3  1000  2.00  1.94  1.92  2.49  LS = AD < FR  1.20 
3  10000  0.10  0.74  1.26  0.76  AD < FR < LS  0.79 
3  10000  0.50  0.87  0.97  0.87  AD = FR < LS  0.87 
3  10000  1.00  0.89  0.92  0.95  AD < LS < FR  0.92 
3  10000  2.00  1.07  1.09  1.49  AD < LS < FR  1.09 
4  100  0.10  0.79  0.76  0.80  LS = AD = FR  0.77 
4  100  0.50  2.81  2.40  2.85  LS < AD = FR  2.56 
4  100  1.00  4.43  4.01  4.55  LS < AD = FR  4.01 
4  100  2.00  6.63  6.13  6.81  LS < AD = FR  5.65 
4  1000  0.10  0.73  1.17  0.74  AD = FR < LS  0.72 
4  1000  0.50  0.95  0.95  1.00  LS = AD < FR  1.07 
4  1000  1.00  1.34  1.32  1.47  LS = AD < FR  1.32 
4  1000  2.00  2.09  2.06  2.50  LS = AD < FR  1.32 
4  10000  0.10  0.84  1.33  0.84  AD = FR < LS  0.74 
4  10000  0.50  0.96  1.03  0.96  AD = FR < LS  0.95 
4  10000  1.00  0.97  0.99  1.03  AD < LS < FR  0.99 
4  10000  2.00  1.14  1.15  1.51  AD = LS < FR  1.15 
Root mean squared error for P
K  N  α  AD  LS  FRAPPE  Significance  LSα 

2  100  0.10  4.33  4.37  4.33  AD = FR = LS  4.36 
2  100  0.50  5.13  5.17  5.14  AD = FR = LS  5.17 
2  100  1.00  5.99  6.03  5.99  AD = FR = LS  6.03 
2  100  2.00  7.24  7.28  7.29  AD = LS = FR  7.25 
2  1000  0.10  1.37  1.42  1.38  AD < FR < LS  1.39 
2  1000  0.50  1.62  1.65  1.63  AD = FR < LS  1.65 
2  1000  1.00  1.90  1.93  1.92  AD < FR = LS  1.93 
2  1000  2.00  2.52  2.58  2.82  AD = LS < FR  2.38 
2  10000  0.10  0.46  0.57  0.46  AD < FR < LS  0.48 
2  10000  0.50  0.52  0.56  0.53  AD < FR < LS  0.52 
2  10000  1.00  0.60  0.61  0.62  AD < LS < FR  0.61 
2  10000  2.00  0.81  0.87  1.14  AD < LS < FR  0.92 
3  100  0.10  5.58  5.64  5.58  AD = FR = LS  5.62 
3  100  0.50  7.37  7.42  7.38  AD = FR = LS  7.42 
3  100  1.00  9.05  9.06  9.06  AD = FR = LS  9.06 
3  100  2.00  11.36  11.33  11.39  LS = AD = FR  11.30 
3  1000  0.10  1.78  1.87  1.78  AD = FR < LS  1.80 
3  1000  0.50  2.35  2.40  2.35  AD = FR < LS  2.39 
3  1000  1.00  2.97  3.00  3.01  AD < LS = FR  3.00 
3  1000  2.00  4.11  4.14  4.41  AD = LS < FR  3.89 
3  10000  0.10  0.61  0.82  0.62  AD < FR < LS  0.61 
3  10000  0.50  0.78  0.84  0.78  AD = FR < LS  0.76 
3  10000  1.00  0.93  0.95  0.98  AD < LS < FR  0.95 
3  10000  2.00  1.35  1.36  1.82  AD = LS < FR  1.49 
4  100  0.10  6.83  6.90  6.84  AD = FR = LS  6.87 
4  100  0.50  9.61  9.63  9.62  AD = FR = LS  9.62 
4  100  1.00  11.90  11.89  11.92  LS = AD = FR  11.89 
4  100  2.00  14.94  14.89  15.01  LS = AD = FR  14.89 
4  1000  0.10  2.16  2.28  2.16  AD = FR < LS  2.17 
4  1000  0.50  3.10  3.15  3.11  AD = FR < LS  3.15 
4  1000  1.00  4.04  4.06  4.08  AD < LS = FR  4.06 
4  1000  2.00  5.61  5.62  5.88  AD = LS < FR  5.36 
4  10000  0.10  0.76  1.02  0.77  AD = FR < LS  0.71 
4  10000  0.50  1.04  1.11  1.04  AD = FR < LS  1.01 
4  10000  1.00  1.28  1.30  1.33  AD < LS < FR  1.30 
4  10000  2.00  1.87  1.87  2.36  AD = LS < FR  2.06 
Computation time
K  N  Α  AD  LS  FRAPPE  Significance  LSα 

2  100  0.10  4.71  1.00  9.97  LS < AD < FR  0.77 
2  100  0.50  4.69  1.16  8.22  LS < AD < FR  1.12 
2  100  1.00  5.46  1.78  8.31  LS < AD < FR  1.77 
2  100  2.00  6.25  2.37  10.40  LS < AD < FR  2.55 
2  1000  0.10  43.37  11.87  136.88  LS < AD < FR  8.06 
2  1000  0.50  51.70  13.98  112.41  LS < AD < FR  12.34 
2  1000  1.00  62.00  24.43  118.90  LS < AD < FR  24.03 
2  1000  2.00  83.07  51.33  195.43  LS < AD < FR  48.43 
2  10000  0.10  447.68  142.14  1963.83  LS < AD < FR  93.61 
2  10000  0.50  570.12  209.39  1908.72  LS < AD < FR  157.44 
2  10000  1.00  687.88  352.24  2242.18  LS < AD < FR  349.51 
2  10000  2.00  1037.45  796.83  3762.70  LS < AD < FR  406.63 
3  100  0.10  6.10  1.84  15.29  LS < AD < FR  1.48 
3  100  0.50  6.42  2.05  15.75  LS < AD < FR  1.90 
3  100  1.00  7.19  2.71  16.78  LS < AD < FR  2.74 
3  100  2.00  9.00  4.01  19.80  LS < AD < FR  4.24 
3  1000  0.10  69.41  18.32  223.32  LS < AD < FR  12.53 
3  1000  0.50  78.73  24.10  264.85  LS < AD < FR  21.42 
3  1000  1.00  96.89  38.06  305.50  LS < AD < FR  36.63 
3  1000  2.00  121.45  60.79  355.51  LS < AD < FR  55.54 
3  10000  0.10  791.36  155.56  3256.83  LS < AD < FR  121.19 
3  10000  0.50  883.99  301.52  4251.68  LS < AD < FR  264.77 
3  10000  1.00  1175.25  617.80  5111.92  LS < AD < FR  578.42 
3  10000  2.00  1506.20  1404.27  7052.33  LS < AD < FR  901.56 
4  100  0.10  8.06  2.45  23.93  LS < AD < FR  2.00 
4  100  0.50  8.78  2.66  26.56  LS < AD < FR  2.72 
4  100  1.00  10.03  3.70  30.89  LS < AD < FR  3.43 
4  100  2.00  12.94  5.00  37.26  LS < AD < FR  4.86 
4  1000  0.10  81.72  17.32  386.11  LS < AD < FR  13.45 
4  1000  0.50  99.92  24.37  433.17  LS < AD < FR  22.68 
4  1000  1.00  117.71  36.94  508.49  LS < AD < FR  36.01 
4  1000  2.00  156.39  58.02  564.57  LS < AD < FR  57.62 
4  10000  0.10  879.95  229.06  5798.15  LS < AD < FR  176.27 
4  10000  0.50  1170.97  480.99  7051.69  LS < AD < FR  505.45 
4  10000  1.00  1555.90  1017.41  8108.08  LS < AD < FR  1051.81 
4  10000  2.00  2202.08  2538.54  10445.75  AD = LS < FR  1308.79 
Comparison on admixtures derived from the HapMap3 dataset
Simulation experiments (13) using realistic population allele frequencies from the HapMap phase 3 project
Simulation 1 q ~ Dir(1,1,1)  Simulation 2 q ~ Dir(.5,.5,.5)  Simulation 3 q ~ Dir(.1,.1,.1)  

Original 


 
Admixture 


 
Leastsquares (α=1) 


 
Leastsquares with α 


 
RMSE (%) ± Std. Dev.  Time (s.) ± Std. Dev.  RMSE (%) ± Std. Dev.  Time (s.) ± Std. Dev.  RMSE (%) ± Std. Dev.  Time (s.) ± Std. Dev.  
P  Q  P  Q  P  Q  
AD (ε=1e4)  2.50 ± 0.04  2.19 ± 0.11  105 ± 13  1.99 ± 0.02  1.44 ± 0.04  88 ± 9  1.54 ± 0.01  0.76 ± 0.02  86 ± 7 
AD (ε=1.4e3)  2.50 ± 0.04  2.19 ± 0.11  98 ± 13  1.99 ± 0.02  1.44 ± 0.04  87 ± 11  1.54 ± 0.01  0.76 ± 0.02  83 ± 9 
LS1 (ε=1.4e3)  2.51 ± 0.03  1.85 ± 0.07  51 ± 6  2.04 ± 0.02  1.43 ± 0.04  37 ± 8  1.63 ± 0.01  1.75 ± 0.05  27 ± 5 
LSα (ε=1.4e3)  2.51 ± 0.03  1.85 ± 0.07  54 ± 8  2.03 ± 0.02  1.53 ± 0.04  28 ± 4  1.57 ± 0.01  1.08 ± 0.02  15 ± 4 
Simulation experiments (46) using realistic population allele frequencies from the HapMap phase 3 project
Simulation 4 q ~ Dir(.2,.2,.05)  Simulation 5 q ~ Dir(.2,.2,.5)  Simulation 6 q ~ Dir(.05,.05,.01)  

Original 


 
Admixture 


 
Leastsquares (α=1) 


 
Leastsquares with α 


 
RMSE (%) Std. Dev.  Time (s.) ± Std. Dev.  RMSE (%) ± Std. Dev.  Time (s.) ± Std. Dev.  RMSE (%) ± Std. Dev.  Time (s.) ± Std. Dev.  
P  Q  P  Q  P  Q  
AD (ε=1e4)  2.01 ± 0.05  0.87 ± 0.02  94 ± 12  1.98 ± 0.03  1.16 ± 0.03  93 ± 17  1.96 ± 0.07  0.53 ± 0.02  91 ± 9 
AD (ε=1.4e3)  2.01 ± 0.05  0.87 ± 0.02  82 ± 5  1.98 ± 0.03  1.16 ± 0.03  86 ± 13  1.96 ± 0.07  0.53 ± 0.02  82 ± 7 
LS1 (ε=1.4e3)  2.09 ± 0.05  1.70 ± 0.05  31 ± 7  2.06 ± 0.03  1.60 ± 0.04  34 ± 5  2.04 ± 0.07  2.00 ± 0.04  27 ± 7 
LSα (ε=1.4e3)  2.05 ± 0.05  1.17 ± 0.03  17 ± 3  2.02 ± 0.04  1.34 ± 0.04  24 ± 4  1.99 ± 0.07  1.09 ± 0.03  14 ± 3 
The apparent advantage of Admixture involves individuals on the periphery of the unit simplex defining the space of Q. In Table 7, this corresponds to individuals on the boundary of the right triangle defined by the xaxis, yaxis, and y = 1  x diagonal line. For Simulation 1, the original Q contains very few individuals on the boundary, Admixture estimates far more on the boundary, and the leastsquares was closer to the ground truth. For Simulation 2  6, the ground truth contains more individuals on the boundary, Admixture correctly estimates these boundary points but the leastsquares algorithms predict fewer points on the boundary. Simulation 6 provides the most obvious example where Admixture estimates individuals exactly on the boundary and leastsquares contains a jumble of individuals near but not exactly on the line.
Real dataset from the HapMap phase 3 project
Discussion
This work contributes to the population inference literature by providing a novel simplification of the binomial likelihood model that improves the computational efficiency of discrete admixture inference. This approximation results in an inference algorithm based on minimizing the squared distance between the genotype matrix G and twice the product of the population allele frequencies and individual admixture proportions: 2PQ. This Euclidean distancebased interpretation aligns with previous results employing multivariate statistics. For example, researchers have found success using principal component analysis to reveal and remove stratification [24] or even to reveal clusters of individuals in subpopulations [57]. Recently, McVean [5] proposed a genealogical interpretation of principal component analysis and uses it to reveal information about migration, geographic isolation, and admixture. In particular, given two populations, individuals cluster along the first principal component. Admixture proportion is the fractional distance between the two population centers. However, these cluster centers must known or inferred in order to estimate ancestral population allele frequencies. The leastsquares approach infers these estimates efficiently and directly.
Typically, discrete admixture models employ a binomial likelihood function rather than a Euclidean distancebased one. Pritchard et al. detail one such model and use a slow sampling based approach to infer the admixed ancestral populations for individuals in a sample [9]. Recognizing the performance advantage of maximizing the likelihood rather than sampling the posterior, Tang et al. proposed an expectation maximization algorithm and Alexander et al. [13] proposed a sequential quadratic programming (SQP) approach using the same likelihood function [9]. We take this approach a step further by simplifying the model proposed by Pritchard et al. to introduce a leastsquares criterion. By justifying the leastsquares simplification, we connect the fast and practical multivariate statistical approaches to the theoretically grounded binomial likelihood model. We validate our approach on a variety of simulated and real datasets.
First, we show that if the true value of P (or Q) is known, the expected value of the least squares solution for Q (or P) across all possible genotype matrices is equal to the true value, and the variance of this estimate decreases with M (or N). In this bestcase scenario, we show that SQP provides a slightly better estimate than the leastsquares solution for a variety of problem sizes and difficulty. For more common scenarios where the algorithms must estimate P and Q using only the genotype information in G, we show that for particularly difficult problems with small N, large K, or large α, the leastsquares approach often performs better than its peers. For about onethird of the parameter sets, Admixture performs significantly better than leastsquares and FRAPPE but all algorithms approach zero error as N becomes very large. In addition, the error introduced by the choice of algorithms was relatively small compared to other characteristics of the experiment such as sample size, number of populations, and the degree of admixture in the sample. That is, improving accuracy has more to do with improving the dataset than with selecting the algorithm, suggesting that algorithm selection may depend on other criteria such as its speed. In nearly all cases, the leastsquares method computes its solution faster, typically a 1.5 to 5times faster. At the current problem size involving about 10000 loci, this speed improvement may justify the use of leastsquares algorithms. For a single point estimate, researchers may prefer a slightly more accurate algorithm at the cost of seconds or minutes. For researchers testing several values of K and α and using multiple runs to gauge the fitness of each parameter set, or those estimating standard errors [13], the speed improvement could be the difference between hours and days of computation. As the number of loci increase to hundreds of thousands or even millions, speed may be more important. The leastsquares approach offers an alternative simpler and faster algorithm for population inference that provides qualitatively similar results.
The key speed advantage of the leastsquares approach comes from a single nonnegative leastsquares update that minimizes a quadratic criterion for P and then for Q per iteration. Admixture, on the other hand, minimizes several quadratic criteria sequentially as it fits the true binomial model. Although the leastsquares algorithm completes each update in less time and is guaranteed to converge to a local minimum or straddle point, predicting the number of iterations to convergence presents a challenge. We provide empirical timing results and note that selecting a suitable stopping criterion for these iterative methods can change the timing and accuracy results. For comparison, we use the same stopping criterion with published thresholds for Admixture and FRAPPE[13], and a threshold of MN×10^{10} for leastsquares.
This work is motivated in part by the desire to analyze larger genotype datasets. In this paper, we focus on the computational challenges of analyzing very large numbers of markers and individuals. However, linkage disequilibrium introduces correlations between loci that cannot be avoided in very large datasets. Large datasets can be pruned to diminish the correlation between loci. For example, Alexander et al. prune the HapMap phase 3 dataset from millions of SNPs down to around 10000 to avoid correlations. In this study, we assume linkage equilibrium and therefore uncorrelated markers and limit our analysis to datasets less than about 10000 SNPs. Incorporating linkage disequilibrium in gradientbased optimizations of the binomial likelihood model remains an open problem.
Estimating the number of populations K from the admixed samples continues to pose a difficult challenge for clustering algorithms in general and population inference in particular. In practice, experiments can be designed to include individual samples that are expected to be distributed close to their ancestors. For example, Tang et al. [11] suggested using domain knowledge to collect an appropriate number of pseudoancestors that reveal allele frequencies of the ancestral populations. The number of groups considered provides a convenient starting point for K. Lacking domain knowledge, computational approaches can be used to try multiple reasonable values for K and evaluating their fitness. For example, Pritchard et al. [9] estimated the posterior distribution of K and select the most probable K. Another approach is to evaluate the consistency of inference for different values of K. If the same value of K leads to very different inferences of P and Q from different random starting points, the inference can be considered inconsistent. Brunet et al. [18] proposed this method of model selection called consensus clustering.
For realistic population allele frequencies, P, from the HapMap Phase 3 dataset and very little admixture in Q, Admixture provides better estimates of Q. The key advantage of Admixture appears to be for individuals containing nearly zero contribution from one or more inferred populations, whereas the leastsquares approach performs better when the individuals are wellmixed. Visually, both approaches reveal population structure. Using the two approaches to infer three ancestral populations from four HapMap Phase 3 sampling populations reveals qualitatively similar results.
We believe the computational advantage of the leastsquares approach along with its good estimation performance warrants further research especially for very large datasets. For example, we plan to adapt and apply the leastsquares approach to datasets utilizing microsatellite data rather than SNPs and consider the case of more than two alleles per locus. Researchers have incorporated geospatial information into samplingbased [19] and PCAbased [8] approaches. Multiple other extensions to samplingbased or PCAbased algorithms have yet to be incorporated into faster gradientbased approaches.
Conclusion
This paper explores the utility of a leastsquares approach for the inference of population structure in genotype datasets. Whereas previous Euclidean distancebased approaches received little theoretical justification, we show that a leastsquares approach is the result of a firstorder approximation of the negative loglikelihood function for the binomial generative model. In addition, we show that the error in this approximation approaches zero as the number of samples (individuals and loci) increases. We compare our algorithm to stateoftheart algorithms, Admixture and FRAPPE, for optimizing the binomial likelihood model, and show that our approach requires less time and performs comparably well. We provide both quantitative and visual comparisons that illustrate the advantage of Admixture at estimating individuals with little admixture, and show that our approach infers qualitatively similar results. Finally, we incorporate a degree of admixture parameter that improves estimates for known levels of admixture without requiring additional parameter tuning as is the case for Admixture.
Methods
where 0 ≤ q_{ ik } ≤ 1 represents the fraction of the i th individual’s genome originating from the k th population and for all i, ∑_{ k }q_{ ki } = 1. Table 1 summarizes the matrix notation we use.
Likelihood function
Bounded error for the leastsquares approach
Mean and total variance of the estimate of p
Mean and total variance of the estimate for q
Incorporating degree of admixture, α
Optimization algorithm
This process is repeated until the change in the criterion function is less than ε at which point we consider the algorithm to have converged. The Admixture algorithm suggests a threshold of ε = 1e4 but we have found that a larger threshold often suffices. Unless otherwise stated, we use a threshold that depends on the size of the problem: ε = MN×10^{10}, corresponding to 1e4 when M = 10000 and N = 100.
Leastsquares solution for P
However, some of the elements of P may be less than zero. In the active/passive set approach, if elements of P are negative, they are clamped at zero and added to the active set. The unconstrained solution is then applied to the remaining passive elements of P. If the solution happens to be nonnegative, the algorithm finishes. If not, negative elements are added to the active set and elements in the active set with a negative gradient (will decrease the criterion by increasing) are added back to the passive set. The process is repeated until the passive set is nonnegative and the active set contains only elements with a positive gradient at zero. We extend the approach of Van Benthem and Keenan to include an upper bound at one. Therefore, we maintain two active sets: those clamped at zero and those clamped at one and update both after the unconstrained optimization of the passive set at each iteration. We provide Matlab source code that implements this algorithm on our website.
Leastsquares solution for Q
As before, some elements of Q may be negative. In that case, we utilize the active set method to clamp elements of Q at zero and update active and passive sets at each iteration until convergence as described above. We adapt the Matlab script by Van Benthem and Keenan so that the unconstrained solution uses Equation 30 instead of the standard pseudoinverse and provide it on our website.
Simulated experiments to compare the proposed approach to Admixture and FRAPPE
We generate simulated genotype data for a variety of problems using M = 10000 markers, and varying N between 100, 1000, and 10000; K between 2, 3, and 4; and α between 0.1, 0.5, 1, and 2, for a total of 36 parameter sets. For each combination of N, K, and α, we generate the ground truth P from a uniform distribution, and Q from a Dirichlet distribution parameterized by α. Then, we draw a random genotype for each individual using the binomial distribution in Equation 11. We estimate P and Q using only the genotype information and the true number of populations, K. We repeat the experiment 50 times drawing new, P, Q, and G matrices each time. Finally, we record the performance of Admixture using the published tight convergence threshold of ε = 1e4[13] and a loose convergence threshold of ε = MN×10^{4}; the leastsquares algorithm using an uninformative prior (α = 1) and ε = MN×10^{4}, and the FRAPPE EM algorithm using the published threshold of ε = 1. For reference, we also include the leastsquares algorithm with informative prior (known α) with convergence threshold of ε = MN×10^{4}_{.} In all experiments, Admixture’s performances with the two convergence thresholds were nearly identical and we only report the results for ε = MN×10^{4}, resulting in shorter computation times. We used a fourway analysis of variance (ANOVA) with a fixed effects model to reveal which factors (including algorithm) contribute more or less to the estimation error and computation time.
Statistical significance of root mean squared error and computation time
For each combination of K, N, and α, we perform a KruskalWallis test to determine if Admixture, LeastSquares, and FRAPPE perform significantly differently at a Bonferroni adjusted significance level of 0.05/(36 parameter sets) = 0.0014. If there is no significant difference, we consider their performances equal. If there is a significant difference, we perform pairwise MannWhitney Utests to determine significant differences between specific algorithms. We use a Bonferroni adjusted significance level of 0.05/(36 parameter sets)/(3 pairwise comparisons) = 4.6e4. The ‘Summary’ columns contain the order of performance among the algorithms such that every algorithm to the left of a ‘<’ symbol performs better than every algorithm to the right. An ‘=’ symbol indicates that the adjacent algorithms do not perform significantly differently.
Comparison on admixtures derived from the HapMap3 dataset
Real dataset from the HapMap phase 3 project
In the original Admixture paper [13], the authors use Admixture to infer three hypothetical ancestral populations from four known populations in the HapMap Phase 3 dataset, including individuals with African ancestry in the American Southwest (ASW), individuals with Mexican ancestry in Los Angeles (MEX), and the same CEU CEU and YRI individuals from the previous example. We ran each algorithm 20 times on the dataset using a convergence threshold of ε = 1e4, recording the convergence times for each trial.
Declarations
Acknowledgements
This work was supported in part by grants from Microsoft Research, National Institutes of Health (Bioengineering Research Partnership R01CA108468, P20GM072069, Center for Cancer Nanotechnology Excellence U54CA119338, and 1RC2CA148265), and Georgia Cancer Coalition (Distinguished Cancer Scholar Award to Professor M. D. Wang).
Authors’ Affiliations
References
 Beaumont M, Barratt EM, Gottelli D, Kitchener AC, Daniels MJ, Pritchard JK, Bruford MW: Genetic diversity and introgression in the Scottish wildcat. Mol Ecol 2001, 10: 319336. 10.1046/j.1365294x.2001.01196.xView ArticlePubMedGoogle Scholar
 Novembre J, Ramachandran S: Perspectives on human population structure at the cusp of the sequencing era. Annu Rev Genomics Hum Genet 2011., 12:Google Scholar
 Menozzi P, Piazza A, CavalliSforza L: Synthetic maps of human gene frequencies in Europeans. Science 1978, 201: 786792. 10.1126/science.356262View ArticlePubMedGoogle Scholar
 Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genomewide association studies. Nat Genet 2006, 38: 904909. 10.1038/ng1847View ArticlePubMedGoogle Scholar
 McVean G: A genealogical interpretation of principal components analysis. PLoS Genet 2009, 5: e1000686. 10.1371/journal.pgen.1000686PubMed CentralView ArticlePubMedGoogle Scholar
 Patterson N, Price AL, Reich D: Population structure and eigenanalysis. PLoS Genet 2006, 2: e190. 10.1371/journal.pgen.0020190PubMed CentralView ArticlePubMedGoogle Scholar
 Lee C, Abdool A, Huang CH: PCAbased population structure inference with generic clustering algorithms. BMC Bioinforma 2009., 10:Google Scholar
 Novembre J, Stephens M: Interpreting principal component analyses of spatial population genetic variation. Nat Genet 2008, 40: 646649. 10.1038/ng.139PubMed CentralView ArticlePubMedGoogle Scholar
 Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics 2000, 155: 945959.PubMed CentralPubMedGoogle Scholar
 Falush D, Stephens M, Pritchard JK: Inference of population structure using multilocus genotype data linked loci and correlated allele frequencies. Genetics 2003, 164: 15671587.PubMed CentralPubMedGoogle Scholar
 Tang H, Peng J, Wang P, Risch NJ: Estimation of individual admixture: Analytical and study design considerations. Genet Epidemiol 2005, 28: 289301. 10.1002/gepi.20064View ArticlePubMedGoogle Scholar
 Wu B, Liu N, Zhao H: PSMIX: an R package for population structure inference via maximum likelihood method. BMC Bioinforma 2006, 7: 317. 10.1186/147121057317View ArticleGoogle Scholar
 Alexander DH, Novembre J, Lange K: Fast modelbased estimation of ancestry in unrelated individuals. Genome Res 2009, 19: 1655. 10.1101/gr.094052.109PubMed CentralView ArticlePubMedGoogle Scholar
 Alexander D, Lange K: Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinforma 2011, 12: 246. 10.1186/1471210512246View ArticleGoogle Scholar
 Kim H, Park H: Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM Journal in Matrix Analysis and Applications 2008, 30: 713730. 10.1137/07069239XView ArticleGoogle Scholar
 Van Benthem MH, Keenan MR: Fast algorithm for the solution of largescale nonnegativityconstrained least squares problems. J Chemom 2004, 18: 441450. 10.1002/cem.889View ArticleGoogle Scholar
 Hanis CL, Chakraborty R, Ferrell RE, Schull WJ: Individual admixture estimates: disease associations and individual risk of diabetes and gallbladder disease among Mexican Americans in Starr County, Texas. Am J Phys Anthropol 1986, 70: 433441. 10.1002/ajpa.1330700404View ArticlePubMedGoogle Scholar
 Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci U S A 2004, 101: 4164. 10.1073/pnas.0308531101PubMed CentralView ArticlePubMedGoogle Scholar
 Guillot G, Estoup A, Mortier F, Cosson JF: A spatial statistical model for landscape genetics. Genetics 2005, 170: 12611280. 10.1534/genetics.104.033803PubMed CentralView ArticlePubMedGoogle Scholar
 Bertsekas DP: Nonlinear programming. Belmont, Mass.: Athena Scientific; 1995.Google Scholar
 Settle JJ, Drake NA: Linear mixing and the estimation of ground cover proportions. Int J Remote Sens 1993, 14: 11591177. 10.1080/01431169308904402View ArticleGoogle Scholar
 Altshuler D, Brooks LD, Chakravarti A, Collins FS, Daly MJ, Donnelly P, Gibbs RA, Belmont JW, Boudreau A, Leal SM: A haplotype map of the human genome. Nature 2005, 437: 12991320. 10.1038/nature04226View ArticleGoogle Scholar
 ADMIXTURE: fast ancestry estimation. [http://www.genetics.ucla.edu/software/admixture/download.html] []
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.