 Methodology article
 Open Access
 Published:
Stochastic Lanczos estimation of genomic variance components for linear mixedeffects models
BMC Bioinformaticsvolume 20, Article number: 411 (2019)
Abstract
Background
Linear mixedeffects models (LMM) are a leading method in conducting genomewide association studies (GWAS) but require residual maximum likelihood (REML) estimation of variance components, which is computationally demanding. Previous work has reduced the computational burden of variance component estimation by replacing direct matrix operations with iterative and stochastic methods and by employing loose tolerances to limit the number of iterations in the REML optimization procedure. Here, we introduce two novel algorithms, stochastic Lanczos derivativefree REML (SLDF_REML) and Lanczos firstorder Monte Carlo REML (L_FOMC_REML), that exploit problem structure via the principle of Krylov subspace shiftinvariance to speed computation beyond existing methods. Both novel algorithms only require a single round of computation involving iterative matrix operations, after which their respective objectives can be repeatedly evaluated using vector operations. Further, in contrast to existing stochastic methods, SLDF_REML can exploit precomputed genomic relatedness matrices (GRMs), when available, to further speed computation.
Results
Results of numerical experiments are congruent with theory and demonstrate that interpretedlanguage implementations of both algorithms match or exceed existing compiledlanguage software packages in speed, accuracy, and flexibility.
Conclusions
Both the SLDF_REML and L_FOMC_REML algorithms outperform existing methods for REML estimation of variance components for LMM and are suitable for incorporation into existing GWAS LMM software implementations.
Background
Linear mixedeffects modeling (LMM) is a leading methodology employed in genomewide association studies (GWAS) of complex traits in humans, offering the dual benefits of controlling for population stratification while permitting the inclusion of data from related individuals [1]. However, the implementation of LMM comes at the cost of increased computational burden relative to ordinary leastsquares regression, particularly in performing residual maximum likelihood (REML) estimation of genomic variance components. Conventional REML algorithms require multiple \(\mathcal {O}\left (n^{3}\right)\) or \(\mathcal {O}\left (mn^{2}\right)\) matrix operations, where m and n are the numbers of markers and individuals, respectively, rendering them infeasible for large biobank scale data sets. Further, common numerical methods for REML estimation rely on sparse matrix methods suitable for traditional LMM applications (e.g., pedigree data or experiments with repeated measures [2]) that are inapplicable to genomics variance components models since these models involve dense relatedness matrices. As a result, the problem of increasing the computational efficiency of REML estimation of genomic variance components has generated considerable research activity [3–8].
In the case of the standard two variance component model (1), the estimation of which is the focus of the current research, previous efforts toward increasing computational efficiency fit into two primary categories: 1., reducing the number of cubic time complexity matrix operations needed to achieve convergence; and 2., substituting stochastic and iterative matrix operations for deterministic, direct methods to obtain procedures with quadratic time complexity. The first approach is embodied by the methods implemented in the FaSTLMM and GEMMA packages [3, 5, 6], which take advantage of the fact that the genetic relatedness matrix (GRM) and identity matrix comprising the covariance structure are simultaneously diagonalizable. As a result, after performing a single spectral decomposition of the GRM and a small number of matrixvector multiplications, the REML criterion (3) and its gradient and Hessian can be repeatedly evaluated using only vector operations. The second approach is exemplified by the popular BOLTLMM software [7, 8], which avoids all cubic operations by solving linear systems via the method of conjugate gradients (CG) and employing stochastic trace estimators in place of deterministic computations.
In the current research, we propose two algorithms, stochastic Lanczos derivativefree residual maximum likelihood (SLDF_REML; Algorithm 3) and Lanczos firstorder Monte Carlo residual maximum likelihood (L_FOMC_REML; Algorithm 4), that combine features of both approaches (Fig. 1). Here, we translate the simultaneous diagonalizability of the heritable and nonheritable components of the covariance structure to stochastic and iterative methods via the principle of Krylov subspace shiftinvariance. As a result, we only need to compute the costliest portions of the objective function once (via stochastic/iterative methods), computing all subsequent iterations of the REML optimization problem only using vector operations. We develop the theory underlying these methods and demonstrate their performance relative to previous methods via numerical experiment.
Results
Across 20 replications per condition for random subsamples of n=16,000 to 256,000 unrelated Europeanancestry individuals, both SLDF_REML and L_FOMC_REML produced heritability estimates for height consistent with those generated by the GCTA software package (Figs. 4 and 5). For large samples, the novel algorithms achieved greater accuracy than either version of BOLTLMM (e.g., for n=250,000, meansquared error was 1.74 ×10^{−6} for BOLTLMM v2.3.2 versus 1.24 ×10^{−7} for L_FOMC_REML). Particularly, the time required per additional iteration after initial overhead computations was low for the novel algorithms (e.g., \(\overline {t}\)=20.07 min for BOLTLMM v2.3.2 versus 2.06 min for L_FOMC_REML; Table 2), enabling increased precision at minor cost. With respect to total timings, SLDF_REML dramatically outperformed all other methods when the precomputed GRM was available (Table 2 and Fig. 3), which we expect whenever the number of markers exceeds the sample size. Examining methods taking genotype matrices as inputs, SLDF_REML and L_FOMC_REML performed similarly, whereas BOLTLMM v2.3.2 converged more quickly than either in smaller samples (Fig. 3), though the differences for n =256,000 were relatively minor (e.g., \(\overline {t}=\)91.09 min for BOLTLMM v2.3.2 versus 102.21 min for L_FOMC_REML; Table 2). The older version of BOLTLMM, v2.1, performed significantly more slowly than any of the other implementations examined (e.g., average wall clock time was 177.95 min at n =256,000), demonstrating the importance of implementation optimization.
As the computations needed to compute the Lanczos decompositions in L_FOMC_REML and BOLTLMM v2.3.2 are equivalent in time and memory complexity, we expect that an optimized compiledlanguage implementation of L_FOMC_REML would reduce the overhead computation time by a significant linear factor (≈3 for n =256,000, comparing the sum of the overhead time and single objective function evaluation time for BOLTLMM v2.3.2 to its total running time; Table 2). Consistent with theory, the wall clock times per objective function evaluation for the novel algorithms were trivial given the Lanczos decompositions (e.g., for n =256,000, \(\overline {t}\) = 2.06 versus 20.07 min for L_FOMC_REML and BOLTLMM v2.3.2, respectively; Table 2 and Fig. 2).
Discussion
We have proposed stochastic algorithms for estimating the two variance component model (1), both of which theoretically offer substantial time savings relative to existing methods. Our methods capitalize on the principle of Krylov subspace shift invariance to reduce the number of steps involving \(\mathcal {O} \left (n^{2}\right)\) or \(\mathcal {O} (mn)\) computations to one, whereas existing methods perform equivalent computations at each iteration of the REML optimization procedure. For large samples, when taking genotype matrices as inputs, our interpretedlanguage implementations of L_FOMC_REML and SLDF_REML [9] produced more accurate variance component estimates than the highlyoptimized, compiled BOLTLMM implementations, while taking similar amounts of time. Thus, we expect comparablyoptimized implementations of the novel algorithms to compute high accuracy REML estimates in close to the time required by BOLTLMM v2.3.2 for a single objective function evaluation. Further, in contrast to the BOLT_LMM algorithm, which requires the genotype matrix, SLDF_REML can exploit precomputed GRMs to reduce operation count by an \(\mathcal {O}(2m/n)\) factor (Table 1), which yields dramatic time savings when the number of markers greatly exceeds the number of individuals (Fig. 3). While GRM precomputation is itself \(\mathcal {O}\left (mn^{2}\right)\), it can be effectively and asynchronously parallelized across multiple compute nodes, substantially mitigating computational burden (though we note that serial input/output constraints can interfere with efficient parallelization). However, as the L_FOMC_REML algorithm involves the computation of BLUPs of SNP effects, L_FOMC_REML is preferable to SLDF_REML when BLUP estimates are desired for prediction (as in [10]).
There are several limitations to the proposed approaches. First, SLDF_REML, which benefits from the ability to take GRMs as input, depends linearly on the number of included covariates, which might grow prohibitive in samples spanning numerous genotyping batches and ascertainment locations. However, as in BOLT_LMM, L_FOMC_REML requires \(\mathcal {O}(mn)\) matrix multiplications for BLUP computation at each step of the REML optimization procedure, whereas SLDF_REML requires only vector operations. Thus, though the options provided by the two novel algorithms increase researchers’ flexibility overall, the choice of whether to employ SLDF_REML versus L_FOMC_REML is problemspecific and necessitates greater researcher attention to resource allocation. For example, even when a precomputed GRM is available, it might be preferable to use L_FOMC_REML if BLUPs of latent SNP effects are desired. On the other hand, if a researcher intends to sequentially analyze a large number of phenotypes in a relatively small sample of individuals, it might prove most efficient to compute a GRM, despite the involved computational burden, in order to speed subsequent computations by supplying the GRM to the SLDF_REML algorithm. Second, neither algorithm mitigates the substantial \(\mathcal {O}(mn)\) or \(\mathcal {O}\left (n^{2}\right)\) memory complexity common to all algorithms for REML estimation of genomic variance components, requiring that researchers have access to highmemory compute nodes to work with large samples (though we note that neither of the novel algorithms substantial increases this burden either). Finally, for the same reasons that the spectral decompositionbased direct methods implemented in the FaSTLMM and GEMMA packages [3, 5, 6] are restricted to the simple two component model (1) (i.e., whereas the GRM and identity matrix are simultaneously diagonalizable, the same doesn’t hold for arbitrary collections of three or more symmetric positive semidefinite matrices), the shiftinvariance property exploited by the proposed methods does not extend to multiple genomic variance components. Given that the two component model is insufficient for precise heritability estimation for many complex traits [11], our novel algorithms apply to the particular, though common, tasks of variance component and BLUP estimation for LMM in association studies.
Despite these limitations, the proposed algorithms have clear advantages over existing methods in terms of flexibility, accuracy, and speed of computation. We provide both pseudocode and heavily annotated Python 3 implementations [9] to facilitate their incorporation into existing software packages. Further, though our algorithms are restricted to the two variance component model, they can be used to generate the inputs necessary for estimation of more complex models, such as the mixture model estimated via variational approximation implemented in [7], and thus have applications to noninfinitesimal models. Finally, we suggest that the methods presented in our theoretical development, in particular stochastic trace estimation and stochastic Lanczos quadrature, are likely to find uses in REML estimation of other models of interest to researchers in genomics. In particular, we suggest the development of models that exploit Krylov subspace shiftinvariance to speed up variance/covariance component estimation for the case of multivariate phenotypes as a target for future research. Such models necessarily involve the computation or approximation of Hessian matrices, thereby introducing additional complexity in comparison to the univariate case considered above. However, the extension of fast cubic complexity methods based on the spectral decomposition of the covariance matrix [3, 5] to the multivariate case [6] suggests the potential for multivariate analogues of the algorithms presented here.
Conclusions
The proposed algorithms, SLDF_REML and L_FOMC_REML, unify previous approaches to estimating the two variance component model (1) by exploiting the simultaneous diagonalizability of the covariance structure components while avoiding matrix operations with cubic time complexity. As a result, the most expensive operations only need to be performed once, as with the spectral decomposition performed in the FaSTLMM and GEMMA software packages [3, 5, 6], but these operations consist only of matrixvector products, as in the BOLTLMM software package [7, 8]. All but one iteration of the REML optimization procedure requires only vector operations, yielding increased speed and numerical precision relative to existing methods. Furthermore, the unique strengths of the two methods lead to a flexible approach depending on researcher goals: SLDF_REML is capable of operating on precomputed GRMs when available, whereas L_FOMC_REML can generate BLUPs of latent SNP effects without added computational burden. We recommend these algorithms for incorporation into GWAS LMM implementations.
Method
We consider the two component genomic variance components model commonly employed in LMM association studies [1], which is of the form
where y is a measured phenotype, the c≪n columns of \(X\in \mathbb {R}^{n\times c}\) are covariates (including an intercept term) with corresponding fixed effects β, and \(Z\in \mathbb {R}^{n\times m}\) is a matrix of n individuals’ standardized genotypes at m loci. Without loss of generality, we assume that X has full column rank; in the case of numerical rank deficiency we can simply replace X by the optimal full rank approximation generated by its economy singular value decomposition or rank revealing QR decomposition. The latent genetic effects \(u\in \mathbb {R}^{m}\) and residuals \(e\in \mathbb {R}^{n}\) are random variables with distributions parametrized by the heritable and nonheritable variance components, \(\sigma _{g}^{2}\) and \(\sigma _{e}^{2}\), respectively. The REML criterion corresponds to the marginal likelihood of \(\sigma _{g}^{2},\sigma _{e}^{2}\vert K^{T}y\), where K^{T} projects to an (n−c)dimensional subspace orthogonal to the covariate vectors such that the null space of K^{T} is exactly the column space of X [12]. In other words \(K^{T}:\mathbb {R}^{n}\rightarrow \mathcal {S}\subset \mathbb {R}^{nc}\) such that \(\mathbb {R}^{n}=\mathcal {S}\oplus \text {col}\ X\). The transformed random variable K^{T}y has the marginal distribution \(K^{T}y\sim \mathcal {MVN}\left (0,\sigma _{g}^{2}\frac {1}{m}K^{T}ZZ^{T}K+\sigma _{e}^{2}KK^{T}\right)\), which we reparametrize as \(K^{T}y\sim \mathcal {MVN}\left (0,\sigma _{g}^{2}K^{T}H_{\tau }K\right)\), where
Here, \(\frac {1}{m}ZZ^{T}\), which indicates the average covariance between individuals’ standardized genotypes, is often referred to as the genomic relatedness matrix (GRM). The REML criterion, or marginal log likelihood, can be expressed as a function of τ:
where P_{τ}=K(K^{T}H_{τ}K)^{−1}K^{T}, and, as implied by the REML firstorder (stationarity) conditions, \(\hat {\sigma }^{2}_{e}(\tau)\) is the expected residual variance component given τ and \(\hat {\sigma }^{2}_{g}(\tau)=\hat {\sigma }^{2}_{e}(\tau)/\tau \) [12, 13]. In practice, K is never explicitly formed.
Naïve procedures for maximizing the REML criterion require evaluating (3) or its derivatives at each iteration of the optimization procedure. Previous methods either reduce the number of necessary cubic time complexity operations to one by exploiting problem structure, or substitute quadratic time complexity iterative and stochastic matrix operations for direct computations (Fig. 1). Here, we unify these approaches via the principle of Krylov subspace shift invariance to achieve methods that only require a single iteration of quadratic time complexity operations.
In what follows, we first present a brief survey of the Lanczos process, its applications to families of shifted linear systems, and its use in constructing Gaussian quadratures for spectral matrix functions. We assume familiarity with the method of conjugate gradients, an iterative procedure for approximating solutions to symmetric positive definite linear systems, and Gaussian quadrature, a method for approximating the integral of a given function by a well chosen weighted sum of its values; if not, see [14] and [15], respectively. We present these methods toward the goal of efficiently evaluating the quadratic form and logdeterminant terms appearing in the REML criterion (3). We then present the details of the SLDF_REML and L_FOMC_REML algorithms, both of which exploit problem structure via Lanczos processbased methods in order to speed computation. Finally, we derive expressions for the computational complexity of the present algorithms, which we confirm via numerical experiment.
Preliminaries
The notation in this section is selfcontained. Our presentation borrows from the literature extensively; further details on the (block) Lanczos procedure [14, 16], conjugate gradients for shifted linear systems [17, 18], stochastic trace estimation [19, 20], and stochastic Lanczos quadrature [21–23] are suggested in the bibliography.
Krylov subspaces
Consider a symmetric positivedefinite matrix A and nonzero vector b. Define the m^{th} Krylov subspace by the span of the first m−1 monomials in A applied to b; that is, \(\mathcal {K}_{m}(A,b)=\text {span}\, \left \{A^{k}b:k=0,\ldots,m1\right \}\). Krylov subspaces are shift invariant—i.e., for real numbers σ, we have \(\mathcal {K}_{m}(A,b)=\mathcal {K}_{m}(A+\sigma I,b)\).
The Lanczos procedure
The Lanczos procedure generates the decomposition AU_{m}=U_{m}T_{m}, where the columns u_{1},…,u_{m} of U_{m} form an orthonormal basis for \(\mathcal {K}_{m}(A,b)\) and the Jacobi matrices\(T_{m}\in \mathbb {R}^{m\times m}\) are symmetric tridiagonal. Choosing u_{1}=b/∥b∥, successive columns are uniquely determined by the sequence of Lanczos polynomials \(\{p_{k}\}_{k=1}^{m1}\) such that each u_{k}=p_{k−1}(A)u_{1} and each p_{k} is the characteristic polynomial of Jacobi matrix T_{k} consisting of the first k rows and columns of T_{m}. The Lanczos procedure is equivalent to the wellknown method of conjugate gradients (CG) for solving the linear system Ax=b in that the m^{th} step CG approximate solution x^{(m)} is obtained from the above decomposition using only vector operations (see Algorithm 1). The number of steps m prior to termination corresponds to the number of CG iterations need to bound the norm of the residual below a specified tolerance: ∥Ax^{(m)}−b∥<ε. The rate of convergence depends on the spectral properties of A and can be controlled in terms of the spectral condition number κ(A). In the present application, the fact that all complex traits of interest generally have a nontrivial nonheritable component results in wellconditioned systems [7, 9].
Solving families of shifted linear systems
Having applied the Lanczos process to the seed system Ax=b, shiftinvariance can be exploited to obtain the m^{th} step CG approximate solution \(x_{\sigma }^{(m)}\) to the shifted linear system A_{σ}x_{σ}=(A+σI)x_{σ}=b, only using vector operations [17]. It can be shown that any positive shift by σ≥0 improves the rate of convergence such that \(\left \Vert A_{\sigma }x_{\sigma }^{(m)}b\left \Vert = \frac {\delta _{m}}{\delta _{m}+\sigma }\right \Vert Ax^{(m)}b\right \Vert \), where δ_{m}>0 is the m^{th} diagonal element of the Lanczos Jacobi matrix corresponding to \(\mathcal {K}_{m}(A,b)\).
Lanczos polynomials and Gaussian quadrature
Additionally, the Lanczos polynomials comprise a sequence of orthogonal polynomials with respect to the spectral measure
where A=QΛQ^{T} is the spectral decomposition [21, 22]. Quadratic forms v^{T}f(A)v involving spectral functions f(A)=Qf(Λ)Q^{T}, e.g., for the matrix logarithm, \(v^{T}(\log A)v = \sum _{i=1}^{n} \left [\ln (\lambda _{i})\left (Q^{T}v\right)_{i}^{2}\right ]\), can be written as Riemann–Stieltjes integrals of the form
The Lanczos decomposition AU_{m}=U_{m}T_{m} generates the weights and nodes for an mpoint Gaussian quadrature approximating the above integral. Denoting the spectral decomposition of the j^{th} Jacobi matrix \(T_{j}=W_{j}D_{j}W_{j}^{T}\) for j=1,…,m, we approximate (4) as
where θ_{j,ℓ}={D_{j}}_{ℓ,ℓ} and \(\omega _{j,\ell }=\left \{e_{1}^{T}W_{j}\right \}_{\ell }\). As m here corresponds to the number of CG iterations needed to ensure that ∥Ax^{(m)}−v∥ is smaller than a specified tolerance, the tridiagonal Jacobi matrices are small and calculating their spectral decompositions is computationally trivial.
Stochastic Lanczos quadrature
Stochastic Lanczos quadrature (SLQ) combines the above quadrature formulation with Hutchinsontype stochastic trace estimators [21]. Such estimators approximate the trace of a matrix \(H\in \mathbb {R}^{n\times n}\) by a weighted sum of quadratic forms \(\text {tr}(H)\approx \frac {n}{n_{\text {rand}}}\sum _{k=1}^{n_{\text {rand}}}v_{k}^{T}{Hv}_{k}\) for normalized, suitably distributed i.i.d. random probing vectors\(\{v_{j}\}_{j=1}^{n_{\text {rand}}}\) [19]. The SLQ approximate trace of a spectral function of a matrix, tr(f(A)), is then
Whereas the number of probing vectors n_{rand} is chosen a priori, the number quadrature nodes m_{κ} corresponds to the number of conjugate gradient iterations needed to ensure \(\left \Vert A_{\sigma }x_{j \sigma }^{(m_{\kappa })}v_{j}\right \Vert \) is less than a specified tolerance for each j=1,…,n_{rand}.
SLQ and shift invariance
For a fixed probing vector v_{i}, we can exploit the shift invariance of \(\mathcal {K}_{m}(A,v_{i})\) to efficiently update Gaussian quadrature generated by the corresponding Lanczos decomposition AU_{m}=U_{m}T_{m}. Again denoting the spectral decomposition of the Jacobi matrix \(T_{i}=W_{i}D_{i}W_{i}^{T}\), the Lanczos decomposition of the shifted system is simply \(A_{\sigma }U_{m}=U_{m}W_{m}(D_{m}+\sigma I_{m})W_{m}^{T}\). Thus, given the approximation (5) for tr(f(A)), we can efficiently compute an approximation of tr(f(A_{σ})) for any σ>0. In Algorithm 2 we implement a method for estimating tr(log(A_{σ})) in \(\mathcal {O}(n_{\text {rand}})\) operations given the spectral decompositions of the Jacobi matrices corresponding to \(\mathcal {K}_{m}(A,v_{j})\) for probing vectors \(\{v_{j}\}_{j=1}^{n_{\text {rand}}}\).
Block methods
For multiple right hand sides B=[b_{1}⋯b_{c}], the Lanczos procedure can be generalized to the block Krylov subspace\(\mathcal {K}_{m}(A,B)=\bigotimes _{j=1}^{c}\mathcal {K}_{m}(A,b_{j})\), resulting in a collection of Lanczos decompositions AU_{j}=U_{j}T_{j} such that {U_{j}}_{1}=b_{j}/∥b_{j}∥ for j=1,…,c. This process is equivalent to block CG methods in that the Jacobi matrices can again be used to generate an approximate solution X^{(m)} to the matrix equation AX^{(m)}=B. We provide an implementation of the block Lanczos procedure in L_Seed [9], employing a conservative convergence criterion defined in terms of the magnitude of the (1,2) operator norm of the residual \(\left \Vert AB X^{(m)}\right \Vert _{1\rightarrow 2}=\max _{j}\left \Vert {Ab}_{j} x^{(m)}_{j}\right \Vert _{2}\). Compared to performing c separate Lanczos procedures with respect to \(\{\mathcal {K}_{m}(A,b_{j})\}_{j=1}^{c}\), block Lanczos with respect to \(\mathcal {K}_{m}(A,b)\), with B=[b_{1}⋯b_{c}], produces the same result (for a fixed number of steps). However, block Lanczos employs BLAS3 operations and is thus more performant, especially when implemented on top of parallelized linear algebra subroutines.
A derivativefree REML algorithm
We propose the stochastic Lanczos derivativefree residual maximum likelihood algorithm (SLDF_REML; Algorithm 3), a method for efficiently and repeatedly evaluating the REML criterion, which is then subject to a zerothorder optimization scheme. To achieve this goal, we first identify the parameter space of interest with a family of shifted linear systems. We then develop a scheme for evaluating the quadratic form y^{T}P_{τ}y and log determinant ln(det(K^{T}H_{τ}K)) terms in the REML criterion (3) that use the previously discussed Lanczos methods to exploit this shifted structure. Specifically, after obtaining a collection of Lanczos decompositions, we can repeatedly solve the linear systems involved in the quadratic form term via Lanczos conjugate gradients and approximate the log determinant term via stochastic Lanczos quadrature.
The parameter space as shifted linear systems
Given a range of possible values of the standardized genetic variance component, or heritability,
we set \(\phantom {\dot {i}\!}\tau _{0}=(1h^{2}_{\max })/h^{2}_{\max }\) and define \(H_{0}=H_{\tau _{0}}\phantom {\dot {i}\!}\), noting that for all \(\tau \in \Theta =\left \{\left (1h^{2}\right)/h^{2}: h^{2}\in \left [h^{2}_{\min },h^{2}_{\max }\right ]\right \}\phantom {\dot {i}\!}\), the spectral condition number of H_{τ} will be less than that of H_{0} as the identity component of H_{τ} will only increase. Further, we have now identified elements of our parameter space τ∈Θ with the family of shifted linear systems
For any vector v for which we have computed the Lanczos decomposition H_{0}U=UT with the first column of U equal to v/∥v∥, we can use Algorithm 1 to obtain the CG approximate solution \(\phantom {\dot {i}\!}x_{\sigma }\approx H_{\sigma }^{1}v\) for all σ≥0 in O(n) operations.
The quadratic form
Directly evaluating the quadratic form
is computationally demanding and is typically avoided in direct estimation methods [12, 13]. Writing the complete QR decomposition of the covariate matrix \(\phantom {\dot {i}\!}X=[Q_{X}\vert Q_{X^{\perp }}]R\) allows us to define \(\phantom {\dot {i}\!}K^{T}=Q_{X^{\perp }}^{T}\), noting that substituting \(\phantom {\dot {i}\!}Q_{X^{\perp }}Q_{X^{\perp }}^{T}\) for K^{T} preserves the value of (7). \(Q_{X^{\perp }}Q_{X^{\perp }}^{T}\) is equivalent to the orthogonal projection operator \(\phantom {\dot {i}\!}S:v\mapsto v  Q_{X}Q_{X}^{T}v\), which admits an efficient implicit construction and is computed in O(nc^{2}) operations via the economy QR decomposition X=Q_{X}R_{X}. Then, reexpressing (7) as y^{T}S(SH_{τ}S)^{†}Sy, we can use the Lanczos process to construct an orthonormal basis and corresponding Jacobi matrix for the Krylov subspace \(\mathcal {K}({SH}_{0}S,Sy)\). We can then obtain the CG approximation of y^{T}S(SH_{σ}S)^{−1}Sy using vector operations as, for any shift σ, we have y^{T}S(SH_{σ}S)^{†}Sy=y^{T}S(SH_{0}S+σI_{n})^{−1}Sy (see Lemma 1 in Additional file 1 for proof).
The log determinant
We use an equivalent formulation [12, 24] of the term ln(det(K^{T}H_{τ}K)), rewriting it as
The det(X^{T}X) term is constant with respect to τ and can be disregarded. For c≪n, \(\det \left (X^{T}H^{1}_{\tau } X\right)\) is computationally trivial via direct methods given \(H^{1}_{\tau } X\), which we can compute for all parameter values of interest in O(n) operations having first applied the block Lanczos process with respect to \(\mathcal {K}(H_{0},X)\). Computing the block Lanczos decomposition corresponding to \(\mathcal {K}(H_{0},X)\), which is only performed once, unfortunately scales with the number of covariates c, a disadvantage not shared by our second algorithm (Algorithm 4). The remaining term, ln(det(H_{τ})), is approximated by applying SLQ (Algorithm 2) to a special case of (5): We rewrite the log determinant as the trace of the matrix logarithm
where we have spectrally decomposed H_{0}=QΛQ^{T} for some τ_{0}≤τ with σ=τ−τ_{0}. We draw n_{rand} i.i.d. normalized Rademacher random vectors \(v_{1},\ldots,v_{n_{\text {rand}}}\phantom {\dot {i}\!}\), where each element of each vector v_{i} takes values of either 1/∥v_{i}∥ or −1/∥v_{i}∥ with equal probability. The SLQ approximate of the log determinant for the seed system is
where the weights w_{i,ℓ} and nodes θ_{i,ℓ} are respectively derived by using the Lanczos process to construct orthonormal bases for \(\mathcal {K}(H_{0},v_{i})\) (in practice, we apply block Lanczos to \(\mathcal {K}(H_{0},(v_{1},\ldots,v_{n_{\text {rand}}}))\)) [21, 22].
The SLDF_REML algorithm
Stochastic Lanczos derivativefree residual maximum likelihood (SLDF_REML; Algorithm 3), conceptually similar to the derivativefree algorithm of Graser and colleagues [13], applies the previously introduced Lanczos methods to approximate the above reparametrization of the REML criterion. Shiftinvariance is then exploited such that, with the exception of the initial Lanczos decompositions, the REML log likelihood can be repeatedly evaluated using only vector operations. SLDF_REML takes a phenotype vector \(y\in \mathbb {R}^{n}\), a covariate matrix \(X\in \mathbb {R}^{n\times c}\), either the genetic relatedness matrix \(ZZ^{T}\in \mathbb {R}^{n\times n}\) or the standardized genotype matrix \(Z\in \mathbb {R}^{n\times m}\) (in which case the action of the GRM as a linear operator is coded implicitly as v↦Z(Z^{T}v)), and a range of possible standardized genomic variance component values \(\Theta = \left [h^{2}_{\text {min}},h^{2}_{\max }\right ]\) as arguments and generates a function REML_criterion\(:\Theta \rightarrow \mathbb {R}\) that efficiently computes the loglikelihood of τK^{T}y. This function is then subject to scalar optimization via Brent’s method, which is feasible given the low cost of evaluation and low dimension of Θ. Hyperparameters include the number of probing vectors to be used for the SLQ approximation of the log determinant n_{rand}, as well as tolerances corresponding to the REML criterion, parameter estimates, and the Lanczos residual norms. Convergence to a given tolerance on a sensible scale is ensured by optimizing with respect to the heritability h^{2}∈Θ⊆ [ 0,1] and evaluating the REML criterion at τ=(1−h^{2})/h^{2}. The REML criterion can be repeatedly evaluated in \(\mathcal {O}(n)\) operations, making high accuracy computationally feasible.
A firstorder Monte Carlo REML algorithm
We additionally propose the Lanczos firstorder Monte Carlo residual maximum likelihood algorithm (L_FOMC_REML; Algorithm 4), which also takes advantage of the shifted structure of the standard genomic variance components model to speed computation. We first present the related firstorder algorithm implemented in the efficient and widelyused BOLTLMM software [7, 8], which we refer to as BOLT_LMM and of which the proposed L_FOMC_REML algorithm is a straightforward extension.
BOLT_LMM (Firstorder Monte Carlo REML)
The BOLT_LMM algorithm is based on the observation that at stationary points of the REML criterion (3), the firstorder REML conditions (i.e., ∇ℓ=0) imply that
where \(\tilde {u}\) and \(\tilde {e}\) are the best linear unbiased predictions (BLUPs) of the latent genetic effects and residuals, respectively [25]. The BLUPs are functions of τ given by
where we have defined \(\acute {H}_{\tau }= \frac {1}{m}SZZ^{T}S+\tau I_{n}\). The expectations (8) are approximated via the following stochastic procedure: Monte Carlo samples of the latent variables, \(\check {u}_{k}\stackrel {i.i.d}{\sim }\mathcal {MVN}(0,I_{m})\), \(\check {e}_{k}\stackrel {i.i.d}{\sim }\mathcal {MVN}(0,S)\) are used to generate samples of the projected phenotype vector
BLUPs are then computed as in (9), yielding the approximations
Using the above expressions, Loh et al. [7, 8] apply a zerothorder rootfinding algorithm to the quantity
noting that f_{r}=0 is a necessary condition (and, in practice, a sufficient condition) for (8). Using CG to approximate solutions to the linear systems involved in BLUP computations results in an efficient REML estimation procedure involving O(n·m·n_{rand}) operations for wellconditioned covariance structures (i.e., for nontrivial nonheritable variance component values). As noted in [8], implicit preconditioning of H_{0} can be achieved by including the first few right singular vectors of the genotype matrix (or eigenvectors of the GRM) as columns of the covariate matrix X.
The L_FOMC_REML algorithm
The BOLT_LMM algorithm described above involves solving n_{rand}+1 linear systems
at each iteration of the optimization scheme in order to compute BLUPs of the latent variables for the observed phenotype vector and each of the Monte Carlo samples. However, each iteration involves spectral shifts of the left hand side of the form
As in the SLDF_REML algorithm, the underlying block Krylov subspace is invariant to these shifts (i.e., \(\mathcal {K}_{m}\left (\acute {H}_{\tau },Y\right)=\mathcal {K}_{m}\left (\acute {H}_{\tau }+\sigma I,Y\right)\), where \(Y=\left [y\vert \check {y}_{1}\vert \cdots \vert \check {y}_{n_{\text {rand}}}\right ]\)). Thus, having performed the Lanczos process for an initial parameter value τ_{0}, we can use L_Solve (Algorithm 1) to obtain the block CG approximate solution \(X_{\sigma }^{(m)}\approx \acute {H}_{\tau +\sigma }^{1}Y\) in O(n·n_{rand}) operations. We are thus able to avoid solving linear systems in all subsequent iterations, though the relatively small number of matrixvector products involved in computing BLUPs for the latent genetic effects at each step are unavoidable. The requirement of the genotype matrix for computing (9) prevents both L_FOMC_REML and BOLT_LMM from efficiently exploiting precomputed GRMs.
Comparison of methods
We compare theoretical and empirical properties of our proposed algorithms, SLDF_REML and L_FOMC_REML, to those of BOLT_LMM.
Computational complexity
In contrast to BOLT_LMM, the Lanczosdecomposition based algorithms we have proposed only need to perform the computationally demanding operations necessary to evaluate the REML criterion once. As such, we differentiate between overhead computations, which occur once and do not depend on the number of iterations needed to achieve convergence, and periteration computations, which are repeated until convergence of the optimization process (Table 1 and Fig. 2).
The overhead computations of SLDF_REML are dominated by the need to construct bases for the n_{rand}+c+1 subspaces \(\mathcal {K}(H_{0},[\check {v}_{1},\ldots,\check {v}_{n_{\text {rand}}},x_{1},\ldots,x_{c},y])\), and are thus \(\mathcal {O}\left (n^{2}(n_{\text {rand}}+c)n_\kappa \right)\) when a precomputed GRM is available and \(\mathcal {O}(2m\cdot n(n_{\text {rand}}+c)n_\kappa)\) otherwise. Here, n_{κ} denotes the number of Lanczos iterations needed to achieve convergence at a prespecified tolerance and increases with \(h^{2}_{\max }\). Subsequent iterations are dominated by the cost of solving c+1 shifted linear systems via L_Solve and are thus \(\mathcal {O}(n\cdot c\cdot n_\kappa)\). The overhead computations in L_FOMC_REML are dominated by the Lanczos decompositions corresponding to the 2n_{rand}+1 seed systems, where the GRM is implicitly represented in terms of the standardized genotype matrix, and is thus \(\mathcal {O}(4m\cdot n\cdot n_{\text {rand}} \cdot n_\kappa)\). Operations of equivalent complexity are needed at every iteration of BOLT_LMM.
Numerical experiments
We compared wall clock times for genomic variance component estimation for height in nested random subsets of 16,000, 32,000, 64,000, 128,000, and 256,000 unrelated \((\hat {\pi }<.05)\) European ancestry individuals from the widely used UK Biobank data set [26]. All analyses included 24 covariates consisting of age, sex, and testing center and used hardcalled genotypes from 330,723 array SNPs remaining after enforcing a 1% minor allele frequency cutoff. We compared SLDF_REML, with and without a precomputed GRM, to L_FOMC_REML which requires the genotype matrix. For the novel algorithms, absolute tolerances for the Lanczos iterations and the REML optimization procedure were set to 5e5 and 1e5, respectively. Additionally, we compared our interpreted Python 3.6 code to BOLTLMM versions 2.1 and 2.3.3 (C++ code compiled against the Intel MKL and Boost libraries) [7, 8, 27, 28]. We ran each algorithm twenty times per condition, trimming away the two most extreme timings in each condition. Mirroring the default settings of the BOLTLMM software packages, we set n_{rand}=15 across both of our proposed methods.
Novel algorithms were implemented in the Python v3.6.5 computing environment [9], using NumPy v1.14.3 and SciPy v1.1.0 compiled against the Intel Math Kernel Library v2018.0.2 [28–30]. Optimization was performed using SciPy’s implementation of Brent’s method, with convergence determined via absolute tolerance of the standardized genomic variance component \(\hat {h}^{2}\). Timing results (Table 2 and Figs. 3 and 5) do not include time required to read genotypes into memory, or, when applicable, to compute GRMs, and reflect total running time on an Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz with 32 physical cores and 1 terabyte of RAM. Timing experiments excluded methods with cubic time complexity, including GCTA, FaSTLMM, and GEMMA. Accuracy was assessed by comparing heritability estimates generated by the stochastic algorithms to those estimated via the direct, deterministic averageinformation Newton–Raphson algorithm as implemented in the GCTA software package v1.92.0b2 [4] (Figs. 4 and 5).
Availability of data and materials
The UK Biobank data are available to qualified researchers via the UK Biobank Access Management System (https://bbams.ndph.ox.ac.uk/ams). The code used in the numerical experiments is available on Github (https://github.com/rborder/SL_REML).
Abbreviations
 BLAS:

Basic linear algebra subprogram
 BLUP:

Basic linear unbiased prediction
 CG:

Conjugate gradients method
 GCTA:

Genomewide complex trait analysis [4]
 GRM:

Genomic relatedness matrix
 GWAS:

Genomewide association study
 LMM:

Linear mixedeffects model
 REML:

Residual maximum likelihood
 SLQ:

Stochastic Lanczos quadrature
References
 1
Yang J, Zaitlen NA, Goddard ME, Visscher PM, Price AL. Advantages and Pitfalls in the Application of Mixed Model Association Methods. Nat Genet. 2014; 46(2):100–6.
 2
Bates D, Mächler M, Bolker B, Walker S. Fitting Linear MixedEffects Models Using Lme4. 2014. arXiv preprint arXiv:14065823.
 3
Zhou X, Stephens M. GenomeWide Efficient Mixed Model Analysis for Association Studies. Nat Genet. 2012; 44(7):821–4.
 4
Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: A Tool for GenomeWide Complex Trait Analysis. Am J Hum Genet. 2011; 88(1):76–82.
 5
Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D. FaST Linear Mixed Models for GenomeWide Association Studies. Nat Methods. 2011; 8(10):833–5.
 6
Zhou X, Stephens M. Efficient Multivariate Linear Mixed Model Algorithms for GenomeWide Association Studies. Nat Methods. 2014; 11(4):407.
 7
Loh PR, Tucker G, BulikSullivan BK, Vilhjálmsson BJ, Finucane HK, Salem RM, et al. Efficient Bayesian MixedModel Analysis Increases Association Power in Large Cohorts. Nat Genet. 2015; 47(3):284–90.
 8
Loh PR, Kichaev G, Gazal S, Schoech AP, Price AL. MixedModel Association for BiobankScale Datasets. Nat Genet. 2018; 50:906–8.
 9
Border R. Stochastic Lanczos Likelihood Estimation of Genomic Variance Components. Appl Math Grad Theses Dissertations. 2018;120.
 10
de los Campos G, Vazquez AI, Fernando R, Klimentidis YC, Sorensen D. Prediction of Complex Human Traits Using the Genomic Best Linear Unbiased Predictor. PLoS Genet. 2013; 9(7):e1003608.
 11
Evans LM, Tahmasbi R, Vrieze SI, Abecasis GR, Das S, Gazal S, et al. Comparison of Methods That Use Whole Genome Data to Estimate the Heritability and Genetic Architecture of Complex Traits. Nat Genet. 2018; 50(5):737–45.
 12
Searle SR, Casella G, McCulloch CE. Variance Components, vol. 391. United States: Wiley; 2009.
 13
Graser HU, Smith SP, Tier B. A DerivativeFree Approach for Estimating Variance Components in Animal Models by Restricted Maximum Likelihood. J Anim Sci. 1987; 64(5):1362–70.
 14
Björck A. Numerical Methods in Matrix Computations, vol. 59. Switzerland: Springer; 2015.
 15
Atkinson KE. An Introduction to Numerical Analysis. United Kingdom: Wiley; 2008.
 16
O’Leary DP. The Block Conjugate Gradient Algorithm and Related Methods. Linear Algebra Appl. 1980; 29:293–322.
 17
Frommer A, Maass P. Fast CGBased Methods for TikhonovPhillips Regularization. SIAM J Sci Comput. 1999; 20(5):1831–50.
 18
Sogabe T. A Fast Numerical Method for Generalized Shifted Linear Systems with Complex Symmetric Matrices. Recent Dev Num Anal Num Comput Algoritm. 2010;:13.
 19
Hutchinson MF. A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines. Commun Stat Simul Comput. 1990; 19(2):433–50.
 20
Avron H, Toledo S. Randomized Algorithms for Estimating the Trace of an Implicit Symmetric Positive SemiDefinite Matrix. J ACM. 2011; 58(2):8:1–8:34.
 21
Golub GH, Matrices MG. Moments and Quadrature with Applications. Princeton: Princeton University Press; 2009.
 22
Ubaru S, Chen J, Saad Y.Fast Estimation of Tr(f(A)) via Stochastic Lanczos Quadrature. SIAM J Matrix Anal Appl. 2017; 38(4):1075–99.
 23
Chen J, Saad Y. A Posteriori Error Estimate for Computing Tr(f(A)) by Using the Lanczos Method. 2018. arXiv:180204928 [math].
 24
Zhu S, Wathen AJ. Essential Formulae for Restricted Maximum Likelihood and Its Derivatives Associated with the Linear Mixed Models. 2018. arXiv:180505188 [stat].
 25
McCulloch C, Searle SR, Neuhaus JM. Generalized, Linear, and Mixed Models. Hoboken: Wiley; 2008.
 26
Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLoS Med. 2015; 12(3):e1001779.
 27
Schling B. The Boost C++ Libraries. USA: XML Press; 2011.
 28
Wang E, Zhang Q, Shen B, Zhang G, Lu X, Wu Q, et al. Intel Math Kernel Library. In: HighPerformance Computing on the Intel®Xeon Phi™. New York: 2014. p. 167–88.
 29
Oliphant T. NumPy: A Guide to NumPy. 2006.
 30
Jones E, Oliphant T, Peterson P, et al. SciPy: Open Source Scientific Tools for Python. 2001.
 31
Loh PR, Bhatia G, Gusev A, Finucane HK, BulikSullivan BK, Pollack SJ, et al. Contrasting Genetic Architectures of Schizophrenia and Other Complex Diseases Using Fast Variance Components Analysis. Nat Genet. 2015; 47(12):1385–92.
Acknowledgements
The authors wish to thank UK Biobank participants. Additionally, the authors thank Matthew C. Keller and Luke M. Evans for their thoughtful comments and provision of computational resources. Publication of this article was funded by the University of Colorado Boulder Libraries Open Access Fund.
Funding
Richard Border was supported by a training grant from the National Institute of Mental Health (T32 MH016880) and by the Institute for Behavioral Genetics. Stephen Becker acknowledges funding by NSF grant DMS1819251.
Author information
Affiliations
Contributions
RB wrote the manuscript, developed the algorithms, wrote the code used in numerical experiments, and analyzed the data. SB supervised the project and contributed to the development of the algorithms and the writing of the manuscript. Both authors read and approved the final manuscript.
Corresponding author
Correspondence to Richard Border.
Ethics declarations
Ethics approval and consent to participate
UK Biobank data collection procedures were approved by the UK Biobank Research Ethics Committee (reference 11/NW/0382).
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
Proof of result used to efficiently compute the quadratic form (7). (PDF 125 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 GWAS
 Linear mixedeffects models
 Variance components
 REML
 Conjugate gradients
 Stochastic trace estimation
 Stochastic Lanczos quadrature