 Methodology article
 Open access
 Published:
Reliable heritability estimation using sparse regularization in ultrahigh dimensional genomewide association studies
BMC Bioinformatics volume 20, Article number: 219 (2019)
Abstract
Background
Data from genomewide association studies (GWASs) have been used to estimate the heritability of human complex traits in recent years. Existing methods are based on the linear mixed model, with the assumption that the genetic effects are random variables, which is opposite to the fixed effect assumption embedded in the framework of quantitative genetics theory. Moreover, heritability estimators provided by existing methods may have large standard errors, which calls for the development of reliable and accurate methods to estimate heritability.
Results
In this paper, we first investigate the influences of the fixed and random effect assumption on heritability estimation, and prove that these two assumptions are equivalent under mild conditions in the theoretical aspect. Second, we propose a twostage strategy by first performing sparse regularization via crossvalidated elastic net, and then applying variance estimation methods to construct reliable heritability estimations. Results on both simulated data and real data show that our strategy achieves a considerable reduction in the standard error while reserving the accuracy.
Conclusions
The proposed strategy allows for a reliable and accurate heritability estimation using GWAS data. It shows the promising future that reliable estimations can still be obtained with even a relatively restricted sample size, and should be especially useful for largescale heritability analyses in the genomics era.
Background
Heritability measures how much the variation of a phenotypic trait in a population is caused by the genetic variation among individuals in that population. It has two specific types of definition: the broad sense and the narrow sense. The narrowsense heritability is of more importance in genetic applications, which is defined as the ratio of the additive genetic variance to the total phenotypic variance [1]. With the tremendous technological advances in genomewide association studies (GWASs) in the last few decades, hundreds of thousands of genetic markers for individuals have been discovered, usually single nucleotide polymorphisms (SNPs), aiming to explore the genetic architecture of human complex traits. Heritability based on GWASs, termed as the SNP heritability [2], has been serving as a more and more critical measure in this exploration, and can guide downstream analysis on more specific biological questions. Hereinafter, we consider the SNP heritability unless otherwise specified.
Traditional approaches to estimating narrowsense heritability are based on twin or pedigree studies, in which genetic variance can be estimated from phenotypic similarity between relatives; see, e.g., [1, 3] and references therein. But in practice, it is rather difficult to completely partition the genetic variance from the variance resulted from shared common environmental factors, as relatives often share similar genes and are more likely to be raised in similar environment [4]. In modern GWASs, designs based on a population sample of unrelated people help to overcome the confounding of genes and environment, with the SNP heritability being viewed as a lower bound for the narrowsense heritability. However, for most traits the declared highly significant SNPs fail to capture all the genetic variance; see, e.g., [5, 6]. This has been referred to as the “missing heritability” problem [7, 8]. To address this gap, researchers in [9] developed the software genomewide complex trait analysis (GCTA) to estimate the SNP heritability without the requirement that individual SNPs are significant, arriving at a higher lower bound for the narrowsense heritability [10]. Recently, computing tools such as BOLTREML [11], BayesR [12], and massively expedited genomewide heritability analysis (MEGHA) [13] have been developed to achieve a higher speed. These works make use of the linear mixed model (LMM) to consider all SNPs across the genomewide average, assuming that the genetic effects are random variables and the genotypes are fixed quantities.
However, from the framework of quantitative genetics theory, the effects of genetic markers on a trait are fixed quantities, and genetic variance stems from variation at quantitative trait locus (QTL) genotypes [1, 14]. What is the difference between the fixed and random effect assumption? Does it matter which assumption is used to estimate heritability? This motivates us to investigate the two assumptions in order to compare their influences on heritability estimation. Moreover, heritability estimators produced by GCTA and following tools may have large standard errors, which is especially the case in the field of imaging genetics, where the sample size cannot increase arbitrarily due to high costs; see, e.g., [15–17]. This stimulates the main focus of our work to construct reliable estimators for heritability with smaller standard errors in the ultrahigh dimensional scenario. The main contributions of this paper are as follows. First, we investigate the influences of the fixed and random effect assumption on heritability estimation, and prove that these two assumptions are equivalent under mild conditions in the theoretical aspect. Second, former GWASs have pointed out that the number of SNPs with nonzero effects that are associated with a given disease or a trait may be relatively small or moderate (e.g., ∼10^{3}), though the whole number of SNPs is usually very large (e.g., 10^{5}∼10^{6}) [18, 19]. In other words, not all SNPs are causal (strictly speaking, here “causal SNPs” just refer to SNPs with nonzero effects), or at least not all SNPs are in perfect linkage disequilibrium (LD) with QTL. In a statistical terminology, the underlying true model is sparse. Therefore, we make use of the underlying sparse structure of GWAS data, and propose a twostage strategy by first performing sparse regularization via crossvalidated elastic net and then applying certain variance estimation methods to construct reliable heritability estimations. Results from simulated data and real neuroanatomical data from the IMAGEN project show that our strategy can provide estimators with a considerable reduction in the standard error while retaining the accuracy. The results demonstrate the promising capability of our strategy for largescale heritability analyses in the genomics era, especially in the field of imaging genetics, where the sample size is usually limited nowadays.
Methods
We begin this section by first introducing some definitions and notations for future reference. For 0<q<+∞, the ℓ_{q} norm of a vector \(u\in \mathbb {R}^{n}\) is defined as \(\u\_{q}:=\left (\sum _{i=1}^{n} u_{i}^{q}\right)^{1/q}\). We say that u=0 if u_{i}=0 for all i=1,2,⋯,n. For m≥1, let \(\mathbb {I}_{m}\) stand for the m×m identity matrix. For a matrix \(W\in \mathbb {R}^{m\times n}\), we use W_{ij} (i=1,2,⋯,m,j=1,2,⋯,n) to denote its ijth entry, W_{i·} (i=1,2,⋯,m) to denote its ith row, and W_{·j} (j=1,2,⋯,n) to denote its jth column. For any index set M⊆{1,2,⋯,n}, we use u_{M} to denote the subvector containing the components of the vector u that are indexed by M, and W_{M} to denote the submatrix containing the columns of the matrix W that are indexed by M.
Model
In this paper, we consider the following sparse linear model to approximate the true underlying model in GWASs,
where \(y\in \mathbb {R}^{m}\) is a vector of observations, W is an m×n (m≪n) design matrix storing the SNP information, \(u^{*}\in \mathbb {R}^{n}\) is the unknown vector representing the SNP effects with s (s≤n) nonzero entries, and e is a vector of residual effects with \(e\sim \mathcal {N}(0,\sigma _{e}^{2}\mathbb {I}_{m})\). The true model is denoted as \(M_{0}:=\{j:u^{*}_{j}\neq 0\}\). Then the cardinality of the true model M_{0}=s represents the number of causal SNPs of a given trait. The sparsity level is defined as γ:=s/n, which may be high or low according to the trait studied. When there are other covariates (such as overall mean, sex and age) to be considered, we simply apply the method proposed in [20], which projects out the nuisance variables (covariates).
Then we state two assumptions regarding the model Eq. 1.
Fixed Effect Assumption. This assumption is consistent with the quantitative genetics paradigm. We now specify it in the sparse scenario as follows: (i). The rows of the design matrix W_{1·},W_{2·},⋯,W_{m·} are independent and identically distributed random vectors with mean \(\mathbb {E}(W_{1 \cdot })=\mathbf {0}\) and n×n positive definite covariance matrix Σ=Cov(W_{1·}); (ii). The residuals e_{1},e_{2},⋯,e_{m} are independent of the design matrix W; (iii). The vector u consists of fixed quantities with supp(u^{∗})=M_{0}. Here the assumed covariance structure of W_{i·} is used to characterize the correlations between the n SNPs.
Random Effect Assumption. Recently, researchers in [9, 10] made use of this assumption to solve the “missing heritability” problem. We also endow it with the sparse structure as follows: (i). \(\{u^{*}_{j}:j\in M_{0}\}\) are a set of independent and identically distributed Gaussian random variables with mean 0 and variance \(\sigma _{u}^{2}\); (ii). For any i∈{1,2,·,m} and j∈M_{0}, e_{i} is independent of \(u^{*}_{j}\); (iii). The design matrix W is made up with fixed entries.
We now describe W in detail under the context of genetics. Noting from the facts that in GWAS each SNP is regarded as a binomial random variable with two trials, and that the success probability is defined as “reference allele frequency”, the entries of the design matrix W can be formulated by another matrix Z in the following way:
where the matrix Z stores the original genetic information in a population. Concretely speaking, the genotype of each SNP is coded in this way: Z_{ij}=0 (resp. 1, resp. 2) if the genotype of the i^{th} individual at locus j is bb (resp. Bb, resp. BB), and p_{j} is the frequency of the reference allele at locus j. After being constructed as above, W is the standard genotype matrix with each column/row standardized to have zero mean and unit variance.
Then we are at the stage to define heritability under the two assumptions on the model Eq. 1. Recall the definition that heritability measures the fraction of variation of a given trait that can be explained by variation of genetic markers among individuals in a population. For the fixed effect assumption, let \(\tau ^{2}=u^{*T}\Sigma u^{*}=\left \\Sigma ^{1/2}u^{*}\right \_{2}^{2}\), which represents a measure of total genetic variance attributed to causal SNPs. With the residual variance \(\sigma _{e}^{2}=\text {Var}(e_{i})\), we can naturally define the heritability as the proportion of explained variance in the linear model Eq. 1:
For the random effect assumption, which has been investigated by many authors [9, 21], the heritability is defined as:
The following proposition tells us that Eq. 3 is equivalent to Eq. 4 under the assumption that the nonzero genetic effects \(\{u^{*}_{j}:j\in M_{0}\}\) are independently drawn from a prior distribution. Under this assumption, in order to guarantee that the total genetic variance τ^{2} is still a fixed quantity, we make a slight modification to take expectation over the distribution of u^{∗}, that is, \(\tau ^{2}=\mathbb {E}_{u^{*}}\left (u^{*\top }\Sigma u^{*}\right)\).
Proposition 1
Suppose that the nonzero genetic effects \(\{u^{*}_{j}:j\in M_{0}\}\) are independently drawn from a prior distribution with mean 0 and variance \({Var}(u_{j}^{*})=\sigma _{u}^{2}\), and that for any i∈{1,2,⋯,m}, \(W_{i \cdot }^{\top }\) and u^{∗} are independent. Then \(h_{{fixed}}^{*}=h_{{rand.}}^{*}\).
Proof
For any i∈{1,2,⋯,m} fixed, the total genetic variance attributed to causal SNPs is □
where the third equality is from the fixed effect assumption (i) that \(\mathbb {E}_{W_{i \cdot }}(W_{i \cdot })=\mathbf {0}\), and the last equality is from the definition of the conditional expectation. It then follows the assumptions that \(\{u^{*}_{j}:j\in M_{0}\}\) are independent, and \(W_{i \cdot }^{\top }\) and u^{∗} are independent that
By the assumption that for j∈M_{0}, \(\mathbb {E}\left (u^{*}_{j}\right)=0\) and \(\text {Var}(u^{*}_{j})=\sigma _{u}^{2}\), one has that \(\mathbb {E}_{u^{*}}\left [(u^{*}_{j})^{2}\right ]=\sigma _{u}^{2}\). Then substituting Eq. 6 into Eq. 5, we obtain that
Since {W_{ij}:j∈M_{0}} are a set of centralized and normalized random variables with zero mean and unit variance by Eq. 2, we have that \(\sum _{j\in M_{0}}\mathbb {E}\left (W_{ij}^{2}\right)=\sum _{j\in M_{0}}\text {Var}\left (W_{ij}\right)=s\), and finally arrive at that \(\tau ^{2}=s\sigma _{u}^{2}\). It then follows immediately from Eq. 3 and Eq. 4 that, \(h_{\text {fixed}}^{*}=h_{\text {rand.}}^{*}\). The proof is complete.
Implementation
In this subsection, we introduce our twostage strategy to estimate heritability, which consists of a sparse regularization step followed by a variance estimation step.
Before a detailed description of the strategy, let us assume that the sample size m is even for simplicity. Then the original data set (y,W) is randomly split into two disjoint data sets (y^{(1)},W^{(1)}) and (y^{(2)},W^{(2)}) with equal samples. Without loss of generality, the following sparse regularization step is performed on (y^{(1)},W^{(1)}) to reduce the model, while the variance estimation step is applied on (y^{(2)},W^{(2)}). In doing so, it is guaranteed that sparse regularization and variance estimation are performed on independent samples. We explain the reason for using independent samples at the end of this section.
Sparse regularization
Recall the model Eq. 1, which is a seriously illconditioned linear system with far fewer samples than variables (SNPs). Thus there exists no unique solution for the effect vector, and the problem of nonidentifiability appears. Fortunately, with the sparse assumptions mentioned above, the popular and practical regularization technique is applicable, which has been extensively studied for high dimensional linear models in the past decade; see, e.g., [22–24] and references therein.
Since in reality, one has no prior knowledge on the amount of each effect, the sparse regularization technique is required to be flexible to both small and large effects. In this paper, we adopt the elastic net [24] as our sparse regularization method. More precisely, we solve the following optimization problem:
where α∈(0,1] represents the weight of Lasso [23] versus ridge [25] regularization, and λ>0 is the regularization parameter providing a tradeoff between accuracy and sparsity.
Here the parameter α is used to adapt to different sparsity levels. For high sparsity level, it is chosen to approach 1, while for lower sparsity level, it is chosen to be smaller. Though the real genetic architecture of a given trait is generally unknown, some prior knowledge may be used to roughly determine the value of α. A suitable choice of λ is critical as its value might strongly affect the set of variables selected. We here proceed to use the kfold crossvalidation to reduce the influence of false variables, and choose suitable values for α and λ.
In practice, we fit the optimization problem Eq. 8 by implementing the MATLAB function “lasso” (https://www.mathworks.com/help/stats/lasso.html), which is designed for Lasso or elastic net regularization of linear models. Specifically, we first define a set Ω corresponding to the domain of α. Then for each α∈Ω fixed and a set of regularization parameters λ predefined, we perform 10fold crossvalidation and choose the smallest λ that is within one standard error of minimum prediction mean squared error (MSE), as is shown for instance in blue dashed line in Fig. 1a. Finally, we determine the value of α to be the one that meets the minimum prediction MSE across the set Ω. After the parameters α and λ have been determined, the selected model is denoted as \(\hat {M}=\{j:\hat {u}_{j}\neq 0\}\), and the number of selected variables is \(\hat {n}=\hat {M}\), where \(\hat {u}\) is the optimal solution to Eq. 8.
Variance estimation
Now some certain variance estimation methods are applied to \(\left (y^{(2)},W^{(2)}_{\hat {M}}\right)\). Recall that at this time, the sample size is only m/2.
For the fixed effect assumption, as \(m/2>\hat {n}\) cannot be guaranteed, the problem might still be high dimensional. There are three notable works [26–28] considering variance estimation in high dimensional linear regression, among which the latter two rely strongly on the sparsity assumption on the model while the first one does not. Since the number of causal SNPs might vary from moderate (e.g., 10^{2}∼10^{3}) to large (e.g., 10^{4}∼10^{5}), the method must be stable with respect to the sparsity level. Moreover, as it is realistic that different SNPs are usually not independent, the method should also be capable of handling the case where there exist correlations between the SNPs. Therefore, we choose to use the method proposed in [26, section 4.2], which is based on the methodofmoment and is applicable to the correlated case. Two estimators for τ^{2} and \(\sigma _{e}^{2}\) are constructed as follows:
where
Note that when \(m/2>\hat {n}\) and \(W_{\hat {M}}\) has full rank, these two estimators are quite similar to the estimators obtained by ordinary least squares. Thus we arrive at a plugin estimator for \(h_{\text {fixed}}^{*}\):
For the random effect assumption, we simply apply the widelyused software GCTA [9], which implements the maximum likelihood method, with \(\left (y^{(2)},W^{(2)}_{\hat {M}}\right)\) as the input to obtain estimators for variance components. Other tools such as BOLTREML [11] or MEGHA [13] are of course applicable. The final estimator for \(h_{\text {rand.}}^{*}\) is denoted as \(\hat {h}_{\text {rand.}}\).
Since the true heritability always belongs to (0,1), once \(\hat {h}_{\text {fixed}}\) or \(\hat {h}_{\text {rand.}}\) is smaller than 0 or larger than 1, it is constrained to a value equal to 0.0001 or 0.9999, respectively. Nevertheless, as is shown by numerical results in the next section, performing a sparse regularization step first can perfectly restrict the obtained estimators to lie in (0,1).
To understand the behavior of the heritability estimator produced by our twostage strategy, we make a decomposition of the bias and variance of the estimator. We only use \(\hat {h}_{\text {rand.}}\) here so as to simplify the illustration, and \(\hat {h}_{\text {fixed}}\) can also produce the same result. The corresponding result is displayed in Fig. 1b. Recall that λ is chosen to be the smallest one that is within one standard error of minimum MSE in section 2.2.1, as is shown in blue dashed line in Fig. 1a and b. We can see from Fig. 1b that when λ is too small and the selected model contains many redundant variables, though the heritability estimator is almost unbiased, its variance is large. Our choice of λ guarantees that the heritability estimator is not only almost unbiased but also with a smaller variance. The performance of our strategy will be demonstrated in detail in the next section.
Now let us turn to illustrate the reason for using independent samples in the proposed twostage strategy. Assume that we are in the case where there are 10 causal SNPs out of total 10000 SNPs. Then Fig. 1c plots the heritability estimators versus the regularization parameter λ which represents the model selection process. We only use \(\hat {h}_{\text {rand.}}\) here so as to simplify the illustration, and \(\hat {h}_{\text {fixed}}\) can also produce the same result. The training set is used to select the model, and then variance estimation is completed on the training set and the validation set, respectively. Therefore, estimators in the training set are obtained with nonindependent samples, and estimators in the validation set are obtained with independent samples. When the selected model contains too many redundant variables, its generalization ability is poor, and estimators produced by the training set are usually overestimated. As λ becomes larger, the selected model becomes more sparse, and the generalization ability of the selected model increases. Therefore, using samples independent of those used in model selection to estimate variance guarantees that even if the selected model is not sparse enough, the heritability won’t be overestimated. Otherwise, if model selection and variance estimation are done on the same sample set, the heritability is more likely to be overestimated. Hence, we suggest that model selection and variance estimation should be performed on independent samples to reduce overestimation.
Simulated data
The simulated genotype data are generated via the R package “echoseq” (https://github.com/hruffieux/echoseq) [29]. Specifically, the genotype matrix W is generated with correlated columns based on generally accepted principles of population genetics (Hardy–Weinberg equilibrium, linkage disequilibrium, and natural selection). The sparse effect vector \(u^{*}\in \mathbb {R}^{n}\) is generated by choosing s indices at random according to a \(\mathcal {N}(0,1/s\mathbb {I}_{s})\) distribution, with different s being chosen for given n. The noise vector e is set as Gaussian with mean 0 and covariance matrix \(\sigma _{e}^{2}\mathbb {I}_{m}\), with \(\sigma _{e}^{2}\) representing the noise level. This generation process ensures that the simulated data behave like real genotype data. The observations y are then obtained via the model Eq. 1. The true value of heritability is approximated by
We see in the following simulations that the sample standard error of the approximation \(\tilde {h}^{*}\) is so small that can be ignored.
Real data from the IMAGEN project
Brain imaging scans were obtained from a cohort of 2089 adolescents (14.5±0.4 years old, 51% females) from the IMAGEN project (http://imageneurope.com) using a standardised 3T, T1weighted gradient echo protocol in eight European centres [30]. Genotype data were obtained using the Illumina 610Quad and Illumina 660WQuad chips, and then preprocessed using PLINK 1.90 (https://www.coggenomics.org/plink2) [31]. We excluded SNPs that did not satisfy the following quality control criteria: genotype call rate ≥99%, minor allele frequency ≥1%, and HardyWeinberg equilibrium P≥1×10^{−6}. After quality control, we finally used 225139 SNPs across the 22 autosomes genotyped on 1765 participants.
Results
The purpose of this section is to carry out several experiments and demonstrate results on the heritability estimation problem for both simulated data and real data from the IMAGEN project. All experiments are performed in MATLAB R2014b and executed on a computer with the following configuration: Intel(R) Xeon(R) CPU E52630 v2, 12×2.60 GHz, 126 GB of RAM. The runtime for model selection is about 10 minutes and the required memory is about 8GB, with a data set including 1000 samples and 100000 SNPs, whose scale is close to that of real data. The following variance estimation step takes only a few seconds.
Simulations on the fixed and random effect assumptions
To compare the influences of the fixed and random effect assumptions, the estimators \(\hat {h}_{\text {fixed}}\) and \(\hat {h}_{\text {rand.}}\) as well as the approximated true heritability \(\tilde {h}^{*}\) are estimated under different noise levels \(\sigma _{e}^{2}\in \{4,1,0.25\}\) and under the case where all the SNPs have nonzero effects for simplicity, that is s=n and M_{0}={1,2,⋯,n}.
The corresponding boxplot is displayed in Fig. 2a. We can see from this figure that both the estimators \(\hat {h}_{\text {fixed}}\) and \(\hat {h}_{\text {rand.}}\) are almost unbiased, and that the approximation of the true heritability \(\tilde {h}^{*}\) behaves well with small deviation that can be ignored. Moreover, it is also demonstrated that the fixed and random effect assumptions produce similar estimators.
Simulations on the sample sizes and SNP sizes
To simplify our expositions, the following simulations are carried out under the case in which all the SNPs have nonzero effects, that is s=n and M_{0}={1,2,⋯,n}. The noise level \(\sigma _{e}^{2}\) is set equal to 1.
Firstly, we illustrate the performance of both estimators \(\hat {h}_{\text {fixed}}\) and \(\hat {h}_{\text {rand.}}\) under different sample sizes m∈{300,1000,3000}. The corresponding boxplot is displayed in Fig. 2b. We can see from this figure that the smaller the sample size, the larger the standard errors for both two estimators \(\hat {h}_{\text {fixed}}\) and \(\hat {h}_{\text {rand.}}\). In the case where the number of samples is relatively small, it is more likely to obtain many estimators reaching the boundaries 0 and 1, thus leading to estimations that are rather unreliable. Thus, when dealing with real GWAS data, the sample size should be as large as possible. This requirement can be easily satisfied for phenotypes like height and body mass index, while for phenotypes related to imaging genetics such as whole brain volume, it is not always the case. The lack of samples makes it a hard problem to estimate the heritability of these phenotypes.
Secondly, we illustrate the performance of both estimators \(\hat {h}_{\text {fixed}}\) and \(\hat {h}_{\text {rand.}}\) under different numbers of total SNPs n∈{1000,3000,10000}. The corresponding boxplot is displayed in Fig. 2c. We can see from this figure that the larger the number of SNPs, the larger the standard error of both estimators \(\hat {h}_{\text {fixed}}\) and \(\hat {h}_{\text {rand.}}\). This indicates that as the problem dimension gets larger, it becomes more difficult to obtain estimators with smaller standard errors. Thus in a typical GWAS, where the dimension is always thousands of hundreds while the number of samples cannot grow arbitrarily, the estimators should be treated carefully, since they may have large standard errors and lead to unreasonable results.
Simulations on sparsity
To elucidate the importance of sparsity, both estimators \(\hat {h}_{\text {fixed}}\) and \(\hat {h}_{\text {rand.}}\) are estimated under different numbers of the causal SNPs s∈{100,1000,10000}. We use the oracle estimators corresponding to the fixed and random effect assumptions for comparisons, whose values are calculated via \(\hat {h}_{\text {fixed}}\) and \(\hat {h}_{\text {rand.}}\), respectively, with the oracle \(\hat {M}=M_{0}\) known in advance. The noise level \(\sigma _{e}^{2}\) is set equal to 1.
The corresponding boxplot is displayed in Fig. 2d. It has been shown in [21] that, when there are many nonzero entries contained in the effect vector, the estimators can still be unbiased even though the model is misspecified. However, the standard errors of these estimators are so large that cannot be accepted, as is shown in the case where s=100,1000. On the other hand, we can see from the oracle estimators that when the sparsity of u^{∗} is taken into consideration, the corresponding standard errors have been greatly reduced, resulting in more reliable estimations. In practice, since the set of causal SNPs is usually unknown, it is necessary to approximate the sparsity pattern of the effect vector M_{0} as close as possible before variance estimation.
Simulations on the performance of the proposed strategy
To illustrate the performance of the proposed twostage strategy, both estimators \(\hat {h}_{\text {fixed}}\) and \(\hat {h}_{\text {rand.}}\) are estimated under different problem sizes. The oracle estimators are also used for comparisons with the oracle \(\hat {M}=M_{0}\) known in advance. The noise level \(\sigma _{e}^{2}\) is set equal to 1. The corresponding boxplots are displayed in Fig. 3.
We can see from Fig. 3 that, no matter in the highly sparse case or the more polygenic scenario, our twostage strategy improves the performance of these estimators in the sense that the corresponding standard errors have been reduced considerably compared to those obtained without considering the sparsity structure. Moreover, when the sparsity level of underlying model is high, as displayed in Fig. 3a and b, our strategy is so impressive that it produces estimators performing as well as the oracle estimators, especially under the random effect assumption. In addition, we find that when there exist correlations between the SNPs and the problem dimension n is high (e.g., n=100000), the performance of the estimator \(\hat {h}_{\text {fixed}}\) without considering the sparsity is somewhat undesirable in the sense that the standard error is too large to be acceptable, while the sparse regularization step reduces the standard error considerably. This result implies that our method is robust in the presence of correlations between the columns of W, and can be applied to the cases where LD exists.
Simulations on real data from the IMAGEN project
We apply our twostage strategy to estimate the heritability of height and the volume of neuroanatomical structures, specifically, the nucleus accumbens (Acc), amygdala (Amy), caudate nucleus (Ca), hippocampus (Hip), globus pallidus (Pa), putamen (Pu), and thalamus (Th).
As is widelyacknowledged that most human complex traits are generally polygenic and the corresponding heritability is largely captured by common SNPs [10, 32], the sparsity level cannot be too high in reality. Therefore, in the sparse regularization stage, we set the parameter α∈{3×10^{−5},10^{−4},3×10^{−4},10^{−3}} in Eq. 8. In the variance estimation stage, the heritability is estimated under the random effect assumption. The standard error of the estimated heritability is approximated using the delta method [33]. The final results are displayed in Tab. 1 with the original results displayed in Additional file 1: Table S1. As far as we know, the heritability of these phenotypes from the IMAGEN project has also been estimated in [17] using GCTA, so Tab. 1 also includes their results for comparison.
We can see from Tab. 1 that the heritability estimated by our twostage strategy is consistent with that reported in [17] on the same data set, with a considerably smaller standard error. This is especially the case for the volumes of Acc, Ca, Pa, and Th, where the corresponding standard error has been greatly reduced. In a word, our strategy can not only provide accurate estimations but also improve the reliability of the estimators in the sense that the standard error is reduced.
In addition to demonstrating the performance of our strategy, we analyse the heritability of average cortical thickness measures in 68 regions of interest (ROIs; 34 ROIs per hemisphere) defined by the DesikanKilliany atlas [34]. The corresponding results are shown in Additional file 1: Table S2. Many estimators obtained using GCTA reach the boundaries (i.e., 0.0001 or 0.9999), which is of course unreasonable, while our strategy overcomes this obstacle to some extent in the sense that most of the estimators are perfectly restricted to the boundary set, leading to more stable and reliable results.
Discussion
In this paper, we compared the fixed and random effect assumption in detail from both theoretical and practical aspects. In the theoretical aspect, we proved that the definitions of heritability are equivalent under mild conditions for both the fixed and random effect assumptions. In the practical aspect, our results demonstrated that both assumptions worked well, and produced similar estimators. However, when there exist correlations between the SNPs and the problem dimension n is high (e.g., n=100000), the performance of the estimator \(\hat {h}_{\text {fixed}}\) is quite undesirable. Therefore, we recommended that \(\hat {h}_{\text {rand.}}\) should be used in the real data analysis.
In modern GWASs, it has been pointed out in [18, 19, 35] that the sparsity structure usually exists in the ultrahigh dimensional genomic data. And our results on simulated data demonstrated that when the sparsity is considered, the standard errors of the heritability estimators had been greatly reduced (Fig. 2d). Therefore, it is quite necessary to take the sparsity structure into consideration and remove the redundant SNPs which are not related to the phenotype in heritability analyses. In practice, the set of causal SNPs is usually unknown, one needs to approximate the sparsity pattern as close as possible before variance estimation.
We proposed a twostage strategy by first performing sparse regularization using crossvalidated elastic net to select the model, and then applying certain variance estimation methods on the reduced model. Due to the fact that in the context of GWASs, there always exists a strong correlation between the explanatory variables (i.e., the SNPs) [36], attention is needed to the potential correlation structure between the SNPs when selecting the model. The elastic net [24] is especially powerful in the case where the pairwise correlations between variables may be high, and is more flexible to different sparsity levels. Moreover, the special structure of its regularization term, which is a linear combination of the Lasso [23] and the ridge [25] regression, enables one to simultaneously consider and balance two competing hypotheses that are usually used for explaining the underlying genetic architecture of human complex traits: common diseaserare variant hypothesis and common diseasecommon variant hypothesis [37], which address that, for some complex traits heritability may be explained by a small number of rare variants each with a large effect, while for other traits it may be explained by a large number of common variants with small effects. In a word, the elastic net can jointly balance the very sparse case and the more polygenic case.
Results from simulated data implied that our strategy produced estimators with considerably smaller standard errors than those obtained via methods without considering the sparsity (Fig. 3), leading to more reliable results for explanations. Moreover, we found that the performance of our strategy is more impressive when the sparsity level is high, in the sense that estimators obtained by our strategy behaves as well as the oracle estimators (Fig. 3a and b). This result points out a new prospect to analyse the complex genetic structure of some diseases that are caused by a few SNPs. Results from real data achieved estimations for the heritability of human height as well as the volumes of some neuroanatomical structures, which are consistent with former works [10, 17, 32] with smaller standard errors. In contemporary genomics, the sample size is usually limited due to physical or economical constraints, which is especially the case for brain imaging phenotypes. Therefore, our results show the promising future that reliable estimations can still be obtained with even a relatively restricted sample size.
While we are working on this paper, we became aware of an independent work [38]. Our contributions are substantially different from theirs, in that we perform variance estimation on a sample set independent of that used for model selection so as to avoid overestimation, while their variable selection and variance estimation steps are done on the same sample set. In addition, our sparse regularization technique is the elastic net, which is applicable in both the very sparse case and the more polygenic scenario, whereas they perform the variable selection through the sure independence screening approach followed by a Lasso criterion, resulting in a highly sparse model.
Conclusion
We have considered the potential sparse structure of GWAS data, and proposed a twostage strategy to produce reliable heritability estimations. Results on simulated data and real data demonstrate the promising future of our strategy for ultrahigh dimensional heritability analyses with even a relatively restricted sample size. Due to the fact that model selection consistency cannot be achieved unless certain strong conditions are satisfied (see, e.g., [39, 40]), the estimated heritability is actually the genetic variance attributed to the selected SNPs, and thus is indeed a lower bound for SNP heritability. Future directions of research may generalize our strategy to more precise models that can capture other underlying sophisticated structures of human complex traits, such as genegene and geneenvironment interactions, to provide better estimations for heritability. In addition, it would be interesting to use our strategy in gene discovery and prediction analyses of complex traits.
Abbreviations
 Acc:

Nucleus accumbens
 Amy:

Amygdala
 Ca:

Caudate nucleus
 GCTA:

Genomewide complex trait analysis
 GWAS:

Genomewide association study
 Hip:

Hippocampus
 LD:

Linkage disequilibrium
 LMM:

Linear mixed model
 MEGHA:

Massively expedited genomewide heritability analysis
 MSE:

Mean squared error
 Pa:

Globus pallidus
 Pu:

Putamen
 QTL:

Quantitative trait locus
 ROI:

Region of interest
 SNP:

Single nucleotide polymorphism
 Th:

Thalamus
References
Falconer DS. Introduction to Quantitative Genetics. Uttar Pradesh: Pearson Education India; 1975.
Speed D, Cai N, Johnson MR, Nejentsev S, Balding DJ, Consortium U. Reevaluation of SNP heritability in complex human traits. Nat Genet. 2017; 49(7):986.
Visscher PM, Medland SE, Ferreira MAR, Morley KI, Zhu G, Cornes BK, Montgomery GW, Martin NG. Assumptionfree estimation of heritability from genomewide identitybydescent sharing between full siblings. PLoS Genet. 2006; 2(3):41.
Vinkhuyzen AAE, Wray NR, Yang J, Goddard ME, Visscher PM. Estimation and partition of heritability in human populations using wholegenome analysis methods. Annu Rev Genet. 2013; 47:75–95.
Gudbjartsson DF, Walters GB, Thorleifsson G, Stefansson H, Halldorsson BV, Zusmanovich P, Sulem P, Thorlacius S, Gylfason A, Steinberg S, et al. Many sequence variants affecting diversity of adult human height. Nat Genet. 2008; 40(5):609–15.
Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino M, Freathy RM, Perry JR, Stevens S, Hall AS, et al. Genomewide association analysis identifies 20 loci that influence adult height. Nat Genet. 2008; 40(5):575–83.
Maher B. Personal genomes: The case of the missing heritability. Nat News. 2008; 456(7218):18–21.
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al. Finding the missing heritability of complex diseases. Nature. 2009; 461(7265):747–53.
Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: A tool for genomewide complex trait analysis. Am J Hum Genet. 2011; 88(1):76–82.
Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010; 42(7):565–9.
Loh PR, Bhatia G, Gusev A, Finucane HK, BulikSullivan BK, Pollack SJ, de Candia TR, Lee SH, Wray NR, Kendler KS, et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variancecomponents analysis. Nat Genet. 2015; 47(12):1385–92.
Moser G, Lee SH, Hayes BJ, Goddard ME, Wray NR, Visscher PM. Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model. PLoS Genet. 2015; 11(4):1004969.
Ge T, Nichols TE, Lee PH, Holmes AJ, Roffman JL, Buckner RL, Sabuncu MR, Smoller JW. Massively expedited genomewide heritability analysis (MEGHA). Proc Natl Acad Sci. 2015; 112(8):2479–84.
Lynch M, Walsh B. Genetics and Analysis of Quantitative Traits. Sunderland: Sinauer Associates; 1998.
Bryant C, Giovanello KS, Ibrahim JG, Chang J, Shen D, Peterson BS, Zhu HT. Mapping the genetic variation of regional brain volumes as explained by all common SNPs from the ADNI study. PLoS One. 2013; 8(8):71723.
Kumar SK, Feldman MW, Rehkopf DH, Tuljapurkar S. Limitations of GCTA as a solution to the missing heritability problem. Proc Natl Acad Sci. 2016; 113(1):61–70.
Toro R, Poline JB, Huguet G, Loth E, Frouin V, Banaschewski T, Barker GJ, Bokde A, Büchel C, Carvalho FM, et al. Genomic architecture of human neuroanatomical diversity. Mol Psychiatry. 2015; 20(8):1011–6.
Ripke S, Neale BM, Corvin A, Walters JTR, Farh KH, Holmans PA, Lee P, BulikSullivan B, Collier DA, Huang H, et al. Biological insights from 108 schizophreniaassociated genetic loci. Nature. 2014; 511(7510):421.
Stahl EA, Wegmann D, Trynka G, GutierrezAchury J, Do R, Voight BF, Kraft P, Chen R, Kallberg HJ, Kurreeman FAS. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat Genet. 2012; 44(5):483–9.
Patterson HD, Thompson R. Recovery of interblock information when block sizes are unequal. Biometrika. 1971; 58(3):545–54.
Jiang JM, Li C, Paul D, Yang C, Zhao HY. On highdimensional misspecified mixed model analysis in genomewide association study. Ann Stat. 2016; 44(5):2127–60.
Hu YH, Li C, Meng KW, Qin J, Yang XQ. Group sparse optimization via ℓ _{p,q} regularization. J Mach Learn Res. 2017; 18(30):1–52.
Tibshirani R. Regression shrinkage and selection via the Lasso. J Royal Stat Soc B. 1996; 58(1):267–88.
Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Stat Soc B. 2005; 67(2):301–20.
Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970; 12(1):55–67.
Dicker LH. Variance estimation in highdimensional linear models. Biometrika. 2014; 101(2):269–84.
Fan JQ, Guo SJ, Hao N. Variance estimation using refitted crossvalidation in ultrahigh dimensional regression. J Royal Stat Soc B. 2012; 74(1):37–65.
Sun TN, Zhang CH. Scaled sparse linear regression. Biometrika. 2012; 99(4):879–98.
Ruffieux H, Davison AC, Hager J, Irincheeva I. Efficient inference for genetic association studies with multiple outcomes. Biostatistics. 2017; 18(4):618–36.
Schumann G, Loth E, Banaschewski T, Barbot A, Barker G, Büchel C, Conrod PJ, Dalley JW, Flor H, Gallinat J. The IMAGEN study: reinforcementrelated behaviour in normal brain function and psychopathology. Mol Psychiatry. 2010; 15(12):1128–39.
Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JJ. Secondgeneration PLINK: Rising to the challenge of larger and richer datasets. Gigascience. 2015; 4(1):7.
Wood AR, Esko T, Yang J, Vedantam S, Pers TH, Gustafsson S, Chu AY, Estrada K, Luan J, Kutalik Z, et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat Genet. 2014; 46(11):1173–86.
Hohls T. Setting confidence limits to genetic parameters estimated by restricted maximum likelihood analysis of North Carolina design II experiments. Heredity. 1996; 77(5):476–87.
Desikan RS, Ségonne F, Fischl B, Quinn BT, Dickerson BC, Blacker D, Buckner RL, Dale AM, Maguire RP, Hyman BT. An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage. 2006; 31(3):968.
Yazdani A, Boerwinkle E. Rare variants analysis using penalization methods for whole genome sequence data. BMC Bioinform. 2015; 16(1):405.
Botta V, Louppe G, Geurts P, Wehenkel L. Exploiting SNP correlations within random forest for genomewide association studies. PLoS One. 2014; 9(4):93379.
Gibson G. Rare and common variants: Twenty arguments. Nat Rev Genet. 2012; 13(2):135–45.
Bonnet A, LévyLeduc C, Gassiat E, Toro R, Bourgeron T. Improving heritability estimation by a variable selection approach in sparse high dimensional linear mixed models. J Royal Stat Soc C. 2018; 67(4):813–39.
Meinshausen N, Yu B. Lassotype recovery of sparse representations for highdimensional data. Ann Stat. 2009; 37(1):246–70.
Zhao P, Yu B. On model selection consistency of Lasso. J Mach Learn Res. 2006; 7(11):2541–63.
Acknowledgements
The authors thank the IMAGEN consortium for providing the data.
Funding
Tianzi Jiang was partially supported by the Natural Science Foundation of China (Grant Nos. 91432302, 31620103905), the Science Frontier Program of the Chinese Academy of Sciences (Grant No. QYZDJSSWSMC019), National Key R&D Program of China (Grant No. 2017YFA0105203), Beijing Municipal Science & Technology Commission (Grant Nos. Z161100000216152, Z161100000216139), the Guangdong Pearl River Talents Plan (2016ZT06S220). Chong Li was partially supported by the National Natural Science Foundation of China (Grant No. 11571308) and Zhejiang Provincial Natural Science Foundation of China (Grant Nos. LY18A010004, LY17A010021). Yue Cui was partially supported by the Natural Science Foundation of China (Grant No. 31771076). The funding bodies played no role in the design of the study, the analysis and interpretation of data or in the writing of the manuscript.
Availability of data and materials
The simulated data generated and analysed during the current study are available in the github repository https://github.com/allizwell2018/heritability_github. The real data analysed during this study are available from the IMAGEN project (http://imageneurope.com) on reasonable request.
Author information
Authors and Affiliations
Contributions
XL performed the theoretical analysis, designed, implemented and tested the methods, and wrote the paper; DW designed and implemented the methods, and wrote the paper; YC and BL wrote the paper; HW, GS, CL and TJ wrote the paper and supervised the research. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The phenotype and GWAS data are available from the IMAGEN project on reasonable request. The IMAGEN study has been approved by the Psychiatry, Nursing & Midwifery Research Ethics Subcommittee, King’s College London.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
Additional estimation results using the proposed strategy. (PDF 242 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Li, X., Wu, D., Cui, Y. et al. Reliable heritability estimation using sparse regularization in ultrahigh dimensional genomewide association studies. BMC Bioinformatics 20, 219 (2019). https://doi.org/10.1186/s1285901927927
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901927927