 Research article
 Open Access
 Published:
Boosting heritability: estimating the genetic component of phenotypic variation with multiple sample splitting
BMC Bioinformatics volume 22, Article number: 164 (2021)
Abstract
Background
Heritability is a central measure in genetics quantifying how much of the variability observed in a trait is attributable to genetic differences. Existing methods for estimating heritability are most often based on randomeffect models, typically for computational reasons. The alternative of using a fixedeffect model has received much more limited attention in the literature.
Results
In this paper, we propose a generic strategy for heritability inference, termed as “boosting heritability”, by combining the advantageous features of different recent methods to produce an estimate of the heritability with a highdimensional linear model. Boosting heritability uses in particular a multiple sample splitting strategy which leads in general to a stable and accurate estimate. We use both simulated data and real antibiotic resistance data from a major human pathogen, Sptreptococcus pneumoniae, to demonstrate the attractive features of our inference strategy.
Conclusions
Boosting is shown to offer a reliable and practically useful tool for inference about heritability.
Introduction
Whereas genomewide association studies (GWAS) represent the primary tool for determining the genetic basis of a phenotype/trait of interest, quantifying the contribution of genetic factors to the variation of a phenotype plays in addition an important role in many studies. For this purpose, heritability is a crucial quantity [1, 2] and it is defined (in the narrowsense) as the proportion of the variance of a phenotype explained by the (additive) genetic factors.
Current studies of heritability in the literature have usually been carried out in the linear mixedeffect model framework [3, 4]. In this framework, the effect sizes of genetic markers, usually SNPs, are assumed to be independent and identical distributed random variables, and often the normal distribution (with 0mean) is used for computational reasons. The maximum likelihood and method of moments are the most widely used methods for heritability inference for this family of models [3,4,5,6,7,8].
Some comparisons of different methods for estimating heritability have been recently conducted, for example, in [6, 9,10,11]. However, these works compare the performance of different methods on different datasets without paying much attention to the actual model specification. Since heritability is a concept detailing the additive variance of a trait which is in a certain sense based on a statistical model, heritability estimation is consequently dependent on the specified model [12]. For example, as reported in [9], there is a sizeable difference in the estimated heritability of schizophrenia \({\hat{h}}^2_{SNP}\) that equals 0.56 according to [3] and only 0.23 according to [13]. These estimates have a very different interpretation also qualitatively and they disagree most likely because they are based on different statistical models of heritability.
In this paper, we focus on the highdimensional linear regression model with fixed effects, where no distributional assumption on the effect sizes is made. Although limited from the computational perspective due to the extremely highdimensional data in GWAS, highdimensional linear regression is a natural model for GWAS in modelling the wholegenome level contributions of genetic variation. The benefit of this model over the classical univariate approach in GWAS has been demonstrated for example in [14, 15]. The study of heritability estimation with fixedeffect models has been started relatively recently and it has not yet gained a widespread attention. A method of moments approach is proposed in [16], a convex optimization strategy is suggested in [17] through a singular value decomposition, maximum likelihood estimation is studied in [18], and some adaptive procedures have also been theoretically studied in [19]. However, to our knowledge, a systematic numerical comparison of these different methods for estimating heritability has not been made yet.
Some twostep procedures based on highdimensional regularized regression have been introduced in [11, 20] that provide an insight to obtain more reliable and stable estimates of heritability. In brevity, this approach is based on splitting the data into two subsets. In the first step, variable selection is employed through a sparsity inducing regularization on one subset to select the relevant covariates. In the second step, these selected covariates are used to estimate heritability from the other subset of data. The selection step is useful to consider only a subset of the covariates that contribute to the variability of the trait (the response). Moreover, splitting the sample is done to avoid doing variable selection and heritability estimation on the same data which can cause overestimate [20]. Although promising, this approach depends crucially on the particular partition used to split the data, which can lead to unstable estimates.
To achieve more reliable results, we propose to use a multiple sample splitting procedure so that different structures in the sample are presented in both selection and estimation steps with a sufficiently high probability [21, 22]. Based on this idea, we present a general framework called “boosting heritability” which allows a user to plugin their own favourite method of variable selection and/or heritability estimation. By repeating sample splitting, one can also obtain various estimates of the heritability and thus provide a meaningful interval of the estimated values.
To demonstrate our framework, we apply the procedure to bacterial GWAS for estimating the heritability of antibiotic resistant phenotypes. While there are numerous works concerning estimating heritability in human GWAS, the topic has not yet been considered widely in bacteria, for the only prominent example see [23]. This is partly because bacterial GWAS poses unique challenges compare to studies with human or animal DNA, stemming from more limited recombination and highly structured populations that result in substantial linkage disequilibrium across whole chromosomes.
The paper is structured as follows. In Section “Model and definition” we present the linear model that relates a trait with a genotype matrix, then narrowsense heritability is defined together with some discussion regarding the fixedeffect vs. randomeffect approach for estimation. In Section “Boosting heritability estimation”, we introduce our “boosting heritability” procedure. Results from a simulation study comparing the different methods as components of the framework presented in Section “Simulation studies” and the application to antibiotic resistance phenotypes are presented in Section 5. We conclude and discuss our results in the final section.
Model and definition
Notations: Here, we introduce the main notations used in the paper. The \(\ell _q\) norm \((0<q<+\infty )\) of a vector \(x \in {\mathbb {R}}^d\) is defined by \(\Vert x \Vert _q = (\sum _{i=1}^d x_i^q )^{1/q}\). For a matrix \(A\in {\mathbb {R}}^{n\times m}\), \(A_{i\cdot }\) denotes its ith row and \(A_{\cdot j}\) denotes its jth column. For any index set \(S \subseteq \{1,\ldots ,d\}\), \(x_S\) denotes the subvector of x containing only the components indexed by S, and \(A_S\) denotes the submatrix of A forming by columns of A indexed by S.
Model
Given a phenotype/trait y of n samples that is modelled as a linear combination of p genetic covariates \(X_{\cdot j}\) and an error term (environmental and unmeasured genetic effects)
where \(X_{i\cdot }\) are independent and identically distributed (i.i.d) with distribution \({\mathcal {N}}(0, \Sigma )\) and are independent of \(\varepsilon _{i} \sim {\mathcal {N}} (0, \sigma ^2_{\varepsilon })\).
Here we focus on the fixed effects encoded by \(\beta\) and assume that the genetic covariates X are random variables. Conversely, in the majority of works in the heritability literature assume that elements of \(\beta\) are considered as i.i.d. random variables following a Gaussian distribution i.e \(\beta _j \overset{i.i.d}{\sim } {\mathcal {N}} (0, \sigma ^2_{\beta })\), while the genetic covariates X are assumed fixed.
Heritability
Under the model (1), we have for the ith observation that
We are interested in estimating (the narrowsense) heritability for y defined as
Technically, heritability is a quantitative measure that expresses how much of the population variability present in a trait is due to genetic differences. Moreover, estimating heritability can assist in modelling the underlying genetic architecture. A heritability close to zero implies that environmental factors cause most of the variability of the trait. In contrast, a heritability close to 1 indicates that the variability of the trait is nearly exclusively caused by the differences in genetic factors.
As we have the relation
one can use \(\Vert y\Vert _2^2/n\) as an unbiased estimator for the denominator of the heritability. Further, one can rewrite (2) as
and use an estimate of the noisevariance \({\hat{\sigma }}_{\varepsilon }^2\) (see e.g. [24]) to estimate \(h^2\) rather than directly estimate the genetic variance \(\beta ^\top \Sigma \beta\) (which requires an estimate of the covariance matrix and the effect sizes).
However, it is worth noting that as a biproduct from GWAS analysis when using a multivariate regression approach, such as the Elastic net discussed below, one would already have the estimated effect sizes corresponding to the selected covariates. Using these effect sizes to estimate the heritability would bring insight on the heritability corresponding to the selected covariates and thus clearly provide useful ways to understand the genetic architecture of a trait.
Contrasting the fixed and random effects
In GWAS the true number of causal loci reported tend to be comparatively small compared with the number of putative genetic markers p, which is usually in the order of hundreds of thousands at minimum. Assume that the true effect size \(\beta\) has \(s\ll p\) nonzero entries. In the randomeffect model, a further assumption is made concerning these nonzero entries such that they are i.i.d Gaussian \({\mathcal {N}} (0, \sigma ^2_\beta )\). Under this random effect assumption, the heritability is defined [20, 25] as \(s \sigma ^2_\beta / (s \sigma ^2_\beta + \sigma _{\varepsilon }^2) .\)
However, when employing the randomeffect assumption, most methods do not use the sparsity constraint. This leads to the target heritability being estimating is \(p \sigma ^2_\beta / (p \sigma ^2_\beta + \sigma _{\varepsilon }^2)\) and the resulting estimate of heritability may thus be inaccurate. Moreover, the LD structure, an important concept that represents the correlation structure of the covariates, is not directly addressed in the formula of heritability in randomeffect model, which can make the estimate unjustifiable, e.g. see [8, 26]. Several attempts have been done recently to take into account the sparsity constraint within the randomeffect model and some promising results have been reported in [20, 25, 27].
Boosting heritability estimation
Related works and motivation
As the number of biomarkers can be very large, it is natural to first apply some variable selection or variable screening methods to remove the irrelevant variables from the actual heritability estimation phase. This kind of a postselection approach has been proposed in the literature, more specifically for the fixedeffect model [11, 20].
The HERRA method proposed in [11] is based on a screening method (e.g. as in [28]) to reduce the number of covariates below the sample size. Given the remaining covariates, the sample is randomly divided into two equally sized parts. A lassotype estimator is employed on the first subset to select a small number of important variables. After that, the least squares estimator is used on the second subset of data using only the selected covariates (from the lassotype estimator) to get an estimate of the noisevariance. The role of the first and second subsets are switched to obtain another estimate of the noisevariance. Finally, heritability is calculated as in the formula (3) where the noisevariance is the mean of the two estimated noisevariances.
Another “twostage” approach with samplesplitting has also been proposed in the paper [20]. The data is randomly split into two disjoint equal sample size. On one half of the data, they use a sparse regularization method based on Elastic net to first select the relevant variables. Then, on the other half of the data, they only use the selected variables to estimate the heritability through a method of moments based approach [16].
Both these approaches clearly suffer from some limitations. Firstly, when the number of covariates is very large, it is expensive to fit a sparse regularization directly as in the “twostage” approach described above. Using a screening method, as in HERRA, to reduce the dimension of the problem is thus a pragmatic approach for applications. However, as the true number of causal biomarkers is not known, as well as their LD structure is not given, reducing the number of variables below the sample size (as in HERRA) introduces another problem from the practical perspective. Secondly, it is clear that both of these approaches crucially depend on the particular sample splitting employed. One can avoid this dependence by performing the sample splitting and inference procedure many times (e.g. 100 times) and aggregating the corresponding results [21, 22]. This is to ensure that the different latent structures possibly residing in the sample are properly taken into account in both the selection and estimation steps.
The idea of aggregating different estimates to yield an estimate with improved statistical properties is the central feature of the generic boosting approach widely used in machine learning, such as AdaBoost [29]. The multiple sample splitting approach has previously been proposed in statistics community as in [21, 22], and successfully used in GWAS [30, 31].
Boosting heritability: multi sample splitting and aggregation of heritability
We propose a strategy that uses multiple sample splitting to estimate heritability, called Boosting heritability detailed in Algorithm 1.
It is noted that the initial step (Step 0) is a screening step that can use a simple measure of association, such as the sample correlation, to remove covariates that are only weakly correlated with the trait of interest. This step is similar to the one used in HERRA [11] and in [27], however, we do not propose to reduce the number of covariates below the actual sample size. This is motivated by the fact for real data we do not know the true number of causal variates as well as the correlation structure of the variables. If too many covariates are removed, this can have a detrimental effect on the subsequent steps in the estimation procedure. Moreover, the initial screening step can be seen as optional, and necessary only for situations where the high dimensionality of the covariate space makes regularized model fitting tedious or practically impossible for practical purposes.
The sample splitting performed in Step 1 is a useful method that can help to avoid overfitting when variable selection and subsequent estimation is considered [20, 22, 31]. Step 2 corresponds to a variable selection phase where we suggest to use Elastic Net as a default alternative, given its ability to deal with highly correlated covariates. Switching the roles of the data subsets help us to obtain a more stable estimate of the heritability. Note that by repeating sample splitting, various estimates of the heritability are obtained and thus provide a meaningful interval of the estimated values (for example see Fig. 1).
We note that the main cost for Boosting Heritability procedure is fitting a penalized regression (Step 2) for variable selection in the setting where \(p \gg n\). However, fast computation methods for penalized regression on large GWAS data have been recently proposed see e.g. [32]. Moreover, the B repetitions can be easily implemented in parallel. When the trait of interest is dichotomous, one can use the Robertson transformation [33] to transfer the heritabilty calculated on an observed scale (on 0 or 1) to a heritability on the liability scale. As we largely follow the approach presented in the HERRA method, the details for obtaining heritability for a binary trait can be found in [11] or in [34].
Plugin Lasso type estimators for heritability
From the formula of heritability (2), direct approaches to estimate heritability can be obtained using estimates of the effect sizes \(\beta\) and of the covariance matrix. By using a lasso type method, one can obtain the nonzero estimated effect sizes of the selected covariates, and one can also use these covariates to obtain an sample covariance matrix. More precisely, let \(S = \left\{ j : {\hat{\beta }} \ne 0 \right\}\) where \({\hat{\beta }}\) is an estimate from a lassotype method, we can calculate the heritability as in equation (2) with \({\hat{\Sigma }}_S = X_S X_S^\top /(n1)\),
The elastic net has been shown to be especially useful when the variables are dependent [35] (LD structure), which is often the case with genetic marker data and this feature is especially highlighted in bacterial genome data. The corresponding estimator is defined as
Here \(\ell (a ,b)\) is the negative loglikelihood for an observation e.g. for the linear Gaussian case it is \(\frac{1}{2}(a b)^2\) and for logistic regression it is \( a \cdot b + \log (1+e^{b})\). Elastic net is controlled by \(\alpha \in [0,1],\) that bridges the gap between lasso (\(\alpha =1\)) and ridge regression (\(\alpha =0\)). As the true genetic basis of a given trait is generally unknown as well as the LD structure is hard to estimate, we suggest to use a small value for \(\alpha\), e.g. 0.001. The tuning parameter \(\lambda >0\) controls the overall strength of the penalty and we use 10fold crossvalidation to choose suitable value for \(\lambda\). Elastic net approach is implemented in the software ’pyseer’ [36, 37] focusing on GWAS for bacterial data.
Simulation studies
We use a real data set of 616 Streptococcus pneumoniae genomes collected from Massachusetts, denoted MA data, to create semisynthetic datasets that incorporate levels of population structure and LD occurring in natural populations (see Fig. 2). The data are publicly available through the article [38]. After initial data filtering with standard population genomic procedures (using a minor allele frequency threshold and removing missing data), we obtain a genotype matrix of 603 samples with 89703 SNPs. Using this observed genotype matrix, we simulate the responses/phenotypes through the linear model defined in (1).
Availability of data and code: The R codes and data used in the numerical experiments are available at: https://github.com/tienmt/boostingher .
Experimental designs
We consider the following designs for choosing the causal SNPs (nonzero effect sizes):

sparse setting: 100 SNPs are randomly chosen.

polygenic setting: 5000 SNPs are randomly chosen.

Penicillin resistance like setting ( [39]): 100 SNPs are randomly chosen from 3 genes (pbpX,pbp1A,penA).
Given the SNPs, the regression coefficients \(\beta ^0\) are sampled from the normal distribution \({\mathcal {N}}(0,1)\). As the true covariance of the genotype matrix is not known, we need to renormalize the coefficient \(\beta ^0\) as \(\beta = \beta ^0 \sqrt{\sigma ^2_\varepsilon h^2/ (\beta ^{0\top } {\bar{\Sigma }} \beta ^0 ( 1h^2 )) }\) to assure that the true corresponding heritability is approximating our target. Here \(h^2\) is the target heritability and \({\bar{\Sigma }}\) is the sample covariance matrix of the genotype matrix and the noise variance is fixed as \(\sigma _{\varepsilon }^2 = 1\).
The target heritability is fixed as \(h^2 = 0.5\). We remind that as true covariance matrix of the genotype matrix is not known, one can only simulate phenotypes from model (1) that approximately target the considered heritability. Therefore, we propose to use the “oracle” estimator, denoted by h2aprx, that is calculated through the formula (3)
as a benchmark for comparison. As in simulations the true covariance matrix is not known in our setup, whereas the noise variance is given and thus this estimator provides a solid basis for approximating the true heritability. It is noted that the “h2aprx” estimator is based on the true simulated values and cannot be used with real data.
For each setup, we generate 30 simulation runs and report the mean and the standard deviation of heritability estimates for each method across the simulation runs. We compare Elastic net (Enet), HERRA and the boosting versions of HERRA denoted by “B_herra” and GCTA method. More specifically, GCTA [40] is a widely used method based on a linear mixed model and maximum (restricted) likelihood estimation. The number of repeated sample splitting is performed with \(B = 50\) times. The Enet is used with fixed parameter \(\alpha = 0.01\) and 10fold cross validation for choosing the tunning parameter \(\lambda\).
Results for estimating heritability
From the results in Table 1, it is clear that the “oracle” approximates well the target heritability in all designs. Generally, the boosting procedure tends to reduce the variability of the original underlying method it is used in conjunction with, see Tables 1, 2 and 3.
Elastic net underestimates the target, which can be explained by the downward bias known to influence the naive plugin lassotype approaches, such as the Elastic net. The effect is due to shrinkage of some of the coefficients corresponding to weak effect towards zero, while such weak effects may still be significant in terms of the total genetic trait variability. However, we would like to note that estimating heritability through Elastic net provides a good lower bound for the heritability, as indicated by the results.
On the other hand, HERRA and its boosting version return stable estimates. More specifically, with a proper choice of the screening step (Step 0) as in Table 2, HERRA and B_herra can lead to accurate estimates.This can be anticipated as this approach follows the spirit of the ’oracle’ estimator. More specifically, it aims at providing a consistent estimate of the noise variance and thus the corresponding heritability estimate would be also consistent and stable [11]. For this reason, the boosting HERRA will be our main focus method in real application in the next section.
In our simulations, GCTA generally did not perform well, most likely due to the sample size being too small for random effects based approaches such as GCTA. We note that, for unrelated individuals and common SNPs in human studies, GCTA method is recommended with at least 3160 unrelated samples, see [40]. In studies of bacterial phenotypes, it would be uncommon to have access to such large numbers of samples that are at least approximately unrelated.
The effect of multiple data splitting
Clearly, choosing the number of data splittings B is a crucial factor in practice. Here we exemplify that as B increases, the resulting estimated heritabilities concentrate around their mean, see Fig. 3. Thus, we suggest to use at least \(B \ge 30\) in practice and \(B = 100\) would be a reasonable choice, computational resources permitting.
The effect of the screening step
We further investigate the effect of reducing the covariates by using the screening step. Different scenarios for 100 randomly selected SNPs with target heritability \(h^2 = 0.5\) and \(\sigma _\epsilon ^2 = 1\) are examined, see Table 2. More precisely, we further consider 3 scenarios: remove 60% of the covariates, remove 90% of the covariates, and only retain top \(n+1\) covariates.
It is revealed that using the screening step to reduce the irrelevant covariates not only reduces the dimension of the data, but can also improve the heritability estimation, in particular for the scenario of removing 60% of the covariates. This fact has also been reported before in the linear mixed model approach in [27], where the authors show an improvement of the maximum likelihood estimation. However, if too many covariates are removed, heritability estimation can be inaccurate as in the scenario of keeping only top \(n+1\) covariates.
On the running time
The running times for default B_herra on MA data with the splitting step parallelized on 10 CPU cores was 2.335 mins. More specifically, the screening step took 5.25 secs of the total runtime. In the case of removing 60% covariates, the running time is significantly reduced to 1.319 mins. The R codes were run on Linux (Redhat 64bit) with R version 3.6.0 .
Simulation results using GCTA model
We further examine the performances of Enet, HERRA, B_herra and GCTA method when the phenotypes are instead simulated from the GCTA model. We remind that GCTA model is a random effect model that is different to the linear model (1) and thus we cannot use the ’h2aprx’ estimator. The settings for choosing the causal SNPs remain the same as before.
The results, Table 3, reveal that HERRA, B_herra yeild unbiased estimates in GCTA model. Although underestimate, Elastic net still provides a good lower bound for the true heritability. Once again, GCTA method underestimates the heritability as the sample is too small.
Heritability of antibiotic resistance in Maela data
To further illustrate the boosting based approach, we apply our procedure to Maela data which represent 3069 Streptococus pneumoniae genomes from an infant cohort study conducted in a refugee camp on the Thailand–Myanmar border [39, 41]. After some data filtering with standard population genomic procedures (using a minor allele frequency threshold and removing missing data), we obtain a genotype matrix with 121014 SNPs. We consider resistances to five different antibiotics as the phenotypes: chloramphenicol, erythromycin, tetracycline, penicillin and cotrimoxazole.
The heritability of the antibiotic resistance phenotype is expected to be high, meaning that the variability stems primarily from the observed genetic differences among these bacteria and that the SNPs available for this particular species/dataset and would include majority of the underlying causal mechanisms for resistance. However, despite that the bacterial isolates are related, it cannot be concluded that the reported estimates refer to total heritability, since unmeasured genetic factors are likely to contribute partially to the measured phenotypic variation. We use two different types of resistance phenotypes to investigate their heritability. First we use the binary phenotype corresponding to the labels ’R’ or ’S’ (stand for ’Sensitive’ or ’Resistant’) for each bacterial isolate in the cohort. Second, we use a continuous phenotype corresponding to the inhibition zone diameters measured in the lab. These inhibition zone diameters are in practice used to defined whether a sample is sensitive or resistant to an antibiotic. It is however worthwhile noting that the transformation from inhibition zone diameters to labelling a sample ’S’ vs ’R’ is nonlinear due to the way the inhibition mechanism dynamics in the bacterial culture.
We apply Enet, HERRA, boosting version of HERRA and GCTA method [40] to this data. The results are given in the Tables 4 and 5 for the two data types, respectively.
As a broad summary, heritabilities of these five antibiotic resistances are high, as expected, whether using binary or continuous phenotypes. However, we would like to note that the results for binary responses are on the observed scale (0/1 resistance status), as we are not able to transform them into the underlying threshold model, see [11]. The Elastic net method yields an important insight by providing a lower bound on the heritability of these antibiotic resistances. For continuous phenotypes, it is at least 51% for chloramphenicol, at least 73% for erythromycin, at least 73% for tetracycline, at least 80% for penicillin and at least 71% for cotrimoxazole.
Interestingly, B_herra yields consistent results with GCTA method. However, the result for heritability of penicillin by GCTA is lower than the one from Enet method while boosting HERRA is not, see Table 5.
Discussion and conclusions
In this paper, we provide a general framework ’boosting heritability’ for making inference about heritability. The main ingredient of ’boosting heritability’ is a multiple sample splitting strategy. This strategy allows one to employ a variable selection step to remove irrelevant covariates that do not contribute to the variability of a trait and thus produce a reliable estimate of heritability. Moreover, by repeating sample splitting many times, this strategy makes sure that different latent structures are taken into account in both selection and estimation steps.
Numerical comparisons of different methods together with our proposal for estimating heritability in linear (fixedeffect) model draw a systematic picture on the behaviour of the current approaches when focusing on an application to bacterial GWAS. The results on real data suggest that the observed variability of the five studied antibiotic resistances is mainly due to the variability in the observed genetic factors, while some unexplained variation still remains.
Succeeding in improving and stabilizing HERRA [11], “boosting heritability” framework still preserves its advantages that are able to deal with the dichotomous, timetoevent or ageatonset traits. Moreover, boosting heritability procedure is also applicable for randomeffect model where the heritability estimation step (Step 3 in Algorithm 1) is done by using a random effect method as in [20]. These would be possible new research directions for the future.
Furthermore, our boosting heritability procedure uses a simple aggregation to combine the estimates that is to use their arithmetic mean. Other types of aggregation, see e.g. [30, 31], could also be used and further examined in future works.
Availability of data and materials
The R codes and data used in the numerical experiments are available at: https://github.com/tienmt/boostingher.
References
 1.
Falconer DS. Introduction to quantitative genetics. Edinburgh, London: Oliver And Boyd; 1960.
 2.
Lynch M, Walsh B. Genetics and analysis of quantitative traits, vol. 1. MA: Sinauer Sunderland; 1998.
 3.
BulikSullivan BK, Loh PR, Finucane HK, Ripke S, Yang J, Patterson N, Daly MJ, Price AL, Neale BM, of the Psychiatric Genomics Consortium SWG et al. Ld score regression distinguishes confounding from polygenicity in genomewide association studies. Nature Genet 2015;47(3):291.
 4.
Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, et al. Common snps explain a large proportion of the heritability for human height. Nat Genet. 2010;42(7):565.
 5.
Golan D, Lander ES, Rosset S. Measuring missing heritability: inferring the contribution of common variants. Proc Nat Acad Sci. 2014;111(49):5272–81.
 6.
Zhou X. A unified framework for variance component estimation with summary statistics in genomewide association studies. Ann Appl Stat. 2017;11(4):2027.
 7.
Bonnet A. Heritability estimation in highdimensional mixed models: theory and applications. PhD thesis, Université ParisSaclay; 2016.
 8.
Speed D, Cai N, Johnson MR, Nejentsev S, Balding DJ, Consortium U, et al. Reevaluation of snp heritability in complex human traits. Nat Genet. 2017;49(7):986.
 9.
Evans LM, Tahmasbi R, Vrieze SI, Abecasis GR, Das S, Gazal S, Bjelland DW, Candia TR, Goddard ME, Neale BM, et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat Genet. 2018;50(5):737.
 10.
Weissbrod O, Flint J, Rosset S. Estimating snpbased heritability and genetic correlation in casecontrol studies directly and with summary statistics. Am J Human Genet. 2018;103(1):89–99.
 11.
Gorfine M, Berndt SI, ChangClaude J, Hoffmeister M, Le Marchand L, Potter J, Slattery ML, Keret N, Peters U, Hsu L. Heritability estimation using a regularized regression approach (herra): applicable to continuous, dichotomous or ageatonset outcome. PLoS ONE. 2017;12(8):0181269.
 12.
Zaitlen N, Kraft P. Heritability in the genomewide association era. Hum Genet. 2012;131(10):1655–64.
 13.
Lee SH, Ripke S, Neale BM, Faraone SV, Purcell SM, Perlis RH, Mowry BJ, Thapar A, Goddard ME, Witte JS, et al. Genetic relationship between five psychiatric disorders estimated from genomewide snps. Nat Genet. 2013;45(9):984.
 14.
Wu TT, Chen YF, Hastie T, Sobel E, Lange K. Genomewide association analysis by lasso penalized logistic regression. Bioinformatics. 2009;25(6):714–21.
 15.
Brzyski D, Peterson CB, Sobczyk P, Candès EJ, Bogdan M, Sabatti C. Controlling the rate of gwas false discoveries. Genetics. 2017;205(1):61–75.
 16.
Dicker LH. Variance estimation in highdimensional linear models. Biometrika. 2014;101(2):269–84.
 17.
Janson L, Barber RF, Candes E. Eigenprism: inference for high dimensional signaltonoise ratios. J R Stat Soc: Ser B (Stat Methodol). 2017;79(4):1037–65.
 18.
Dicker LH, Erdogdu MA. Maximum likelihood for variance estimation in highdimensional linear models. In: Proceedings of the 19th international conference on artificial intelligence and statistics, PMLR 2016; 51:159167, 2016.
 19.
Verzelen N, Gassiat E, et al. Adaptive estimation of highdimensional signaltonoise ratios. Bernoulli. 2018;24(4B):3683–710.
 20.
Li X, Wu D, Cui Y, Liu B, Walter H, Schumann G, Li C, Jiang T. Reliable heritability estimation using sparse regularization in ultrahigh dimensional genomewide association studies. BMC Bioinform. 2019;20(1):219.
 21.
Meinshausen N, Meier L, Bühlmann P. Pvalues for highdimensional regression. J Am Stat Assoc. 2009;104(488):1671–81.
 22.
Fan J, Guo S, Hao N. Variance estimation using refitted crossvalidation in ultrahigh dimensional regression. J R Stat Soc: Ser B (Stat Methodol). 2012;74(1):37–65.
 23.
Lees JA, Croucher NJ, Goldblatt D, Nosten F, Parkhill J, Turner C, Turner P, Bentley SD. Genomewide identification of lineage and locus specific variation associated with pneumococcal carriage duration. Elife. 2017;6:26255.
 24.
Reid S, Tibshirani R, Friedman J. A study of error variance estimation in lasso regression. Statistica Sinica. 2016;26:35–67.
 25.
Bonnet A, Gassiat E, LévyLeduc C, et al. Heritability estimation in high dimensional sparse linear mixed models. Electron J Stat. 2015;9(2):2099–129.
 26.
Speed D, Balding DJ. Sumher better estimates the snp heritability of complex traits from summary statistics. Nat Genet. 2019;51(2):277.
 27.
Bonnet A, LévyLeduc C, Gassiat E, Toro R, Bourgeron T. Improving heritability estimation by a variable selection approach in sparse high dimensional linear mixed models. J Roy Stat Soc: Ser C (Appl Stat). 2018;67(4):813–39.
 28.
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc: Ser B (Stat Methodol). 2008;70(5):849–911.
 29.
Freund Y, Schapire RE. Experiments with a new boosting algorithm. In: Proceedings of the thirteenth international conference on machine learning. ICML’96, 1996; pp. 148–156. http://dl.acm.org/citation.cfm?id=3091696.3091715.
 30.
Renaux C, Buzdugan L, Kalisch M, Bühlmann P. Hierarchical inference for genomewide association studies: a view on methodology with software. Comput Stat. 2020;35(1):1–40.
 31.
Buzdugan L, Kalisch M, Navarro A, Schunk D, Fehr E, Bühlmann P. Assessing statistical significance in multivariable genome wide association analysis. Bioinformatics. 2016;32(13):1990–2000.
 32.
Qian J, Tanigawa Y, Du W, Aguirre M, Chang C, Tibshirani R, Rivas MA, Hastie T. A fast and scalable framework for largescale and ultrahighdimensional sparse regression with application to the uk biobank. PLoS Genet. 2020;16(10):1009141.
 33.
Dempster ER, Lerner IM. Heritability of threshold characters. Genetics. 1950;35(2):212.
 34.
Lee SH, Wray NR, Goddard ME, Visscher PM. Estimating missing heritability for disease from genomewide association studies. Am J Human Genet. 2011;88(3):294–305.
 35.
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc: Ser B (Stat Methodol). 2005;67(2):301–20.
 36.
Lees JA, Galardini M, Bentley SD, Weiser JN, Corander J. pyseer: a comprehensive tool for microbial pangenomewide association studies. Bioinformatics. 2018;34(24):4310–2.
 37.
Lees JA, Mai TT, Galardini M, Wheeler NE, Horsfield ST, Parkhill J, Corander J. Improved prediction of bacterial genotypephenotype associations using interpretable pangenomespanning regressions. Mbio. 2020;11(4).
 38.
Croucher NJ, Finkelstein JA, Pelton SI, Parkhill J, Bentley SD, Lipsitch M, Hanage WP. Population genomic datasets describing the postvaccine evolutionary epidemiology of streptococcus pneumoniae. Sci Data. 2015;2:150058.
 39.
Lees JA, Vehkala M, Välimäki N, Harris SR, Chewapreecha C, Croucher NJ, Marttinen P, Davies MR, Steer AC, Tong SY, et al. Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nat Commun. 2016;7:12797.
 40.
Yang J, Lee SH, Goddard ME, Visscher PM. Gcta: a tool for genomewide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82.
 41.
Chewapreecha C, Marttinen P, Croucher NJ, Salter SJ, Harris SR, Mather AE, Hanage WP, Goldblatt D, Nosten FH, Turner C, et al. Comprehensive identification of single nucleotide polymorphisms associated with betalactam resistance within pneumococcal mosaic genes. PLoS Genet. 2014;10(8):1004547.
Acknowledgements
The authors would like to thank the editor and two anonymous referees who kindly reviewed the earlier version of this manuscript and provided valuable suggestions and enlightening comments. T.T.M and J.C. would like to thank John A Lees for his useful discussion on GWAS and heritability.
Funding
This research was supported by the European Research Council grant no. 742158.
Author information
Affiliations
Contributions
Conceptualization: T.T.M. and J.C. . Formal analysis: T.T.M.. Data curation: P.T. . Methodology, T.T.M.. Writing: original draft, T.T.M.; review and editing, all authors. Funding acquisition: J.C. All the authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Mai, T.T., Turner, P. & Corander, J. Boosting heritability: estimating the genetic component of phenotypic variation with multiple sample splitting. BMC Bioinformatics 22, 164 (2021). https://doi.org/10.1186/s12859021040797
Received:
Accepted:
Published:
Keywords
 Antimicrobial resistance
 Boosting
 Heritability
 Linear model