Boosting heritability: estimating the genetic component of phenotypic variation with multiple sample splitting

Background Heritability is a central measure in genetics quantifying how much of the variability observed in a trait is attributable to genetic differences. Existing methods for estimating heritability are most often based on random-effect models, typically for computational reasons. The alternative of using a fixed-effect model has received much more limited attention in the literature. Results In this paper, we propose a generic strategy for heritability inference, termed as “boosting heritability”, by combining the advantageous features of different recent methods to produce an estimate of the heritability with a high-dimensional linear model. Boosting heritability uses in particular a multiple sample splitting strategy which leads in general to a stable and accurate estimate. We use both simulated data and real antibiotic resistance data from a major human pathogen, Sptreptococcus pneumoniae, to demonstrate the attractive features of our inference strategy. Conclusions Boosting is shown to offer a reliable and practically useful tool for inference about heritability.

reasons. The maximum likelihood and method of moments are the most widely used methods for heritability inference for this family of models [3][4][5][6][7][8].
Some comparisons of different methods for estimating heritability have been recently conducted, for example, in [6,[9][10][11]. However, these works compare the performance of different methods on different datasets without paying much attention to the actual model specification. Since heritability is a concept detailing the additive variance of a trait which is in a certain sense based on a statistical model, heritability estimation is consequently dependent on the specified model [12]. For example, as reported in [9], there is a sizeable difference in the estimated heritability of schizophrenia ĥ 2 SNP that equals 0.56 according to [3] and only 0.23 according to [13]. These estimates have a very different interpretation also qualitatively and they disagree most likely because they are based on different statistical models of heritability.
In this paper, we focus on the high-dimensional linear regression model with fixed effects, where no distributional assumption on the effect sizes is made. Although limited from the computational perspective due to the extremely high-dimensional data in GWAS, high-dimensional linear regression is a natural model for GWAS in modelling the whole-genome level contributions of genetic variation. The benefit of this model over the classical univariate approach in GWAS has been demonstrated for example in [14,15]. The study of heritability estimation with fixed-effect models has been started relatively recently and it has not yet gained a wide-spread attention. A method of moments approach is proposed in [16], a convex optimization strategy is suggested in [17] through a singular value decomposition, maximum likelihood estimation is studied in [18], and some adaptive procedures have also been theoretically studied in [19]. However, to our knowledge, a systematic numerical comparison of these different methods for estimating heritability has not been made yet.
Some two-step procedures based on high-dimensional regularized regression have been introduced in [11,20] that provide an insight to obtain more reliable and stable estimates of heritability. In brevity, this approach is based on splitting the data into two subsets. In the first step, variable selection is employed through a sparsity inducing regularization on one subset to select the relevant covariates. In the second step, these selected covariates are used to estimate heritability from the other subset of data. The selection step is useful to consider only a subset of the covariates that contribute to the variability of the trait (the response). Moreover, splitting the sample is done to avoid doing variable selection and heritability estimation on the same data which can cause overestimate [20]. Although promising, this approach depends crucially on the particular partition used to split the data, which can lead to unstable estimates.
To achieve more reliable results, we propose to use a multiple sample splitting procedure so that different structures in the sample are presented in both selection and estimation steps with a sufficiently high probability [21,22]. Based on this idea, we present a general framework called "boosting heritability" which allows a user to plug-in their own favourite method of variable selection and/or heritability estimation. By repeating sample splitting, one can also obtain various estimates of the heritability and thus provide a meaningful interval of the estimated values.
To demonstrate our framework, we apply the procedure to bacterial GWAS for estimating the heritability of antibiotic resistant phenotypes. While there are numerous works concerning estimating heritability in human GWAS, the topic has not yet been considered widely in bacteria, for the only prominent example see [23]. This is partly because bacterial GWAS poses unique challenges compare to studies with human or animal DNA, stemming from more limited recombination and highly structured populations that result in substantial linkage disequilibrium across whole chromosomes.
The paper is structured as follows. In Section "Model and definition" we present the linear model that relates a trait with a genotype matrix, then narrow-sense heritability is defined together with some discussion regarding the fixed-effect vs. random-effect approach for estimation. In Section "Boosting heritability estimation", we introduce our "boosting heritability" procedure. Results from a simulation study comparing the different methods as components of the framework presented in Section "Simulation studies" and the application to antibiotic resistance phenotypes are presented in Section 5. We conclude and discuss our results in the final section.

Model and definition
Notations: Here, we introduce the main notations used in the paper. The ℓ q norm For a matrix A ∈ R n×m , A i· denotes its i-th row and A ·j denotes its j-th column. For any index set S ⊆ {1, . . . , d} , x S denotes the subvector of x containing only the components indexed by S, and A S denotes the submatrix of A forming by columns of A indexed by S.

Model
Given a phenotype/trait y of n samples that is modelled as a linear combination of p genetic covariates X ·j and an error term (environmental and unmeasured genetic effects) where X i· are independent and identically distributed (i.i.d) with distribution N (0, �) and are independent of ε i ∼ N (0, σ 2 ε ). Here we focus on the fixed effects encoded by β and assume that the genetic covariates X are random variables. Conversely, in the majority of works in the heritability literature assume that elements of β are considered as i.i.d. random variables following a Gaussian distribution i.e β j i.i.d ∼ N (0, σ 2 β ) , while the genetic covariates X are assumed fixed.

Heritability
Under the model (1), we have for the i-th observation that We are interested in estimating (the narrow-sense) heritability for y defined as Technically, heritability is a quantitative measure that expresses how much of the population variability present in a trait is due to genetic differences. Moreover, estimating heritability can assist in modelling the underlying genetic architecture. A heritability (1) close to zero implies that environmental factors cause most of the variability of the trait. In contrast, a heritability close to 1 indicates that the variability of the trait is nearly exclusively caused by the differences in genetic factors.
As we have the relation one can use y 2 2 /n as an unbiased estimator for the denominator of the heritability. Further, one can re-write (2) as and use an estimate of the noise-variance σ 2 ε (see e.g. [24]) to estimate h 2 rather than directly estimate the genetic variance β ⊤ �β (which requires an estimate of the covariance matrix and the effect sizes).
However, it is worth noting that as a bi-product from GWAS analysis when using a multivariate regression approach, such as the Elastic net discussed below, one would already have the estimated effect sizes corresponding to the selected covariates. Using these effect sizes to estimate the heritability would bring insight on the heritability corresponding to the selected covariates and thus clearly provide useful ways to understand the genetic architecture of a trait.

Contrasting the fixed and random effects
In GWAS the true number of causal loci reported tend to be comparatively small compared with the number of putative genetic markers p, which is usually in the order of hundreds of thousands at minimum. Assume that the true effect size β has s ≪ p nonzero entries. In the random-effect model, a further assumption is made concerning these non-zero entries such that they are i.i.d Gaussian N (0, σ 2 β ) . Under this random effect assumption, the heritability is defined [20,25] as sσ 2 β /(sσ 2 β + σ 2 ε ). However, when employing the random-effect assumption, most methods do not use the sparsity constraint. This leads to the target heritability being estimating is pσ 2 β /(pσ 2 β + σ 2 ε ) and the resulting estimate of heritability may thus be inaccurate. Moreover, the LD structure, an important concept that represents the correlation structure of the covariates, is not directly addressed in the formula of heritability in random-effect model, which can make the estimate unjustifiable, e.g. see [8,26]. Several attempts have been done recently to take into account the sparsity constraint within the random-effect model and some promising results have been reported in [20,25,27].

Related works and motivation
As the number of biomarkers can be very large, it is natural to first apply some variable selection or variable screening methods to remove the irrelevant variables from the actual heritability estimation phase. This kind of a post-selection approach has been proposed in the literature, more specifically for the fixed-effect model [11,20].
The HERRA method proposed in [11] is based on a screening method (e.g. as in [28]) to reduce the number of covariates below the sample size. Given the remaining covariates, the sample is randomly divided into two equally sized parts. A lasso-type estimator is employed on the first subset to select a small number of important variables. After that, the least squares estimator is used on the second subset of data using only the selected covariates (from the lasso-type estimator) to get an estimate of the noisevariance. The role of the first and second subsets are switched to obtain another estimate of the noise-variance. Finally, heritability is calculated as in the formula (3) where the noise-variance is the mean of the two estimated noise-variances.
Another "two-stage" approach with sample-splitting has also been proposed in the paper [20]. The data is randomly split into two disjoint equal sample size. On one half of the data, they use a sparse regularization method based on Elastic net to first select the relevant variables. Then, on the other half of the data, they only use the selected variables to estimate the heritability through a method of moments based approach [16].
Both these approaches clearly suffer from some limitations. Firstly, when the number of covariates is very large, it is expensive to fit a sparse regularization directly as in the "two-stage" approach described above. Using a screening method, as in HERRA, to reduce the dimension of the problem is thus a pragmatic approach for applications. However, as the true number of causal biomarkers is not known, as well as their LD structure is not given, reducing the number of variables below the sample size (as in HERRA) introduces another problem from the practical perspective. Secondly, it is clear that both of these approaches crucially depend on the particular sample splitting employed. One can avoid this dependence by performing the sample splitting and inference procedure many times (e.g. 100 times) and aggregating the corresponding results [21,22]. This is to ensure that the different latent structures possibly residing in the sample are properly taken into account in both the selection and estimation steps.
The idea of aggregating different estimates to yield an estimate with improved statistical properties is the central feature of the generic boosting approach widely used in machine learning, such as AdaBoost [29]. The multiple sample splitting approach has previously been proposed in statistics community as in [21,22], and successfully used in GWAS [30,31].

Boosting heritability: multi sample splitting and aggregation of heritability
We propose a strategy that uses multiple sample splitting to estimate heritability, called Boosting heritability detailed in Algorithm 1.
It is noted that the initial step (Step 0) is a screening step that can use a simple measure of association, such as the sample correlation, to remove covariates that are only weakly correlated with the trait of interest. This step is similar to the one used in HERRA [11] and in [27], however, we do not propose to reduce the number of covariates below the actual sample size. This is motivated by the fact for real data we do not know the true number of causal variates as well as the correlation structure of the variables. If too many covariates are removed, this can have a detrimental effect on the subsequent steps in the estimation procedure. Moreover, the initial screening step can be seen as optional, and necessary only for situations where the high dimensionality of the covariate space makes regularized model fitting tedious or practically impossible for practical purposes.
The sample splitting performed in Step 1 is a useful method that can help to avoid overfitting when variable selection and subsequent estimation is considered [20,22,31].
Step 2 corresponds to a variable selection phase where we suggest to use Elastic Net as a default alternative, given its ability to deal with highly correlated covariates. Switching the roles of the data subsets help us to obtain a more stable estimate of the heritability. Note that by repeating sample splitting, various estimates of the heritability are obtained and thus provide a meaningful interval of the estimated values (for example see Fig. 1).
We note that the main cost for Boosting Heritability procedure is fitting a penalized regression (Step 2) for variable selection in the setting where p ≫ n . However, fast computation methods for penalized regression on large GWAS data have been recently proposed see e.g. [32]. Moreover, the B repetitions can be easily implemented in parallel. When the trait of interest is dichotomous, one can use the Robertson transformation [33] to transfer the heritabilty calculated on an observed scale (on 0 or 1) to a heritability on the liability scale. As we largely follow the approach presented in the HERRA method, the details for obtaining heritability for a binary trait can be found in [11] or in [34].

Plug-in Lasso type estimators for heritability
From the formula of heritability (2), direct approaches to estimate heritability can be obtained using estimates of the effect sizes β and of the covariance matrix. By using a lasso type method, one can obtain the non-zero estimated effect sizes of the selected covariates, and one can also use these covariates to obtain an sample covariance matrix. More precisely, let S = j :β � = 0 where β is an estimate from a lasso-type method, we can calculate the heritability as in equation (2) with � S = X S X ⊤ S /(n − 1), The elastic net has been shown to be especially useful when the variables are dependent [35] (LD structure), which is often the case with genetic marker data and this feature is especially highlighted in bacterial genome data. The corresponding estimator is defined as Here ℓ(a, b) is the negative log-likelihood for an observation e.g. for the linear Gaussian case it is 1 2 (a − b) 2 and for logistic regression it is −a · b + log(1 + e b ) . Elastic net is controlled by α ∈ [0, 1], that bridges the gap between lasso ( α = 1 ) and ridge regression ( α = 0 ). As the true genetic basis of a given trait is generally unknown as well as the LD structure is hard to estimate, we suggest to use a small value for α , e.g. 0.001. The tuning parameter > 0 controls the overall strength of the penalty and we use 10-fold crossvalidation to choose suitable value for . Elastic net approach is implemented in the software 'pyseer' [36,37] focusing on GWAS for bacterial data.

Simulation studies
We use a real data set of 616 Streptococcus pneumoniae genomes collected from Massachusetts, denoted MA data, to create semi-synthetic datasets that incorporate levels of population structure and LD occurring in natural populations (see Fig. 2). The data are publicly available through the article [38]. After initial data filtering with standard population genomic procedures (using a minor allele frequency threshold and removing missing data), we obtain a genotype matrix of 603 samples with 89703 SNPs. Using this observed genotype matrix, we simulate the responses/phenotypes through the linear model defined in (1).
Availability of data and code: The R codes and data used in the numerical experiments are available at: https:// github. com/ tienmt/ boost ingher .

Experimental designs
We consider the following designs for choosing the causal SNPs (non-zero effect sizes): • sparse setting: 100 SNPs are randomly chosen.
Given the SNPs, the regression coefficients β 0 are sampled from the normal distribution N (0, 1) . As the true covariance of the genotype matrix is not known, we need to re-normalize the coefficient β 0 as β = β 0 σ 2 ε h 2 /(β 0⊤� β 0 (1 − h 2 )) to assure that the true corresponding heritability is approximating our target. Here h 2 is the target heritability and ¯ is the sample covariance matrix of the genotype matrix and the noise variance is fixed as σ 2 ε = 1. The target heritability is fixed as h 2 = 0.5 . We remind that as true covariance matrix of the genotype matrix is not known, one can only simulate phenotypes from model (1) that approximately target the considered heritability. Therefore, we propose to use the "oracle" estimator, denoted by h2aprx, that is calculated through the formula (3)  Fig. 2 Sample covariance matrix of the first 100 SNPs covariates in the genotype matrix shows the complex dependence structure present in the S. pneumoniae data as a benchmark for comparison. As in simulations the true covariance matrix is not known in our setup, whereas the noise variance is given and thus this estimator provides a solid basis for approximating the true heritability. It is noted that the "h2aprx" estimator is based on the true simulated values and cannot be used with real data.
For each setup, we generate 30 simulation runs and report the mean and the standard deviation of heritability estimates for each method across the simulation runs. We compare Elastic net (Enet), HERRA and the boosting versions of HERRA denoted by "B_ herra" and GCTA method. More specifically, GCTA [40] is a widely used method based on a linear mixed model and maximum (restricted) likelihood estimation. The number of repeated sample splitting is performed with B = 50 times. The Enet is used with fixed parameter α = 0.01 and 10-fold cross validation for choosing the tunning parameter .

Results for estimating heritability
From the results in Table 1, it is clear that the "oracle" approximates well the target heritability in all designs. Generally, the boosting procedure tends to reduce the variability of the original underlying method it is used in conjunction with, see Tables 1, 2 and 3. Elastic net underestimates the target, which can be explained by the downward bias known to influence the naive plug-in lasso-type approaches, such as the Elastic net. The effect is due to shrinkage of some of the coefficients corresponding to weak effect towards zero, while such weak effects may still be significant in terms of the total genetic trait variability. However, we would like to note that estimating heritability through Elastic net provides a good lower bound for the heritability, as indicated by the results.
On the other hand, HERRA and its boosting version return stable estimates. More specifically, with a proper choice of the screening step (Step 0) as in Table 2, HERRA and B_herra can lead to accurate estimates.This can be anticipated as this approach follows the spirit of the 'oracle' estimator. More specifically, it aims at providing a consistent estimate of the noise variance and thus the corresponding heritability estimate would be also consistent and stable [11]. For this reason, the boosting HERRA will be our main focus method in real application in the next section.
In our simulations, GCTA generally did not perform well, most likely due to the sample size being too small for random effects based approaches such as GCTA. We note Var(y) , Table 1 Simulation results with MA data using linear model and the target heritability h 2 = 0.5 The mean and the standard deviation (in parentheses) of the estimated heritabilities between the simulation replicates are presented 100 causal SNPs, σ 2 ε = 1 5000 causal SNPs, σ 2 ε = 1 100 causal SNPs from 3 genes, σ 2 ε = 1 that, for unrelated individuals and common SNPs in human studies, GCTA method is recommended with at least 3160 unrelated samples, see [40]. In studies of bacterial phenotypes, it would be uncommon to have access to such large numbers of samples that are at least approximately unrelated.

The effect of multiple data splitting
Clearly, choosing the number of data splittings B is a crucial factor in practice. Here we exemplify that as B increases, the resulting estimated heritabilities concentrate around their mean, see Fig. 3. Thus, we suggest to use at least B ≥ 30 in practice and B = 100 would be a reasonable choice, computational resources permitting.  Table 2 Simulation results with MA data, 100 randomly selected SNPs, σ 2 ǫ = 1 and h 2 = 0.5 The mean and the standard deviation (in parentheses) of the estimated heritabilities between the simulation replicates are presented

The effect of the screening step
We further investigate the effect of reducing the covariates by using the screening step. Different scenarios for 100 randomly selected SNPs with target heritability h 2 = 0.5 and σ 2 ǫ = 1 are examined, see Table 2. More precisely, we further consider 3 scenarios: remove 60% of the covariates, remove 90% of the covariates, and only retain top n + 1 covariates.
It is revealed that using the screening step to reduce the irrelevant covariates not only reduces the dimension of the data, but can also improve the heritability estimation, in particular for the scenario of removing 60% of the covariates. This fact has also been reported before in the linear mixed model approach in [27], where the authors show an improvement of the maximum likelihood estimation. However, if too many covariates are removed, heritability estimation can be inaccurate as in the scenario of keeping only top n + 1 covariates.

On the running time
The running times for default B_herra on MA data with the splitting step parallelized on 10 CPU cores was 2.335 mins. More specifically, the screening step took 5.25 secs of the total runtime. In the case of removing 60% covariates, the running time is significantly reduced to 1.319 mins. The R codes were run on Linux (Redhat 64-bit) with R version 3.6.0 .

Simulation results using GCTA model
We further examine the performances of Enet, HERRA, B_herra and GCTA method when the phenotypes are instead simulated from the GCTA model. We remind that GCTA model is a random effect model that is different to the linear model (1) and thus we cannot use the 'h2aprx' estimator. The settings for choosing the causal SNPs remain the same as before.
The results, Table 3, reveal that HERRA, B_herra yeild unbiased estimates in GCTA model. Although underestimate, Elastic net still provides a good lower bound for the true heritability. Once again, GCTA method underestimates the heritability as the sample is too small.

Heritability of antibiotic resistance in Maela data
To further illustrate the boosting based approach, we apply our procedure to Maela data which represent 3069 Streptococus pneumoniae genomes from an infant cohort study conducted in a refugee camp on the Thailand-Myanmar border [39,41]. After some data filtering with standard population genomic procedures (using a minor allele frequency threshold and removing missing data), we obtain a genotype matrix with 121014 SNPs. We consider resistances to five different antibiotics as the phenotypes: chloramphenicol, erythromycin, tetracycline, penicillin and co-trimoxazole.
The heritability of the antibiotic resistance phenotype is expected to be high, meaning that the variability stems primarily from the observed genetic differences among these bacteria and that the SNPs available for this particular species/dataset and would include majority of the underlying causal mechanisms for resistance. However, despite that the bacterial isolates are related, it cannot be concluded that the reported estimates refer to total heritability, since unmeasured genetic factors are likely to contribute partially to the measured phenotypic variation. We use two different types of resistance phenotypes to investigate their heritability. First we use the binary phenotype corresponding to the labels 'R' or 'S' (stand for 'Sensitive' or 'Resistant') for each bacterial isolate in the cohort. Second, we use a continuous phenotype corresponding to the inhibition zone diameters measured in the lab. These inhibition zone diameters are in practice used to defined whether a sample is sensitive or resistant to an antibiotic. It is however worthwhile noting that the transformation from inhibition zone diameters to labelling a sample 'S' vs 'R' is nonlinear due to the way the inhibition mechanism dynamics in the bacterial culture.
We apply Enet, HERRA, boosting version of HERRA and GCTA method [40] to this data. The results are given in the Tables 4 and 5 for the two data types, respectively.
As a broad summary, heritabilities of these five antibiotic resistances are high, as expected, whether using binary or continuous phenotypes. However, we would like to note that the results for binary responses are on the observed scale (0/1 resistance status), as we are not able to transform them into the underlying threshold model, see [11]. The Elastic net method yields an important insight by providing a lower bound on the heritability of these antibiotic resistances. For continuous phenotypes, it is at least 51% for chloramphenicol, at least 73% for erythromycin, at least 73% for tetracycline, at least 80% for penicillin and at least 71% for co-trimoxazole.  Interestingly, B_herra yields consistent results with GCTA method. However, the result for heritability of penicillin by GCTA is lower than the one from Enet method while boosting HERRA is not, see Table 5.

Discussion and conclusions
In this paper, we provide a general framework 'boosting heritability' for making inference about heritability. The main ingredient of 'boosting heritability' is a multiple sample splitting strategy. This strategy allows one to employ a variable selection step to remove irrelevant covariates that do not contribute to the variability of a trait and thus produce a reliable estimate of heritability. Moreover, by repeating sample splitting many times, this strategy makes sure that different latent structures are taken into account in both selection and estimation steps.
Numerical comparisons of different methods together with our proposal for estimating heritability in linear (fixed-effect) model draw a systematic picture on the behaviour of the current approaches when focusing on an application to bacterial GWAS. The results on real data suggest that the observed variability of the five studied antibiotic resistances is mainly due to the variability in the observed genetic factors, while some unexplained variation still remains.
Succeeding in improving and stabilizing HERRA [11], "boosting heritability" framework still preserves its advantages that are able to deal with the dichotomous, timeto-event or age-at-onset traits. Moreover, boosting heritability procedure is also applicable for random-effect model where the heritability estimation step (Step 3 in Algorithm 1) is done by using a random effect method as in [20]. These would be possible new research directions for the future.
Furthermore, our boosting heritability procedure uses a simple aggregation to combine the estimates that is to use their arithmetic mean. Other types of aggregation, see e.g. [30,31], could also be used and further examined in future works.