Multivariate estimation of factor structures of complex traits using SNP-based genomic relationships

Background Heritability and genetic correlation can be estimated from genome-wide single-nucleotide polymorphism (SNP) data using various methods. We recently developed multivariate genomic-relatedness-based restricted maximum likelihood (MGREML) for statistically and computationally efficient estimation of SNP-based heritability (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h^2_{\text{SNP}}$$\end{document}hSNP2) and genetic correlation (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _G$$\end{document}ρG) across many traits in large datasets. Here, we extend MGREML by allowing it to fit and perform tests on user-specified factor models, while preserving the low computational complexity. Results Using simulations, we show that MGREML yields consistent estimates and valid inferences for such factor models at low computational cost (e.g., for data on 50 traits and 20,000 individuals, a saturated model involving 50 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h^2_{\text{SNP}}$$\end{document}hSNP2’s, 1225 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho _G$$\end{document}ρG’s, and 50 fixed effects is estimated and compared to a restricted model in less than one hour on a single notebook with two 2.7 GHz cores and 16 GB of RAM). Using repeated measures of height and body mass index from the US Health and Retirement Study, we illustrate the ability of MGREML to estimate a factor model and test whether it fits the data better than a nested model. The MGREML tool, the simulation code, and an extensive tutorial are freely available at https://github.com/devlaming/mgreml/. Conclusion MGREML can now be used to estimate multivariate factor structures and perform inferences on such factor models at low computational cost. This new feature enables simple structural equation modeling using MGREML, allowing researchers to specify, estimate, and compare genetic factor models of their choosing using SNP data.

Given the initial genetic factor weight matrix, denoted by WG*, we rescale each of the T columns to have unit length, yielding WG. We have three main simulation settings. In Simulation Setting 1 and 2 we consider T = 10 traits and R = 100 runs, while in Simulation Setting 3 we consider T = 50 traits and R = 1 run. Across runs, we simulate different genetic datasets. Therefore, results for a given setting are completely independent across runs. In addition, within a given run, we draw new βf and εf for each simulation setting and for each distinct value of ρG considered in Simulation Setting 1, creating considerable independence across settings and across values of ρG within a given run.
For the data simulated here, we estimate a fully saturated model in each run. Supplementary Tables S1-S4 show the resulting h 2 estimates, ρG estimates, their standard errors (SEs), averaged across runs, and the standard deviation in ρG estimates across runs, for the four distinct values of ρG.
We observe that MGREML provides consistent estimates of the true genetic correlation structure, and that the estimates become more precise (i.e., have lower SEs) when interdependence across traits increases (i.e., higher |ρG|). SEs are in line with standard deviations, validating the delta method implemented in MGREML. In the range of setups considered here, only for ρG = 1.00 SEs are considerably larger than standard deviations. This finding means MGREML is slightly conservative in terms of SEs in extreme scenarios.

Simulation Setting 2
We divide the traits into two clusters of five traits, with random ρG within clusters and ρG = 0.00 between traits in different clusters. Again, for each trait h 2 = 50%. A typical run is shown in Figure 1 in the main text. We use the following steps to find an appropriate WG*: 1. Initialize WG* as 10 × 10 matrix of i.i.d. draws from N(0,1). 2. Set the elements in 5 × 5 lower-left block of WG* to zero. 3. Set the elements in 5 × 5 upper-right block of WG* to zero.
For the data simulated here, for each run we estimate both a nested model (first set of five genetic factors affecting Traits 1-5, second set of five genetic factors affecting Traits 6-10, and no environmental correlation) and a fully saturated model. Supplementary Table S5 shows the root-mean-square error (RMSE) calculated from the estimation error in ρG across all runs.
We observe that MGREML provides accurate estimates of the true genetic-correlation structure, even when the degrees of freedom in the model is larger than necessary (i.e., when a fully saturated model is estimated, while the data-generating process is not fully saturated). With respect to the ρG estimates that can be non-zero under the nested model, RMSE of those estimates is only marginally smaller than RMSE of the corresponding ρG estimates under the fully saturated model. Table S1. Average estimates over 100 simulation runs when ρG = −1/9 and h 2 = 50% for all generated traits. The h 2 estimates (with standard error between parentheses) are shown in the first column; The ρG estimates (with standard error between parentheses) are shown below the diagonal in the subsequent columns; The standard deviation of the ρG estimates across simulations runs is shown above the diagonal.  Supplementary Figure S1 shows the quantile-quantile (QQ) plots of the MGREML likelihoodratio test (LRT) statistics and corresponding −log10(p-values) across runs, when comparing the fit of the nested model to the fit of the fully saturated model in each run. Given the datagenerating process, the LRT statistics should follow a χ 2 distribution with 70 degrees of freedom. This expectation is confirmed by the QQ plots, because no appreciable deviation from the 45-degree line is observed.

Simulation Setting 3
To investigate runtime of MGREML when comparing factor models, we consider one additional simulation run in accordance with Simulation Setting 1, with ρG = 0.00, while considering T = 50 traits instead of ten traits. For this simulated data, we estimate a saturated main model and a nested model assuming one idiosyncratic genetic factor for each trait (i.e., a nested model in accordance with the true value ρG = 0.00), and compare the fit of these two models using a likelihood-ratio test. In both models, we include an intercept as a fixed-effect covariate that applies to each trait. Thus, there are 50 fixed effects in total in both models. We find that MGREML finishes this analysis in approximately 42 minutes on a single notebook with two 2.7 GHz cores and 16 GB of RAM.

Application
To illustrate the ability of MGREML to distinguish between realistic and unrealistic factor models, we employ data from the US Health and Retirements Study (HRS). The HRS is a longitudinal panel study that surveys a representative sample of approximately 20,000 individuals aged 51 years and older (and their spouses) in the United States of America [3].
To construct the GRM, we use the 2012 release of genetic data. There are 6,570 individuals with full information on the 10 traits (5 for height, 5 for BMI) in our model. In our analyses, we control for (standardized) birth year, birth year 2 , birth year 3 , sex, sex × birth year, sex × birth year 2 , and sex × birth year 3 . MGREML also adjusts for the first 20 principal components of the genomic-relatedness matrix to control for subtle forms of population stratification.
Supplementary Table S6 provides descriptive statistics of the HRS analysis sample. Due to the exclusion of individuals with a genetic relatedness larger than 0.025, the actual analysis sample comprises 6,425 individuals.
The resulting probability that the null (i.e., no genetic correlation) is correctly rejected is 21.8%. Conversely, the probability of a Type II error is 78.2%.