 Software
 Open Access
 Published:
Multivariate estimation of factor structures of complex traits using SNPbased genomic relationships
BMC Bioinformatics volume 23, Article number: 305 (2022)
Abstract
Background
Heritability and genetic correlation can be estimated from genomewide singlenucleotide polymorphism (SNP) data using various methods. We recently developed multivariate genomicrelatednessbased restricted maximum likelihood (MGREML) for statistically and computationally efficient estimation of SNPbased heritability (\(h^2_{\text{SNP}}\)) and genetic correlation (\(\rho _G\)) across many traits in large datasets. Here, we extend MGREML by allowing it to fit and perform tests on userspecified factor models, while preserving the low computational complexity.
Results
Using simulations, we show that MGREML yields consistent estimates and valid inferences for such factor models at low computational cost (e.g., for data on 50 traits and 20,000 individuals, a saturated model involving 50 \(h^2_{\text{SNP}}\)’s, 1225 \(\rho _G\)’s, and 50 fixed effects is estimated and compared to a restricted model in less than one hour on a single notebook with two 2.7 GHz cores and 16 GB of RAM). Using repeated measures of height and body mass index from the US Health and Retirement Study, we illustrate the ability of MGREML to estimate a factor model and test whether it fits the data better than a nested model. The MGREML tool, the simulation code, and an extensive tutorial are freely available at https://github.com/devlaming/mgreml/.
Conclusion
MGREML can now be used to estimate multivariate factor structures and perform inferences on such factor models at low computational cost. This new feature enables simple structural equation modeling using MGREML, allowing researchers to specify, estimate, and compare genetic factor models of their choosing using SNP data.
Background
Narrowsense heritability (\(h^2\)) quantifies the relative importance of additive genetic variance for a trait. Genetic correlation (\(\rho _G\)) reflects the shared genetic architecture between two traits. Genomicrelatednessbased restricted maximum likelihood (GREML) estimation [1,2,3,4] is widely used to estimate \(h^2\) and \(\rho _G\) using genomewide singlenucleotide polymorphism (SNP) data for unrelated individuals [5]. As the heritability captured by SNPs provides a reasonable lower bound for \(h^2\) [6, 7], the former is often referred to as SNPbased heritability and denoted by \(h^2_{\text{SNP}}\) [8].
We recently developed MGREML, a computationally and statistically efficient approach for multivariate GREML [9]. Importantly, MGREML resolves inconsistencies when combining bivariate estimates into a multivariate \(\rho _G\) matrix. By default, MGREML assumes the \(\rho _G\) matrix is shaped by a socalled saturated model [10], which can fit any conceivable proper correlation matrix.
Here, we derive an extension of the statistical framework of MGREML (1) to estimate userspecified genetic and environmental factor models (e.g., a model with just one genetic factor for all traits) and (2) to test whether the given factor model fits the data better than a nested model (also userspecified).
Whereas Genomic SEM [11], another method to estimate genetic factor models, relies on preexisting summary statistics from largescale genomewide association studies (GWAS) for all traits of interest, MGREML uses individuallevel data, giving users (1) more statistical power for a fixed sample size [9] and (2) more direct control over model specification and estimation (e.g., being able to control for an additional covariate in the MGREML analysis itself, rather than having to obtain a new set of GWAS results).
In short, we enable MGREML to estimate genetic and environmental factor structures using individuallevel data, and to test whether a given factor structure fits the data better than a nested model. We validate this approach using simulations and an empirical application.
Implementation
Model
Consider a set of N unrelated individuals for whom we observe T traits, k covariates, and M SNPs. Let \({\textbf{X}}\) denote the \(N \times k\) matrix of covariates, \({\textbf{G}}\) the \(N \times M\) matrix of standardized SNPs, and \({\textbf{Y}}\) the \(N \times T\) matrix of traits, for which column t corresponds to Trait t and is denoted by \({\textbf{y}}_t\). Furthermore, let \({\textbf{S}}_t\) denote a binary \(K_t \times k\) matrix indicating which of the k covariates in \({\textbf{X}}\) apply to Trait t. Now, the matrix of covariates for Trait t can be defined as \({\textbf{X}}_t = {\textbf{X}} {{\textbf{S}}}^{\top }_t\).
When applying univariate GREML as implemented in GCTA [1] to Trait t, the following linear mixed model (LMM) is estimated using restricted maximum likelihood (REML) [12]:
In this model, \(\varvec{\beta }_t\) is the \(K_t \times 1\) vector of fixed effects of the covariates that apply to Trait t and the \(N \times N\) matrix \({\textbf{A}}\) is the socalled genomicrelatedness matrix (GRM) reflecting the subtle genetic similarities between unrelated individuals.
Typically, the GRM is calculated as \({\textbf{A}} = M^{1} {\textbf{G}}{{\textbf{G}}}^{\top }\) using tools such as GCTA or PLINK [1, 13]. Calculation of the GRM requires \(O(N^2M)\) time. However, as this calculation can be massively parallelized, it places little practical limitation on either N or M.
By giving standardized SNPs the same weight, the preceding definition of the GRM makes tacit assumptions about the relationship between allele frequency and linkage disequilibrium on the one hand and SNP effect sizes on the other. Other tools, such as LDAK [14], can be used to construct a GRM that assigns different weights to the SNPs, thereby incorporating different assumptions about SNP effect sizes. Importantly, MGREML can use any valid GRM in binary format as input, irrespective of its precise definition and irrespective of whether it is calculated using PLINK, GCTA, or LDAK.
The parameters of interest in the univariate model are \(\sigma _{G_{tt}}\) and \(\sigma _{E_{tt}}\), where \(\sigma _{G_{tt}}\) denotes the additive genetic variance of Trait t captured by the available SNPs and \(\sigma _{E_{tt}}\) the remaining variance in Trait t. The latter quantity is sometimes referred to as the environmental variance, even though this name can be somewhat misguiding, since \(\sigma _{E_{tt}}\) simply reflects all variance in Trait t that is not tagged by the additive linear effects of the available SNPs and covariates [6]. In spite of the subtleties in its definition, we stick to the convention of calling this term the environmental variance.
In this model, \(h^2_{\text{SNP}}\) of Trait t can be defined as \(h^2_{\text{SNP}}(t) = \sigma _{G_{tt}} \left( \sigma _{G_{tt}} + \sigma _{E_{tt}} \right) ^{1}\). In essence, univariate GREML quantifies the degree to which genetic similarity between individuals, as tagged by the SNPs used to construct the GRM, maps to trait similarity.
Notice here that REML does not estimate \(\varvec{\beta }_t\) directly. Instead, REML controls for the fixedeffect covariates by considering socalled error contrasts [15, 16]. More specifically, REML estimation is equivalent to maximumlikelihood estimation applied to \({\textbf{K}}_t{\textbf{y}}_t\), where the rows of matrix \({\textbf{K}}_t\) form a basis of the left null space of \({\textbf{X}}_t\). However, once REML estimates of \(\sigma _{G_{tt}}\) and \(\sigma _{E_{tt}}\) are obtained, one can readily calculate the generalized least squares estimator of the fixed effects \(\varvec{\beta }_t\) [1, 9]. This option is implemented in both GCTA and MGREML.
The univariate LMM can be generalized to a multivariate LMM [17, 18], which can be used to jointly estimate genetic covariance and environmental covariance between Traits \(t=1,\ldots ,T\) and \(s=1,\ldots ,T\), denoted by \(\sigma _{G_{ts}}\) and \(\sigma _{E_{ts}}\) respectively. Using the same notation as seen in the original derivations of MGREML [9], this multivariate LMM can be written as follows:
where ‘\(\otimes\)’ denotes the Kronecker product. In this model, \({\textbf{V}}_G\) is the \(T \times T\) genetic variance matrix and \({\textbf{V}}_E\) the \(T \times T\) environmental variance matrix. Now, the genetic correlation between Traits t and s is defined as \(\rho _{G} (t,s) = \sigma _{G_{ts}} \left( \sigma _{G_{tt}} \sigma _{G_{ss}} \right) ^{0.5}\) [2], where \(\sigma _{G_{ts}}\) is element t, s from \({\textbf{V}}_G\).
Computational complexity
The variance matrix of the multivariate model (i.e., \({\textbf{V}}_G \otimes {\textbf{A}} + {\textbf{V}}_E \otimes {\textbf{I}}_{N}\)) is dense, rendering naïve REML estimation infeasible for large N and T, as mere evaluation of the loglikelihood function already requires \(O(N^3T^3)\) time. However, the time complexity can be drastically reduced by transforming the data using the eigenvalue decomposition (EVD) of the GRM [4, 9].
Let \({\textbf{Q}} \varvec{\Phi }{{\textbf{Q}}}^{\top }\) denote the EVD of \({\textbf{A}}\). Here, \({\textbf{Q}}\) denotes the matrix of eigenvectors and \(\varvec{\Phi }\) the diagonal matrix of eigenvalues. MGREML defines matrix \({\textbf{P}}\) as the \(n = NL\) columns from \({\textbf{Q}}\) that correspond to the eigenvalues that are not among the L largest, and \({\textbf{D}}\) as the diagonal matrix with corresponding eigenvalues, \(d_1, \ldots , d_n\).
Using this matrix \({\textbf{P}}\), MGREML transforms the data, and then reorders it such that (i) the variance matrix is block diagonal, enabling significant computational improvements, and (ii) the contribution of the L leading principal components from the genetic data to the variance matrix are eliminated, thus, correcting for population stratification [19] without introducing any additional fixedeffect covariates [9]. By default \(L=20\), causing MGREML to control for even quite subtle population stratification. Users can specify a different value for L using the ––adjustpcs option.
More specifically, the following model holds for \(Tn \times 1\) vector \({\textbf{y}} = \textrm{vec} ( {{\textbf{Y}}}^{\top } {\textbf{P}})\) (where vec() denotes the vectorization operator):
where \({\textbf{Z}} = ({{\textbf{Z}}}^{\top }_1 \; \cdots \; {{\textbf{Z}}}^{\top }_{n})^{\top }\), \({\textbf{Z}}_j = \left( {\textbf{I}}_T \otimes {{\textbf{x}}}^{\top }_j \right) {{\textbf{S}}}^{\top }\), \({{\textbf{x}}}^{\top }_j\) is \(1 \times k\) row j from \({{\textbf{P}}}^{\top } {\textbf{X}}\), and
Omitting the constant, the corresponding loglikelihood function is given by
where \({\textbf{M}} = {\textbf{V}}^{1}  {\textbf{V}}^{1} {\textbf{Z}} \left( {\textbf{Z}}^{\top } {\textbf{V}}^{1} {\textbf{Z}} \right) ^{1} {\textbf{Z}}^{\top } {\textbf{V}}^{1}\) [1]. This loglikelihood function depends on \(\textrm{log} \left {\textbf{V}} \right\) and quadratic forms of the type \(q = {{\textbf{w}}}^{\top } {\textbf{V}}^{1} {\textbf{w}}\). Importantly, \({\textbf{V}}\) is a highly sparse, blockdiagonal matrix, where diagonal block j equals \({\textbf{V}}_j = d_j {\textbf{V}}_G + {\textbf{V}}_E\), with \({\textbf{V}}_G\) and \({\textbf{V}}_E\) being functions of the parameter vector \(\varvec{\theta }\).
As a result of this blockdiagonal structure, these quadratic forms and logdeterminants can be written as a sum of n independent contributions, where each contribution comes from a \(T \times T\) block. MGREML can calculate the contribution of any given block in \(O(T^2)\) time. Concordantly, the loglikelihood function can be evaluated in \(O(NT^2)\) time. Similarly, the gradient (i.e., the vector of partial derivatives of the loglikelihood function with respect to \(\varvec{\theta }\)) can also be calculated in \(O(NT^2)\) time.
MGREML retains its computational efficiency in case there are a limited number of fixed effects covariates. However, if the number of covariates grows large, MGREML will get slower. Nevertheless, as MGREML controls for population stratification without having to introduce any fixed effects for that purpose, a limited number of fixedeffect covariates suffices in a typical empirical application.
The average information (AI) algorithm, a variation on Newton’s method [1, 20], is illsuited for MGREML estimation for large T, since that algorithm involves repeated calculation of the AI matrix, which requires \(O(NT^4)\) time per iteration for a saturated model [9]. Specifically, a saturated model has \(T(T1)\) free parameters. Thus, the AI matrix has \(T(T1) \times T(T1)\) elements, where each element involves a calculation requiring O(N) time, placing overall complexity at \(O(NT^4)\).
To avoid having to calculate the AI matrix in every iteration, MGREML instead uses a Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm [21] in combination with a goldensection line search to estimate \(\varvec{\theta }\). Importantly, a BFGS iteration has roughly the same computational complexity as a gradientdescent iteration yet a higher rate of convergence across iterations. Thus, a BFGS algorithm balances computational complexity per iteration and rate of convergence across iterations.
The BFGS algorithm is initialized such that the first iteration is equivalent to a gradientdescent iteration with goldensection search. Evaluations of the loglikelihood function and its gradient suffice for application of the goldensection search and BFGS algorithm, putting the overall time complexity of MGREML at \(O(NT^2)\) per iteration.
For large T, the BFGS algorithm can exhibit unstable behavior, in which case relevant quantities are reinitialized such that first next BFGS iteration is again equivalent to a gradientdescent iteration with goldensection search. If instability persists, MGREML switches to the AI algorithm for a single iteration. In our experience, such expensive ‘interventions’ are needed only sparingly and are effective in resolving numerical instabilities in MGREML estimation.
Once MGREML has converged, the variance matrix of \({\widehat{\varvec{\theta }}}\) is estimated using the AI matrix [20]. In addition, a delta method is used to obtain the standard error (SE) of \(h^2_{\text{SNP}}\) and \(\rho _G\) estimates. Although calculation of the AI matrix, as indicated, is expensive, this calculation only needs to be carried out once. Moreover, MGREML users can specify the ––nose option to forgo calculation of the AI matrix and SEs altogether after convergence of the BFGS algorithm.
Factor structures
By default, MGREML assumes a saturated model for both \({\textbf{V}}_G\) and \({\textbf{V}}_E\). An example of such a saturated model for \(T=3\) traits is shown in Fig. 1. Letting \(\gamma _{tf}\) (resp. \(\varepsilon _{tf}\)) denote the effect of genetic (environmental) Factor f on Trait t, the saturated model for \(T=3\) traits can be written as follows:
For T traits in general, a saturated model for \({\textbf{V}}_G\) (resp. \({\textbf{V}}_E\)) can be described in terms of a lower triangular matrix of free genetic (environmental) coefficients \({\textbf{C}}_G\) (\({\textbf{C}}_E\)) where \({\textbf{V}}_G = {\textbf{C}}_G {{\textbf{C}}}^{\top }_G\) (\({\textbf{V}}_E = {\textbf{C}}_E {{\textbf{C}}}^{\top }_E\)).
Here, we generalize this approach by allowing \({\textbf{C}}_G\) (resp. \({\textbf{C}}_E\)) to be a \(T \times F_G\) (\(T\times F_E\)) matrix of which a predefined subset of the \(TF_G\) (\(TF_E\)) elements are free, while the other elements are constrained to zero, reflecting an arbitrary factor model with \(F_G\) (\(F_E\)) genetic (environmental) factors. Both factor models need to satisfy standard identification requirements in structural equation modeling [22]. Under this framework, the implied genetic (resp. environmental) variance matrix \({\textbf{V}}_G\) (\({\textbf{V}}_E\)) is always at least positive (semi)definite. In other words, provided the userspecified model is identified, MGREML always yields valid correlation matrices.
MGREML users can specify a main model, comprising a genetic factor model and an environmental factor model. In case a user also specifies a nested model, MGREML performs a classical likelihoodratio test (LRT) [23], to infer whether the fit of the main factor model is significantly better than that of the nested model.
In total, users can specify at most four factor models: (1A) the main genetic factor model, (1B) the main environmental factor model, (2A) the nested genetic factor model, and (2B) the nested environmental factor model. For example, a user can specify a main genetic factor model where there is only one genetic factor for all traits and a nested genetic factor model, where the traits have no genetic variance at all (i.e., there is no genetic factor), while the environmental factor model is saturated both in the main model as well as the nested model.
A factor model specification for MGREML is effectively a binary \(T \times F\) matrix stored as a plain text file, where F denotes the number of factors. More specifically, in a given model, for \(f=1,\ldots ,F\) and \(t=1,\ldots ,T\), if Factor f has a free path coefficient to Trait t, element t, f of the binary matrix equals one and otherwise that element equals zero.
Let \(C_{G_A}\) (resp. \(C_{E_A}\)) denote the number of free coefficients in the main genetic (environmental) factor model and let \(C_{G_0}\) (resp. \(C_{E_0}\)) be defined analogously for the nested model. Finally, let \(\ell _{A}\) (resp. \(\ell _{0}\)) denote the loglikelihood of the main (nested) model. Now, the LRT statistic is calculated by MGREML as \(\text{LRT} = 2(\ell _{A}\ell _{0})\), which under standard maximum likelihood estimation (MLE) assumptions [24] and nestedness of the models is \(\chi ^2\left( (C_{G_A} + C_{E_A})  (C_{G_0} + C_{E_0}) \right)\) distributed.
An example of a genetic factor model that MGREML users can specify is shown in Table 1. The corresponding structural equation model for \({\textbf{V}}_G\) which MGREML fits under that specification is shown in Fig. 2. The environmental factors shaping \({\textbf{V}}_E\) are not shown here, for clarity of the figure. We use this genetic factor model in our empirical application. In this example, the first genetic factor captures the genetic signal shared between all height measurements, the second genetic factor captures the genetic signal shared between all measurements of body mass index (BMI), and the third factor captures the genetic overlap between height and BMI (i.e., the genetic correlation).
Results
Simulation study
To test the validity of MGREML estimates of genetic correlations and underlying factor structures, we generated 100 independent datasets with \(N=20,000\) individuals and \(T=10\) traits with SNPbased \(h^2 = 50\%\). In Simulation 1, we set \(\rho _G\) to the same value across all combinations of traits. In Simulation 2, we simulate two clusters of five traits by setting \(\rho _G\) to random values within clusters and to zero between clusters. In Simulation 3, we consider one additional dataset with \(N=20,000\) individuals and \(T=50\) traits with SNPbased \(h^2 = 50\%\) and \(\rho _G=0\). The simulation design is fully described in the Supplementary Information [see Additional File 1].
As MGREML estimation is a specific form of MLE, we expect MGREML to yield consistent estimates of the population parameters, provided standard MLE assumptions hold [24]. That is, as N increases, each parameter estimate converges to the true value. The results of Simulation 1 support the claim that MGREML yields consistent estimates of \(h^2_{\text{SNP}}\) and \(\rho _G\) across the full range of feasible values for \(\rho _G\) [see Additional File 1: Tables S1–S4]. The SEs of estimates also align with the standard deviations of estimates across the generated datasets. Estimates have lower SEs when interdependence across traits is higher (i.e., higher \(\vert \rho _G \vert\)).
The results of Simulation 2 show that MGREML also yields consistent estimates when the degrees of freedom in the model is larger than necessary [see Additional File 1: Table S5]. Estimates closely reflect the implied factor structure, as illustrated in Fig. 3 which shows MGREML estimation results for the first dataset. When comparing the fit of the appropriate factor model and the saturated model using an LRT, we find that resulting pvalues closely follow the correct theoretical distribution [see Additional File 1: Figure S1].
The results of Simulation 3 show that MGREML can readily estimate and compare factor models for \(T=50\) traits observed in \(N=20,000\) individuals, involving 50 fixed effects and including calculation of SEs, on a single notebook with two 2.7 GHz cores and 16 GB of RAM in less than one hour. In addition, on more powerful machines, MGREML estimation can handle at least up until \(T=200\) traits and \(N=20,000\) individuals [9].
Empirical application
To illustrate the ability of MGREML to estimate a factor model and test whether it fits the data better than a nested model, we use data on \(N=6,425\) unrelated individuals from the US Health and Retirement Study (HRS) [25], for whom we analyze repeated measures of height and BMI in five consecutive waves of data collection (Waves 7–11). The HRS is a longitudinal panel study that surveys a representative sample of approximately 20,000 individuals aged 51 years and older (and their spouses) in the United States. Further details (e.g., quality control filters and descriptive statistics) are provided in the Supplementary Information [see Additional File 1].
As a baseline model, we start by assuming height and BMI both have no genetic variance (Model I). Given that previous \(h^2_{\text{SNP}}\) estimates for height and BMI are considerably greater than zero [26, 27] (e.g., \({\widehat{h}}^2_{\text{SNP}} (\text{height}) = 43\%\) with \(\text{SE} = 2\%\) and \({\widehat{h}}^2_{\text{SNP}} (\text{BMI}) = 21\%\) with \(\text{SE} = 2\%\) [28]), we also consider an alternative model with one genetic factor for the height observations and one genetic factor for the BMI observations (Model II), which corresponds to the first two columns of the factor model shown in Table 1 labeled G\(_{height }\) and G\(_{BMI }\).
Although we expect Model II to have a far better fit than Model I, Model II still assumes there is no genetic correlation between height and BMI. Yet, there is ample evidence that height and BMI are genetically correlated traits [9, 29] (e.g., \({\widehat{\rho }}_G ( \text{height},\text{BMI}) = 0.14\) with \(\text{SE} = 0.04\) [9]). Therefore, we also consider a third model in which we introduce a shared genetic factor that affects both the height and BMI observations (Model III), accounting for the genetic overlap between these two traits. Model III corresponds to the factor model shown in Table 1 (where the shared factor is labeled G\(_{shared }\)) and equivalently in Fig. 2. In all three models, we assume a saturated environmental factor model.
With the HRS surveying a representative sample of individuals aged 51 years and older (and their spouses), it seems unlikely that the unique and the shared genetic architecture of height and BMI will drastically change for individuals in our analysis sample between the biennial waves of data collection. Therefore, we a priori believe Model III to be most suitable for the data. That is, we expect this to be the most parsimonious model that is able to capture both the unique and the shared genetic component of height and BMI across waves. At the same time, taking aforementioned estimates of \(h^2_{\text{SNP}}\) and \(\rho _G\) for height and BMI at face value, and using the online GCTAGREML power calculator [30], we find that the statistical power to detect \(\rho _G (\text{height},\text{BMI}) \ne 0\) in this sample is only 21.8%. Hence, Model II might not be rejected in favor of Model III simply due to lack of statistical power. Details of this power calculation are described in the Supplementary Information [see Additional File 1].
In the application to data on repeated measures of height and BMI, we first compare the fit of Model I and Model II. We find that Model II, as expected, fits the data better than Model I (LRT=72.03, degrees of freedom=10, pvalue=\(1.79 \times 10^{11}\)). Thus, the null model of no genetic variance is rejected in favor of a model in which (1) height has genetic variance, (2) BMI has genetic variance, yet (3) height and BMI have no genetic correlation. When we compare the fit of Model II and Model III, we do not find an improvement in fit (LRT=11.11, degrees of freedom=10, pvalue=0.349). Thus, the model without genetic correlation between height and BMI is not rejected in favor of a model with genetic overlap, in line with our power calculation.
Conclusion
Accurate estimates of genetic correlations and genetic factor structures across multiple traits help to understand their shared etiology and aid in finding likely causal relationships [29, 31]. As such, estimation and inference based on genetic and environmental factor models may contribute to the design of future genetic and functional studies.
Here, we derived a statistical framework (1) to model and estimate such factor models using individuallevel SNP data and (2) to test hypotheses regarding these factor models. Using simulations and an empirical application, we confirmed the validity of this statistical framework.
This framework is implemented in our freely available commandline tool MGREML, which has simple input options for this purpose. MGREML accepts userspecified genetic and environmental factor models as input, and performs estimation and inference based thereon. Even on a single machine, this tool can readily be applied to data on 20,000 individuals and 50 traits.
Availability of data and materials
Mandated data deposition All data used in this manuscript are freely available via publicaccess repositories. Specifically, we obtained access to the genetic data of the Health and Retirement Study (HRS) through the Database of Genotypes and Phenotypes (dbGaP): https://www.ncbi.nlm.nih.gov/gap/ (dbGaP application 3544). The RAND HRS data, produced by the RAND Center for the Study of Aging, containing the phenotype data, can be accessed via the HRS website: https://hrsdata.isr.umich.edu/dataproducts/rand/. Researchers who wish to link genotype and phenotype data from the HRS must apply for access via the HRS website: https://hrsdata.isr.umich.edu/dataproducts/geneticcrossreference/.
Software and code The MGREML tool and the code for the simulation study are freely available via the GitHub repository for this project:
· Project name: MGREML
· Project home page (including tutorial):https://github.com/devlaming/mgreml/
· Operating system: platform independent
· Programming language: Python 3.x.
· Other requirements: Python packages networkx, numpy, pandas, psutil, scipy, and tqdm
· License: GNU GPL v3
· Any restrictions to use by nonacademics: as stipulated by GNU GPL v3
Abbreviations
 AI:

Average information
 BFGS algorithm:

Broyden–Fletcher–Goldfarb–Shanno algorithm
 BMI:

Body mass index
 EVD:

Eigenvalue decomposition
 GCTA:

Genomewide complex trait analysis
 GREML:

Genomicrelatednessbased restricted maximum likelihood
 GRM:

Genomicrelatedness matrix
 GWAS:

Genomewide association study
 HRS:

Health and Retirement Study
 LMM:

Linear mixed model
 LRT:

Likelihoodratio test
 MGREML:

Multivariate genomicrelatednessbased restricted maximum likelihood
 MLE:

Maximum likelihood estimation
 RAM:

Randomaccess memory
 REML:

Restricted maximum likelihood
 SE:

Standard error
 SNP:

Singlenucleotide polymorphism
References
Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genomewide complex trait analysis. Am J Hum Genet. 2011;88:76–82.
Lee SH, Yang J, Goddard ME, Visscher PM, Wray NR. Estimation of pleiotropy between complex diseases using singlenucleotide polymorphismderived genomic relationships and restricted maximum likelihood. Bioinformatics. 2012;28:2540–2.
Benjamin DJ, Cesarini D, Chabris CF, Glaeser EL, Laibson DI, Gudnason V, et al. The promises and pitfalls of genoeconomics. Annu Rev Econ. 2012;4:627–62.
Lee SH, Van der Werf JHJ. MTG2: an efficient algorithm for multivariate linear mixed model analysis based on genomic information. Bioinformatics. 2016;32:1420–2.
Evans LM, Tahmasbi R, Vrieze SI, Abecasis GR, Das S, Gazal S, et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat Genet. 2018;50:737–45.
Lee JJ, Chow CC. Conditions for the validity of SNPbased heritability estimation. Hum Genet. 2014;133:1011–22.
Witte JS, Visscher PM, Wray NR. The contribution of genetic variants to disease depends on the ruler. Nat Rev Genet. 2014;15:765–76.
Yang J, Zeng J, Goddard ME, Wray NR, Visscher PM. Concepts, estimation and interpretation of SNPbased heritability. Nat Genet. 2017;49:1304–10.
De Vlaming R, Slob EAW, Jansen PR, Dagher A, Koellinger PD, Groenen PJF, et al. Multivariate analysis reveals shared genetic architecture of brain morphology and human behavior. Commun Biol. 2021;4:1180.
Schumacker RE, Lomax RG. A beginner’s guide to structural equation modeling. 4th ed. New York: Routledge; 2016.
Grotzinger AD, Rhemtulla M, de Vlaming R, Ritchie SJ, Mallard TT, Hill WD, et al. Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits. Nat Hum Behav. 2019;3:513–25.
Patterson HD, Thompson R. Recovery of interblock information when block sizes are unequal. Biometrika. 1971;58:545–54.
Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Secondgeneration PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:s13742015.
Speed D, Hemani G, Johnson MR, Balding DJ. Improved heritability estimation from genomewide SNPs. Am J Hum Genet. 2012;91:1011–21.
Harville DA. Bayesian inference for variance components using only error contrasts. Biometrika. 1974;61:383–5.
Casella G, Searle SR. On a matrix identity useful in variance component estimation. Biometrics Unit Technical Reports. 1985; p. BU875M.
Meyer K. Maximum likelihood estimation of variance components for a multivariate mixed model with equal design matrices. Biometrics. 1985;41:153–65.
Lynch M, Walsh B. Genetics and analysis of quantitative traits. 1st ed. Sunderland: Sinauer; 1998.
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genomewide association studies. Nat Genet. 2006;38:904–9.
Gilmour AR, Thompson R, Cullis BR. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics. 1995;51:1440–50.
Nocedal J, Wright SJ. Numerical optimization. 2nd ed. New York: Springer; 2006.
Bollen KA. Structural equations with latent variables. 1st ed. New York: Wiley; 1989.
Wilks SS. The largesample distribution of the likelihood ratio for testing composite hypotheses. Ann Math Stat. 1938;9:60–2.
Newey WK, McFadden D. Chapter 36: Large sample estimation and hypothesis testing. vol. 4 of Handbook of Econometrics. Elsevier; 1994. p. 2111–2245.
Sonnega A, Faul JD, Ofstedal MB, Langa KM, Phillips JWR, Weir DR. Cohort profile: the Health and Retirement Study (HRS). Int J Epidemiol. 2014;43:576–85.
Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42:565–9.
Yang J, Manolio TA, Pasquale LR, Boerwinkle E, Caporaso N, Cunningham JM, et al. Genome partitioning of genetic variation for complex traits using common SNPs. Nat Genet. 2011;43:519–25.
De Vlaming R, Okbay A, Rietveld CA, Johannesson M, Magnusson PK, Uitterlinden AG, et al. MetaGWAS Accuracy and Power (MetaGAP) calculator shows that hiding heritability is partially due to imperfect genetic correlations across studies. PLoS Genet. 2017;13: e1006495.
BulikSullivan B, Finucane HK, Anttila V, Gusev A, Day FR, Loh PR, et al. An atlas of genetic correlations across human diseases and traits. Nat Genet. 2015;47:1236–41.
Visscher PM, Hemani G, Vinkhuyzen AA, Chen GB, Lee SH, Wray NR, et al. Statistical power to detect genetic (co) variance of complex traits using SNP data in unrelated samples. PLoS Genet. 2014;10: e1004269.
O’Connor LJ, Price AL. Distinguishing genetic correlation from causation across 52 diseases and complex traits. Nat Genet. 2018;50:1728–34.
Acknowledgements
The authors are grateful for computational resources provided by the Dutch national einfrastructure with support of the SURF cooperative (Project EINF607).
Funding
This work was supported by the European Research Council (Starting Grant 946647 GEPSI to CAR). EAWS is funded by the NIHR Cambridge BRC. The Health and Retirement Study (HRS) is sponsored by the National Institute on Aging (Grant NIA U01AG009740) and is conducted by the University of Michigan. These funding bodies did not play any role in the design of the study, in the collection, analysis, and interpretation of data, and in writing the manuscript.
Author information
Authors and Affiliations
Contributions
Conceptualization: all authors. Methodology: all authors. Software: RDV, EAWS. Formal analysis: RDV, EAWS, CAR. Data Curation: CAR. Writing—Original Draft: RDV, EAWS, CAR. Visualization: RDV, EAWS, CAR. Writing—Review and Editing: all authors. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The Health and Retirement Study (HRS) has been approved by the University of Michigan Health Sciences/Behavioral Sciences Institutional Review Board (IRB Protocol: HUM0006112). The research project in which this manuscript is embedded has been approved by the Erasmus Research Institute of Management Institutional Review Board (IRBE approval 201404). In accordance with the stated approval by the University of Michigan Health Sciences/Behavioral Sciences Institutional Review Board of the HRS, informed consent to participate has been obtained from the participants.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Supplementary Information.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
De Vlaming, R., Slob, E.A.W., Groenen, P.J.F. et al. Multivariate estimation of factor structures of complex traits using SNPbased genomic relationships. BMC Bioinformatics 23, 305 (2022). https://doi.org/10.1186/s12859022048353
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859022048353
Keywords
 SNP heritability
 Genetic correlation
 GREML
 Genetic factor model
 Genomic SEM