Expressing gene based methods with a weighted combination of Z-scores
Consider a sample including n individuals with both genotype and phenotype data available in a genomics region (gene or pathway) with M genetic variants (e.g. SNPs). For the \(i^{th}\) individual, denote \(y_i\) as the trait value which is either a quantitative or qualitative trait (1 for cases and 0 for controls), denote \(X_i = (x_{i1},\ldots ,x_{iM})^{'}\) as the genotypic score for the considered region, where \(x_{im}\in \{0, 1, 2\}\) is the number of minor alleles at the \(m^{th}\) variant. \(x_{im}\) can also be the number of minor alleles in dominant, recessive coding, or imputed dosage that the \(i^{th}\) individual has at the \(m^{th}\) variant. Although the formulas derived in the Methods section is based on genotypes with additive coding, the conclusions are still held when the genotypes are centered with mean 0.
The generalized linear model was used to model the relationship between the trait and the genetic variants in the considered region:
$$\begin{aligned} f(E(y_i|X_i))=\beta _0+\varvec{\beta _c}^{'}X_i \end{aligned}$$
where \(f(\cdot )\) is a monotone “link” function and the vector \(\varvec{\beta _c}\) is the parameter of interest which represents the fixed effects of the genetic variants. Testing the association between the genetic variants in the region and the trait is equivalent to test the effect of the weighted combination of genetic variants \(g_i=\sum _{m=1}^Mw^0_mx_{im}\) on the trait. Under the generalized linear model, we propose to use the score test statistic [30] to test the null hypothesis \(H_0: \varvec{\beta _c}=0\). The score statistic can be expressed as follows:
$$\begin{aligned} \varvec{S}(w^0_1,\ldots ,w^0_M)=n\frac{\big (\sum _{i=1}^n(y_i-{\bar{y}})(g_i-{\bar{g}})\big )^2}{\sum _{i=1}^n(y_i-{\bar{y}})^2\sum _{i=1}^n(g_i-{\bar{g}})^2} \end{aligned}$$
The score test statistic \(\varvec{S}\) can be viewed as a function of weight \(\varvec{W_0}=(w^0_1,\ldots ,w^0_M)^{'}\). Let \(Y=(y_1,\ldots ,y_n)^{'}\), \(\varvec{X}=(X_1,\ldots ,X_n)^{'}\). Denote \(P=I_n-\frac{1}{n}1_n1_n^{'}\), where \(1_n\) represents a column vector containing all ones. P is an idempotent matrix. That is, \(P=P^{'},PP=P\). Considering \(x_i=X_i^{'}\varvec{W_0}\), we can rewrite the score test as:
$$\begin{aligned} \varvec{S}(w^0_1,\ldots ,w^0_M)&=n\frac{\varvec{W_0}^{'}\varvec{X}^{'}PYY^{'}P\varvec{XW_0}}{\varvec{W_0}^{'}\varvec{X}^{'}P\varvec{X}Y^{'}PY\varvec{W_0}} \end{aligned}$$
Detailed derivation of the aforementioned score statistic can be found in the supplementary materials (Additional file 1: Derivation of the score test). When real genotype and phenotype data are available, the score statistic can be maximized and extended to a General method to Test the effect of the Optimally Weighted combination of genetic variants in a gene (G-TOW) [30].
To test the association between a trait and a genetic variant, a Z test is usually employed. We can use the Z test below to test the main effect of the \(m^{th}\) variant in the considered region on the trait. \(Z_m=\frac{Y^{'}PX_{m\cdot }}{\sigma \sqrt{X_{m\cdot }^{'}PX_{m\cdot }}}\) where \(\sigma =\sqrt{\frac{1}{n}Y^{'}PY}\) and \(X_{m\cdot }=(x_{m1},\ldots ,x_{mn})^{'}\).
Denote the linkage disequilibrium (LD) matrix for the considered region as \(\varvec{R}=diag(\varvec{D})^{-1/2}\varvec{D} diag(\varvec{D})^{-1/2}\), where \(\varvec{D}=\varvec{X}^{'}P\varvec{X}\) and \(diag(\varvec{D})\) denotes the diagonal matrix of \(\varvec{D}\). When GWAS summary statistics such as the Z-scores and the LD matrix for genetic variants in the considered region are available, the score statistic can be written as:
$$\begin{aligned} \varvec{S}(w_1,\ldots ,w_M)&=\frac{\varvec{W}^{'}\varvec{Z}\varvec{Z}^{'}\varvec{W}}{\varvec{W}^{'}\varvec{R}\varvec{W}} \end{aligned}$$
(1)
where \(\varvec{Z}=(Z_1,\ldots ,Z_M)^{'}\) and \(\varvec{W}=(w_1,\ldots ,w_M)^{'}=diag(\varvec{D})^{1/2}\varvec{W_0}\) (see Additional file 1: Derivation of the score test). From equation (1), the score statistic \(\varvec{S}\) is equivalent to a linearly weighted test statistic based on Z-scores:
$$\begin{aligned} \varvec{L}(w_1,\ldots ,w_M)=\sum _{m=1}^Mw_mZ_m=\varvec{W}^{'}\varvec{Z} \end{aligned}$$
(2)
Under the null hypothesis, \({\varvec{Z}}\) follows multivariate normal distribution with mean \({\varvec{0}}\) and covariance matrix \(\varvec{R}\) [31]. This conclusion clearly demonstrates that testing the weighted combination of genetic variants in a considered region using the score test is the same as using the weighted combination of Z-scores for those variants.
In the aforementioned weight function \(\varvec{W}=(w_1,\ldots ,w_M)^{'}\), the true value of each weight is unknown and must be determined biologically or empirically. Therefore, in real data analysis, we should give reasonable values of weights in advance for a gene-based test. If all or most of the genetic variants in the region have almost an equal effect size in the same direction of association, we set \(w_m=1\) for \(m=1,\ldots ,M\), and the test becomes the burden test \(\varvec{L}_B=\varvec{L}(1,\ldots ,1)\), which sums up the association signals across all the variants and obtains high power. If we believe that the causal genetic variants would be subject to “purifying selection” and thus appear less frequently in the population than neutral variants, we set \(w_m=1/\sqrt{p_m(1-p_m)}\), where \(p_m\) denotes MAF of the \(m^{th}\) variant, and obtain \(\varvec{L}_W=\varvec{L}(1/\sqrt{p_1(1-p_1)},\ldots ,1/\sqrt{p_M(1-p_M)})\), which is the weighted sum statistic (WSS) [12]. If we assume that the values of the weights \(\varvec{W}\) come from gene expression or functional annotation data, the test degenerates into the PathSPU(1) test [28]. We know that \(\varvec{S}(w_1,\ldots ,w_M)\) follows central chi-square distribution with 1 degree of freedom (\(\chi ^2_1\)) and \(\varvec{L}(w_1,\ldots ,w_M)\) follows multivariate normal distribution with mean \({\varvec{0}}\) and covariance matrix \(\varvec{W}^{'}\varvec{R}\varvec{W}\) under the null hypothesis, given the choice of the weight function \(\varvec{W}\) is not proportional to \({\varvec{Z}}\).
As a function of \(\varvec{W}=(w_1,\ldots ,w_M)^{'}\), either the score test \(\varvec{S}(w_1,\ldots ,w_M)\) or the linear weighted test statistic \(\varvec{L}(w_1,\ldots ,w_M)\) can reach its maximum when we choose an appropriate weight \(\varvec{W}\). According to conclusions in Li and Lagakos [32], we have
$$\begin{aligned} \sup _{\varvec{W}}\{\varvec{S}(w_1,\ldots ,w_M)\}&= \sup _{\varvec{W}}\left\{ \frac{L(w_1,\ldots ,w_M)^2}{{\textrm{Var}}(L(w_1,\ldots ,w_M))}\right\} \\&=\sup _{\varvec{W}}\left\{ \frac{\varvec{W}^{'}\varvec{Z}\varvec{Z}^{'}\varvec{W}}{\varvec{W}^{'}\varvec{R}\varvec{W}}\right\} \\&=\varvec{Z}^{'}\varvec{R}^{-1}\varvec{Z} \end{aligned}$$
When \(\widehat{\varvec{W}}=\varvec{R}^{-1}\varvec{Z}\), the score test statistic \(\varvec{S}(w_1,\ldots ,w_M)\) reaches its maximum value. Given the asymptotic null distribution of \({\varvec{Z}}\) in Eq. (2), we define the score test
$$\begin{aligned} \varvec{S}_S=\widehat{\varvec{W}}^{'}\varvec{Z}=\varvec{Z}^{'}\varvec{R}^{-1}\varvec{Z} \end{aligned}$$
(3)
which follows central chi-square distribution with M degrees of freedom (\(\chi _M^2\)). The appropriate weights can be obtained when the linear weighted test statistic reaches its maximum value [33]. Although \(S_S\) may not have high power when its degree freedom is large, it gives higher weights to the SNPs that have weak correlation with other SNPs. When the correlation matrix \(\varvec{R}\) of \({\varvec{Z}}\) is a diagonal matrix denoted as \(\varvec{A}=diag(a_1,\ldots ,a_M)\) where \(0< a_i\le 1\), that is, \(\varvec{R}=\varvec{A}\), we have \(\varvec{W}=\varvec{A}^{-1}\varvec{Z}\). The score test in Equation (1), which is equivalent to the linear weighted test in Equation (2), will reach its maximum value when \(\varvec{W}=\varvec{A}^{-1}\varvec{Z}\).
To test the association between genetic variants in a considered region and a trait, Kwak and Pan [6] proposed a class of approaches called sum of powered score (SPU) tests along with its data-adaptive version (aSPU), \(SPU(\gamma )=\sum _{m=1}^{M}{Z}_{m}^{\gamma }\) and \(\gamma =1,2,\ldots ,8,\infty\). The SPU method can also be viewed as a special combination test method with weight \(\varvec{W}=\varvec{Z}^{\gamma -1}\). aSPU can be viewed as a data-adaptive weighted combination test method.
When the diagonal matrix A is the identity matrix \(\varvec{A}={\varvec{I}}\), we denote the test in Equation (3) as \(S_Q=\varvec{Z}^{'}\varvec{Z}\), which is the same as the sum of squared score test (SSU) [34] and the variance component test [35]. Based on the asymptotic null distribution of \({\varvec{Z}}\) in Equation (2), the test \(S_Q=\varvec{Z}^{'}\varvec{Z}\) follows a mixture of chi square distribution under the null hypothesis: \(S_Q\sim \sum _{m=1}^M\lambda _m\chi ^2_1\), where \(\lambda _1,\ldots ,\lambda _M\) are the eigenvalues of \(\varvec{R}\). Particularly, if we set the diagonal element of \(\varvec{A}\) as the beta distribution density function with pre-specified shape parameters as 1 and 25, which are evaluated at the corresponding sample MAF in the data, the score test degenerates into the sequence kernel association test (SKAT) for rare variants [13]. If the value of the diagonal elements \(\varvec{A}\) comes from a set of gene expression derived weights, the score test degenerates into PathSPU(2) test method [28]. Naturally, these two methods (SKAT and PathSPU(2)) both follow a mixture of chi square distribution under the null hypothesis. In our paper, we only consider GWAS summary data for common variants, so we set \(\varvec{A}\) as the identity matrix for this case.
A new gene-based method
We have proved and demonstrated that most of the gene-based associate tests can be expressed as a weighted combination of Z-scores. Thus, we can propose a new weighted combination method by utilizing the good properties of different weights. The statistics of \(L_B, L_W, S_S\), and \(S_Q\) represent four typical weighted methods. To combine the strength of \(L_B, L_W, S_S\), and \(S_Q\), we consider their weighted average:
$$\begin{aligned} L_{{\varvec{\rho }}}&=\rho _1(L_B)^2+\rho _2(L_W)^2+\rho _3S_S+\rho _4S_Q\\&=\varvec{Z}^{'}\varvec{A}\varvec{Z} \end{aligned}$$
where \(\varvec{A}=\rho _1\varvec{1}\varvec{1}^{'}+\rho _2\varvec{W}\varvec{W}^{'}+\rho _3\varvec{R}^{-1}+\rho _4{\varvec{I}}\), \(\varvec{1}\) denotes a column vector containing all 1s, \(\rho _1+\rho _2+\rho _3+\rho _4=1\), and \(0\le \rho _i\le 1\) for \(i=1,2,3,4\). Under the null hypothesis, for a given \(\varvec{\rho }\), \(L_{{\varvec{\rho }}}\) is a linear combination of independent central \(\chi _1^2\) random variables:
$$\begin{aligned} L_{{\varvec{\rho }}}\sim \sum _{i=1}^M\lambda _i\chi _1^2 \end{aligned}$$
where \(\chi _1^2\) denotes a central \(\chi ^2\) random variable with 1 degree of freedom and \(\lambda _i\) for \(i=1,\ldots ,M\) are the eigenvalues of \(\varvec{RA}\) [4]. We propose a novel method - OWC. For a set of values of \(\varvec{\rho }\), OWC test can be achieved by using the minimum p-value across the values of \(\varvec{\rho }\):
$$\begin{aligned} T=\min _{\varvec{\rho }}p_{L_{\varvec{\rho }}} \end{aligned}$$
(4)
where \(p_{L_{\varvec{\rho }}}\) is the estimated p-value of \(L_{\varvec{\rho }}\). Naturally, T can be obtained by a simple grid search across a range of \(\varvec{\rho }\): \(\{\rho _1,\rho _2,\rho _3,\rho _4\}\). The test statistic \(T=\min \{p_{L_{{\rho }_1}},\ldots ,p_{L_{{\rho }_4}}\}\). We search over \(\rho _i\in (0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1)\) for \(i=1,2,3,4\). Specifically, if \(\rho _2=\rho _3=0\) in \(\varvec{\rho }\), \(L_{{\varvec{\rho }}}\) can be rewritten as \(\rho _1(L_B)^2+(1-\rho _1)S_Q\), which is equivalent to SKAT-O test method [14].
p-value estimation
Monte Carlo simulations are used to obtain the p-values for T in a single layer of simulations. Briefly, after obtaining \(\varvec{R}\), we first simulate null scores of \({\varvec{Z}}^{(b)}\sim N(0, \varvec{R})\) for \(b=1,\ldots , B\). Then, we use the null scores to calculate the null test statistic \(T^{b}\) following the aforementioned procedure for each b, and then the p-value of the test is the proportion of the number of the null test statistic \(T^{b}\) with \(T^{b}\le T\) [36]. A larger B is needed to estimate a smaller p-value.
The aforementioned vector \({\varvec{Z}}^{(b)}\) can be generated in the following way [37]: we first generate a vector \(\varvec{L}\) with M elements where each element is independently generated from a standard univariate normal distribution with mean 0 and variance 1; that is, \(\varvec{L}\sim N({\varvec{0}}, {\varvec{I}})\). We then have \({\varvec{Z}}^{(b)}=\varvec{D}\varvec{L}\), where \(\varvec{D}\) is obtained from Cholesky decompositions of \(\varvec{R}\) with \(\varvec{R}=\varvec{D}\varvec{D}^{'}\). Specifically, for the test statistic \(T({\varvec{Z}}, \varvec{R})\) as a function of \({\varvec{Z}}\) and \(\varvec{R}\), we can estimate its p-value in detail as follows:
-
1
Generate independent \({\varvec{Z}}^{(b)}\sim N(0, \varvec{R})\) for \(b=1,\ldots ,B\).
-
2
Using asymptotic distribution of \(L_{{\varvec{\rho }}}\) under null hypothesis, calculate the null test statistic T by searching across a range of \(\varvec{\rho }\) for \({\varvec{Z}}\) and \({\varvec{Z}}^{(b)}\), respectively.
-
3
Finally, the p-value for the T test, \(p_{T}\), is
$$\begin{aligned} p_{T}=\bigg [\sum _{b=1}^BI\big (T({\varvec{Z}}^{(b)}, \varvec{R})\le T({\varvec{Z}}, \varvec{R})\big )+1 \bigg ]/(B+1) \end{aligned}$$
(5)
where \(T({\varvec{Z}}, \varvec{R})\) is the value of T test based on the observed data, \(T({\varvec{Z}}^{(b)}, \varvec{R})\) is the value of T test based on the \(b^{th}\) sampling data.
If the Z statistic in the summary data is not provided, we need to first transform the p-value in the summary data into a Z statistic using \(Z=\textrm{sign}(\beta )\Phi ^{-1}(1-p/2)\), where \(\Phi\) is the cumulative distribution function of the standard univariate normal distribution. Then, a similar procedure can be used to obtain the p-value of the test T.
One limitation of the Monte Carlo simulation to estimate p-values, such as the above one, is the computational burden. Especially, when there are about twenty thousands genes in a GWAS and a small significance level is used to claim significant findings. We adopted a fast algorithm [26] to estimate p-values, which will dramatically reduce the computational time. This algorithm reduces computational time by scarifying the precision of the p-value estimation for those tests with large true p-values.
We first define the following parameters for the algorithm:
-
\(B_{max}= \hbox {maximum number of random sampling}\) (e.g.\(10^6\))
-
\(B_0= \hbox {minimum number of random sampling}\) (e.g. 10)
-
\(p_0=\hbox {a constant }\times \,\hbox {significance level}\) (e.g.\(5 \times 10^{-6}\))
-
M = multiplying increment for the number of random sampling (e.g. 10)
-
The fast algorithm works as follows:
-
Step 0 Calculate the statistic T of OWC based on the observed data
-
Step 1 Set initial values: \(p_0=10^{-5}\), \(B_{max}=10^6\), \(B_0=10\), \(M=10\), \(B=B_0\)
-
Step 2 Use Eqs. (4) and (5) to estimate p-value, \({\hat{p}}\). Let \(B=B \times M\)
-
Step 3 If \({\hat{p}}\)>\(p_0\) or \(B>B_{max}\), report \({\hat{p}}\) and stop; otherwise go to step 2.