TS: a powerful truncated test to detect novel disease associated genes using publicly available gWAS summary data

Zhang, Jianjun; Guo, Xuan; Gonzales, Samantha; Yang, Jingjing; Wang, Xuexia

doi:10.1186/s12859-020-3511-0

Research Article
Open access
Published: 04 May 2020

TS: a powerful truncated test to detect novel disease associated genes using publicly available gWAS summary data

Jianjun Zhang¹,
Xuan Guo²,
Samantha Gonzales²,
Jingjing Yang³ &
…
Xuexia Wang ORCID: orcid.org/0000-0002-0613-2608¹

BMC Bioinformatics volume 21, Article number: 172 (2020) Cite this article

1967 Accesses
1 Citations
1 Altmetric
Metrics details

Abstract

Background

In the last decade, a large number of common variants underlying complex diseases have been identified through genome-wide association studies (GWASs). Summary data of the GWASs are freely and publicly available. The summary data is usually obtained through single marker analysis. Gene-based analysis offers a useful alternative and complement to single marker analysis. Results from gene level association tests can be more readily integrated with downstream functional and pathogenic investigations. Most existing gene-based methods fall into two categories: burden tests and quadratic tests. Burden tests are usually powerful when the directions of effects of causal variants are the same. However, they may suffer loss of statistical power when different directions of effects exist at the causal variants. The power of quadratic tests is not affected by the directions of effects but could be less powerful due to issues such as the large number of degree of freedoms. These drawbacks of existing gene based methods motivated us to develop a new powerful method to identify disease associated genes using existing GWAS summary data.

Methods and Results

In this paper, we propose a new truncated statistic method (TS) by utilizing a truncated method to find the genes that have a true contribution to the genetic association. Extensive simulation studies demonstrate that our proposed test outperforms other comparable tests. We applied TS and other comparable methods to the schizophrenia GWAS data and type 2 diabetes (T2D) GWAS meta-analysis summary data. TS identified more disease associated genes than comparable methods. Many of the significant genes identified by TS may have important mechanisms relevant to the associated traits. TS is implemented in C program TS, which is freely and publicly available online.

Conclusions

The proposed truncated statistic outperforms existing methods. It can be employed to detect novel traits associated genes using GWAS summary data.

Background

Even though genome-wide association studies (GWASs) have been remarkably successful in identifying a large number of genetic variants associated with complex traits and diseases, these identified variants can only explain a small to modest fraction of heritability [1]. Larger sample sizes and more powerful statistical tests are needed to boost power to identify novel genetic variants, especially for weakly associated variants with small effect sizes or low frequency variants. Due to various reasons, it is often difficult for researchers to obtain access to individual level data, and thus difficult to obtain a sufficient sample size to obtain reliable results. The increase in public availability of genome-wide association study (GWAS) summary statistics, e.g. minor allele frequency (MAF), estimated effect size, odds ratio, or p-values, for individual single nucleotide polymorphisms (SNPs) motivated us to develop novel powerful methods for analyzing GWAS summary data. Methods based on summary statistics can also be viewed as a complementary approach to the traditional single variant single trait association test.

Most popular gene based association tests (grouping SNPs together into a SNP set (e.g. a gene or surrounding of a gene) to test the joint effects of SNPs in the SNP set [2]) can often be viewed as a combination of summary statistics (e.g. Z statistics or p-values generated from GWAS). For example, the weighted sum statistic (WSS) [3] is a type of burden test statistic, which is used to jointly analyze a group of genetic variants in order to test for association in a considered region. The WSS can be viewed as a weighted sum of the summary statistic, where the summary statistic is the Z statistic. The sequence kernel association test (SKAT) [4] and the sum of squared score test (SSU) [5] are proposed to test for association between genetic variants and a single trait. Both tests can be viewed as the weighted sum of summary statistics, where the summary statistic is a score test. The SKAT-O statistic [6] is a linear combination of SKAT and the burden test. When a tuning parameter searching in a certain range, the SKAT-O can reach the maximized value of power. The burden test and SKAT can be considered as special cases of SKAT-O. The SKAT-O statistic can also be rewritten as a special combination of Z statistics [7]. In addition to the aforementioned gene based association tests, there are several other p-value based methods which are not based on Z statistics, such as the minimum p-value, a general gene-based p-value adaptive combination approach (GPA) [8] or the gene-based association test, which uses extended Simes procedure (GATES) [9].

Most of the existing gene-based methods can be viewed as either burden test methods or quadratic test methods. Burden test methods are usually powerful when the direction of effects of genetic variants in the considered region are the same. Quadratic test methods usually have reasonable power given a wide range of alternative hypotheses. Specifically, burden tests collapse the variants in a genomic region into a single burden variable by using weighted combination of variants, and then test the association between the trait and the single burden variable. Quadratic tests refer to tests with statistics of quadratic forms of the score vector. However, burden tests can suffer loss of statistical power when different directions of effects of causal variants exist, while quadratic tests could be less powerful due to other issues such as the large number of degree of freedoms. Thus, we aim to develop a new powerful method for further analyzing GWAS summary data. In the “Method” section of this paper, we propose a truncated statistic method (TS) to find approximate contributions of trait associated genes. As our method is based on the quadratic test method and uses a truncated statistic to find the most likely contributions of genetic variants, TS can overcome the shortcomings of both quadratic and burden tests.

To evaluate the performance of the proposed method, we conducted extensive simulation studies and real data analysis. We compared our method TS with five existing gene-based methods: a gene-based association test that uses an extended Simes procedure (GATES) [9], an adaptive sum of powered score tests (aSPU) [10], and three methods proposed by [11]: sum test (ST), squared sum test (S2T), and the adaptive test (AT). All methods are designed for single trait association studies. GATES adopts an extended Simes procedure to correct multiple testing issues while calculating the p-value quickly based on SNP summary statistics. The aSPU method estimates and selects the most powerful test among a class of so-called, sum of powered score (SPU) tests. ST is a type of burden test statistic [3], S2T is a type of SKAT statistic [4], and AT is equivalent to the SKAT-O statistic [6]. Simulation results demonstrate that our proposed method TS outperforms the five comparable methods. Results of our application to the schizophrenia GWAS summary data obtained from the Psychiatric Genomics Consortium (PGC), and fasting glucose GWAS meta-analysis summary data obtained from the UK Biobank component of the European DIAMANTE study, also indicate that our method performs better than existing methods.

Results

Simulations

We conducted extensive simulation studies to evaluate the performance of TS. We compared the type I error rates and power of TS with the five existing methods by following the simulation setting in [12]. Kwak et al. [10] have shown that the performance using any reference data from the same ancestry in estimating linkage disequilibrium (LD) among SNPs is mostly satisfactory with an estimated inflation factor close to 1. Therefore, the LD between SNPs in the simulation studies is estimated using the 1000 Genomes project [13].

Type I error

To evaluate the type I error rates of the proposed method, we simulate the test statistic Z from a multivariate normal distribution N(0,R), where R denotes the corresponding LD matrix of the gene EPB41 (erythrocyte membrane protein band 4.1), which is used in our simulation studies. Authors in [14] reported that gene EPB41 colocalizes with AMPA receptors at excitatory synapses and mediates the interaction of the AMPA receptors with the cytoskeleton. Existing study [15] has demonstrated brain region- and subunit-specific abnormalities in the expression of subunits of the AMPA subtype of glutamate receptors in schizophrenia. We consider four different significance levels: α=10⁻³,10⁻⁴,10⁻⁵, and 2.80×10⁻⁶. In the simulation studies, p-values of our method and aSPU are estimated by performing 10⁶ times permutations and type I error rates are calculated based on 10⁶ replicates. Table 1 shows the estimated type I error rates. From this table, we can see that the type I error rates of all of the methods are controlled well, which indicate that all the tests are valid.

Table 1 Estimated type I error rates for different test methods

Full size table

Power comparison

Using the same gene EPB41, we conducted extensive simulation studies to assess the power of the proposed method. We simulate 10⁴ summary statistics from N(A×△,R) where A denotes the signs of associations which are determined by the risk or protective effects of causal variants. △ denotes different settings and determines the effect sizes of causal variants. R is the corresponding LD matrix of the gene EPB41. We simulated three causal SNPs. The effects of the three causal SNPs are deternmined by randomly selecting three numbers either 1 or -1 for A. Then we set the effects of the rest of SNPs in the gene as 0. Table 2 shows the estimated power at 2.80×10⁻⁶ significance level under three combinations of A for three settings of △: a set of fixed values of △ and two randomly simulated △, one from uniform distribution, and the other from normal distribution. The proposed TS performs robustly across all scenarios and has the overall best performance compared to the other five test methods. As our proposed method TS belongs to quadratic tests and uses a truncated method to find the true contribution of genetic variants to the association, our method is robust to the direction of effects and weak effect sizes. Among the five comparable tests, even though S2T and GATES methods are unaffected by the directions of causal SNPs, these two methods will suffer a loss of power when the effect sizes are weak, which is indicated by the results of the first three settings (three different fixed values of △). Comparing the results of the last three settings from the same normal distribution of △, the burden test ST has the worst performance when there are differing directions of effects. Because the two adaptive methods, AT and aSPU, are considered to be a combination of the burden test and the quadratic test, these two methods will still be more or less affected by directions and other noises, despite their adaptive nature. The results of the first three settings from fixed effect size of △, and middle three settings from uniform distribution of △, verify this conclusion. TS maintains its power in all of the scenarios, indicating that our proposed test TS is robust to different directions of effects and weak effect sizes.

Table 2 Estimated power (%) under 2.8×10⁻⁶ significance level for different tests. Data are simulated from N(A×△,R). A has 3 nonzero elements with different signs

Full size table

Application to schizophrenia gWAS summary data set

We applied our method to a schizophrenia (SCZ) GWAS summary data, which was downloaded from the PGC website (see URL https://www.med.unc.edu/pgc/). The GWAS summary data was generated from a meta analysis with 36,989 cases and 113,075 controls, denoted as SCZ [16]. The GWAS summary data consists of the MAF, effect size estimate, odds ratio, and p-value for each SNP contained in a gene. We treat all SNPs that are located in or near a gene as a set to be analyzed for joint association and group all SNPs from 20 kb upstream of a gene to 20 kb downstream of a gene following [2].

In order to better illustrate our TS method and make fair comparisons, we first performed LD pruning for each SNP set by removing those SNPs that have pairwise LD r²>0.5 with other SNPs. We then removed genome-wide significant SNPs with p-values less than 5×10⁻⁸ and filtered out those SNPs with MAF <0.05. We set our genome-wide SNP set significance level as 2.8×10⁻⁶, which is the Bonferroni corrected significance level for the total number of tested SNP sets (17,415 genes). P-values of our method TS and aSPU are estimated by performing 10⁷ permutations, respectively.

We applied our method and the other comparable methods to the SCZ data of 150,064 individuals to identify SCZ-associated genes, then used genome-wide significant SNPs around these genes in the SCZ dataset as partial validation. Figure 1 shows the Venn diagram comparing the number of significant genes identified by our proposed method with the other comparable methods, aSPU, GATES, and Guo and Wu’s method-GW, which represents a super method where we aggregate the identified genes by S2T, ST, and AT. Specifically, TS identified 215 significant genes, aSPU identified 76 significant genes, GATES identified 73 genes, and GW identified 93 significant genes in total (Table 3). Among these 215 significant genes identified by TS, 43 genes (in total 50 unique genes containing significant SNPs in GWAS) contain the genome-wide significant SNPs (p-value <5×10⁻⁸) within 20 kb in the SCZ data, offering a significant validation of the identified gene. Clearly, our method identified more associated genes than the other three comparison methods.

Table 3 Comparison of the four methods using the PGC SCZ and UKB T2D GWAS summary data

Full size table

Our method performs better than other methods in terms of the number of significant genes identified. However, each test identified some unique genes missed by the other methods, highlighting that different tests may perform better in different scenarios. Overall, a total of 278 significant and unique genes were identified in the SCZ data across all tests. Supplementary Table S1 shows the significant genes, with associated p-value and minimum p-value, identified by TS, aSPU, GATES, or GW in SCZ, respectively.

Many of these significantly identified genes may have important mechanisms relevant to schizophrenia. Their biological implications in the etiology of schizophrenia are discussed as follows.

miRNA plays a role in gene expression regulation, thus the effect on miRNA expression naturally affects the expression of its target gene(s). Particularly, the presence of altered miRNA expression in brain and peripheral tissues has been implicated in the development of schizophrenia and other psychiatric disorders such as bipolar disorder (BD) and major depression (MD) [17–19]. Several miRNA encoding genes were identified as significant by all four methods (TS, aSPU, GATES, and GW), such as MIR4677, MIR6511A4, MIR4267, MIR4436B1, and MIR137HG. MIR137HG is of particular interest as it is instrumental in neurodevelopment and neuroplasticity [20]. Modulated expression of MIR137HG and MIR137 has been specifically shown to reduce grey matter content in key areas in the brain, which is characteristic of schizophrenia [21, 22]. However, TS identified several miRNA encoding genes that were missed by the other methods, such as MIR4256, MIR6756, MIR3652, MIR4700, and MIR624. Due to the growing evidence regarding miRNA involvement in psychiatric disorders, some of these genes may be relevant to schizophrenia etiology, and thus may warrant further research.

Voltage-dependent calcium ion channel dysfunction has a long history of being a plausible mechanism for schizophrenic pathology [23]. CACNA1C codes for a calcium ion channel subunit, and has been reported to be a target of miR-137 [19]. Interestingly, TS uniquely identified another calcium ion channel subunit encoding gene, CACNA2D3. Due to its similarity to CACNA1C and its high expression in the brain [24], CACNA2D3 may also prove to be a potent factor in schizophrenic development, despite not containing any genome-wide significant SNPs (most significant SNP p = 0.000586), which highlights our method’s robustness to weak effects.

Finally, TS uniquely identified GRM7 as significant. Glutamate receptor dysfunction has been long studied for its role in schizophrenia development [25, 26], and GRM7 in particular was recently investigated for its potential as a biomarker for risperidone response [27]. GRM7 was also previously identified by another proposed method, OWC [7], indicating TS is equivalently adapted for identifying weakly associated signals.

Application to t2D gWAS summary data sets

We also conducted a comprehensive analysis of fasting glucose GWAS summary data for type 2 diabetes (T2D) from the UK Biobank component of the European DIAMANTE study (denoted as UKB), which included over 440,000 individuals of European ancestry with 19,119 cases and 423,698 controls. The analysis for the dataset was restricted to HRC variants and was conducted using the UK Biobank Resource under Application Number 9161 (McCarthy). The GWAS summary data can be downloaded from http://www.type2diabetesgenetics.org/informational/data. Similar to the SCZ dataset, the UKB summary data also consists of MAF, effect size estimate, odds ratio, and p-value for each SNP in genes. We applied the same procedure used to filter and analyze the SCZ data on the UKB data. We used 0.05/17,495≈2.8×10⁻⁶ as the significance level and performed 10⁷ permutations for our proposed method TS and aSPU, respectively.

Figure 2 shows the Venn diagram comparing the number of significant genes identified by our proposed method compared with aSPU, GATES and GW. TS identified 217 significant genes, whereas aSPU, GATES, and GW identified 54, 40, and 57 significant genes, respectively (Table 3). Among these 217 significant genes identified by TS, 47 genes contain the genome-wide significant SNPs (p-value <5×10⁻⁸) within 20 kb in the UKB data. In total 78 genes in the T2D GWAS contain significant SNPs. That is, our TS method verified 60.26% of the entire genes containing significant SNPs. Based on the UKB analysis, we can further conclude that our TS method performed the best compared to the other methods in terms of the number of significant genes identified. The 233 significant and unique genes identified by all of the four methods are provided in Supplementary Table S2.

In the UKB dataset, ADIPOR2 was identified as significant by all methods except GW, which was just below significance (p = 2.6E −5). Adiponectin is an adipokine involved in metabolic control, and has insulin sensitizing effects. Expression of adiponectin is found to be down-regulated as insulin resistance is developing, and is also associated with anti-inflammatory and other protective effects [28]. ADIPOR2 codes for an adiponectin receptor, dysfunction of which affects adiponectin’s ability to exert its regulatory effects. Deletion of the gene entirely in mice models demonstrates ineffective ADIPOR2 function in development of type 2 diabetes [29], and recovery of receptor function has been considered for a potential target in preventing diabetic nephropathy in mice [30]. Also, variations in ADIPOR2 have previously been demonstrated to be associated with type 2 diabetes [31, 32]. Bountiful evidence of the effects of adiponectin and ADIPOR2 expression on T2D pathology serves to further highlight the validity of the truncated statistic method. T2D pathogenesis has an inflammatory component, which connects the inability of pancreatic beta cells to maintain sufficient insulin levels in the presence of developing resistance with various cellular stresses that either induces or is associated with an inflammatory respons [33]. In particular, activation of the JNK pathway is associated with a reduction in insulin gene expression and subsequent development of T2D [34]. Our method identified MAP4K5 as significant (most significant SNP p = 1.7E-8), which was missed by the other three methods. MAP4K5 encodes for a component in mitogen-activated protein kinase (MAPK) cascade, a signaling pathway which mediates activation of the JNK pathway via tumour necrosis factor α (TNF- α) [35, 36]. Additionally, a particular study identified key protective polymorphisms in MAP4K5 which may affect JNK activation and thus T2D development [37].

We further performed a verfication study for T2D using an independent T2D GWAS summary data obtained from the DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium (http://diagram-consortium.org/downloads.html). The stage 1 analyses of DIAGRAM comprised a total of 26,676 T2D cases and 132,532 control participants from 18 GWAS. In stage 1, in each study, logistic regression association analysis of T2D was performed with genotype dosage using an additive genetic model including covariates age, sex and principal components derived from the genetic data to account for population stratification. In fifteen of the repeated studies, researchers also adjusted for body mass index (BMI) [38]. The GWAS summary data consists of the MAF, effect size estimate, odds ratio, and p-value for each SNP contained in a gene. We applied the proposed TS method and five other methods to the DIAGRAM stage 1 GWAS summary data in order to verify T2D associated genes, which were identified using the GWAS summary data from the European DIAMANTE study (UKB). Table 4 shows that among 217 genes identified in UKB with the TS method, 32 genes were verified in the DIAGRAM data. The verification rate is 14.7%, which is greater than 8.7% of the GW method (S2T, ST, and AT combined) and 9.2% of the aSPU method.

Table 4 Verification study for UKB T2D using GWAS summary data obtained from DIAGRAM

Full size table

We analyzed genes uniquely identified by a method and showed how many of them included the significant SNPs in GWAS in Table 3. In the PGC SCZ data, the proposed TS method (8.4%) performs better than the GW (6.7%) and GATES (0%). Although the ratio 25% of sSPU is greater then 8.4% of TS, aSPU uniquely identified 4 genes and only 1 of them included the significate SNPs in GWAS. In the UKB T2D data, the proposed TS method (11%) performs better than the GW (0%) and aSPU (0%). The GATES uniquely identified 7 T2D associated genes, and only 1 of them included the significant SNPs in GWAS.

In summary, in both the SCZ and UKB summary data-sets, several genes which contain genome-wide significant SNPs were identified by all of the methods considered, including TS, further highlighting the validity of our proposed method. Additionally, TS identified genes in both datasets that were missed by the other methods, which demonstrates its ability to maintain power across a range of scenarios as well as overcome the limitations discussed previously.

Discussion

In this paper, we propose a novel gene-based genetic association test, the truncated statistic method (TS), where we use a truncated test to find the estimated contribution of genetic variants based on a quadratic statistic. The TS method can overcome the shortcomings of both burden tests and quadratic tests: different directions of effects for burden tests and the large number of other noises for quadratic tests. When our data satisfies some special conditions (see Appendix A), our method can reduce to the score test. If we focus on the summary data obtained from rare variants analysis, we can set the weight on each Z statistic of each SNP that has the beta distribution density function with pre-specified shape parameters 1 and 25 being in row evaluated at the corresponding sample MAF in the data, similar to the method of SKAT. Through simulation studies and real data analyses, we demonstrate that the proposed test TS, often outperforms other comparison methods such as ST, S2T, AT, aSPU, and GATES, which are some of the most popular methods based on summary data.

When we perform association tests based on a large number of genes in whole-genome sequencing studies, different disease models may exist: some of the models are likely to include many causal variants whose effects are in the same directions while other models may include few causal variants or causal variants whose effects are in different directions; or some of the models are likely to include many weakly associated variants while other models may include a few strongly associated variants. Because the true disease model is usually unknown, there is no uniformly most powerful test to detect single trait associated genes; an association test may perform well for one dataset, but not necessarily for another. For example, both schizophrenia and type 2 diabetes are representative of complex diseases with common but often weakly associated variants, in which many of these variants may be working in tandem to produce the disease. A robust, flexible method such as TS is needed to better elucidate these weakly associated variants so that their role in disease etiology can be understood further. Thus, the proposed TS can be an attractive tool for many situations, because it adapts to the underlying biological disease model by selecting the true contribution of genetic variants.

Our proposed method provides an alternative approach with extreme scalability and robust performance. This method only needs the publicly available GWAS summary statistics as input, without the need to access raw genotype and phenotype data. TS incorporates the inverse of the LD matrix. Thus, it can account for the LD information among SNPs. In order to guarantee that the LD matrix is invertible in real data analysis, we suggest to perform SNP pruning first [39] before applying TS in the GWAS summary data. We expect that researchers will be able to identify novel disease associated genes by employing the proposed method to analyze publicly available GWAS summary data and shed more light onto the underlying mechanisms of diseases. In our paper, our proposed method mainly focused on single trait GWAS summary data. However, it can be easily extended to multiple traits GWAS summary data. We have implemented the proposed method in an C code, which is publicly available online at github https://github.com/Jianjun-CN/c-code-for-TS.

Conclusions

We proposed a powerful gene based test TS. Simulation studies and real data analyses demontrated that TS outperformed existing methods. It can be employed to detect novel associated genes using GWAS summary data.

Method

Suppose we have GWAS summary data including MAF, estimated effect size, standard deviation, p-value, test statistic, for each SNP from a GWAS study. We aim to propose a novel gene based association test using the GWAS summary data. Assume M genetic variants in a considered region (a gene or a pathway) are associated with a phenotypic trait with effect sizes β=(β₁,…,β_M). Denote Z=(Z₁,…,Z_M) as a vector of test statistics used to test the genetic associations for the M genetic variants. The null hypothesis is H₀:β=0 and the alternative hypothesis is H₁:β≠0, which means that at least one of the elements of β does not equal to zero. For each pair of Z statistics, under the null hypothesis, we have:

$$ Cov(Z_{k},Z_{l})=r_{kl} $$

(1)

where r_kl denotes the LD between the k^th SNP and the l^th SNP [8]. We can assume that the p-values are calculated based on the Z-statistic, even though the p-values may be computed based on non-Z statistics. This assumption will not affect our conclusion about the proposed test. For p-values obtained from non-Z statistics (such as chi-square test, or t-test), we can transform the corresponding p-values into Z statistics by using Z_m=sign(β_m)Ψ⁻¹(1−p_m/2) for the m^th SNP when we know the p-value, where Ψ(·) is the cumulative distribution function (CDF) of a standard normal distribution and β_m denotes the corresponding estimated effect size contained in the GWAS summary results.

It is reasonable to assume that Z follows a multivariate normal distribution with mean 0 and correlation matrix R under the null hypothesis, where R denotes the estimated null correlation matrix computed based on the variant linkage disequilibrium (LD). We propose a test statistic that includes only Z statistics with a true contribution to the association of a genetic variant under the alternative hypothesis. Here, we adopt the truncated statistic method for combining statistical evidence, which has been suggested for such an analysis in literature [40, 41]. For a given τ>0, we let Z(τ) be the sub-vector of Z satisfying |Z_m|≥τ. That is, only the statistics in the vector Z with an absolute value greater than or equal to τ will be kept. Similarly, we let R(τ) be a sub-matrix of R representing the correlation matrix corresponding to Z(τ). Define the test statistic based on Z(τ) and R(τ) as:

$$ T_{\tau}=\boldsymbol{Z}(\tau)^{T}\boldsymbol{R}(\tau)^{-1}\boldsymbol{Z}(\tau) $$

(2)

When τ is large, T_τ can be undefined if |Z_m|<τ for all m. In this case, we define T_τ=0. Then, our test statistic is:

$$ S_{T}=\max_{\tau>0}T_{\tau} $$

(3)

Usually, we select the value τ from the second minimum value of Z to the maximum value of Z, because we have proven that the statistic T_τ containing more than one Z-statistic is greater than the statistic T_τ that contains only one Z-statistic (see Appendix A). That is, we only need to compute T_τ at most M−1 times to get the maximum value of S_T. The asymptotic distribution of S_T does not follow a standard distribution but can be evaluated by permutation methods.

We use the following steps to evaluate the distribution of S_T under the null hypothesis:

1)
Suppose we permute B times. For the b^th permutation, we obtain the Z statistics, Z^b, generated from the multivariate normal distribution N(0,R) where b=0 represents the original Z statistics.
2)
Scan all possible τ values, correspondingly we search from the second minimum value of Z^b to the maximum value of Z^b, then, we get the test statistic $S_{T}^{(b)}$ for the b^th step.
3)
Repeat B times permutations and then the p-value of S_T is given by:

$$P_{T}=\frac{\#\{b:S_{T}^{(b)}>S_{T}^{(0)}, \ b=1,\cdots,B\}}{B}$$

Comparison of methods

We evaluate the performance of the proposed method (TS) by comparing it with the five aforementioned methods: 1) adaptive sum of powered score tests (aSPU) [10]; 2) gene-based association test that uses extended Simes procedure (GATES) [9]; 3) three methods proposed by [11]: sum test (ST), squared sum test (S2T), and adaptive test (AT).

Please find the brief introduction about these methods and their notations as follows:

1)
Sum test (ST), $B=\sum _{m=1}^{M}Z_{m}$, a type of burden test statistic [3].
2)
Squared sum test (S2T), $Q=\sum _{m=1}^{M}Z_{m}^{2}$, a type of SKAT statistic [4] and equivalent to the weighted sum of squared score (SSU) test [42].
3)
Adaptive test (AT), T= minρ∈[0,1]P(Q_ρ), where Q_ρ=(1−ρ)Q+ρB² and P(Q_ρ) denotes the corresponding p-value.
1.
Adaptive sum of powered score tests (aSPU), aSPUs= minγ∈ΓP_SPUs(γ) where $SPUs(\gamma)=\sum _{m=1}^{M}Z_{m}^{\gamma }$.
4)
Gene-based association test that uses extended Simes procedure (GATES), $p_{GATES}=\min \big (\frac {m_{e}p_{(j)}}{m_{e(j)}}\big)$ where m_e is the effective number of independent p-values among the M SNPs, p_(j) is the j^th smallest p-value and m_e(j) is the effective number of independent p-values among the top j SNPs.

Because Z∼MVN(0,R), B=$1_{M}^{'}\boldsymbol {Z}\sim N(0,1_{M}^{'}\boldsymbol {R}1_{M})$, where 1_M denotes a column vector of length M where every element is equal to one. Therefore, $\frac {B^{2}}{1_{M}^{'}\boldsymbol {R}1_{M}}$ follows the χ₁ distribution. Under the null hypothesis, the statistic of the squared sum test Q=$\phantom {\dot {i}\!}\boldsymbol {Z}^{'}\boldsymbol {Z}$ has an asymptotic distribution as the weighted sum of independent χ₁ with weights being the eigenvalues of R. We compute the p-value of the T statistic of the adaptive test using an one-dimensional numerical integration, where we will search over ρ∈(0,0.01,0.04,0.09,0.16,0.25,0.5,1) following [43]. We can estimate the p-values of the three statistics proposed by [11] using the “sats" function in the “mkatr" package in R. We estimate the p-values of the adaptive sum of powered score tests (aSPU) using the Monte Carlo simulations, which can be obtained using the “aSPUs” function in the “aSPU” package in R. We use the “gates” function in the “COMBAT" package in R to calculate the p-value of the gene-based association test that uses extended Simes procedure (GATES).

Appendix

Assume that Z=(Z₁,⋯,Z_M) represents a vector of Z statistics following a multivariate normal distribution with mean 0 and covariance matrix R, where R is the LD matrix between SNPs. When we try to adapt the truncated statistic method for combining statistical evidence, we want to know whether the test statistic in equation (2) will increase or not when we add a new Z statistic.

If τ changes from a large value to a small value, the number of Z statistics added in the test statistic will increase as τ changes. Suppose we have selected n Z statistics, Z₁=(Z₁,⋯,Z_n)^T and have a corresponding covariance matrix Σ₁ for n<M. We can then write the test statistic as $\boldsymbol {Z}_{1}^{T}\Sigma _{1}^{-1}\boldsymbol {Z}_{1}$. Suppose we add one Z statistic, Z_n+1, to the test statistic, when τ gets smaller. The corresponding test Z statistic can be written as:

$$ (\boldsymbol{Z}_{1}^{T}, Z_{n+1}) \left(\begin{array}{ll} \Sigma_{1} & B \\ B^{T} & 1 \end{array} \right)^{-1} \left(\begin{array}{ll} \boldsymbol{Z}_{1} \\ Z_{n+1} \end{array} \right) $$

(4)

where B represents the correlation between the Z_n+1 statistic with the first n Z statistics. From the theory of block matrix inversion, we have:

$$ \left(\begin{aligned} \Sigma_{1} & B \\ B^{T} & 1 \end{aligned} \right)^{-1} = \left(\begin{aligned} \Sigma_{1}^{-1}+\Sigma_{1}^{-1}BB^{T}\Sigma_{1}^{-1}(1-B^{T}\Sigma_{1}^{-1}B)^{-1} &\quad -\Sigma_{1}^{-1}B(1-B^{T}\Sigma_{1}^{-1}B)^{-1} \\ -(1-B^{T}\Sigma_{1}^{-1}B)^{-1}B^{T}\Sigma_{1}^{-1} &\quad\quad(1-B^{T}\Sigma_{1}^{-1}B)^{-1} \end{aligned} \right) $$

(5)

Because $(1-B^{T}\Sigma _{1}^{-1}B)^{-1}$ is a constant (not a matrix), we define $C=(1-B^{T}\Sigma _{1}^{-1}B)^{-1}$ and rewrite the Eq. 5 as:

$$\begin{array}{*{20}l} \left(\begin{array}{ll} \Sigma_{1} & B \\ B^{T} & 1 \end{array} \right)^{-1} &= \left(\begin{array}{ll} \Sigma_{1}^{-1}+\Sigma_{1}^{-1}BB^{T}\Sigma_{1}^{-1}*C & -\Sigma_{1}^{-1}B*C \\ -C*B^{T}\Sigma_{1}^{-1} & C \end{array} \right)\\ &=C \left(\begin{array}{ll} \frac{1}{C}\Sigma_{1}^{-1}+\Sigma_{1}^{-1}BB^{T}\Sigma_{1}^{-1} & -\Sigma_{1}^{-1}B \\ -B^{T}\Sigma_{1}^{-1} & 1 \end{array} \right) \end{array} $$

Then, the truncated statistic (4) is equivalent to:

$$\begin{array}{*{20}l} &(\boldsymbol{Z}_{1}^{T}, Z_{n+1}) \left(\begin{array}{ll} \Sigma_{1} & B \\ B^{T} & 1 \end{array} \right)^{-1} \left(\begin{array}{l} \boldsymbol{Z}_{1} \\ Z_{n+1} \end{array} \right)\\ =&C (\boldsymbol{Z}_{1}^{T}, Z_{n+1}) \left(\begin{array}{ll} \frac{1}{C}\Sigma_{1}^{-1}+\Sigma_{1}^{-1}BB^{T}\Sigma_{1}^{-1} & -\Sigma_{1}^{-1}B \\ -B^{T}\Sigma_{1}^{-1} & 1 \end{array} \right) \left(\begin{array}{l} \boldsymbol{Z}_{1} \\ Z_{n+1} \end{array} \right)\\ =& \boldsymbol{Z}_{1}^{T}\Sigma_{1}^{-1}\boldsymbol{Z}_{1}+C\Big[ (\boldsymbol{Z}_{1}^{T}\Sigma_{1}^{-1}B)^{2}-2Z_{n+1}\boldsymbol{Z}_{1}^{T}\Sigma_{1}^{-1}B+Z_{n+1}^{2} \Big]\\ =&\boldsymbol{Z}_{1}^{T}\Sigma_{1}^{-1}\boldsymbol{Z}_{1}+C\Big(\boldsymbol{Z}_{1}^{T}\Sigma_{1}^{-1}B-Z_{n+1}\Big)^{2} \end{array} $$

If we want to know whether the test statistic will increase or not when we add a new Z statistic, we only need to check whether the condition $C=(1-B^{T}\Sigma _{1}^{-1}B)^{-1}$ is greater than zero or not. It means that when $B^{T}\Sigma _{1}^{-1}B<1$, the truncated statistics will increase, and will not increase otherwise. For the special case, n=1, we always have $B^{T}\Sigma _{1}^{-1}B<1$. It means the Z statistics contained in the truncated statistic will always be greater than 1. This is also in line with our thought, as we view our proposed method as a complementary approach to single trait single variant association tests. When these n+1 Z statistics are all independent, it means B=0. Thus, the truncated statistic degenerates to a summation of the square of these n+1 Z statistics, and our method becomes equivalent to the score test.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request. The proposed methods are implemented in an C code available at https://github.com/Jianjun-CN/c-code-for-TS

The original SCZ GWAS summary data can be downloaded at the PGC site https://www.med.unc.edu/pgc/. The original UKB GWAS summary data can be downloaded at http://www.type2diabetesgenetics.org/informational/data.

Abbreviations

GWASs:: Genome-wide association studies
TS:: Truncated statistic method
T2D:: Type 2 diabetes
MAF:: Minor allele frequency
SNPs:: Single nucleotide polymorphisms
WSS:: Weighted sum statistic
SKAT:: Sequence kernel association test
SSU:: Sum of squared score test
GATES:: Gene-based association test that uses extended Simes procedure
GPA:: Gene-based p-value adaptive combination approach
aSPU:: Adaptive sum of powered score tests
ST:: Sum test
S2T:: Squared sum test
AT:: Adaptive test
SPU:: Sum of powered score
PGC:: Psychiatric Genomics Consortium
LD:: Linkage disequilibrium
SCZ:: Schizophrenia
UKB:: UK Biobank component of the European DIAMANTE study

References

Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, Cho JH, et al.Finding the missing heritability of complex diseases. Nature. 2009; 461(7265):747.
Article CAS PubMed PubMed Central Google Scholar
Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X. Powerful SNP-set analysis for case-control genome-wide association studies. Am J Hum Genet. 2010; 86(6):929–42.
Article CAS PubMed PubMed Central Google Scholar
Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009; 5(2):e1000384.
Article PubMed PubMed Central CAS Google Scholar
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011; 89(1):82–93.
Article CAS PubMed PubMed Central Google Scholar
Pan W. Relationship between genomic distance-based regression and kernel machine regression for multi-marker association testing. Genet Epidemiol. 2011; 35(4):211–6.
PubMed PubMed Central Google Scholar
Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012; 13(4):762–75.
Article PubMed PubMed Central Google Scholar
Zhang J, Gonzales S, Liu J, Wang X. An optimally weighted combination method to detect novel disease associated genes using publicly available GWAS summary data. bioRxiv 709808. 2019.
Zhang J, Zhao Z, Guo X, Guo B, Wu B. Powerful statistical method to detect disease associated genes using publicly available GWAS summary data. bioRxiv 478321. 2018.
Li MX, Gui HS, Kwan JS, Sham PC. GATES: a rapid and powerful gene-based association test using extended Simes procedure. Am J Hum Genet. 2011; 88(3):283–93.
Article CAS PubMed PubMed Central Google Scholar
Kwak IY, Pan W. Adaptive gene-and pathway-trait association testing with GWAS summary statistics. Bioinformatics. 2015; 32(8):1178–84.
Article PubMed PubMed Central CAS Google Scholar
Guo B, Wu B. Statistical methods to detect novel genetic variants using publicly available GWAS summary data. Comput Biol Chem. 2018; 74:76–79.
Article CAS PubMed PubMed Central Google Scholar
Guo B, Wu B. Powerful and efficient SNP-set association tests across multiple phenotypes using GWAS summary data. Bioinformatics. 2018; 35(8):1366–72.
Article PubMed Central CAS Google Scholar
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012; 491(7422):56.
Article CAS Google Scholar
Shen L, Liang F, Walensky LD, Huganir RL. Regulation of AMPA receptor GluR1 subunit surface expression by a 4.1 N-linked actin cytoskeletal association. J Neurosci. 2000; 20(21):7932–40.
Article CAS PubMed PubMed Central Google Scholar
Tucholski J, Simmons MS, Pinner AL, McMillan LD, Haroutunian V, Meador-Woodruff JH. N-linked glycosylation of cortical NMDA and kainate receptor subunits in schizophrenia. Neuroreport. 2013; 24(12):688.
Article CAS PubMed PubMed Central Google Scholar
Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014; 511(7510):421–7. https://doi.org/10.1038/nature13595.
Article CAS PubMed Central Google Scholar
Maffioletti E, Tardito D, Gennarelli M, Bocchio Chiavetto L. Micro spies from the brain to the periphery: New clues from studies on microRNAs in neuropsychiatric disorders. Front Cell Neurosci. 2014; 8. https://doi.org/10.3389/fncel.2014.00075.
Duan J, Shi J, Fiorentino A, Leites C, Chen X, Moy W, Gejman PV, et al.A Rare Functional Noncoding Variant at the GWAS-Implicated MIR137/MIR2682 Locus Might Confer Risk to Schizophrenia and Bipolar Disorder. Am J Hum Genet. 2014; 95(6):744–53.
Article CAS PubMed PubMed Central Google Scholar
Kwon E, Wang W, Tsai L-H. Validation of schizophrenia-associated genes CSMD1, C10orf26, CACNA1C and TCF4 as miR-137 targets. Mol Psychiatry. 2013; 18(1):11–12.
Article CAS PubMed Google Scholar
Mahmoudi E, Cairns MJ. MiR-137: An important player in neural development and neoplastic transformation. Mol Psychiatry. 2017; 22(1):44–55.
Article CAS PubMed Google Scholar
Wright C, Gupta CN, Chen J, Patel V, Calhoun VD, Ehrlich S, Turner JA, et al.Polymorphisms in MIR137HG and microRNA-137-regulated genes influence gray matter structure in schizophrenia. Transl Psychiatry. 2016; 6:e724.
Article CAS PubMed PubMed Central Google Scholar
Vita A, De Peri L, Deste G, Sacchetti E. Progressive loss of cortical gray matter in schizophrenia: A meta-analysis and meta-regression of longitudinal MRI studies. Transl Psychiatry. 2012; 2(11):e190.
Article CAS PubMed PubMed Central Google Scholar
Lidow MS. Calcium signaling dysfunction in schizophrenia: A unifying approach. Brain Res Rev. 2003; 43(1):70–84.
Article CAS PubMed Google Scholar
Sayers EW, Agarwala R, Bolton EE, Brister JR, Canese K, Clark K, Holmes JB, et al.Database resources of the national center for biotechnology information. Nucleic Acids Res. 2019; 47(Database issue):D23.
Article CAS PubMed Google Scholar
Ohtsuki T, Koga M, Ishiguro H, Horiuchi Y, Arai M, Niizato K, Arinami T. A polymorphism of the metabotropic glutamate receptor mGluR7 (GRM7) gene is associated with schizophrenia. Schizophr Res. 2008; 101(1):9–16.
Article PubMed Google Scholar
Li W, Ju K, Li Z, He K, Chen J, Wang Q, Shi Y. Significant association of GRM7 and GRM8 genes with schizophrenia and major depressive disorder in the Han Chinese population. Eur Neuropsychopharmacol J Eur Coll europsychopharmacol. 2016; 26(1):136–46.
Article CAS Google Scholar
Sacchetti E, Magri C, Minelli A, Valsecchi P, Traversa M, Calza S, Gennarelli M. The GRM7 gene, early response to risperidone, and schizophrenia: A genome-wide association study and a confirmatory pharmacogenetic analysis. Pharmacogenomics J. 2017; 17(2):146–54.
Article CAS PubMed Google Scholar
Lihn AS, Pedersen SB, Richelsen B. Adiponectin: action, regulation and association to insulin sensitivity. Obes Rev. 2005; 6(1):13–21.
Article CAS PubMed Google Scholar
Liu Y, Michael MD, Kash S, Bensch WR, Monia BP, Murray SF, Reifel-Miller A. Deficiency of Adiponectin Receptor 2 Reduces Diet-Induced Insulin Resistance but Promotes Type 2 Diabetes. Endocrinology. 2007; 148(2):683–92.
Article CAS PubMed Google Scholar
Park HS, Lim JH, Kim MY, Kim Y, Hong YA, Choi SR, Park CW, et al.Resveratrol increases AdipoR1 and AdipoR2 expression in type 2 diabetic nephropathy. J Transl Med. 2016; 14(1):176.
Article PubMed PubMed Central CAS Google Scholar
Vaxillaire M, Dechaume A, Vasseur-Delannoy V, Lahmidi S, Froguel P. Genetic analysis of ADIPOR1 and ADIPOR2 candidate polymorphisms for type 2 diabetes in the caucasian population. Diabetes. 2006; 55(3):856–61.
Article CAS PubMed Google Scholar
Damcott CM, Ott SH, Pollin TI, Reinhart LJ, Shuldiner A. Genetic variation in adiponectin receptor 1 and adiponectin receptor 2 is associated with type 2 diabetes in the old order amish. Diabetes. 2005; 54(7):2245–50.
Article CAS PubMed Google Scholar
Donath MY, Shoelson SE. Type 2 diabetes as an inflammatory disease. Nat Rev Immunol. 2011; 11(2):98–107.
Article CAS PubMed Google Scholar
Kaneto H, Xu G, Fujii N, Kim S, Bonner-Weir S, Weir GC. Involvement of c-Jun N-terminal Kinase in Oxidative Stress-mediated Suppression of Insulin Gene Expression. J Biol Chem. 2002; 277(33):30010–8.
Article CAS PubMed Google Scholar
Shi C-S, Leonardi A, Kyriakis J, Siebenlist U, Kehrl JH. TNF-Mediated Activation of the Stress-Activated Protein Kinase Pathway: TNF Receptor-Associated Factor 2 Recruits and Activates Germinal Center Kinase Related. J Immunol. 1999; 163(6):3279–85.
CAS PubMed Google Scholar
Shi C-S, Kehrl JH. Tumor Necrosis Factor (TNF)-induced Germinal Center Kinase-related (GCKR) and Stress-activated Protein Kinase (SAPK) Activation Depends upon the E2/E3 Complex Ubc13-Uev1A/TNF Receptor-associated Factor 2 (TRAF2). J Biol Chem. 2003; 278(17):15429–34.
Article CAS PubMed Google Scholar
Gu Y, Luo T, Yang J, Zhang D, Dai M, Jian W, Luo M, et al.The -822G/A polymorphism in the promoter region of the MAP4K5 gene is associated with reduced risk of type 2 diabetes in Chinese Hans from Shanghai. J Hum Genet. 2006; 51(7):605–10.
Article CAS PubMed Google Scholar
Scott RA, Scott LJ, Mägi R, Marullo L, Gaulton KJ, Kaakinen M, Jackson AU, et al.An expanded genome-wide association study of type 2 diabetes in Europeans. Diabetes. 2017; 66(11):2888–902.
Article CAS PubMed PubMed Central Google Scholar
Huang H, Chanda P, Alonso A, Bader JS, Arking DE. Gene-based tests of association. PLoS Genet. 2011; 7(7):e1002177.
Article CAS PubMed PubMed Central Google Scholar
Zaykin DV, Zhivotovsky LA, Westfall PH, Weir BS. Truncated product method for combining P-values. Genet Epidemiol. 2002; 22(2):170–85.
Article CAS PubMed Google Scholar
Li Y, Feng T, Zhu X. Detecting association with rare variants for common diseases using haplotype-based methods. Stat Interface. 2011; 4:273–284.
Article Google Scholar
Pan W. Asymptotic tests of association with multiple SNPs in linkage disequilibrium. Genet Epidemiol. 2009; 33(6):497–507.
Article PubMed PubMed Central Google Scholar
Wu B, Guan W, Pankow JS. On efficient and accurate calculation of significance p-values for sequence kernel association testing of variant set. Ann Hum Genet. 2016; 80(2):123–35.
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

A high-performance computing infrastructure at University of North Texas was used in obtaining results presented in this publication.

Funding

XW was supported by the University of North Texas Foundation which was contributed by Dr. Linda Truitt Creagh. The content is solely the responsibility of the authors and does not necessarily represent the views of the University of North Texas Foundation and Dr. Linda Truitt Creagh. JY was supported by the start-up funding from the Department of Human Genetics Emory University.

Author information

Authors and Affiliations

Department of Mathematics, University of North Texas, 1155 Union Circle #311430, Denton, 76203, TX, USA
Jianjun Zhang & Xuexia Wang
Department of Computer Science and Engineering, University of North Texas, Discovery Park 3940 N. Elm, Denton, 76203, TX, USA
Xuan Guo & Samantha Gonzales
Center for Computational and Quantitative Genetics, Department of Human Genetics School of Medicine, Emory University, Whitehead Biomedical Research Building, Suite 305K, Atlanta, 30322, GA, USA
Jingjing Yang

Authors

Jianjun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xuan Guo
View author publications
You can also search for this author in PubMed Google Scholar
Samantha Gonzales
View author publications
You can also search for this author in PubMed Google Scholar
Jingjing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xuexia Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

XW designed the study, supervised the project, developed the method, and wrote the manuscript. JZ developed and implemented the method, performed data analysis, and wrote the manuscript. XG implemented the method. SG generated and cleaned the real datasets. JY reviewed the study. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xuexia Wang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1

The TS and the other comparison methods are applied to the SCZ data.

Additional file 2

The TS and the other comparison methods are applied to the UKB data.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Zhang, J., Guo, X., Gonzales, S. et al. TS: a powerful truncated test to detect novel disease associated genes using publicly available gWAS summary data. BMC Bioinformatics 21, 172 (2020). https://doi.org/10.1186/s12859-020-3511-0

Download citation

Received: 31 July 2019
Accepted: 23 April 2020
Published: 04 May 2020
DOI: https://doi.org/10.1186/s12859-020-3511-0

TS: a powerful truncated test to detect novel disease associated genes using publicly available gWAS summary data