GCORE-sib: An efficient gene-gene interaction tool for genome-wide association studies based on discordant sib pairs
- Pei-Yuan Sung^{1},
- Yi-Ting Wang^{1},
- Chao A. Hsiung^{2} and
- Ren-Hua Chung^{2}Email author
https://doi.org/10.1186/s12859-016-1145-z
© The Author(s). 2016
Received: 4 February 2016
Accepted: 1 July 2016
Published: 8 July 2016
Abstract
Background
A computationally efficient tool is required for a genome-wide gene-gene interaction analysis that tests an extremely large number of single-nucleotide polymorphism (SNP) interaction pairs in genome-wide association studies (GWAS). Current tools for GWAS interaction analysis are mainly developed for unrelated case-control samples. Relatively fewer tools for interaction analysis are available for complex disease studies with family-based design, and these tools tend to be computationally expensive.
Results
We developed a fast gene-gene interaction test, GCORE-sib, for discordant sib pairs and implemented the test into an efficient tool. We used simulations to demonstrate that the GCORE-sib has correct type I error rates and has comparable power to that of the regression-based interaction test. We also showed that the GCORE-sib can run more than 10 times faster than the regression-based test. Finally, the GCORE-sib was applied to a GWAS dataset with approximately 2,000 discordant sib pairs, and the GCORE-sib finished testing 19,368,078,382 pairs of SNPs within 6 days.
Conclusions
An efficient gene-gene interaction tool for discordant sib pairs was developed. It will be very useful for genome-wide gene-gene interaction analysis in GWAS using discordant sib pairs. The tool can be downloaded for free at http://gcore-sib.sourceforge.net.
Keywords
Genome-wide association study Gene-gene interaction Discordant sib pairBackground
Genome-wide association studies (GWAS) are a popular strategy to investigate the genetic structure of complex diseases by identifying the association between single nucleotide polymorphisms (SNPs) and complex disorders. GWAS analysis is mainly focused on testing the effects of individual SNPs on complex diseases; however, complex diseases are likely to result from the interactions among multiple genes. That is, the presence of specific alleles in different genes can significantly increase the risk of developing a particular disease, such as Alzheimer’s disease, type 1 diabetes, autism, and schizophrenia [1–4]. In fact, most of the significant SNPs identified by GWAS can only explain a small proportion of the heritability of a disease. The missing heritability may be explained by gene-gene interactions [5]. Hence, the development of statistical gene-gene interaction tests based on GWAS has become important.
A computationally efficient test is required for a genome-wide interaction analysis that tests an extremely large number of SNP-SNP interaction pairs in GWAS (e.g., approximately 5 × 10^{11} interaction tests for a GWAS with 1 million SNPs). Several approaches, which can finish genome-wide interaction tests in a reasonable time while still maintaining statistical power, have been developed for GWAS with unrelated case-control samples. Some examples for these approaches include SNPHarvester [6], SNPRuler [7], and BOOST [8]. These approaches typically employ a two-stage analysis strategy; in the first stage, a rapid algorithm is used to identify a promising subset of SNPs with potential interaction effects, and in the second stage, a commonly used test such as the test based on logistic regression is used to identify pairwise interactions from the subset of SNPs.
Current interaction tests for family-based studies are computationally intensive, which prevent the applications of the tests to genome-wide interaction analysis. For example, MDR-PDT [9] and PGMDR [10] are extended from the machine learning-based Multifactor Dimensionality Reduction (MDR) test [11], which involves intensive calculations such as cross-validations and permutations. Regression-based tests such as conditional logistic regression (CLR) and generalized estimating equations (GEE) [12] can also be used for testing interactions [13]; however, iterative algorithms such as the Newton-Raphson method are required to estimate the parameters. As many family-based GWAS have been conducted [14–17], it becomes important to develop a computationally efficient interaction test for family-based GWAS.
To overcome the computational challenges in current family-based interaction tests, here we developed an efficient gene-gene interaction test for discordant sib pairs (DSPs), the GCORE-sib, which takes into consideration the correlations in DSPs, and implemented the test into an efficient tool for family-based interaction analysis. The GCORE-sib is extended from the fast epistasis statistic implemented in PLINK [18], which is an odds ratio-based interaction test for case-control studies [19]. The log odds ratios, which measure the correlations between two SNPs, are first calculated for affected and unaffected siblings, and the difference in the log odds ratios is compared in the GCORE-sib statistic. Variance and covariance for the statistic were calculated based on appropriate theoretical models, and the distribution of the statistic was assumed to follow a standard normal distribution. Therefore, the statistic and its p-value were rapidly calculated. We used simulation studies to evaluate the type I error rates for the GCORE-sib test, and to compare the power of the test with that of GEE and MDR-PDT. The GCORE-sib software was implemented with POSIX threads (Pthreads), which allow for parallel computing of the SNP pairs. We compared the performance in terms of run time among the GCORE-sib, GEE, and MDR-PDT. Finally, a GWAS dataset was used to evaluate the run time of the GCORE-sib in the genome-wide scale.
Implementation
The GCORE-sib statistic
Counts of genotypes in the affected sibs in the k discordant sib pairs
AA | Aa | aa | |
---|---|---|---|
BB | n _{11} | n _{12} | n _{13} |
Bb | n _{21} | n _{22} | n _{23} |
bb | n _{31} | n _{32} | n _{33} |
Counts of genotypes in the unaffected sibs in the k discordant sib pairs
AA | Aa | aa | |
---|---|---|---|
BB | R _{11} − n _{11} | R _{12} − n _{12} | R _{13} − n _{13} |
Bb | R _{21} − n _{21} | R _{22} − n _{22} | R _{23} − n _{23} |
bb | R _{31} − n _{31} | R _{32} − n _{32} | R _{33} − n _{33} |
Counts of alleles in the affected sibs
Case | ||
---|---|---|
A | a | |
B | 4n _{11} + 2n _{12} + 2n _{21} + n _{22} | 4n _{13} + 2n _{12} + 2n _{23} + n _{22} |
b | 4n _{31} + 2n _{32} + 2n _{21} + n _{22} | 4n _{33} + 2n _{32} + 2n _{23} + n _{22} |
Counts of alleles in the unaffected sibs
Control | ||
---|---|---|
A | a | |
B | \( \begin{array}{l}4\left({R}_{11}-{n}_{11}\right)+2\left({R}_{12}-{n}_{12}\right)\\ {}+2\left({R}_{21}-{n}_{21}\right)+\left({R}_{22}-{n}_{22}\right)\end{array} \) | \( \begin{array}{l}4\left({R}_{13}-{n}_{13}\right)+2\left({R}_{12}-{n}_{12}\right)\\ {}+2\left({R}_{23}-{n}_{23}\right)+\left({R}_{22}-{n}_{22}\right)\end{array} \) |
b | \( \begin{array}{l}4\left({R}_{31}-{n}_{31}\right)+2\left({R}_{32}-{n}_{32}\right)\\ {}+2\left({R}_{21}-{n}_{21}\right)+\left({R}_{22}-{n}_{22}\right)\end{array} \) | \( \begin{array}{l}4\left({R}_{33}-{n}_{33}\right)+2\left({R}_{32}-{n}_{32}\right)\\ {}+2\left({R}_{23}-{n}_{23}\right)+\left({R}_{22}-{n}_{22}\right)\end{array} \) |
The calculation of \( \widehat{Var}\left( log\left({\widehat{OR}}_{case}\right)\right) \) is also shown in Additional file 1.
Simulations
For the type I error simulations, we first simulated no interaction effects and no main effects for two SNPs in the two genes. Different minor allele frequencies (MAFs) at the two SNPs (i.e., (0.2, 0.2; 0.3, 0.15)) and different numbers of DSPs (i.e., 250, 500, and 1000) were considered. We then considered the situation where main effects were present for the two SNPs under different levels of disease prevalence (i.e., 1 %, 5 %, and 10 %). We simulated one scenario where only one SNP had main effect (i.e., λ _{1} = log (2), λ _{2} = 0) and three scenarios where both SNPs had main effects (i.e., λ _{1} = log (1.3), λ _{2} = log (1.3); λ _{1} = log (1.5), λ _{2} = log (1.5); λ _{1} = log (2), λ _{2} = log (2)). In addition, to investigate whether the GCORE-sib is robust to the violation of the assumption of LE, we simulated two SNPs in different levels of LD (i.e., LD measures r ^{2} = 0.3, 0.5, and 0.8). All type I error rates in these scenarios were calculated based on 5,000 replicates of samples. Two significance levels (i.e., 0.05 and 0.01) were considered for the type I error calculations.
Parameter settings for the power simulations
Scenario | Parameters (NF, DM, MAF, ES)^{a} |
---|---|
Scen1 | NF: 250,500,1000; DM: Additive; MAF: (0.2,0.2); ES: log(2) |
Scen2 | NF: 500; DM: Additive, Dominant, Recessive; MAF: (0.2,0.2); ES: log(2) |
Scen3 | NF: 500; DM: Additive; MAF: (0.2,0.2),(0.3,0.15); ES: log(2) |
Scen4 | NF: 500; DM: Additive; MAF: (0.2,0.2); ES: log(2), log(2.25) |
Parallel computing
Although the calculation of the GCORE-sib statistic is fast for a SNP pair, performing a genome-wide interaction analysis by testing tens of billions of tests can still be very time consuming for the GCORE-sib. The GCORE-sib software was implemented with POSIX Threads (Pthreads) so that the calculations for SNP pairs can be performed in parallel on a multi-core computer. Moreover, the calculations can be performed for SNPs between a chromosome pair specified by the user. Therefore, the calculations can be distributed across different computers.
Performance comparison
We compared the performance of the GCORE-sib with GEE and MDR-PDT. Currently, GEE is mostly implemented in R, which is not comparable to the GCORE-sib and MDR-PDT implemented in C++. Alternatively, we used the interaction test based on a regular logistic regression implemented in PLINK. The logistic regression based on GEE usually first runs the regular logistic regression to obtain initial parameter estimates assuming all samples are independent, and more iterations are performed for the overall parameter estimates including the correlation matrix. Therefore, the logistic regression with GEE is expected to run longer than the regular logistic regression. A total of 1,000 DSPs were simulated using SeqSIMLA, and the performance was compared based on 1,000, 5,000, and 10,000 pairs of SNPs on a computer with Xeon 2.0 GHz CPU and 96 GB of RAM.
To evaluate the performance of the GCORE-sib for a genome-wide interaction analysis, we downloaded the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset for hypertension. The dataset consists of 1,952 unrelated cases for hypertension and 2,938 unrelated controls, and there are 457,710 SNPs in the data. We randomly matched cases and controls, which resulted in 1,952 case-control pairs. The case-control pairs were analyzed as DSPs in the GCORE-sib. Because the WTCCC study is not a family-based study, our analysis was used only for the performance evaluation for the GCORE-sib. We also downloaded the gene annotation file from the UCSC genome browser website. All possible pairs of SNP interactions between genes were tested in parallel with 20 cores by the GCORE-sib on the aforementioned computer.
Results
Type I error rates
Type I error rate simulations for two SNPs with MAFs of (0.2 and 0.2; 0.3 and 0.15) and with different numbers of DSPs at the significant levels of 0.05 and 0.01
MAF/number of DSPs | α = 0.05 | α = 0.01 |
---|---|---|
MAF 0.2, 0.2 | ||
250 DSPs | 0.0486 | 0.0092 |
500 DSPs | 0.0494 | 0.0086 |
1000 DSPs | 0.0500 | 0.0106 |
MAF 0.3, 0.15 | ||
250 DSPs | 0.0502 | 0.0114 |
500 DSPs | 0.0484 | 0.0106 |
1000 DSPs | 0.0510 | 0.0124 |
Type I error rates for different disease models, main effects, and disease prevalences
Disease prevalence | ||||
---|---|---|---|---|
Effect size | Disease model | 0.01 | 0.05 | 0.1 |
λ _{1} = log (2), λ _{2} = 0 | Additive | 0.0484 | 0.0534 | 0.0488 |
Dominant | 0.0544 | 0.0502 | 0.0560 | |
Recessive | 0.0494 | 0.0440 | 0.0454 | |
λ _{1} = log (1.3), λ _{2} = log (1.3) | Additive | 0.0476 | 0.0530 | 0.0458 |
Dominant | 0.0524 | 0.0570 ^{ a } | 0.0530 | |
Recessive | 0.0474 | 0.0532 | 0.0522 | |
λ _{1} = log (1.5), λ _{2} = log (1.5) | Additive | 0.0534 | 0.0500 | 0.0508 |
Dominant | 0.0674 | 0.0674 | 0.0612 | |
Recessive | 0.0518 | 0.0520 | 0.0498 | |
λ _{1} = log (2), λ _{2} = log (2) | Additive | 0.0768 | 0.0658 | 0.0602 |
Dominant | 0.1670 | 0.1268 | 0.0764 | |
Recessive | 0.0448 | 0.0514 | 0.0520 |
Power
Performance comparison
Performance comparison for the GCORE-sib, PLINK, and MDR-PDT
SNP pairs | GCORE-sib | PLINK | MDR-PDT |
---|---|---|---|
1,000 | 0.2 s | 3 s | 43 m 57 s |
5,000 | 1.4 s | 16 s | 3 h 29 m 32 s |
10,000 | 2 s | 26 s | 6 h 52 m 36 s |
Discussion
We developed an odds ratio-based gene-gene interaction test considering correlations in discordant sib pairs. The hypergeometric distribution for genotype counts was assumed in each discordant sib pair. Then the estimates of correlation within families can be calculated based on the model assumption. We demonstrated that the GCORE-sib showed appropriate type I error rates under the null hypothesis of no interaction, even in the presence of LD between SNPs, or when only one SNP showed main effect. Sharing the same property as the odds ratio-based test for case-control studies, the GCORE-sib maintains proper type I error rates when only one SNP has main effects. When the two SNPs both have strong main effects, type I error rates could be inflated for most of the interaction tests [23]. Therefore, in practice, significant results from interaction tests for two SNPs should be interpreted along with tests for main effects for the same two SNPs.
We also compared the power of the GCORE-sib with two alternative family-based interaction tests, GEE and MDR-PDT. Our simulation results suggested that the GCORE-sib and GEE had similar power, while MDR-PDT had the lowest power under most of the simulation scenarios. Under the assumption of HWE, alleles are independent in Tables 3 and 4, and the GCORE-sib tests the allelic correlations between the two SNPs based on the two tables. Hence, an additive model is implicitly assumed in the GCORE-sib. Moreover, genotypes for GEE were also coded based on an additive model in our simulations. As most of our power simulations were conducted under the additive model, it was not surprising to observe higher power for the GCORE-sib and GEE than the machine learning-based MDR-PDT.
GEE and MDR-PDT are not suitable for genome-wide interaction tests, due to the high computational burden. In contrast, the GCORE-sib is demonstrated to be able to perform a rapid test for each pair of SNP-SNP interactions. However, GEE is flexible of incorporating covariates in the test. Therefore, the GCORE-sib can be used as a complementary tool to GEE for analyzing DSPs. That is, the GCORE-sib can be used as a screening tool to identify candidate SNP pairs with interactions. Then GEE can be used to test for interactions for the candidate SNP pairs by incorporating appropriate covariates.
Conclusions
In conclusion, we have developed an efficient gene-gene interaction test for DSPs, which is suitable for genome-wide interaction analysis for SNP pairs in DSPs. We have implemented the method in C++, which can be downloaded for free at http://gcore-sib.sourceforge.net.
Declarations
Acknowledgements
We are grateful to the National Center for High-performance Computing in Taiwan for computer time and facilities. This study makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk.
Funding
This work was funded by grants from the National Health Research Institutes (PH-104-PP-10) and Ministry of Science and Technology (NSC 102-2221-E-400-001-MY2) in Taiwan.
Availability of data and material
The simulation scripts used in this study were deposited in the public repository GESDB (http://gesdb.nhri.org.tw) [24]. The WTCCC GWAS dataset is available from http://www.wtccc.org.uk.
Authors’ contributions
PYS, CAH and RHC developed the method and designed the simulation studies. PYS performed the simulation studies. PYS and YTW analyzed the data. PYS and RHC wrote the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Ethics approval and consent to participate
The analysis of WTCCC data in the present study was approved by the Institutional Review Board (IRB) of the National Health Research Institutes in Taiwan (IRB protocol # EC1020503-E). Written informed consent for participation in the WTCCC study was obtained from all participants.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Ma DQ, Whitehead PL, Menold MM, Martin ER, Ashley-Koch AE, Mei H, Ritchie MD, Delong GR, Abramson RK, Wright HH, et al. Identification of significant association and gene-gene interaction of GABA receptor subunit genes in autism. Am J Hum Genet. 2005;77(3):377–88.View ArticlePubMedPubMed CentralGoogle Scholar
- Ebbert MT, Ridge PG, Wilson AR, Sharp AR, Bailey M, Norton MC, Tschanz JT, Munger RG, Corcoran CD, Kauwe JS. Population-based analysis of Alzheimer’s disease risk alleles implicates genetic interactions. Biol Psychiatry. 2014;75(9):732–7.View ArticlePubMedGoogle Scholar
- Bjornvold M, Undlien DE, Joner G, Dahl-Jorgensen K, Njolstad PR, Akselsen HE, Gervin K, Ronningen KS, Stene LC. Joint effects of HLA, INS, PTPN22 and CTLA4 genes on the risk of type 1 diabetes. Diabetologia. 2008;51(4):589–96.View ArticlePubMedPubMed CentralGoogle Scholar
- Gasso P, Mas S, Alvarez S, Trias G, Bioque M, Oliveira C, Bernardo M, Lafuente A. Xenobiotic metabolizing and transporter genes: gene-gene interactions in schizophrenia and related disorders. Pharmacogenomics. 2010;11(12):1725–31.View ArticlePubMedGoogle Scholar
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53.View ArticlePubMedPubMed CentralGoogle Scholar
- Yang C, He Z, Wan X, Yang Q, Xue H, Yu W. SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies. Bioinformatics. 2009;25(4):504–11.View ArticlePubMedGoogle Scholar
- Wan X, Yang C, Yang Q, Xue H, Tang NL, Yu W. Predictive rule inference for epistatic interaction detection in genome-wide association studies. Bioinformatics. 2010;26(1):30–7.View ArticlePubMedGoogle Scholar
- Wan X, Yang C, Yang Q, Xue H, Fan X, Tang NL, Yu W. BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. Am J Hum Genet. 2010;87(3):325–40.View ArticlePubMedPubMed CentralGoogle Scholar
- Martin ER, Ritchie MD, Hahn L, Kang S, Moore JH. A novel method to identify gene-gene effects in nuclear families: the MDR-PDT. Genet Epidemiol. 2006;30(2):111–23.View ArticlePubMedGoogle Scholar
- Chen GB, Zhu J, Lou XY. A faster pedigree-based generalized multifactor dimensionality reduction method for detecting gene-gene interactions. Statistics and its interface. 2011;4(3):295–304.View ArticlePubMedPubMed CentralGoogle Scholar
- Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69(1):138–47.View ArticlePubMedPubMed CentralGoogle Scholar
- Zeger SL, Liang KY, Albert PS. Models for longitudinal data: a generalized estimating equation approach. Biometrics. 1988;44(4):1049–60.View ArticlePubMedGoogle Scholar
- Hancock DB, Martin ER, Li YJ, Scott WK. Methods for interaction analyses using family-based case-control data: conditional logistic regression versus generalized estimating equations. Genet Epidemiol. 2007;31(8):883–93.View ArticlePubMedGoogle Scholar
- Anney R, Klei L, Pinto D, Regan R, Conroy J, Magalhaes TR, Correia C, Abrahams BS, Sykes N, Pagnamenta AT, et al. A genome-wide scan for common alleles affecting risk for autism. Hum Mol Genet. 2010;19(20):4072–82.View ArticlePubMedPubMed CentralGoogle Scholar
- Wijsman EM, Pankratz ND, Choi Y, Rothstein JH, Faber KM, Cheng R, Lee JH, Bird TD, Bennett DA, Diaz-Arrastia R, et al. Genome-wide association of familial late-onset Alzheimer’s disease replicates BIN1 and CLU and nominates CUGBP2 in interaction with APOE. PLoS Genet. 2011;7(2):e1001308.View ArticlePubMedPubMed CentralGoogle Scholar
- Mattheisen M, Samuels JF, Wang Y, Greenberg BD, Fyer AJ, McCracken JT, Geller DA, Murphy DL, Knowles JA, Grados MA, et al. Genome-wide association study in obsessive-compulsive disorder: results from the OCGAS. Mol Psychiatry. 2015;20(3):337–44.View ArticlePubMedGoogle Scholar
- International Multiple Sclerosis Genetics C, Hafler DA, Compston A, Sawcer S, Lander ES, Daly MJ, De Jager PL, de Bakker PI, Gabriel SB, Mirel DB, et al. Risk alleles for multiple sclerosis identified by a genomewide study. N Engl J Med. 2007;357(9):851–62.View ArticleGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75.View ArticlePubMedPubMed CentralGoogle Scholar
- Ueki M, Cordell HJ. Improved statistics for genome-wide interaction analysis. PLoS Genet. 2012;8(4):e1002625.View ArticlePubMedPubMed CentralGoogle Scholar
- Spielman RS, Ewens WJ. A sibship test for linkage in the presence of association: the sib transmission/disequilibrium test. Am J Hum Genet. 1998;62(2):450–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Chung RH, Shih CC. SeqSIMLA: a sequence and phenotype simulation tool for complex disease studies. BMC bioinformatics. 2013;14:199.View ArticlePubMedPubMed CentralGoogle Scholar
- Su Z, Marchini J, Donnelly P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. 2011;27(16):2304–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Hu JK, Wang X, Wang P. Testing gene-gene interactions in genome wide association studies. Genet Epidemiol. 2014;38(2):123–34.View ArticlePubMedPubMed CentralGoogle Scholar
- Yao PJ, Chung RH. GESDB: a platform of simulation resources for genetic epidemiology studies. Database: the journal of biological databases and curation 2016;2016. doi:https://doi.org/10.1093/database/baw082.