GCORE-sib: An efficient gene-gene interaction tool for genome-wide association studies based on discordant sib pairs

Background A computationally efficient tool is required for a genome-wide gene-gene interaction analysis that tests an extremely large number of single-nucleotide polymorphism (SNP) interaction pairs in genome-wide association studies (GWAS). Current tools for GWAS interaction analysis are mainly developed for unrelated case-control samples. Relatively fewer tools for interaction analysis are available for complex disease studies with family-based design, and these tools tend to be computationally expensive. Results We developed a fast gene-gene interaction test, GCORE-sib, for discordant sib pairs and implemented the test into an efficient tool. We used simulations to demonstrate that the GCORE-sib has correct type I error rates and has comparable power to that of the regression-based interaction test. We also showed that the GCORE-sib can run more than 10 times faster than the regression-based test. Finally, the GCORE-sib was applied to a GWAS dataset with approximately 2,000 discordant sib pairs, and the GCORE-sib finished testing 19,368,078,382 pairs of SNPs within 6 days. Conclusions An efficient gene-gene interaction tool for discordant sib pairs was developed. It will be very useful for genome-wide gene-gene interaction analysis in GWAS using discordant sib pairs. The tool can be downloaded for free at http://gcore-sib.sourceforge.net. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1145-z) contains supplementary material, which is available to authorized users.


Background
Genome-wide association studies (GWAS) are a popular strategy to investigate the genetic structure of complex diseases by identifying the association between single nucleotide polymorphisms (SNPs) and complex disorders. GWAS analysis is mainly focused on testing the effects of individual SNPs on complex diseases; however, complex diseases are likely to result from the interactions among multiple genes. That is, the presence of specific alleles in different genes can significantly increase the risk of developing a particular disease, such as Alzheimer's disease, type 1 diabetes, autism, and schizophrenia [1][2][3][4]. In fact, most of the significant SNPs identified by GWAS can only explain a small proportion of the heritability of a disease. The missing heritability may be explained by gene-gene interactions [5]. Hence, the development of statistical gene-gene interaction tests based on GWAS has become important.
A computationally efficient test is required for a genome-wide interaction analysis that tests an extremely large number of SNP-SNP interaction pairs in GWAS (e.g., approximately 5 × 10 11 interaction tests for a GWAS with 1 million SNPs). Several approaches, which can finish genome-wide interaction tests in a reasonable time while still maintaining statistical power, have been developed for GWAS with unrelated case-control samples. Some examples for these approaches include SNPHarvester [6], SNPRuler [7], and BOOST [8]. These approaches typically employ a two-stage analysis strategy; in the first stage, a rapid algorithm is used to identify a promising subset of SNPs with potential interaction effects, and in the second stage, a commonly used test such as the test based on logistic regression is used to identify pairwise interactions from the subset of SNPs.
Current interaction tests for family-based studies are computationally intensive, which prevent the applications of the tests to genome-wide interaction analysis. For example, MDR-PDT [9] and PGMDR [10] are extended from the machine learning-based Multifactor Dimensionality Reduction (MDR) test [11], which involves intensive calculations such as cross-validations and permutations. Regression-based tests such as conditional logistic regression (CLR) and generalized estimating equations (GEE) [12] can also be used for testing interactions [13]; however, iterative algorithms such as the Newton-Raphson method are required to estimate the parameters. As many familybased GWAS have been conducted [14][15][16][17], it becomes important to develop a computationally efficient interaction test for family-based GWAS.
To overcome the computational challenges in current family-based interaction tests, here we developed an efficient gene-gene interaction test for discordant sib pairs (DSPs), the GCORE-sib, which takes into consideration the correlations in DSPs, and implemented the test into an efficient tool for family-based interaction analysis. The GCORE-sib is extended from the fast epistasis statistic implemented in PLINK [18], which is an odds ratio-based interaction test for case-control studies [19]. The log odds ratios, which measure the correlations between two SNPs, are first calculated for affected and unaffected siblings, and the difference in the log odds ratios is compared in the GCORE-sib statistic. Variance and covariance for the statistic were calculated based on appropriate theoretical models, and the distribution of the statistic was assumed to follow a standard normal distribution. Therefore, the statistic and its p-value were rapidly calculated. We used simulation studies to evaluate the type I error rates for the GCORE-sib test, and to compare the power of the test with that of GEE and MDR-PDT. The GCORE-sib software was implemented with POSIX threads (Pthreads), which allow for parallel computing of the SNP pairs. We compared the performance in terms of run time among the GCORE-sib, GEE, and MDR-PDT. Finally, a GWAS dataset was used to evaluate the run time of the GCOREsib in the genome-wide scale.

Implementation
The GCORE-sib statistic The GCORE-sib statistic was developed from the PLINK interaction statistic [18] and is calculated based on the difference in log odds ratios between cases and controls in families. In the test, we considered discordant sib pair in each nuclear family (DSP; one affected and one unaffected sib). Affected and unaffected sibs are defined as cases and controls, respectively. Assume we have k independent discordant sib pairs. Let n ij be the number of affected sibs with genotypes i and j at the two SNPs M 1 and M 2 , where i = 1, 2, 3 (for genotypes AA, Aa, and aa, respectively) and j = 1, 2, 3 (for genotypes BB, Bb, and bb, respectively). Suppose that in the k discordant sib pairs, R ij is the number of sibs with genotypes i and j at the two SNPs (including the affected and unaffected sibs). Therefore, we can construct the genotype tables for the affected and unaffected sibs, as shown in Tables 1 and 2. Each cell count in Table 1 represents the total number of affected sibs with a specific genotype in all k discordant sib pairs. That is, n ij = ∑ s = 1 k n ij s for i = 1, 2, 3 and j = 1, 2, 3, where n ij s represents the number of affected sibs with genotypes i and j in s th discordant sib pair. Similar to S-TDT [20], we assumed that the random variables (N 11 s , N 12 s , …, N 33 s ) with the observed values of (n 11 s , n 12 s , …, n 33 s ) follow a multivariate hypergeometric distribution.
We followed the same procedure in PLINK to collapse the pair of 3 × 3 genotype tables into a pair of 2 × 2 tables for cases and controls as shown in Tables 3 and 4.  According to Tables 3 and 4, the odds ratios between SNPs M 1 and M 2 for cases and controls can be calculated as: Similar to the PLINK approach, under the assumptions of Hardy-Weinberg Equilibrium (HWE) and Linkage Equilibrium (LE) for the two SNPs, the GCORE-sib statistic for the gene-gene interaction test can be constructed based on a Z-score as follows. whereÔR case andÔR control are the sample estimators for OR case and OR control , respectively. The null hypothesis of the GCORE-sib test is that the two SNPs tested do not have interaction effects on the disease. Due to the correlation of genotypes between discordant sibs, the covariance between the two log odds ratios needs to be considered. Based on the multivariate hypergeometric distribution assumption, we can calculate the variance and covariance for the two odds ratios. The detailed derivation is shown in Additional file 1. Based on the derivation, the covariance is calculated as follows Therefore, the GCORE-sib statistic can be written as The calculation ofV ar logÔR case À Á À Á is also shown in Additional file 1.

Simulations
We used the Sequence and phenotype Simulator, Seq-SIMLA [21], to evaluate the type I error rates for the GCORE-sib and to compare the power of the GCORE-sib with other methods under different scenarios. SeqSIMLA requires of a population of sequences generated by other programs. Therefore, we downloaded the haplotypes for the Han Chinese population (CHB) in the HapMap3 project as a reference panel. Then we used the HAPGEN version 2 (HAPGEN2) [22] to produce simulated haplotypes based on the reference panel. HAPGEN2 can simulate haplotypes with similar LD structures and allele frequencies to that of the reference panel. We randomly selected two genes that were not linked as the simulated region and generated a total of 10,000 haplotypes in the two genes. Based on the 10,000 haplotypes, SeqSIMLA first simulated haplotypes in founders and assumed random mating to generate the offspring. We chose the logistic function as the penetrance function in SeqSIMLA: where X = (X 1 ,X 2 ) is a vector of genotype coding based on additive, dominant, or recessive model for the two disease SNPs; ρ is the parameter used to determine the disease prevalence; λ 1 and λ 2 represent the effect sizes of the main effects for the disease SNPs; and λ 3 determines the interaction effect for the two disease SNPs. For the type I error simulations, we first simulated no interaction effects and no main effects for two SNPs in the two genes. Different minor allele frequencies (MAFs) at the two SNPs (i.e., (0.2, 0.2; 0.3, 0.15)) and different numbers of DSPs (i.e., 250, 500, and 1000) were considered. We then considered the situation where main effects were present for the two SNPs under different levels of disease prevalence (i.e., 1 %, 5 %, and 10 %). We simulated one scenario where only one SNP had main effect (i.e., λ 1 = log (2), λ 2 = 0) and three scenarios where both SNPs had main effects (i.e., λ 1 = log (1.3), λ 2 = log (1.3); λ 1 = log (1.5), λ 2 = log (1.5); λ 1 = log (2), λ 2 = log (2)). In addition, to investigate whether the GCOREsib is robust to the violation of the assumption of LE, we simulated two SNPs in different levels of LD (i.e., LD measures r 2 = 0.3, 0.5, and 0.8). All type I error rates in these scenarios were calculated based on 5,000 replicates of samples. Two significance levels (i.e., 0.05 and 0.01) were considered for the type I error calculations.
For the power studies, we simulated interaction effects for two SNPs in the two genes. We compared the power of our method with the power for GEE and MDR-PDT under different scenarios. The "exchangeable" within cluster correlation structure was specified in GEE. The regression model based on GEE included individual terms for the two SNPs and one interaction term, where genotypes were coded as the minor allele counts. We considered different numbers of DSPs (i.e., 250, 500, and 1000), disease models (i.e., additive, dominant, and    (2); λ 1 = 0, λ 2 = 0, λ 3 = log (2.25)). The default settings were 500 DSPs, additive model, MAFs of 0.2 for the two SNPs, and effect size of (λ 1 = 0, λ 2 = 0, λ 3 = log (2)). We changed one parameter at a time for each of the scenarios. Table 5 shows the parameter values for each of the scenarios. The power was calculated with a significance level of 0.05 based on 1,000 replicates of samples.

Parallel computing
Although the calculation of the GCORE-sib statistic is fast for a SNP pair, performing a genome-wide interaction analysis by testing tens of billions of tests can still be very time consuming for the GCORE-sib. The GCORE-sib software was implemented with POSIX Threads (Pthreads) so that the calculations for SNP pairs can be performed in parallel on a multi-core computer. Moreover, the calculations can be performed for SNPs between a chromosome pair specified by the user. Therefore, the calculations can be distributed across different computers.

Performance comparison
We compared the performance of the GCORE-sib with GEE and MDR-PDT. Currently, GEE is mostly implemented in R, which is not comparable to the GCOREsib and MDR-PDT implemented in C++. Alternatively, we used the interaction test based on a regular logistic regression implemented in PLINK. The logistic regression based on GEE usually first runs the regular logistic regression to obtain initial parameter estimates assuming all samples are independent, and more iterations are performed for the overall parameter estimates including the correlation matrix. Therefore, the logistic regression with GEE is expected to run longer than the regular logistic regression. A total of 1,000 DSPs were simulated using SeqSIMLA, and the performance was compared based on 1,000, 5,000, and 10,000 pairs of SNPs on a computer with Xeon 2.0 GHz CPU and 96 GB of RAM.
To evaluate the performance of the GCORE-sib for a genome-wide interaction analysis, we downloaded the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset for hypertension. The dataset consists of 1,952 unrelated cases for hypertension and 2,938 unrelated controls, and there are 457,710 SNPs in the data. We randomly matched cases and controls, which resulted in 1,952 case-control pairs. The case-control pairs were analyzed as DSPs in the GCORE-sib. Because the WTCCC study is not a family-based study, our analysis was used only for the performance evaluation for the GCORE-sib. We also downloaded the gene annotation file from the UCSC genome browser website. All possible pairs of SNP interactions between genes were tested in parallel with 20 cores by the GCORE-sib on the aforementioned computer. Table 6 shows the type I error estimates for the GCORE-sib under different MAFs at the two SNPs and different numbers of DSPs at the significance levels of 0.05 and 0.01. The type I error rates were close to the nominal levels under all scenarios. Table 7 summarizes the results of the type I error rates in the presence of main effects. In the presence of only one main effect (i.e., λ 1 = log (2), λ 2 = 0), the type I error estimates were close to the 0.05 nominal level across different levels of disease prevalence and disease models. The type I error estimates were inflated by the large effect size (e.g., λ 1 = log (2), λ 2 = log (2)) when both SNPs had main effects. When there was LD between the two SNPs, the type I error rates were 0.046, 0.052, and 0.054 for LD r 2 of 0.3, 0.5, and 0.8, respectively, at the significance level of 0.05, and the type I error rates were 0.0082, 0.0088, and 0.0104 for LD r 2 of 0.3, 0.5, and 0.8, respectively, at the significance level of 0.01. Therefore, the GCORE-sib also   maintained proper type I error rates when the assumption of LE was violated.

Type I error rates
Power Figure 1 shows the power comparisons under different scenarios. In Scen1, the power for the GCORE-sib was similar to the power for GEE, while MDR-PDT had the lowest power. For all different methods, the power increased with the increase in the number of DSPs. In Scen2, for the additive and recessive models, similar power patterns were observed that the power for the GCORE-sib and GEE was similar, and the power for both tests was higher than the power for MDR-PDT. However, in the dominant model, GEE and MDR-PDT can have significantly higher power than that for the GCORE-sib. The GCORE-sib had the highest power in the additive model, followed by the dominant and recessive models, while the other tests showed the highest power in the dominant model, followed by the additive and recessive models. In Scen3, the power for the GCORE-sib and GEE in MAF of (0.2, 0.2) and MAF of (0.3, 015) was similar, while MDR-PDT showed higher power in MAF of (0.3, 0.15) than that in MAF of (0.2, 0.2). In the last scenario where the interaction effect size is increased to log(2.25), the power for the GCORE-sib was still close to the power for GEE, and the power for both tests was also higher than the power for MDR-PDT. In summary, the power for the GCORE-sib was generally similar to the power for GEE and the power for the GCORE-sib and GEE was generally higher than the power for MDR-PDT under the additive model in our simulations.

Discussion
We developed an odds ratio-based gene-gene interaction test considering correlations in discordant sib pairs. The hypergeometric distribution for genotype counts was assumed in each discordant sib pair. Then the estimates of correlation within families can be calculated based on the model assumption. We demonstrated that the GCORE-sib showed appropriate type I error rates under the null hypothesis of no interaction, even in the presence of LD between SNPs, or when only one SNP showed main effect. Sharing the same property as the odds ratio-based test for case-control studies, the GCORE-sib maintains proper type I error rates when only one SNP has main effects. When the two SNPs both have strong main effects, type I error rates could be inflated for most of the interaction tests [23]. Therefore, in practice, significant results from interaction tests for two SNPs should be interpreted along with tests for main effects for the same two SNPs.
We also compared the power of the GCORE-sib with two alternative family-based interaction tests, GEE and MDR-PDT. Our simulation results suggested that the GCORE-sib and GEE had similar power, while MDR-PDT had the lowest power under most of the simulation scenarios. Under the assumption of HWE, alleles are independent in Tables 3 and 4, and the GCORE-sib tests  the allelic correlations between the two SNPs based on the two tables. Hence, an additive model is implicitly assumed in the GCORE-sib. Moreover, genotypes for GEE were also coded based on an additive model in our simulations. As most of our power simulations were conducted under the additive model, it was not surprising to observe higher power for the GCORE-sib and GEE than the machine learning-based MDR-PDT.  Table 5. The error bars represent the 95 % confidence intervals for the power