The choice of null distributions for detecting genegene interactions in genomewide association studies
 Can Yang^{1}Email author,
 Xiang Wan^{1}Email author,
 Zengyou He^{2},
 Qiang Yang^{3},
 Hong Xue^{4} and
 Weichuan Yu^{1}Email author
https://doi.org/10.1186/1471210512S1S26
© Yang et al; licensee BioMed Central Ltd. 2011
Published: 15 February 2011
Abstract
Background
In genomewide association studies (GWAS), the number of singlenucleotide polymorphisms (SNPs) typically ranges between 500,000 and 1,000,000. Accordingly, detecting genegene interactions in GWAS is computationally challenging because it involves hundreds of billions of SNP pairs. Stagewise strategies are often used to overcome the computational difficulty. In the first stage, fast screening methods (e.g. Tuning ReliefF) are applied to reduce the whole SNP set to a small subset. In the second stage, sophisticated modeling methods (e.g., multifactordimensionality reduction (MDR)) are applied to the subset of SNPs to identify interesting interaction models and the corresponding interaction patterns. In the third stage, the significance of the identified interaction patterns is evaluated by hypothesis testing.
Results
In this paper, we show that this stagewise strategy could be problematic in controlling the false positive rate if the null distribution is not appropriately chosen. This is because screening and modeling may change the null distribution used in hypothesis testing. In our simulation study, we use some popular screening methods and the popular modeling method MDR as examples to show the effect of the inappropriate choice of null distributions. To choose appropriate null distributions, we suggest to use the permutation test or testing on the independent data set. We demonstrate their performance using synthetic data and a real genome wide data set from an Agedrelated Macular Degeneration (AMD) study.
Conclusions
The permutation test or testing on the independent data set can help choosing appropriate null distributions in hypothesis testing, which provides more reliable results in practice.
Keywords
Background
Singlenucleotide polymorphisms (SNPs) serve as markers for mapping diseaseassociated genetic variants. It has been well known that SNP profiles are associated with a variety of diseases [1]. Highthroughput genotyping technologies have been used to assay hundreds of thousands of SNPs in the human genome. Many singlelocus based methods [2] have been proposed and many susceptibility determinants have been identified [1]. However, these identified SNPs seem to be insufficient in explaining the genetic contributions to complex diseases [3]. Researchers start to suspect that the causality of many common disease are more related with genegene interactions rather than with single genetic variations [3, 4]. For many common complex diseases, some SNPs have shown little main effects while their interactions are significantly associated with disease traits [5–7]. Consequently, detecting genegene interactions is a topic of current interest in GWAS [4]. Many methods have recently been proposed to identify interaction patterns associated with diseases, including MDR [6], CPM [5], RPM [8], BGTA [9], SNPRuler [10], LASSO [11–13], HapForest [14], BOOST [15], PLINK [16], BEAM [17], SNPHarvester [18] and INTERSNP [19]. However, a key issue of applying most of these methods in GWAS is the computational burden [4]. For example, to find pairwise interactions from 500,000 SNPs, we need 1.25 × 10^{11} statistical tests in total. To address this issue, screening approaches [20] have been proposed. The whole process of detecting genegene interactions is then divided into three stages:

Screening: Evaluate the importance of each SNP and assign it a score. Those SNPs with scores lower than the given threshold are removed without further consideration. Often a small portion of SNPs remains. This stage is often accomplished by fast algorithms using heuristics to reduce the search space of the next stage. One example is the popular screening method Tuning ReliefF [21].

Modeling: Search for the best combination of SNPs in the remaining SNPs. The exhaustive search can be used in this stage because the number of remaining SNPs is small, e.g., the popular modeling methods MDR and CPM. During the search process, the importance of a SNP combination is often measured by its prediction accuracy (typically evaluated by crossvalidation). Thus, the best SNP combination and its corresponding interaction pattern can be identified in term of prediction accuracy.

Testing: Assess the significance of interaction patterns by hypothesis testing.
In this paper, we show through simulations that both screening and modeling may change the null distribution used in hypothesis testing. However, many methods (such as MDR combined with Tuning ReliefF [6, 21] and some other stagewise methods [24]) neglect the rightward shift of the original null distribution (as illustrated in Figure 1) caused by screening and modeling. They inevitably suffer from the higher false positive rate. We have also noticed that some methods [25–27] modify the test statistics, which causes the leftward shift of the original null distribution. If we still stick to the theoretical null distribution in hypothesis testing, we may produce conservative results (see the discussion section). To address this issue, we suggest to use the permutation test and testing on the independent data set. The permutation test uses the resampling method to estimate the changed null distribution for hypothesis testing. Testing on the independent data set can reserve the theoretical null distribution. Through simulation experiments and the experiment on a real genomewide data set from an AMD study, we demonstrate that the appropriate choice of null distributions leads to more reliable results.
Results
Simulation study of the inappropriate choice of null distributions
The huge number of SNPs in GWAS poses a heavy computational burden for detecting genegene interactions. The exhaustive search of all pairwise interactions and further using crossvalidation to evaluate them (e.g. MDR [6]) become impractical in GWAS. To make it computationally feasible, a screening method is applied to the whole data set to preselect a small subset of SNPs. Then the exhaustive search can be applied to identify the most likely diseaseassociated SNPs. At last, hypothesis testing is conducted on identified SNPs. The importance of hypothesis testing is briefly discussed in the discussion section. Here we use MDR and some efficient screening methods [21, 28–30] as examples to show that null distributions are affected by these methods Throughout this paper, we use the latest MDR software (MDR 2.0 beta 8.1) to perform all experiments. It also implements various screening methods, such as ReliefF, Tuning ReliefF, and SURFSTAR. We first show that the modeling process of MDR changes the null distribution in its search process. Then we show that MDR coupled with some screening methods further changes the null distribution.
The null distribution of MDR
A toy example illustrating how MDR collapses two 3 × 3 genotype tables to a 2 × 2 contingency table
Case  BB  Bb  bb  Control  BB  Bb  bb  MDR table  Case  Control 

AA  179  119  18  AA  199  126  15  
Aa  315  173  26  Aa  306  164  17  Lowrisk  399  443 
aa  101  59  10  aa  118  49  6  Highrisk  601  557 
We design two scenarios with three settings of dway interactions (d = 2, 3, 4), one showing the true null distributions of MDR (without search) and another showing the change of null distributions of MDR (with search). For each scenario of this experiment, we generate 500 null data sets, each of which contains 2,000 samples.
In the second scenario, we generate the data sets containing 20 SNPs for all three cases. These SNPs are generated in the same way as in the first scenario. In this scenario, MDR first searches for the best d*way interactions using 10fold crossvalidation and then conducts the testing. The results are shown in the lower panel of Figure 2. Compared with the first scenario, the null distributions change when MDR searches for the best d*way interactions (d* = 2, 3,4) in d = 20 dimensions. We see that the null distributions do not strictly follow χ^{2} distributions. We use χ^{2} distributions to approximate them, and obtain their degrees of freedom as df = 13.76, df = 29.49 and df = 64.64, respectively.
In summary, our result shows the following facts:
Therefore, hypothesis testing using as the null distribution in MDR will give many false positive results.
The null distribution for MDR combined with screening methods
MDR works well for small studies with 100 or less SNPs. In GWAS, it is not practical to use this exhaustive search method. Therefore, many screening methods have been proposed to reduce the number of SNPs before MDR is applied to detect d*way interactions. These screening methods include ReliefF [28, 29], Tuning ReliefF (TURF) [21], SURF [34], SURFSTAR [30]. The reader is referred to [20] for a recent review on these screening methods.
Simulation study of the suggested solutions
To see the effect of MDR using the independent data set in hypothesis testing, we similarly divide each null data set into three subsets: D^{(1)}, D^{(2)}, D^{(3)}. ReliefF is applied to D^{(1)} and then MDR is applied to D^{(2)}. At last, χ^{2} tests are conducted on D^{(3)} using the 2 × 2 contingency tables obtained by the MDR method. The result is shown in the lower panel of Figure 4. We can see that the obtained null distributions of MDR agree with the distributions shown in the upper panel of Figure 2. This result clearly shows that the null distributions are not changed by screening and modeling when testing on the independent data set.
Real data analysis using the suggested solutions: an experiment on the Agedrelated Macular Degeneration (AMD) data set
We use the AMD data from [37] as a real example. The AMD study genotyped 116,204 SNPs on 96 cases and 50 controls. After applying quality control to the AMD data set, we have 82,143 qualified SNPs. Two significant loci, rs380390 and rs1329428, were reported in [37] based on the allelic association test with degree of freedom df = 1.
The experiment result on the AMD data set
Model  SNP name  CV accuracy  CV consistency  χ^{2}value 

M _{1}  rs1329428  0.7246  10/10  27.1480 
M _{2}  rs1329428, rs9299597  0.7086  6/10  38.3007 
M _{3}  rs1535891, rs1329428, rs9299597  0.7833  3/10  55.4297 
M _{4}  rs1535891, rs2828151, rs404569, rs380390  0.7615  10/10  85.2449 
Third, we apply testing on the independent data set. The whole data set is partitioned into three groups (49, 49 and 48 individuals, respectively). ReilefF, MDR and hypothesis testing are sequentially applied to them, as described in the method section. Finally, no significant features are reported. This seems different from the result of the permutation test. The reason is that testing on the independent data set has a lower power than the permutation test. The number of samples of the AMD data set is very small (146 individuals). After a nearly equal partition of the data set, only 48 samples can be used in hypothesis testing. Thus no significant results can be detected (see more explanations in the simulation study section). Since the permutation test often has a higher power, the permutation test is preferred when the computational cost is affordable.
Discussion
The importance of hypothesis testing in feature assessment
It is important to note that hypothesis testing plays a key role in feature assessment [22, 23]. Model selection is a closely related topic, which aims to identify the best model in term of the prediction accuracy [22]. Analytical methods such as Akaike information criterion (AIC), Bayesian information criterion (BIC) can be applied for model selection. Efficient sample reuse methods such as crossvalidation and bootstrapping can be applied here as well. However, to statistically quantify the importance of selected features, feature assessment is preferred. Instead of considering prediction accuracy, feature assessment makes use of hypothesis testing to statistically assess the significance of features. To characterize the performance of hypothesis testing, different measures have been defined, e.g., the family wise error rate (FWER) and the false discovery rate (FDR) [35, 36]. The Bonferroni correction and the BenjaminiHochberg method [38] can be used for controlling FWER and FDR, respectively.
As pointed out by Efron [23], the choice of null distribution is critical in hypothesis testing. The empirical null distribution may not match the theoretical null distribution due to reasons such as inappropriate assumptions or correlation across features and samples. This paper shows that screening (e.g, ReliefF) and modeling (e.g, MDR) can also change the null distribution. If their effect is not taken into account in hypothesis testing, the resulting feature assessment will be unreliable.
Related work
We have shown that inappropriate choice of null distributions will give misleading results of hypothesis testing. One example is that MDR combined with Tuning ReliefF [6, 21] will give overoptimistic results. Alternatively, some other methods [25–27], which modify the test statistic but stick to the theoretical null distribution, may produce conservative results. For example, Machini et. al [25] proposed a twostage method for detecting interactions in genomewide scale. In the first stage, a singlelocusbased test was performed. Those SNPs with significant Pvalues (i.e., smaller than a certain threshold α) were selected. The selected set of SNPs was denoted as I_{1} (I_{1} ⊂ {1,…, L}). In the second stage, for each pair of SNPs l and m (l, m ∈ I_{1}, l ≠ m), the log likelihood ratio statistic R(l, m) was calculated for the full interaction model. They defined a new statistic R′(l, m) = R(l, m) – (k_{ l } + k_{ m }), where k_{ l } and k_{ m } were the singlelocus χ^{ 2 } values for SNPs l and m. The significance of this statistic was assessed against distribution, where df = 8 is the degrees of freedom of the full model fitted at the two SNPs. They showed that their method was conservative in term of the false positive rate. In fact, the modified statistic R′(l, m) is a shrunken version of R(l, m), but hypothesis testing is performed using the degree of freedom of R(l, m), where l, m ∈ {1,…, L}, l ≠ m. Thus, this method will be too conservative to detect interesting interactions.
Conclusion
GWAS have identified many genomic regions associated with complex diseases. However, some previously reported results are based on an inappropriate choice of null distributions, which will produce many false positive results. In this work, we have illustrated that both screening and modeling can change the null distribution used in hypothesis testing. This causes unreliable significance assessment. We have suggested two solutions to address this issue. One is to use the permutation test and another is to use the independent data set for testing. Both solutions can help to appropriately choose null distributions, while the permutation test has a higher power with more computational cost.
Method
The null distribution is changed after the screening step and the modeling step in the stagewise procedure. In this section, we suggest two solutions to address this issue. The first solution is to use the permutation test, which uses the resampling to generate the reference distribution for hypothesis testing. The second one is to use the independent data set for testing, which reserves the theoretical null distribution.
Suppose we have L SNPs and n samples for an association study. The whole data set D = [X; Y] is an (L + 1) × n matrix, where we use X to denote all SNPs with the lth column X_{ l } corresponding to the lth SNP, and use Y to denote the phenotype.
Permutation test
 1.
Compute the test statistic based on original data. Apply an efficient screening method to D and reduce it to D′, where D′ is a (d + 1) × n matrix collecting the top d features. Next, apply modeling methods such as MDR to D′ to identify the best model f(X*), where X* has d* features. Then calculate the test statistic of model f(X*) based on all samples, denoted as t_{ obs }.
 2.
Generate B independent vectors Y_{(1)},…, Y_{(B)} by randomly permuting the response variable Y. Evaluate the permuted statistic T_{(b)} of the same procedure in Step 1 corresponding to the permuted data set D_{(b)} = [X; Y_{(b)}], b = 1,…,B.
 3.
where I(·) is the indicator function.
In theory, a larger number of permutations will produce more accurate estimation. In practice, we typically use between 200 and 1000 permutations due to the high computational cost in the permutation test.
Testing on the independent data set
 1.
Partition the whole data set D into three subsets with nearly equal size: D^{(1)}, D^{(2)} and D^{(3)}.
 2.
Apply an efficient screening method to D^{(1)} and identify a subset of features, denoted as A_{1}. Let A_{1} denote the size of A_{1} and we have A_{1} ⊈ L.
 3.
Apply a modeling method to D^{(2)} by only involving the features in A_{1}. The identified features are collected in A_{2} and we have A_{2} ⊂ A_{1}.
 4.
Perform hypothesis test on the features in A_{2} using the data set D^{(3)}. The correction factor for multiple testing can be calculated based on A_{2}. The detected significant features are collected in A_{ 3 } and A_{3} ⊂ A_{2}. These features are finally used for genetic mapping.
This solution can be applied without sacrificing the running time. However, it has a requirement on the number of samples. A small sample size will degrade its performance. In that situation, the permutation test is a better choice.
Declarations
Acknowledgement
This work was partially supported with the grant GRF621707 from the Hong Kong Research Grant Council, the grants RPC06/07.EG09, RPC07/08.EG25 and RPC10EG04 from HKUST, the Natural Science Foundation of China under Grant No. 61003176, and the Fundamental Research Funds for the Central Universities of China (DUT10JR05 and DUT10ZD110).
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S1.
Authors’ Affiliations
References
 WTCCC: Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007, 447: 661–678. 10.1038/nature05911View ArticleGoogle Scholar
 Balding D: A tutorial on statistical methods for population association studies. Nature Reviews Genetics 2006, 7: 781–791. 10.1038/nrg1916View ArticlePubMedGoogle Scholar
 Eichler E, Flint J, Gibson G, Kong A, Leal S, Moore J, Nadeau J: Missing heritability and strategies for finding the underlying causes of complex disease. Nature Reviews Genetics 2010, 11(6):446–450. 10.1038/nrg2809PubMed CentralView ArticlePubMedGoogle Scholar
 Cordell H: Detecting genegene interactions that underlie human diseases. Nature Reviews Genetics 2009, 10: 392–404. 10.1038/nrg2579PubMed CentralView ArticlePubMedGoogle Scholar
 Nelson M, Kardia S, Ferrell R, Sing C: A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Research 2001, 11(3):458. 10.1101/gr.172901PubMed CentralView ArticlePubMedGoogle Scholar
 Ritchie M, Hahn L, Roodi N, Bailey L, Dupont W, Parl F, Moore J: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet 2001, 69: 138–147. 10.1086/321276PubMed CentralView ArticlePubMedGoogle Scholar
 Phillips PC: Epistasisthe essential role of gene interactions in the structure and evolution of genetic systems. Nature Reviews Genetics 2008, 9(11):855–867. 10.1038/nrg2452PubMed CentralView ArticlePubMedGoogle Scholar
 Culverhouse R, Klein T, Shannon W: Detecting epistatic interactions contributing to quantitative traits. Genetic Epidemiology 2004, 27: 141–152. 10.1002/gepi.20006View ArticlePubMedGoogle Scholar
 Zheng T, Wang H, Lo S: Backward genotypetrait association (BGTA)  based dissection of complex traits in casecontrol design. Human Heredity 2006, 62: 196–212. 10.1159/000096995PubMed CentralView ArticlePubMedGoogle Scholar
 Wan X, Yang C, Yang Q, Xue H, Tang N, Yu W: Predictive rule inference for epistatic interaction detection in genomewide association studies. Bioinformatics 2010, 26: 30–37. 10.1093/bioinformatics/btp622View ArticlePubMedGoogle Scholar
 Tibshirani R: Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, series B 1996, 58: 267–288.Google Scholar
 Wu T, Chen Y, Hastie T, Sobel E, Lange K: Genomewide Association Analysis by Lasso Penalized Logistic Regression. Bioinformatics 2009, 25(6):714–721. 10.1093/bioinformatics/btp041PubMed CentralView ArticlePubMedGoogle Scholar
 Yang C, Wan X, Yang Q, Xue H, Yu W: Identifying main effects and epistatic interactions from largescale SNP data via adaptive group Lasso. BMC Bioinformatics 2010, 11(Suppl 1):S18. 10.1186/1471210511S1S18PubMed CentralView ArticlePubMedGoogle Scholar
 Chen X, Liu C, Zhang M, Zhang H: A forestbased approach to identifying gene and genegene interactions. Proceedings of the National Academy of Sciences of the United States of America 2007, 104(49):19199–19203. 10.1073/pnas.0709868104PubMed CentralView ArticlePubMedGoogle Scholar
 Wan X, Yang C, Yang Q, Xue H, Fan X, Tang N, Yu W: BOOST: A fast approach to detecting genegene interactions in genomewide casecontrol studies. Am J Hum Genet 2010, 87(3):325–340. 10.1016/j.ajhg.2010.07.021PubMed CentralView ArticlePubMedGoogle Scholar
 Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira M, Bender D, Maller J, Sklar P, De Bakker P, Daly M, et al.: PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet 2007, 81(3):559–575. 10.1086/519795PubMed CentralView ArticlePubMedGoogle Scholar
 Zhang Y, Liu J: Bayesian inference of epistatic interactions in casecontrol studies. Nature Genetics 2007, 39: 1167–1173. 10.1038/ng2110View ArticlePubMedGoogle Scholar
 Yang C, He Z, Wan X, Yang Q, Xue H, Yu W: SNPHarvester: a filteringbased approach for detecting epistatic interactions in genomewide association studies. Bioinformatics 2009, 25(4):504–511. 10.1093/bioinformatics/btn652View ArticlePubMedGoogle Scholar
 Herold C, Steffens M, Brockschmidt F, Baur M, Becker T: INTERSNP: genomewide interaction analysis guided by a priori information. Bioinformatics 2009, 25(24):3275–3281. 10.1093/bioinformatics/btp596View ArticlePubMedGoogle Scholar
 Moore J, Asselbergs F, Williams S: Bioinformatics challenges for genomewide association studies. Bioinformatics 2010, 26(4):445–455. 10.1093/bioinformatics/btp713PubMed CentralView ArticlePubMedGoogle Scholar
 Moore J, White B: Tuning ReliefF for genomewide genetic analysis. Lecture Notes in Computer Science, Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics 2007, 4447: 166–175. full_textView ArticleGoogle Scholar
 Hastie T, Tibshirani R, Friedman J: The Elements of Statistical learning: Data Mining, Inference, and Prediction. 2nd edition. NewYork: Springer; 2009.View ArticleGoogle Scholar
 Efron B: Largescale simultaneous hypothesis testing: The choice of a null hypothesis. Journal of the American Statistical Association 2004, 99(465):96–104. 10.1198/016214504000000089View ArticleGoogle Scholar
 Niu A, Zhang Z, Sha Q: Application of seventeen twolocus models in genomewide association studies by twostage strategy. BMC Proc 2009, 3(Suppl 7):S26. 10.1186/175365613s7s26PubMed CentralView ArticlePubMedGoogle Scholar
 Marchini J, Donnelly P, Cardon LR: Genomewide strategies for detecting multiple loci that influence complex diseases. Nature Genetics 2005, 37(4):413–417. 10.1038/ng1537View ArticlePubMedGoogle Scholar
 Evans D, Marchini J, Morris A, Cardon L: Twostage twolocus models in genomewide association. PLoS Genetics 2006, 2(9):e157. 10.1371/journal.pgen.0020157PubMed CentralView ArticlePubMedGoogle Scholar
 Med B: Optimal twostage strategy for detecting interacting genes in complex diseases. BMC Genetics 2006, 7: 39.Google Scholar
 Kira K, Rendell L: A practical approach to feature selection. Proceedings of the Ninth International Workshop on Machine learning 1992, 249–256.Google Scholar
 Wiskott L, Fellous J, Kruger N, Malsburg C: Estimating attributes: analysis and extension of relief. European Conference on Machine Learning 1994, 171–182.Google Scholar
 Greene C, Himmelstein D, Kiralis J, Moore J: The informative extremes: using both nearest and farthest individuals can improve Relief algorithms in the domain of human genetics. Lecture Notes in Computer Science, Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics 2010, 6023: 182–193. full_textView ArticleGoogle Scholar
 Park M, Hastie T: Penalized logistic regression for detecting gene interactions. Biostatistics 2008, 9: 30–50. 10.1093/biostatistics/kxm010View ArticlePubMedGoogle Scholar
 Hastie T, Tibshirani R: Generalized additive models. Chapman & Hall/CRC; 1990.Google Scholar
 Li W, Yang Y: Fractal Characterizations of MAX Statistical Distribution in Genetic Association Studies. Advances in Complex Systems (ACS) 2009, 12(04):513–531. 10.1142/S0219525909002349View ArticleGoogle Scholar
 Greene C, Penrod N, Kiralis J, Moore J: Spatially Uniform ReliefF (SURF) for computationallyefficient filtering of genegene interactions. BioData Mining 2009, 2: 5. 10.1186/1756038125PubMed CentralView ArticlePubMedGoogle Scholar
 Dudoit S, Laan M: Multiple Testing Procedures with Applications to Genomics. Springer; 2008.View ArticleGoogle Scholar
 Dudoit S, Shaffer J, Boldrick J: Multiple hypothesis yesting in microarray experiments. Statistical Science 2003, 18: 71–103. 10.1214/ss/1056397487View ArticleGoogle Scholar
 Klein R, Zeiss C, Chew E, Tsai J, Sackler R, Haynes C, Henning A, SanGiovanni J, Mane S, Mayne S, Bracken M, Ferris F, Ott J, Barnstable C, Hoh J: Complement factor H polymorphism in agerelated macular degeneration. Science 2005, 308: 385–389. 10.1126/science.1109557PubMed CentralView ArticlePubMedGoogle Scholar
 Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple teting. Journal of the Royal Statistical Society Series B 1995, 85: 289–300.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.