 Research
 Open Access
Identifying main effects and epistatic interactions from largescale SNP data via adaptive group Lasso
 Can Yang^{1}Email author,
 Xiang Wan^{1},
 Qiang Yang^{2},
 Hong Xue^{3} and
 Weichuan Yu^{1}Email author
https://doi.org/10.1186/1471210511S1S18
© Yang et al; licensee BioMed Central Ltd. 2010
 Published: 18 January 2010
Abstract
Background
Single nucleotide polymorphism (SNP) based association studies aim at identifying SNPs associated with phenotypes, for example, complex diseases. The associated SNPs may influence the disease risk individually (main effects) or behave jointly (epistatic interactions). For the analysis of high throughput data, the main difficulty is that the number of SNPs far exceeds the number of samples. This difficulty is amplified when identifying interactions.
Results
In this paper, we propose an Adaptive Group Lasso (AGL) model for largescale association studies. Our model enables us to analyze SNPs and their interactions simultaneously. We achieve this by introducing a sparsity constraint in our model based on the fact that only a small fraction of SNPs is diseaseassociated. In order to reduce the number of false positive findings, we develop an adaptive reweighting scheme to enhance sparsity. In addition, our method treats SNPs and their interactions as factors, and identifies them in a grouped manner. Thus, it is flexible to analyze various disease models, especially for interaction detection. However, due to the intensive computation when millions of interaction terms needs to be searched in the model fitting, our method needs to combined with some filtering methods when applied to genomewide data for detecting interactions.
Conclusion
By using a wide range of simulated datasets and a real dataset from WTCCC, we demonstrate the advantages of our method.
Keywords
 Single Nucleotide Polymorphism
 Markov Chain Monte Carlo
 Lasso
 Epistatic Interaction
 Sparsity Constraint
Background
Rapid Improvements of highthroughput genotyping technologies enable us to detect genetic variations with much finer resolution than before. In genomewide association (GWA) studies of complex diseases, a few thousands samples are collected and hundreds of thousands of single nucleotide polymorphisms (SNPs) have been genotyped for each sample [1].
Researchers have been investigating diseaseassociated gene mapping for decades and various approaches have been proposed. However, most of them have used a singleSNP based strategy, in which each SNP is analyzed individually (see [2] for a comprehensive review). Due to the sophisticated regulatory mechanism encoded in the human genome, it is widely agreed that complex traits are typically caused by multiple genetic variations. One type of genetic variation influences the traits individually. This is known as main effects. Another type of genetic variation is that SNPs may show little effect individually, but strong effects jointly. This is known as epistasis or multilocus interactions [3]. Therefore, multilocus based approaches are believed to have higher power than singlelocus based ones. Identifying epistatic interactions arises as an important problem in multilocus based approaches [4].
Recently, an increasing number of research has reported the presence of epistatic interactions in complex diseases, such as type2 diabetes [5]. In order to detect epistatic interaction, various computational and statistical methods have developed [4]. For example, Nelson et al. [6] proposed a combinatorial partitioning method (CPM) that enumerated multilocus genotypes and evaluated them with phenotypes. Culverhouse et al. [7] proposed a restricted partitioning method (RPM) to improve the efficiency of CPM. Millstein et al. [8] developed a testing framework when epistasis is present. Ritchie et al. [9] proposed a multifactordimensionality reduction (MDR) method that identified interactions based on classification accuracy through exhaustive search. Zhang and Liu [10] proposed a Bayesian epistasis association mapping (BEAM) method to address the issue of epistasis mapping in genomewide scale by using Markov Chain Monte Carlo (MCMC) method. In spite of their promising performance, most of these methods only show their successes in association studies on smallscale data sets.
From our view, detecting diseaseassociated SNPs and their interactions can be cast as the variable selection problem in the framework of regression analysis. Standard tools for variable assessment are the methods of multivariate regression. In traditional applications of multivariate regression, the number of variables is less than the number of samples. In the context of SNPbased disease association studies, however, the number of SNPs is far more than the number of samples, making it difficult or even impossible to directly apply standard multivariate regression methods.
It is widely agreed in GWA studies that only a small fraction of SNPs is diseaseassociated. In the multivariate regression framework, this implies that most regression coefficients should be zero. This motivates us to impose a sparsity constraint to the regression model. In addition, SNPs are biallelic markers (i.e., with allele A and a). Each SNP has only three genotypes: two homozygous genotypes (AA and aa) and one heterozygous genotype (Aa). Therefore, each SNP can be naturally treated as a threelevel factor and be coded with three dummy variables. Similarly, the interaction between two SNPs can be treated as a ninelevel factor. In order to encourage sparsity on factors (groups of variables) rather than a single dummy variable, we impose a group constraint on the set of dummy variables that represent a disease model (e.g., a single locus model or a twolocus model). Hence, we propose an Adaptive Group Lasso (AGL) method to identify main effects and epistatic interactions from largescale SNP data. Since Lasso [11] is well known for imposing a sparsity constraint at the variable level, we employ Group Lasso [12, 13] to impose the sparsity constraint at the factor level, and develop adaptive reweighting to enhance the sparsity and to reduce false positive finding.
Results and discussion
In this section, we evaluate the performance of our method using both simulated and real data. In simulation studies, we compare our method with some recent competitors under a wide range of epistatic models. For the real casecontrol study, we use the rheumatoid arthritis (RA) data set from the Wellcome Trust Case Control Consortium (WTCCC).
Simulation studies
In simulation studies, we mainly compare our method AGL with Lasso, BEAM and MDR.
 (1)
We do not presume particular types of main effects and interactions. Thus, our model is more flexible.
 (2)
We impose a sparsity constraint at the factor level instead of at the variable level.
 (3)
Our model includes all possible interactions and is able to identify interactions with weak main effects.
We also compare with BEAM [10] which arises as a powerful epistasis mapping method. Both methods share the concept of three SNP classes: unassociated SNPs, SNPs with main effects and SNPs with interactions. BEAM builds a Bayesian partition model based on these three classes. It is worth mentioning that there is only a single group of interacting SNPs in the BEAM model. To identify multiple interacting groups, BEAM implicitly makes use of MCMC to visit possible interactions. We explicitly allow multiple groups of interacting SNPs and impose additive effects between those groups. Comprehensive comparison studies between BEAM and other related methods have been carried out in [10].
Due to the limited space, the comparison with MDR is given in the supplementary. We further conduct null simulation to estimate the type I error rate of our method.
In the following experiments, we use fivefold crossvalidation in model fitting process. We use Bonferroni correction to adjust our pvalue and set the significance threshold as 0.3 in simulation studies. In the released version of BEAM, the threshold is set as 0.3. Thus, we choose the same threshold.
Comparison with Lasso
We conduct experiments under two scenarios.

Scenario 1: Identification of main effects.
Two epistasis models (left: M11; right: M12).
Model 11  AA  Aa  aa  Model 12  AA  Aa  aa 

BB  α  α  α  BB  α  α  α 
Bb  α  α(1 + θ)^{2}  α(1 + θ)^{3}  Bb  α  α(1 + θ)  α 
bb  α  α(1 + θ)^{3}  α(1 + θ)^{4}  bb  α  α  α(1 + θ) 

Scenario 2: Identification of interaction effects.
Four pure epistatisis models used in [18].
Model 13  h^{2}= 0.3, p_{ a }= 0.4, q_{ b }= 0.4  Model 14  h^{2}= 0.2, p_{ a }= 0.4, q_{ b }= 0.4  

AA  Aa  aa  AA  Aa  aa  
BB  0.077  0.689  0.417  BB  0.086  0.536  0.641 
Bb  0.763  0.150  0.491  Bb  0.677  0.275  0.096 
bb  0.196  0.657  0.247  bb  0.219  0.413  0.712 
Model 15  h^{2} = 0.1, p_{ a }= 0.4, q_{ b }= 0.4  Model 16  h^{2} = 0.05, p_{ a }= 0.4, q_{ b }= 0.4  
AA  Aa  aa  AA  Aa  aa  
BB  0.068  0.299  0.017  BB  0.005  0.179  0.251 
Bb  0.289  0.044  0.285  Bb  0.211  0.100  0.026 
bb  0.048  0.262  0.174  bb  0.156  0.098  0.156 
Generally speaking, if the underlying interaction could be well characterized by Lasso with a presumed model structure, e.g., additive model, then the statistical power of Lasso would be higher than that of AGL because Lasso uses less degree of freedom. However, since the underlying interaction is generally unknown and its possible pattern may cover a wide range of spectrum [19], AGL can serves as a valuable tool for discovering interactions in larger model space. Hence, AGL and Lasso may be complementary to each other in GWA studies.
On the other hand, however, it can also be seen that unassociated groups may enter the final model even after adaptive reweighting. Hence, the selected groups in the final model may not be associated with the phenotype. In this regard, the significance assessment is critically needed.
Comparison with BEAM
 1.
The ability of detecting epistatic interactions when the main effect is weak or even absent.
 2.
The ability of detecting multiple interactions.
Detecting epistatic interactions with weak main effect
A wide range of interaction models without marginal effects has been discussed in [19]. Here we consider the 40 pure epistatic models in [18] to compare the performance between AGL and BEAM. The details of these models are available in the supplementary document. The heritability h^{2}(see definition in [19]) of these 40 models ranges from 0.05 to 0.2 and the
MAF ranges from 0.2 to 0.4. We use 100 data sets for each disease model. There are 200 cases, 200 controls, and 1000 SNPs in each data set.
Our model includes all possible interactions (1000 × 999/2 interactions) in the model fitting process, so there is no chance to miss interesting interactions. The good performance of our model is due to the groupsparsity constraint: It identifies interactions in a grouped manner. This is very helpful to weaken the influence of noise SNPs.
The poor performance of BEAM is not due to the statistical testing power of the Bstatistics [10], but the sampling efficiency. We carefully examined the interactions in the disease models with h^{2} ≥ 0.1: Those interactions are very significant even after the Bonferroni correction of Bstatistics. Notice that BEAM runs 5 × 10^{6} MCMC iterations which are 10 times of pairwise exhaustive search in the simulation study. Thus, our conjecture is that MCMC might converge too slowly to find the ground truth (We provide some evidence in the supplementary to support our conjecture).
Detecting multiple interactions
Another disadvantage of BEAM is that it only allows a single interacting group in its Bayesian partition model. To compensate for this limitation, BEAM uses MCMC sampling strategy to visit possible interactions during model optimization. In contrast, our approach allows multiple interacting groups and imposes additive effects for these interactions. This flexibility enables us to have a higher power to identify multiple interactions. Due to the limited space, we show our comparison result of detecting multiple interactions in the supplementary.
Null simulation study
To validate the use of our pvalue and to estimate the typeI error, we conduct null simulation studies in two cases:

Case 1: We generate 100 null datasets. Each dataset contains 10 K SNPs and 1000 samples. All the SNPs are generated independently with MAF uniformly distributed in [0.05, 0.5]. In this case, the nominal typeI error rates should be 10, 20, 30 per one million SNPs for significance thresholds at 0.1, 0.2, 0.3.

Case 2: We use genomeSIMLA [20] to simulate the SNP data based on the marker information on the Affymetrix 500 K chip from human chromosome 1. Linkage disequilibrium (LD) exists among SNPs. We also generate 100 null datasets, each of which contains 38836 SNPs and 1000 samples. Due to LD pattern, the error rate should lower than the nominal error rate.
Null simulation: empirical typeI error from 100 simulated data sets.
Cases  Error rates per one million SNPs of different thresholds after Bonferroni correction  

threshold = 0.1  threshold = 0.2  threshold = 0.3  
Case 1  11  18  28 
Case 2  3.08  4.89  7.47 
Analysis of WTCCC data
Main effects
Some significant SNPs identified by our method on WTCCC RA data. These SNPs covers the regions which are moderate associated with RA as reported in [1].
SNPs  Location  Related Genes  Pvalue 

rs7539166  1p36  TMEM51  < 10^{30} 
rs384843  1p36  NBPF3  5.2 × 10^{9} 
rs7516721  1p36  GPR3  7.7 × 10^{14} 
rs4959053  6p21  PSORS1C1  < 10^{30} 
rs6457617  6p21  MHC region  1.3 × 10^{15} 
rs6973565  7q32  PLXNA4  1.2 × 10^{9} 
rs10250029  7q32  CHCHD3  < 10^{30} 
rs10751815  10p15  ADARB2  < 10^{30} 
rs4266996  10p15  unknown  < 10^{30} 
rs17147777  10p15  unknown  < 10^{30} 
rs8129909  21q22  IGSF5  < 10^{30} 
rs16999716  21q22  DSCAM  9.2 × 10^{9} 
rs4542939  21q22  DSCAM  8.3 × 10^{9} 
rs13047947  21q22  PDE9A  < 10^{30} 
rs140344  22q12  unkown  < 10^{30} 
rs5749509  22q12  SYN3  < 10^{30} 
rs6518796  22q12  SYN3  < 10^{30} 

WTCCC reports that SNP rs6457617 located at 6p21 shows a very strong association. Our experiment verifies this result.

We do not identify SNP rs6679677 located at 1p13 reported by WTCCC. Instead, we identify two SNPs rs7551793 and rs948620 near SNP rs6679677. The signals of these two SNPs are much stronger than rs6679677.

We also identify SNPs in the moderate association regions reported in [1]. We summarize the result in Table 4.
Interactions
Detecting interactions in genomewide scale is very challenging and multistage strategies are often explored. For example, MDR [9] usually is combined with TuRF [21] which serves as a filter to remove those noise SNPs. Currently, our method AGL can not be directly applied to genomewide scale SNP data since it is too computationally intensive to exhaustively search for all SNP pairs. As suggested in simulation study, our method keeps its statistical power when about 500,000 SNP pairs are considered in our model. Thus, the main difficulty is the computation burden of searching for all SNP pairs. Thus, a filtering method is necessary for our method.
Some significant SNP groups identified by our method in the candidate regions on WTCCC RA data.
SNP Groups  Location  Related Genes  Pvalue 

(rs4988822, rs3135392)  6p21.3  (HLADRA, HLADRA)  3.33 × 10^{15} 
(rs17429127, rs2157082)  6p21.3  (HLADRA, HLADRA)  < 10^{30} 
(rs1358169, rs6460831)  7p21.3  (THSD7A, THSD7A)  < 10^{30} 

For the region 6p21, we select a segment covering the SNPs reported in [22]. This segment contains 250 SNPs from rs3135366 to rs461338. We enumerate all possible interactions and include them in our model. Our method reports two interacting pairs: (rs4988822, rs3135392) and (rs17429127, rs2157082). These SNPs are related to gene HLADRA. The result in [23] reports that there is a strong association between RA and HLA gene family. Notice that the two SNPs rs4988822 and rs3135392 cannot be identified by univariate analysis due to their weak main effects (Their pvalue is at the level of 10^{3} given by univariate analysis). However, they do show a strong interaction. We also run BEAM on the selected segment (5 × 10^{6} MCMC). BEAM reports (rs9469220, rs3957146) and (rs3129872, rs9272723) as the two most significant interactions based on the Bstatistics. We carefully check these two pairs based on the logistic regression model. We find that the interaction of the SNP pair (rs9469220, rs3957146) is very weak using the standard χ^{2} test based on logistic regression models with df = 4, while the interaction of the SNP pair (rs3129872, rs9272723) is strong. We further explore the reason why our method does not report the interacting pair (rs3129872, rs9272723). We observe that SNP rs9268645, which is highly correlated (strong LD) with SNP rs3129872, enters our model as a main effect term. Consequently, the pair (rs3129872, rs9272723) does not enter the model as a interaction term. This shows that the SNP pair (rs3129872, rs9272723) should not be reported as an interacting pair.

For the region 7p21, we select a segment which covers the SNPs reported in [22]. The segment contains 250 SNPs from SNP rs1076224 to SNP rs1548882. We analyze this region and report one interacting pairs (rs1358169, rs6460831). The two SNPs rs1358169 and rs2526100 are related to gene THSD7A on chromosome 7, which has been reported to be associated with bone mineral density [24]. This shows plausible biological relevance. We also run BEAM on the segment. But it does not report any interaction.

Neither BEAM nor our method finds significant interactions in the region 6q23.
The definition of interactions is not consistent in literatures. For example, the interaction effect of two SNPs is define via logistic regression models of genotypes and their combinations [3], while it is also defined via models of haplotypes [25]. The interactions reported above are based on the definition of interaction of a SNP pair in [3]. We extend this definition such that we can simultaneously handle interactions of multiple SNP pairs. The details are given in the Method section. We realize that the interacting SNPs reported above are close to each other on the physical map. This type of interaction effects may be caused by the haplotype effect. Detecting interactions of genes in different genome regions by analyzing genomewide SNP data is still under investigation [4].
Conclusion
In this paper, we proposed an Adaptive Group Lasso method for largescale SNP data analysis. The novelty of our method is that it analyzes SNPs and their interactions simultaneously. It imposes a sparsity constraint at the group level and enables us to identify associated SNPs (especially for interacting SNPs) from largescale SNP data robustly. We show that our method outperforms its recent competitors in both simulation studies and real application.
The limitation of our method is that the interaction model can not be directly applied to genomewide scale SNP data analysis. The main difficulty comes from the computation burden of searching for all SNP pairs. There are two possible solutions to solve this issue. One solution is to make use of some filtering method to reduce the number of SNPs to a manageable size, for example, [22]. Another solution is incorporating biological information. Pathway information [26] provides a biological clue to narrow down the search range for interaction detection. We shall investigate this in our future work.
Methods
Our target is to identify diseaseassociated SNPs and their interactions. Researchers may have different understandings of epistatic interaction. To be clear, Our definition of interaction refers to the deviation from the additive model of multiple SNPs. In other words, the interaction effects refers to the phenotype variation that can be explained by joint effects of multiple SNPs but not by their main effects. This is consistent with the definition given in [3]. We achieve our goal through addressing the following key issues:

Model: How to model the relationship between genotype and phenotype?

Optimization: How to optimize the model structure?

Significance assessment: How to test the significance of the detected association?
In the following, we first present our model for identification of main effects and epistatic interactions. Then, we give an optimization algorithm for largescale SNP data analysis. Finally, we describe a method to assess the significance of the SNPs identified by our model.
The adaptive group Lasso model for identification of main effects
For convenience, we define an active set = {jβ_{ j }≠ 0}. We use an iterative algorithm to adaptively assign weights w_{ j }in (2) as given in Algorithm 1.
From the iterative algorithm, the penalty weight is adjusted according to its previous estimation. If the current estimation is small, i.e., SNP_{ j }is less likely to be associated with disease, then the penalty weight should increase to prevent SNP_{ j }from entering the model in the next iteration, and vice versa. The theoretical justification is given in the supplementary document.
 1.
Set the iteration count m = 0. Initially set = 1, j = 1, ⋯, L.
 2.
Solve problem (2) to obtain β( ) and the active set , where is determined by crossvalidation.
 3.
 (2)
Remove SNP _{ j }, when j ∉ .
 4.
If the active set does not change, stop; otherwise increment m and go to step 2.
The proposed model has following characteristics:

Flexibility: Due to the dummy variable representation, our model is flexible to analyze different main effects (additive, dominant, recessive, interference [17]) in a unified way without presuming one particular type of effect. This flexibility will be pronounced when identifying epistatic interactions.

Sparsity: The sparsity constraint comes from the fact that most SNPs are unassociated. Notice that imposes a constraint to select a group of dummy variables rather than a single one. The weight w_{ j }is assigned adaptively in Algorithm 1 to enhances the sparsity at the factor level. If without reweighting, too many noise SNPs would enter the model in GWA studies.
The adaptive group Lasso model for identification of epistatic interactions
Definition of interactions
Let L_{ M }and L_{ F }be the log likelihood of the main effect model and the full model, respectively. According to the likelihood ratio test, interaction effects are defined via the difference of the log likelihood of these two models, i.e., L_{ F } L_{ M }. Hence, interaction effects can be interpreted as departure from linear models naturally [4].
Modelling interactions
The model for identification of epistatic interactions is an extension of model (2): Interaction terms are further included in a grouped manner. We treat the combination of two SNPs as a 9level factor and use 9 dummy variables to code them. Let J_{1} be the index set of all SNPs:
J_{1} ≜ {1, 2,⋯, L}, and J_{2} be the index set of all pairwise interactions of L SNPs:
The proposed model structure has the following characteristics:

Our interaction model integrates the analysis of SNPs with main effects and interacting SNPs, and unassociated SNPs are prevented from entering the model by the sparsity constraint. Recall that interaction refers to the phenotype variation that can be explained by joint effects of multiple SNPs but not by their main effects, model (7) includes main effects to discourage the interaction terms to enter the model, when their effects can be mostly explained by the main effects. However, this may not be able to completely prevent spurious interactions from entering the model. The reason is that the interaction terms may explain more variances than their corresponding main effects. Hence, they are more likely to enter the final model even when they are penalized more heavily ( for j_{1} ∈ J_{1}, for j_{2} ∈ J_{2}). To overcome this difficulty, we resort to the statistical test of interaction effects.

By using dummy variables, our model is flexible to model various interactions, including all epistatic models described in [17]. Both main effect terms and interaction terms enter the model in a grouped manner. Thus, our model is insensitive to the noise occurring at one level of the genotype combinations G_{1} and G_{2}.

Our model imposes additive effects for these interactions. Simultaneous analysis of all interactions achieves under this model structure.
Optimization algorithm
In our Adaptive Reweighing Algorithm of Group Lasso (Algorithm 1), we need an algorithm to solve optimization problem (2) efficiently. We make use of the coordinate descent algorithm [13, 27]. The advantage of the coordinate descent algorithm is that it has a closedform solution of the least square problem when updating one group at a time. Therefore, it is suitable for largescale data analysis. For the loglikelihood of logistic regression, the iteratively reweighed least square algorithm (IRLS) is efficient: A quadratic approximation is formed to the loglikelihood based on current estimation and the least square problem is solved by the coordinate descent algorithm. This process is repeated until its convergence which is guaranteed [13]. The detail of the algorithm is given in the supplementary.
Statistical testing
The statistical testing is to determine whether the identified SNPs (i.e., the SNPs in the active set) are significantly associated with the disease
Significance tests of main effects
 1.
Refit the logistic regression model for the identified SNPs in and obtain the loglikelihood ℓ_{ full }.
 2.
Leaving SNP _{ j }∈ (j = 1, ..., s _{1}) out, fit the logistic regression models and obtain the loglikelihood ℓ_{\j}.
 3.
Obtain pvalue of SNP _{ j }using χ ^{2} test based on the value 2(ℓ_{ full } ℓ_{\j}) with the degree of freedom (df) discussed below.
 1.
 2.
 3.
The degree of freedom of the χ ^{2} test for SNP _{ j }is , where df _{ full }and df _{\j}are the effective degrees of freedom of the logistic regression models with all s _{1} SNPs and without SNP _{ j }, respectively.
It is worth mentioning that our pvalue is obtained after the selecting process in AGL fitting. The selecting process may affect the precision of pvalue estimation. We justify our pvalue by conducting null simulation later.
Significance tests of epistatic interactions
Our Tests of interactions is built upon the definition of interactions. The key point is that the main effects of the two SNPs should not be taken into account when testing their interaction effect. We conduct significance tests of epistatic interactions in the following way:
 1.
Refit the logistic regression model for the identified groups in and obtain the loglikelihood ℓ_{ full }.
 2.
For each interaction term in with index l (l = 1, ..., s _{2}), fit the logistic regression model with the main effect term replacing the interaction term and obtain the loglikelihood ℓ_{\l}.
 3.
Obtain pvalue of interaction term l using χ ^{2} tests based on the value 2(ℓ_{ full }ℓ_{\l}) with df = df _{ full } df _{\l}.
Declarations
Acknowledgements
This work was partially supported with the Grant GRF621707 from the Hong Kong Research Grant Council, grant RPC06/07.EG09, RPC07/08.EG25, and a postdoctoral fellowship award from the Hong Kong University of Science and Technology.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 1, 2010: Selected articles from the Eighth AsiaPacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/11?issue=S1.
Authors’ Affiliations
References
 WTCCC: Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007, 447: 661–678. 10.1038/nature05911View ArticleGoogle Scholar
 Balding D: A tutorial on statistical methods for population association studies. Nature Reviews Genetics 2006, 7: 781–791. 10.1038/nrg1916View ArticlePubMedGoogle Scholar
 Cordell H: Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Human Molecular Genetics 2002, 11: 2463–2468. 10.1093/hmg/11.20.2463View ArticlePubMedGoogle Scholar
 Cordell HJ: Detecting genegene interactions that underlie human diseases. Nat Rev Genet 2009, 10: 392–404. 10.1038/nrg2579PubMed CentralView ArticlePubMedGoogle Scholar
 Cho Y, Ritchie M, Moore J, Park J, Lee KU, Shin H, Lee H, Park K: Multifactordimensionality reduction shows a twolocus interaction associated with Type 2 diabetes mellitus. Diabetologia 2004, 47: 549–554. 10.1007/s0012500313213View ArticlePubMedGoogle Scholar
 Nelson M, Kardia S, Ferrell R, Sing C: Combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Research 2001, 11: 458–470. 10.1101/gr.172901PubMed CentralView ArticlePubMedGoogle Scholar
 Culverhouse R, Klein T, Shannon W: Detecting epistatic interactions contributing to quantitative traits. Genetic Epidemiology 2004, 27: 141–152. 10.1002/gepi.20006View ArticlePubMedGoogle Scholar
 Millstein J, Conti D, Gilliland F, Gauderman W: A testing framework for identifying susceptibility genes in the presence of epistasis. Am J Hum Genet 2006, 78: 15–27. 10.1086/498850PubMed CentralView ArticlePubMedGoogle Scholar
 Ritchie M, Hahn L, Roodi N, Bailey L, Dupont W, Parl F, Moore J: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet 2001, 69: 138–147. 10.1086/321276PubMed CentralView ArticlePubMedGoogle Scholar
 Zhang Y, Liu J: Bayesian inference of epistatic interactions in casecontrol studies. Nature Genetics 2007, 39: 1167–1173. 10.1038/ng2110View ArticlePubMedGoogle Scholar
 Tibshirani R: Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, series B 1996, 58: 267–288.Google Scholar
 Yuan M, Lin Y: Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B 2006, 68: 49–67. 10.1111/j.14679868.2005.00532.xView ArticleGoogle Scholar
 Meier L, Geer S, Buhlmann P: The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B 2008, 70: 53–71.View ArticleGoogle Scholar
 Hoggart C, Whittatker J, Iorio M, Balding D: Simultaneous Analysis of All SNPs in Genomewide and ReSequencing Association Studies. PLoS Genetics 2008, 4(7):e1000130. 10.1371/journal.pgen.1000130PubMed CentralView ArticlePubMedGoogle Scholar
 Wu T, Chen Y, Hastie T, Sobel E, Lange K: Genomewide Association Analysis by Lasso Penalized Logistic Regression. Bioinformatics 2009, 25(6):714–721. 10.1093/bioinformatics/btp041PubMed CentralView ArticlePubMedGoogle Scholar
 Marchini J, Donnelly P, Cardon LR: Genomewide strategies for detecting multiple loci that influence complex diseases. Nature Genetics 2005, 37(4):413–417. 10.1038/ng1537View ArticlePubMedGoogle Scholar
 Li W, Reich J: A Complete Enumeration and Classification of TwoLocus Disease Models. Human Heredity 2000, 50: 334–349. 10.1159/000022939View ArticlePubMedGoogle Scholar
 Velez D, White B, Motsinger A, Bush W, Ritchie M, Williams S, Moore J: A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genetic Epidemiology 2007, 31: 306–315. 10.1002/gepi.20211View ArticlePubMedGoogle Scholar
 Culverhouse R, Suarez B, Lin J, Reich T: A perspective on epistasis: limits of models displaying no main effect. Am J Hum Genet 2002, 70: 461–471. 10.1086/338759PubMed CentralView ArticlePubMedGoogle Scholar
 Dudek S, Motsinger A, Velez D, Williams S, Ritchie M: Data simulation software for wholegenome association and other studies in human genetics. Pacific Symposium on Biocomputing 2006.Google Scholar
 Moore J, White B: Tuning ReliefF for genomewide genetic analysis. Lecture Notes Computer Science 2007, 4447: 166–175. full_textView ArticleGoogle Scholar
 Yang C, He Z, Wan X, Yang Q, Xue H, Yu W: SNPHarvester: a filteringbased approach for detecting epistatic interactions in genomewide association studies. Bioinformatics 2009, 25(4):504–511. 10.1093/bioinformatics/btn652View ArticlePubMedGoogle Scholar
 Gregersen PK, Silver J, Winchester RJ: The shared epitope hypothesis: An approach to understanding the molecular genetics of susceptibility to rheumatoid arthritis. Arthritis Rheum 1987, 30: 1205–1213. 10.1002/art.1780301102View ArticlePubMedGoogle Scholar
 Mori S, Kou I, Sato H, Emi M, Ito H, Hosoi T, Ikegawa S: Association of genetic variations of genes encoding thrombospondin, type 1, domaincontaining 4 and 7A with low bone mineral density in Japanese women with osteoporosis. Journal of Human Genetics 2008, 53(8):694–697. 10.1007/s1003800803004View ArticlePubMedGoogle Scholar
 Zhao J, Jin L, Xiong M: Test for Interaction between Two Unlinked Loci. Am J Hum Genet 2006, 79(5):831–845. 10.1086/508571PubMed CentralView ArticlePubMedGoogle Scholar
 Wang K, Li M, Bucan M: PathwayBased Approaches for Analysis of Genomewide Association Studies. Am J Hum Genet 2007, 81: 1278–1283. 10.1086/522374PubMed CentralView ArticlePubMedGoogle Scholar
 Friedman J, Hastie T, Hofling H, Tibshirani R: Pathwise coordinate optimization. The Annals of Applied Statistics 2007, 1: 302–332. 10.1214/07AOAS131View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.