 Methodology Article
 Open Access
 Published:
EigenEpistasis for detecting genegene interactions
BMC Bioinformatics volume 18, Article number: 54 (2017)
Abstract
Background
A large amount of research has been devoted to the detection and investigation of epistatic interactions in genomewide association studies (GWASs). Most of the literature focuses on loworder interactions between singlenucleotide polymorphisms (SNPs) with significant main effects.
Results
In this paper we propose an original approach for detecting epistasis at the gene level, without systematically filtering on significant genes. We first compute interaction variables for each gene pair by finding its EigenEpistasis component, defined as the linear combination of Gene SNPs having the highest correlation with the phenotype. The selection of significant effects is done using a penalized regression method based on Group Lasso controlling the False Discovery Rate.
Conclusion
The method is tested against two recent alternative proposals from the literature using synthetic data, and shows good performances in different settings. We demonstrate the power of our approach by detecting new genegene interactions on three genomewide association studies.
Background
Genome Wide Association Studies (GWASs) look for genetic markers linked to a phenotype of interest. Typically, hundreds of thousands of single nucleotide polymorphisms (SNPs) are studied for a limited number of individuals using highdensity genotyping arrays. Usually the association between each SNP and the phenotype is tested using singlemarker methods. Multiple markers may also be considered, but these are typically selected using simple forwardselection methods. GWASs are a powerful tool for investigating the genetic architecture of complex diseases and have been successful in identifying hundreds of variants. However, they have been able to explain only a small proportion of the phenotypic variations expected from classical family studies [1]. A number of explanations for this missing heritability have been put forward. For example, it has been suggested that shared environments among relatives are not adequately taken into account. Another suggestion is that much larger numbers of variants with small effects remain to be identified. Rare variants, which are difficult to find using existing genotyping arrays [1], seem to be important causal factors, and so do structural variations. But complex diseases may also be caused, at least in part, by complex genetic structures with multiple interactions between markers (a phenomenon termed epistasis). Whereas in pedigree studies the genetic effect on phenotype is seen as part of the additive genetic variance, in GWASs it is seen as an unmeasured interaction between genes [2]. For example, Zuk et al. proposed a model that takes into account epistatic interaction in relation to Crohn’s disease [3]. They found that 80% of the missing heritability could be due to genetic interactions.
In recent years a number of methods for studying epistasis have been proposed and reported in various reviews [4–6]. They vary in terms of their data analysis (genomewide or filtering) and their statistical methodology (Bayesian, frequentist, machine learning or data mining). Most of them focus on singlelocus interactions, but considering interactions at the gene level can have several advantages. First, given that genes are the functional unit of the genome, results may be more biologically interpretable. Second, genetic effects are more easily detected when SNP effects are aggregated together. Third, genebased analysis simplifies the multiple testing problem by reducing the number of variables. Several genegene methods have been proposed. These are based on a summarizing step which is used to obtain information at the gene level. In more recent methods, filters or penalized models are used to make the method applicable to a large number of genes, while older methods are only applicable to two or a very limited number of genes. For the summarizing step, most methods resort to a principal components (PC) approach, but each method has its specific characteristics. We describe some of these below.
Chatterjee et al. harnessed Tukey’s onedegreeoffreedom method to investigate interaction between two genes [7]. Their method is based on the assumption that the SNPs included in each gene region act as surrogates for an underlying biological phenotype. The genotypic information for the gene region is extracted as a single component by a weighted sum of all SNPs. The weights are determined according to the SNP’s correlation with the trait. The product of the two sums is then introduced as the genegene interaction term into a logistic model, where marginal effects are represented by the respective sums. Building on this idea, Wang et al. compared two different interaction tests [8]. On the one hand, they used Principal Component Analysis (PCA) to summarize SNP information within a gene, and on the other hand they used Partial Least Squares (PLS) to extract components that summarize, first, the information among SNPs in a gene and, second, the correlation between SNPs and the outcome of interest. They then proposed an interaction test based on either the first PC or the first PLS component for each gene, and were able to show that the PCA and PLS methods often outperformed Tukey’s onedegreeoffreedom method. But it is worth noting that the main objective of these three methods was improving the detection of associations in the presence of genegene interactions, rather than identifying the interactions themselves. Other approaches based on principal component analysis have since been proposed for epistasis detection. Li et al. proposed selecting, as the gene representation, PCs that are able to explain at least 80% of the variation [9].
Genotypic data are characterized by the high correlation among markers resulting from socalled linkage disequilibrium (LD). Procedures that take LD information into account have been developed for epistasis detection. For example, He et al. proposed an approach using LD information to weight genotype scores which are then aggregated using principal components [10]. Rajapakse et al. developed a genebased test of interactions for casecontrol studies which compares LD patterns between cases and controls [11]. Using the same idea, Peng et al. used a canonical correlationbased Ustatistic model (CCU) to detect coassociation in casecontrol studies [12]. The idea is to test for two given genes the difference between canonical correlation coefficient computed by Canonical Correlation Analysis (CCA) among cases and among controls. Their work was subsequently extended to include kernel [13, 14].
However most of these methods can be applied only to a reduced number of genes. Computational constraints mean that it is not feasible to model all genegene interactions directly. One way of overcoming this is to reduce the genegene search space by eliminating unimportant genes, and to this end twostep procedures have been developed that first filter out specific genes or SNPs through a genomewide search before testing for interactions. One example of this is the modelbased kernel machine method (3GSPA) proposed by Li and Cui, which first performs a search for gene pairs contributing to the overall phenotypic variations [15]. Significant pairs are then tested for interaction effects. Another attractive alternative is offered by penalized regression methods that select a subset of important predictors out of a large number of potential predictors. These methods operate by shrinking the size of the coefficients. The coefficients of predictors with little or no apparent effect are force to be set to zero, reducing the effective degrees of freedom and in many cases making model selection possible. A few approaches using penalized models have been proposed. D’Angelo et al. combined principal component analysis and lasso penalized regression [16]. Wang et al. used a principal component analysis combined with an L1 penalty, with adaptive weights based on gene size, pathway support and effect size [17].
Here we propose a Group Lasso approach [18] that takes into account the group structure of each gene in order to detect epistasis. We introduce GeneGene EigenEpistasis (GGEE) as a new approach for computing the genegene interaction part of the model, and we compare GGEE with two different interaction variable modeling approaches inspired by previous proposals in the literature, namely PCA and PLS. An adaptive ridgecleaning approach [19] is then used in order to compute pvalues for each group.
In the next section, we detail each model and outline the design of the simulation studies performed to compare the performance of the different approaches. In the Results section, the findings of the simulation studies are shown, and we illustrate our approach on three real datasets relating to ankylosing spondylitis, thyroid carcinomas and inflammatory bowel disease. The different approaches and the results are discussed in the last section.
Methods
We consider n individuals where y=(y _{1},y _{2},…,y _{ n })^{T} denotes the vector of trait values. For each individual, genetic variants among G genes are considered. Each gene is described by a given number of SNPs p _{ g } where \(\sum _{g} p_{g}=p\). The SNP matrix \(\mathbf {X} \in \mathbb {R}^{n \times p}\) considers an additive coding scheme in which the genotype value of each SNP j from individual i is denoted X _{ ij }∈{1,2,3}. X _{ i } is a pdimensional vector of covariates for observation i and for j∈{1,…,p}. X ^{g} denotes the submatrix of X whose columns are the p _{ g } SNPs of gene g. A generalized linear model is generally assumed for GWAS, where the phenotype is considered as a random variable y _{ i } whose conditional expectation can be written as a function of the covariates X _{ i } and their interactions Z _{ i },
where
and Z _{ i } is the ith line of the matrix of interactions and γ a parameter vector of appropriate dimension. When the phenotype is binary (case control study), it is usual to assume a logistic model where g() is the logit and Y is assumed to follow a binomial distribution. Below we will consider only quantitative phenotypes using a classical linear model. In this case g() is the identity and the residuals are assumed to be Gaussian.
The main effect of each gene is modeled through the sum of the effects of all its SNPs. Concerning interaction effects, we compute new variables representing interactions between two specific genes and define as a group all the interaction variables related to a given pair of genes. The matrix of interaction is thus structured into G(G−1)/2 submatrices:
where Z ^{rs} describes the interactions between the two genes r and s. The parameter vector γ is accordingly structured into subvectors γ ^{rs}. We will now present and compare three different approaches for modeling genegene interactions.
Modeling genegene interactions
Let us consider two genes r and s described respectively by p _{ r } and p _{ s } SNPs. A possible interaction term describing the epistasis between the two genes is
We hereafter set \(\boldsymbol {W}^{rs}=\{ X_{ij}^{r} X_{ik}^{s} \}_{i=1\cdots n}^{j=1,\cdots,p_{r}; k=1,\cdots,p_{s}}\). In this case the submatrix of interactions is Z ^{rs}=W ^{rs} and \(\boldsymbol {\gamma }^{rs}=\{\gamma _{jk}^{rs}\}\) is a vector of size p _{ r } p _{ s }. The number of parameters in such a model is obviously too large to be reliably estimated. For this reason a number of papers in the literature consider reducing the dimension of γ.
In this paper we will consider three different methods for reducing the dimension reduction, namely Principal Component Analysis (PCA), Partial Least Squares (PLS), and our proposed GeneGene EigenEpistasis approach that we have termed GGEE.
Principal component analysis
Principal Component Analysis (PCA) can reduce the number of variables describing each gene r from p _{ r } to q _{ r }<p _{ r }. Considering gene r described by p _{ r } SNPs, we compute the matrix of the first q principal components
where U ^{r} is the matrix of the first q _{ r } principal axes. Using C ^{r} and C ^{s} instead of X ^{r} and X ^{s} in the computation of the interaction allows the number of parameters relative to each interaction to be controlled. This control is achieved by choosing the number of principal components q. The PCA model that we describe draws upon ideas in [20]. The interaction term takes the form
Relating this expression to the general form of the interaction term \(\boldsymbol {W}_{i}^{rs}\) described above, we can see that performing PCA prior to computing the interactions is a means of constraining the linear interaction term of Eq. 1.
The submatrix of interactions is \(\boldsymbol {Z}^{rs}=\{ C_{ij}^{r} C_{ik}^{s} \}_{i=1\cdots n}^{j=1, \cdots, q;k=1, \cdots, q }\), and \(\boldsymbol {\gamma }^{rs}=\{\gamma ^{rs}_{jk}\}\) is a vector of size q ^{2} describing the interaction between genes r and s. In particular, if a single principal component is chosen, there will be only one parameter to estimate per interaction.
Partial least squares
Wang et al. proposed an alternative method for integrating interactions using a PLS approach [8]. Let (X ^{r},X ^{s}) be the genotypic matrix for the given pair of genes (r,s). Their approach computes the components that maximize cov ^{2}(X ^{r} u,T v), with T=(y,X ^{s}) and (u,v) the weight vectors. The interaction of a couple of genes (r, s) is then represented by the first q components:
In this approach phenotypic information is retained when the interaction variables are constructed.
Genegene EigenEpistasis
We propose an original approach for modeling interactions. The general idea is to consider the interaction variable between the two genes r and s as a function f _{ u }(X ^{r},X ^{s}) parameterized by u. One way to estimate u is to maximize the correlation between the interaction function and the phenotype:
If we consider the function f to be linear, our problem becomes easily tractable and has only one solution. Setting
where \(\boldsymbol {W}^{rs}=\{ X_{ij}^{r} X_{ik}^{s} \}_{i=1\cdots n}^{j=1 \cdots, p_{r}; k=1,\cdots, p_{s}}\) and \(\boldsymbol {u} \in \mathbb {R}^{p_{r} p_{s}}\) we obtain the following problem:
The solution u is the eigenvector corresponding to the largest eigenvalue of the matrix W ^{rs} ^{T} y y ^{T} W ^{rs}, which is the vector W ^{rs} ^{T} y. The complexity of computing u is therefore in O(np _{ r } p _{ s }). We then use the projection of the matrix W ^{rs} on u as the interaction variable. The resulting EigenEpistasis vector Z is the linear combination of all the SNPSNP interactions being the most correlated with the phenotype. In its construction, GGEE has similarities with PLS. The main difference lies in the original design matrix. PLS searches for components that maximize cov ^{2}(X ^{r} u,y X ^{s} v), whereas GGEE retains the component that maximizes cov ^{2}(y,W ^{rs} u), with W ^{rs} the matrix of all pairwise interaction between the two genes r and s. Like PLS, GGEE takes phenotypic information into account in the construction of the interaction variables. Other methods as such as CCU [12] and the kernel versions of CCU [13, 14] that we referred to in the introduction also consider the phenotype in their construction, but these methods can be applied only to casecontrol problems.
Estimation of coefficients
We propose a Group Lasso approach [18] for estimating the parameters of linear or logistic (case control) regression. A group comprises either the SNPs of a given gene, or interaction terms relative to a given genepair interaction. In the particular case of linear regression, the model parameters are estimated by:
The parameter λ is selected by crossvalidation.
In order to improve estimation accuracy and to obtain pvalues for each of the selected groups, we use the adaptive ridge cleaning approach proposed by Bécu et al. [19]. This screen and clean procedure is a twostage method. The group lasso model is first fitted on half of the data. The coefficient of the candidate groups selected by the model are then introduced into a ridge regression model fitted on the second half of the data with a specific penalty that allows the group structure to be taken into account. For each group the significance of the regression coefficients is estimated using permutation tests.
Simulation design
To evaluate the performance of the proposed approach, we conducted two simulation studies, the first using simulated data and the second using a real dataset relating to ankylosing spondylitis. In each case we compared the proposed GGEE model to the two other interaction variable modeling approaches. The first simulation corresponds to a simplified context where all parameters were controlled and external interference limited, while the second simulation corresponds to a realistic context with a realistic pattern of minor allele frequency (MAF) and LD.
Design
Genotypes Our first (simplified) simulation study was adapted from the model used in [21] with an extension to control the MAF of each SNP. The n lines of the genotype matrix are an i.i.d. sample from a multivariate random vector \(\boldsymbol {X}_{i}\sim \mathcal {N}_{p} (\boldsymbol {0},\boldsymbol {\Sigma })\). The correlation matrix Σ is block diagonal, each block corresponding to a gene. Two variables belonging to the same gene are correlated at level ρ=0.8 while all other correlations are null. Each SNP (column of the genotype matrix) is randomly assigned an MAF p from a uniform distribution between 0.05 and 0.5. An MAF value of 0.2 is assigned to all causal SNPs. The genotype frequencies derived from the HardyWeinberg equation are then used to discretize X _{ ik } values to 0, 1 or 2. In practice, X _{ ik } is set to 1 if \(X_{ik}<q_{p^{2};N(0,1)}\phantom {\dot {i}\!}\), X _{ ik } is set to 3 if \(X_{ik}<q_{(1p)^{2};N(0,1)}\phantom {\dot {i}\!}\) and X _{ ik } is set to 2 otherwise.
In the second (realistic) simulation study using a real ankylosing spondylitis dataset, genes are randomly selected. The number of SNPs composing each genes varies according to the selection.
Phenotypes For both simulation studies, we generated phenotype vectors using two different schemes. Our first scheme corresponds to the model proposed by Wang et al. [17] (which, for the sake of brevity, we will refer to hereafter as the “Wang Pathway” model):
where \(\mathcal {C}\) and \(\mathcal {C}^{2}\) are respectively the set of causal SNPs and causal interactions, and ε _{ i } a random Gaussian variable. For each causal gene g, we consider two causal SNPs and a coefficient β _{ g } is assigned to the standardized sum of these causal SNPs. In the same way, for the interactions, all the causal SNPs from a causal pair (r,s) are pairwise multiplied and a coefficient γ _{ rs } is assigned to the standardized sum of the product.
Our second scheme for simulating phenotypes is based on the following model:
The difference with the first model concerns the simulation of the interaction effect. In the second model the interaction effect for a causal couple (r,s) is defined as the product of the first PCA component \(\boldsymbol {C}_{.1}^{r}\) of gene r and the first PCA component \(\boldsymbol {C}_{.1}^{s}\) of gene s.
In both models, β _{0} is set to 0, and ε _{ i } are generated independently from a \(\mathcal {N}(0,\sigma ^{2})\), with σ ^{2} determined from the coefficient of determination R ^{2} that calibrates the strength of the association. Both simulation models can be written as \(y_{i} = \boldsymbol {X}_{i}^{T} \boldsymbol {\beta } + \boldsymbol {Z}_{i}^{T} \boldsymbol {\gamma } + \epsilon _{i}\) where X the marginal effect genotype matrix and Z the interaction effect matrix.
Let us denote \(\boldsymbol {Q} \boldsymbol {\phi }= [\boldsymbol {X}, \boldsymbol {Z}] \left [\begin {array}{c} \boldsymbol {\beta } \\ \boldsymbol {\gamma } \end {array}\right ]\) and
We remark that:
Thus, replacing \(\hat {\text {var}}(\epsilon _{i})\) by σ ^{2}, and \(\hat {\text {cov}}(\epsilon _{i},\boldsymbol {Q_{i}\phi }\bar {y})\) by −σ ^{2}/n, we obtain \(R^{2} \approx \frac {\sum (\boldsymbol {Q_{i}\phi }  \bar {y})^{2}}{\sum (\boldsymbol {Q_{i}\phi }  \bar {y})^{2} + \mathrm {n}\sigma ^{2} 2\sigma ^{2}}\). This relation between R ^{2} and σ ^{2} gives us an expression for σ ^{2} that depends on R ^{2}, \(\sigma ^{2} = \frac {(R^{2}  1) \sum (\boldsymbol {Q_{i}\phi }  \bar {y})^{2}}{R^{2} (2n)}\).
We looked at how much of the coefficient of determination R ^{2} is explained by main effects, and how much is explained by interaction effects, in order to determine their respective roles in the model.
For a similar reason, when simulating phenotypes, Wang et al. [17] examined how much of partial R ^{2} was due to interaction effects. They selected coefficient values so that 30% of the partial R ^{2} was explained by interaction effects. Li and Cui [15] did not use the R ^{2} directly, but they simulated data assuming different proportions of interaction effects among the total genetic variance. In our study, once the phenotype y had been set for each simulated design matrix, we computed how much of the R ^{2} could be attributed to interaction and main effects as \(p_{I}=\frac {R^{2}_{I}}{R^{2}_{T}}\) and \(p_{M}=\frac {R^{2}_{M}}{R^{2}_{T}}\) respectively, with \(R^{2}_{I}\) the Rsquare value for the model containing only simulated interaction effects, \(R^{2}_{M}\) the Rsquare value where there were only simulated main effects, and \(R^{2}_{T}\) Rsquare value where there were both simulated main effects and simulated interaction effects.
Scenarios In the first (simplified) simulation study, genotypes are simulated as described in the design. We considered six genes, each composed of six SNPs and for 600 subjects. We define one causal interaction between genes and two causal genes with main effects, and the simulation takes place using two alternative simulation settings:

(1) one interaction and two main effects involving the same genes

(2) one interaction and two different main effects
For these two settings, different coefficients of determination, from 0.05 to 0.7, are considered and 1000 iterations are performed.
In the second (realistic) simulation study, genotypes come from a real dataset comprising 763 individuals. At each iteration we randomly select six genes of various size (from 1 to 1119 with a median of 2 SNPs) in the dataset. We consider the five following settings:

(1) one interaction and two main effects involving the same genes

(2) one interaction and two different main effects

(3) one interaction effect only

(4) two main effects only

(5) no effects
For each setting, coefficients of determination, from 0.1 to 0.4, are considered and 500 iterations are performed.
For both simulation studies, main effects and interaction effects are weighted with the same coefficient values (β _{ g }=γ _{ rs }=2,∀g,r,s). For each interaction, the power is estimated as the proportion of detected interactions over the total number of simulations.
Real data illustration
To illustrate our approach we applied the proposed method on three real datasets related to ankylosing spondylitis, thyroid carcinomas and inflammatory bowel disease.
The dataset regarding ankylosing spondylitis consists of the French subset of the large study of the International Genetics of Ankylosing Spondylitis (IGAS) study [22]. For this subset, unrelated cases were recruited through the Rheumatology clinic of Ambroise Paré Hospital (BoulogneBillancourt, France) or through the national selfhelp patients’ association: "Association Française des Spondylarthritiques". Populationmatched unrelated controls were obtained from the "Centre d’Etude du Polymorphisme Humain", or were recruited as healthy spouses of cases. The protocol was reviewed and approved by the Ethics committee of the Ambroise Paré hospital. All participants gave their informed consent to the study. The application on thyroid carcinomas was carried out on a public dataset that came from the study of LuzónToro et al. on identification of epistatic interactions in two different types of thyroid carcinomas [23]. Finally, we used the Wellcome Trust CaseControl Consortium genomewide association dataset to study Inflammatory Bowel Disease.
Results
Simulation studies
In the following, we will refer to the different simulation settings by using letters as described in Table 1.
Results from the simplified simulation study
Figure 1 shows results obtained for the two settings. The first column gives the estimated power to detect the gene interaction as a function of the R ^{2} values. The last two columns show heatmap matrices reflecting the proportion of significant values for each variable and each method over the 1000 simulations for different R ^{2} values.
In the first setting (Fig. 1 a, b), we consider genes 1 and 2, both having main and interaction effects. When the phenotype is simulated using the Wang Pathway model, the GGEE and PLS methods have a higher power to detect the interaction effect than PCA method, which tends to identify only the two main effects of the two genes (Fig. 1 a). Whereas for PCA and PLS the power is nondecreasing with R ^{2}, for GGEE we observe a Ushaped curve. For the smallest R ^{2} values, which correspond to the most difficult cases, the power of GGEE to detect the interaction tends to decrease. When R ^{2} values reach 0.4, GGEE’s power to detect the interaction starts to increase. The situation is different for the main effects, since GGEE’s power to detect these increases continuously with R ^{2} [see Additional file 1]. For PLS, the power to detect the interaction effect is continuously nondecreasing. Note, however, that for this method one of the two main effects (here gene 1) is detected to the detriment of the second, regardless of the value of R ^{2}. In the PCA phenotype simulation model (Fig. 1 b), GGEE has a higher power than the other methods to detect interaction effects while retaining a good specificity, whatever the value of R ^{2}. The reasonably high power of the PCA method can be explained by the similarity between the phenotype simulation model and the estimation model. It is worth noting that in this first setting, only a few variables are falsely significant, which reflects a good specificity for all methods (the worst being for the gene 3 × gene 4 interaction variable in the case of the Wang Pathway model and r ^{2}=0.1, where the false discovery rate is 0.068).
In the second setting (Fig. 1 c, d), genes 1 and 2 have only main effects, and genes 3 and 4 have only an interaction effect. When the phenotype is simulated using the Wang Pathway model, the interaction power of GGEE is uniformly higher than that of the other methods (Fig. 1 c). For all values of R ^{2}, PCA tends to detect false main effects for genes 3 and 4, but not to detect interaction effects. In the PCA phenotype simulation (Fig. 1 d), PCA has a good power to detect interaction effects, but once again these good performances can be explained by the similarity between the simulation model and the estimation model. The interaction power for GGEE is lower, but still good. With this model, only GGEE tends to attribute a false main effect to genes 3 and 4. In this second setting, whatever the phenotype simulation model, the power of the PLS method is almost null. PLS identifies only the first gene as having a main effect, while the effects of genes 3 and 4 are not detected, whether as main or as interaction effects. Moreover, PLS tends to attribute a false interaction effect between genes 1 and 2.
To evaluate the performances of the different methods in a more complex context, we also consider a setting where we simulate 25 genes with four causal interactions between genes, and two genes with causal main effects. In these simulations, interaction genes are different from main effect genes, and we only consider the case where R ^{2}=0.7. The results of this setting reflect the good performance of the GGEE method over PCA and PLS in detecting interaction in a context where further interactions and different main effects are simulated [see Additional file 2].
Results from the realistic simulation study
Figures 2 and 3 show results for the first three settings. Fig. 2 shows the power to detect gene interaction depending on the R ^{2}, and Fig. 3 shows heatmaps of significant effects when R ^{2}=0.2. In both figures, the upper row relates to phenotypes simulated using the Wang Pathway model, and the lower row to phenotypes simulated using the PCA model.
In the first setting (Fig. 2 a, b, Fig. 3a, b), the same two genes (genes 1 and 2) are simulated with main and interaction effects. In this setting GGEE has the best power to detect interaction effects for all R ^{2} values. The interaction power of PLS remains close to 0.3. The power of PCA depends on the phenotype simulation model. When the phenotype is simulated using the Wang Pathway model, the power is similar to PLS. When it is simulated using PCA it increases continuously, because of the similarity between the phenotype simulation model and the estimation model. Looking at the heatmaps (Fig. 3a, b) we can see that only a few variables are falsely significant. We also observe that unlike GGEE, PCA and PLS can detect the two simulated main effects, with a preference for gene 1 in the case of PLS. These results are obtained when R ^{2}=0.2, but other R ^{2} values give similar results [see Additional file 3].
In the second setting (Fig. 2 c, d, Fig. 3c, d), main effects are simulated for genes 1 and 2, and one interaction is simulated between genes 3 and 4. In this setting the power of PLS to detect interaction effects is almost null, while the respective powers of PCA and GGEE are different, according to which phenotype simulation model is used. Both methods have a higher power when the phenotype is simulated using the PCA model. Regarding the detection of main effects, the results are similar to the first setting, with GGEE less successful than PCA and PLS (Fig. 3c, d). But unlike in the first setting, here some variables are falsely significant. False detections among interaction variables are more pronounced for GGEE and concern genes that have been simulated to have only main effects. False detections among main effects are more pronounced for PCA and PLS when the phenotype is simulated using the Wang Pathway model and concern genes that have been simulated to have an interaction effect. Under the PCA phenotype model, false detections among main effects are more pronounced for PLS and GGEE when R ^{2} values are higher [see Additional file 3].
In the third setting, where only one interaction is simulated between genes 1 and 2, GGEE has a higher power to detect interaction than PLS and PCA when the phenotype is simulated using the Wang Pathway model (Fig. 2 e). The power of PCA is higher in the PCA phenotype simulation model because of its similarity to the estimation model, whereas the power of PLS is almost null (Fig. 2 f). In the Wang Pathway phenotype simulation model, PCA and PLS both falsely detect main effects. In the PCA phenotype simulation model, the false detections are made by PLS and GGEE (Fig. 3e, f). In all cases these false detections concern genes that are simulated to have an interaction effect.
Figure 4 shows the results for the fourth and fifth settings. The heatmap on the left corresponds to the fourth setting, where only two main effects are simulated. We remark that all methods successfully identify the main effect, PCA and PLS doing so with a higher power. False detections corresponding to the respective interaction effects are observed for GGEE, and to a lesser extent for PLS. The figure on the right corresponds to the fifth setting, where no specific effects are simulated and the result shows that all three methods perform well with very few false detections.
In all settings, estimating the coefficients with the group lasso is more computationally expensive than constructing the interaction variables. GGEE and PCA are quite similar in terms of computation time, whereas in some settings PLS has a slightly greater execution time than other methods. Note that the time required by GGEE for constructing the interaction variables varies according to the number of SNPs that constitute each gene [see Additional file 4].
Percentage of R ^{2} attributable to interaction and main effects respectively
Using each setting in both simulation studies, we determine the p _{ I } and p _{ M } average values that correspond to the proportion of the R ^{2} attributed to interaction and main effects, respectively. For most settings, the p _{ I } depends on the number of simulated effects. With one interaction and two main effects simulated the R ^{2} part attributable to interaction effects is around 33% (Table 2 (B, C, D), Table 3 (C, D)). For the setting with numerous effects [see Additional file 2], the average p _{ I } is 67% because we consider four interaction effects for only two main effects. Finally, as expected, when only interaction effects are simulated, the average p _{ I } is 100% (Table 3 (E, F)) and 0% when only main effects are simulated (Table 3 (OME)). However, the R ^{2} distribution between main and interaction effects is not distinguishable in the setting where the phenotype is simulated using the Wang Pathway model with the same main and interaction effects. The p _{ I } and p _{ M } values are all above 90% (Table 2 (A), Table 3 (A)). In the second simulation study, the R ^{2} distribution is also not well divided between main and interaction effects when the phenotype is simulated under the PCA model, though p _{ M } is still higher than p _{ I } (Table 3 (B)).
Real data illustrations
Ankylosing spondylitis
Ankylosing spondylitis (AS) is a common form of inflammatory arthritis predominantly affecting the spine and pelvis. It occurs with a prevalence of 0.1% to 1.4% depending on the considered population [24]. Genetic factors account for more than 90% of the risk of susceptibility to AS. Human leukocyte antigen (HLA) class I molecule HLA B27, belonging to the Major Histocompatibility Complex (MHC) region, was the first genetic risk factor identified as associated with ankylosing spondylitis in the 1970’s [25, 26] and remains the most important risk locus for this pathology. Despite the strong association only a small portion of HLAB27 carriers develop the disease. Furthermore, studies in families suggest that less than 50% of the overall genetic risk is due to HLAB27, which suggests that other genetic factors are involved [27]. A number of updated reviews on AS genetics, including genomewide association study (GWAS) results, identified new ankylosing spondylitissusceptibility genes outside of the MHC region [28, 29].
We applied all the methods described above to the AS dataset. The data contain 408 cases and 358 controls, and each individual was genotyped for 116, 513 SNPs with Immunochip technology. For each SNP we obtained detailed genetic information, such as gene affiliation, with the NCBI2R package [30] which annotates lists of SNPs with current information from NCBI. We considered only SNPs located within a single gene in order to form gene groups without overlap. We focused our analysis on a list of 29 genes previously identified as having a main effect in GWAS.
The three methods tested yield different results, and only the PLS and GGEE methods identify interactions. PCA detects only the main effect HLAB and identifies no interactions. PLS detects the main effect HLAB, but also identifies one interaction effect between the genes EOMES and BACH2. Our method GGEE does not detect any main effects, but it shows two significant interactions, the first between the genes HLAB and SULT1A1 and the second between IL23R and ERAP2.
Thyroid carcinomas
Thyroid cancers are thought to be related to a number of environmental and genetic predisposing factors and can be classified in various types and subtypes. Most association studies have focused on main effects but only a limited number of genes were identified. Recently, some papers focus on the detection of epistatic interactions [23, 31]. We applied our proposed approach on the two data sets used in LuzónToro et al. [23] regarding two rare tumours, sporadic medullary thyroid carcinoma (sMTC) and juvenile papillary thyroid carcinoma (jPTC). Affymetrix GenomeWide Human SNP 6.0 arrays were used to hybridized DNA. The data set related to sMTC contains 66 cases and the jPTC data set 30 cases. The same 125 healthy controls and 232, 607 SNPs were used for both studies. As for the ankylosing spondylitis dataset, we obtained gene affiliation for each SNP with the NCBI2R package and considered only SNPs located within a single gene. We focused the analysis of the sMTC data set on a list of 10 genes, 3 of these genes (CHFR, AC016582.2 and C8orf37) were chosen following the conclusions of LuzónToro et al., the others because they contained markers that were susceptible to be associated with the disease from univariate analysis. The analysis of the jPTC data set was realized on a list of 20 genes among them we can cite DIO, RP11648K4.2, LOXL1, DMGDH, PAX8 and STK17B from which epistatic interaction were already detected (even if the interaction between PAX8 and STK17B was identified in a study concerning papillary thyroid and not the juvenile form). The 14 others genes contained susceptible associated individual markers from univariate analysis. Regarding the sMTC study, GGEE identifies one interaction between genes NCK1 and TRIQK. PCA detects only one main effect for the gene TRIQK whereas none effects were identified with PLS. Concerning the jPTC data set, 3 interactions were identified by GGEE (NCAM1 and MNDA, MNDA and STK17B, LOC105370481 and STX3). PLS identifies 2 interactions (LOC105370236 and LOC105370481, LOC105370236 and PIKFYVE) and PCA detects only one main effect for the gene LOC105370481. We note that the effects detected with our approach concerned different genes that the ones identified in the presented previous studies (except for the gene STK17B). More analysese are needed to better understand these differences.
Inflammatory bowel disease
Although the etiology of Inflammatory Bowel Disease (IBD) is not completely understood, previous studies have underlined the contribution of an important genetic susceptibility. Recently, MartinezChamorro et al. [32] detected an epistatic interaction between the genes NOD2 and TLR4. We applied our approach to the Wellcome Trust CaseControl Consortium genomewide association dataset for Inflammatory Bowel Disease. The data contains 1949 case for 159 960 SNPs genotyped by Affymetrix. The control group was constituted of 1972 individuals from the Wellcome Trust CaseControl Consortium genomewide association dataset for hypertension. As for the two previous real data analysis, we obtained gene affiliation for each SNP with the NCBI2R package and considered only SNPs located within a single gene. The analysis was realized on a list of 22 genes that contain SNPs that are suspected to be associated with IBD from an univariate analysis. The two genes NOD2 and TLR4 were added to the list as they were previously detected as having an epistatic interaction. GGEE identifies one interaction between the genes LOC105376008 and CACNB2 whereas PCA detects 9 main effects (IL23R, PODN, ATG16L1, C5orf56, DNAH11, LOC105378282, HSD17B12, LINC00558, ADCY4) but none interaction. Finally PLS identifies 3 main effects for the genes IL23R, PODN and DNAH11 as well as 2 interactions the first one between the genes PODN and FCRLA, the second one between PVT1 and NOD2.
Discussion
The results obtained in both simulation studies point to a certain confusion between main and interaction effects. When simulated interaction and main effects involve different genes, the methods tend to detect as interaction effects the pairs of genes simulated to have main effects and, conversely, to detect as main effects the genes simulated to having interaction effects.
Overall, GGEE tends to detect more false interactions than false main effect whereas PLS and PCA tend to detect more false main effects though PLS tends to attribute a false interaction effect between genes 1 and 2. This type of confusion may explain the Ushaped power curve for GGEE observed in the first simulation study (Fig. 1 a). As the problem becomes harder, the genetic effects of both genes are preferentially assigned to the interaction effect, implying a better power to detect interaction where R ^{2} values are small. Finally, we remark that for GGEE false detections of main effects are more frequent when the PCA phenotype simulation model is used, whereas for the PLS and PCA methods, where the number of false detections for main effects is higher when the Wang Pathway phenotype simulation model is used.
Other observations regarding the power of the different methods can be made with these simulation results. PLS has more trouble than PCA and GGEE in detecting interaction effects, and has a tendency to detect the first main effect with a higher power than the second main effect when two main effects are simulated. For all methods, the power to detect interactions increases more slowly with respect to R ^{2} when simulations are performed using real data genotypes than with fully simulated genotypes, but we observe that in the first setting the curve representing the interaction power of GGEE is detached from the others, reflecting the superior performance of GGEE over PLS and PCA. Note that the power of GGEE to detect main effects is always less than that of PCA and PLS when R ^{2}<0.4 [see Additional file 1 and 5]. In short, GGEE performs better when detecting interactions than when detecting main effects.
Conclusions
In this paper we compared different approaches for modelling genegene epistasis in a penalized regression framework. Our primary concern was the detection of interaction effects, and for this purpose we defined a general model and tested different interaction terms. We focused our analysis at the gene scale and compared three ways to design the interaction term. Some methods were inspired by previous proposed approaches based on dimensional reduction methods including Principal Component Analysis (PCA) and Partial Least Square analysis (PLS). We additionally proposed a new interaction modeling approach that we called GeneGene EigenEpistasis (GGEE), where one interaction variable is built for each couple. The interaction variable was defined based on a criterion that maximizes the covariance between the phenotype and the pairwise SNP product matrix of the two genes. The interaction components were then introduced in a Group Lasso penalized regression model that takes the gene structure into account and is capable of handling a large number of genes simultaneously.
A power study of the different methods based on two different simulation schemes (simplified and realistic) provided us with a rich body of information. Across various papers in the literature we find comparisons of similar methods that use different phenotype simulation settings. In the present work we compared two simulation models. Our first model was from a previous study [17] that simulated the interaction component of each couple in an SNP pairwise product fashion. Our second model defined the interaction component as a pairwise product of representative variables of each gene. Overall the GGEE method performed well in detecting interactions in all the settings that were tested, although it was not always able to do so in the settings where main and interaction effects involved different genes. The power of the PCA method is highly dependent on the phenotype simulation model, because of the similarity between the second phenotype simulation model and the estimation model of the PCA method. The PLS method is characterized by a lack of power in detecting interactions. PLS performs well only when the related main effects are also present. When the simulated main and interaction effects do not concern the same genes, the detection capability of the PLS approach collapses dramatically.
For all methods we observed a confusion phenomenon when active genes are not simulated with both main and interaction effects. False detections of interactions concern genes that were simulated to have main effects, and false detections of main effects concern genes simulated to have interaction effects. This phenomenon reveals the difficulty that all methods encounter in clearly distinguishing the different types of effects. There are more false main detections when using methods such as PCA and PLS that are better at detecting main effects (except when the phenotype is simulated using the PCA model). As for interaction effects, the GGEE methods make more false interaction detections than PCA and PLS.
When genotypes are fully simulated in the simplified simulation study, the GGEE and PCA approaches performed better when the PCA phenotype simulation model was used, whereas the PLS method was not very sensitive to the choice of phenotype simulation model. Unlike PCA and PLS, GGEE is better at detecting interaction effects than at detecting main effects when simulations use a real data set. Since the simulation study using realistic data is meant to mimic real genotype data structure, we conclude that in a real context GGEE will be better at detecting interaction effects than main effects.
In comparison to SNPSNP interaction approaches, the genescale dimension of our proposed method means that considerably fewer interaction variables need to be considered within a genetic region. This reduction in problem size allows larger problems to be handled. Moreover, a penalized regression method allows a true multivariate approach over a larger number of genes. It also extends other proposed genescale approaches, such as that presented by Wang et al. [8]. The ability to handle a relatively large number of genes simultaneously makes the detection of interactions between different genetic regions possible. This might be useful as an initial step, prior to using SNPSNP interaction methods that may provide more accurate information.
As the GGEE method is not able yet to consider all human genes at the same time, it is necessary to specify a list of genes to be explored for potential interactions. Given that its power to detect main effects is low, for the detection of main effects it will be safer to use previously acquired knowledge of the genetic effects, or to use a preprocessing method. Another limitation of the method is gene size. Computing the gene EigenEpistasis vector for two genes of size p _{ r } and p _{ s } requires an n×(p _{ r } p _{ s }) matrix to be computed.
Prospects are improving the GGEE method’s performance by optimizing the computational cost and exploring new interaction functions to be plugged into the GGEE criterion.
Abbreviations
 3GSPA:

modelbased kernel machine method
 AS:

Ankylosing spondylitis
 CCA:

Canonical correlation analysis
 CCU:

Canonical correlationbased Ustatistic model
 GGEE:

Genegene EigenEpistasis
 GWAS:

Genomewide association studies
 HLA:

Human leukocyte antigen
 LD:

Linkage disequilibrium
 MAF:

Minor allele frequency
 MHC:

Major histocompatibility complex
 PC:

Principal components
 PCA:

Principal component analysis
 PLS:

Partial least squares
 SNP:

Singlenucleotide polymorphism
References
 1
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al.Finding the Missing Heritability of Complex Diseases. Nature. 2009; 461(7265):747–53.
 2
Haig D. Does Heritability Hide in Epistasis between Linked SNPs?. Eur J Hum Genet. 2011; 19(2):123.
 3
Zuk O, Hechter E, Sunyaev SR, Lander ES. The Mystery of Missing Heritability: Genetic Interactions Create Phantom Heritability. PNAS. 2012; 109(4):1193–8.
 4
Niel C, Sinoquet C, Dina C, Rocheleau G. A survey about methods dedicated to epistasis detection. Front Genet. 2015; 6:285.
 5
Wei WH, Hemani G, Haley CS. Detecting Epistasis in Human Complex Traits. Nat Rev Genet. 2014; 15(11):722–33.
 6
Steen KV. Travelling the World of GeneGene Interactions. Brief Bioinformatics. 2012; 13(1):1–19.
 7
Chatterjee N, Kalaylioglu Z, Moslehi R, Peters U, Wacholder S. Powerful Multilocus Tests of Genetic Association in the Presence of GeneGene and GeneEnvironment Interactions. Am J Hum Genet. 2006; 79(6):1002–16. Turkey’s 1df.
 8
Wang T, Ho G, Ye K, Strickler H, Elston RC. A Partial LeastSquare Approach for Modeling GeneGene and GeneEnvironment Interactions When Multiple Markers Are Genotyped. Genet Epidemiol. 2009; 33(1):PLS approach.
 9
Li J, Tang R, Biernacka JM, de Andrade M. Identification of GeneGene Interaction Using Principal Components. BMC Proceedings. 2009; 3(Suppl 7):S78. PC.
 10
He J, Wang K, Edmondson AC, Rader DJ, Li C, Li M. GeneBased Interaction Analysis by Incorporating External Linkage Disequilibrium Information. Eur J Hum Genet. 2011; 19(2):164–72. PC Framework.
 11
Rajapakse I, Perlman MD, Martin PJ, Hansen JA, Kooperberg C. Multivariate Detection of GeneGene Interactions. Genet Epidemiol. 2012; 36(6):622–30. CLD.
 12
Peng Q, Zhao J, Xue F. A GeneBased Method for Detecting Genegene CoAssociation in a Case–control Association Study. Eur J Hum Genet. 2010; 18(5):582–7. CCU.
 13
Larson NB, Jenkins GD, Larson MC, Vierkant RA, Sellers TA, Phelan CM, et al.Kernel Canonical Correlation Analysis for Assessing GeneGene Interactions and Application to Ovarian Cancer. Eur J Hum Genet. 2014; 22(1):126–31. KCCA.
 14
Yuan Z, Gao Q, He Y, Zhang X, Li F, Zhao J, et al. Detection for GeneGene CoAssociation via Kernel Canonical Correlation Analysis. BMC Genet. 2012; 13:83. KCCU.
 15
Li S, Cui Y. GeneCentric Gene–gene Interaction: A ModelBased Kernel Machine Method. Ann Appl Stat. 2012; 6(3):1134–61. 3GSPA.
 16
D’Angelo GM, Rao D, Gu CC. Combining Least Absolute Shrinkage and Selection Operator (LASSO) and PrincipalComponents Analysis for Detection of GeneGene Interactions in GenomeWide Association Studies. BMC Proc. 2009; 3(Suppl 7):PCALASSO.
 17
Wang X, Zhang D, Tzeng JY. PathwayGuided Identification of GeneGene Interactions. Ann Hum Genet. 2014; 78(6):Pathway guided.
 18
Yuan M, Lin Y. Model Selection and Estimation in Regression with Grouped Variables. J R Stat Soc Series B. 2006; 68:49–67.
 19
Bécu JM, Grandvalet Y, Ambroise C, Dalmasso C. Beyond support in twostage variable selection. Statistics and Computing. 2017; 27:169–179.
 20
Zhang F, Wagener D. An Approach to Incorporate Linkage Disequilibrium Structure into Genomic Association Analysis. J Genet Genomics. 2008; 35(6):381–385. PCLR.
 21
Wu TT, Chen YF, Hastie T, Sobel E, Lange K. GenomeWide Association Analysis by Lasso Penalized Logistic Regression. Bioinformatics. 2009; 25(6):714–21. Lasso penalized logistic regression.
 22
Cortes A, Hadler J, Pointon JP, Robinson PC, Karaderi T, Leo P, et al. Identification of Multiple Risk Variants for Ankylosing Spondylitis through HighDensity Genotyping of ImmuneRelated Loci. Nat Genet. 2013; 45(7):730–8.
 23
LuzónToro B, Bleda M, Navarro E, GarcíaAlonso L, RuizFerrer M, Medina I, et al.Identification of Epistatic Interactions through GenomeWide Association Studies in Sporadic Medullary and Juvenile Papillary Thyroid Carcinomas. BMC Med Genomics. 2015; 8(1):83.
 24
Sieper J, Braun J, Rudwaleit M, Boonen A, Zink A. Ankylosing Spondylitis: An Overview. Ann Rheum Dis. 2002; 61(Suppl 3). iii8.
 25
Schlosstein L, Terasaki PI, Bluestone R, Pearson CM. High Association of an HLA Antigen, W27, with Ankylosing Spondylitis. N Engl J Med. 1973; 288(14):704–6.
 26
Woodrow JC, Eastmond CJ. HLA B27 and the Genetics of Ankylosing Spondylitis. Ann Rheum Dis. 1978; 37(6):504–9.
 27
Thomas GP, Brown MA. Genetics and Genomics of Ankylosing Spondylitis. Immunol Rev. 2010; 233(1):162–80.
 28
Tsui FW, Tsui HW, Akram A, Haroon N, Inman RD. The genetic basis of ankylosing spondylitis: New insights into disease pathogenesis. Appl Clin Genet. 2014; 7:105–15.
 29
Reveille JD, Sims AM, Danoy P, Evans DM, Leo P, Pointon JJ, et al.GenomeWide Association Study of Ankylosing Spondylitis Identifies NonMHC Susceptibility Loci. Nat Genet. 2010; 42(2):123–7.
 30
Melville S, Melville MS. Package ‘NCBI2R’. 2012. Available online at: https://cran.rproject.org/src/contrib/Archive/NCBI2R/. Accessed 16 Jan 2017.
 31
Landa I, Boullosa C, IngladaPérez L, SastrePerona A, Pastor S, Velázquez A, et al.An Epistatic Interaction between the PAX8 and STK17B Genes in Papillary Thyroid Cancer Susceptibility. PLoS ONE. 2013; 8(9):e74765.
 32
MartinezChamorro A, Moreno A, GómezGarcía M, Cabello MJ, Martin J, LopezNevot MÁ. Epistatic Interaction between TLR4 and NOD2 in Patients with Crohn’s Disease: Relation with Risk and Phenotype in a Spanish Cohort. Immunobiology. 2016; 221(9):927–33.
Acknowledgements
None
Funding
Université d’Evry Val d’Essonne, CNRS 8071, ENSIIE, INRA.
Availability of data and material
Our proposed GGEE method has been implemented in an R package which is available on github: https://github.com/vstanislas/GGEE
Authors’ contributions
The three authors developed the approach and the design. VS conducted the simulation study, applied the approach to the real data and drafted the manuscript. CD and CA supervised the implementation process and critically read and edit the manuscript. The work of VS fulfills part of the requirements of her PhD. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable
Ethics approval and consent to participate
The dataset regarding ankylosing spondylitis consists of the French subset of the large study of the International Genetics of Ankylosing Spondylitis (IGAS) study [22]. For this subset, unrelated cases were recruited through the Rheumatology clinic of Ambroise Paré Hospital (BoulogneBillancourt, France) or through the national selfhelp patients’ association: "Association Française des Spondylarthritiques". Populationmatched unrelated controls were obtained from the "Centre d’Etude du Polymorphisme Humain", or were recruited as healthy spouses of cases. The protocol was reviewed and approved by the Ethics committee of the Ambroise Paré hospital. All participants gave their informed consent to the study.
The application on thyroid carcinomas was carried out on a public dataset that came from the study of LuzónToro et al. on identification of epistatic interactions in two different types of thyroid carcinomas [23]. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction. All subjects underwent peripheral blood extraction for genomic DNA isolation using MagNA Pure LC system (Roche, Indianapolis, IN) according to the manufacturer’s instructions. A written informed consent was obtained from all the participants for clinical and molecular genetic studies. The study was approved by the Ethics Committee for clinical research in the University Hospital Virgen del Rocío (Seville, Spain) and complies with the tenets of the declaration of Helsinki.
Finally, the study on Inflammatory Bowel Diseases makes use of data generated by the Wellcome Trust CaseControl Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk.
Author information
Affiliations
Corresponding author
Additional files
Additional file 1
Figure S1. Comparison of power to detect main effects in the simplified simulation study. Power under a simplified context. The figures show the power to detect main effects of the three methods depending on R ^{2}. (PDF 8 kb)
Additional file 2
Figure S2. Simulation on 25 genes with fully simulated data and various simulated effects. Discoveries for the setting with numerous effects. Heatmap of the ratio of the number of times where each variable was significant to the total number of simulations for R ^{2}=0.7 using the Wang Pathway model for the phenotype simulation with fully simulated data. We consider 25 genes with two main effects for genes 1 and 2, and four interaction effects between genes 3 and 4, genes 5 and 6, genes 7 and 8, and genes 9 and 10. (PDF 30 kb)
Additional file 3
Figure S3. Discoveries in the realistic simulation study with R ^{2}=0.4. Discoveries under a realistic context. Heatmaps of the ratio of the number of times where each variable was significant to the total number of simulations for R ^{2}=0.4. (PDF 9 kb)
Additional file 4
Figure S4. Comparison of execution time required to model interaction and to fit Group Lasso for the five first settings of the realistic simulation study. Execution time. Median of the execution time to model interaction and to fit Group Lasso for the five first settings of the realistic simulation study. (PDF 8 kb)
Additional file 5
Figure S5. Comparison of power to detect main effects in the realistic simulation study. Power under a realistic context. The figures show the power to detect main effects of the three methods depending on R ^{2}. (PDF 7 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Stanislas, V., Dalmasso, C. & Ambroise, C. EigenEpistasis for detecting genegene interactions. BMC Bioinformatics 18, 54 (2017). https://doi.org/10.1186/s1285901714880
Received:
Accepted:
Published:
Keywords
 Genomewide association study
 Genegene interactions
 Epistasis
 Group Lasso