Eigen-Epistasis for detecting gene-gene interactions

Background A large amount of research has been devoted to the detection and investigation of epistatic interactions in genome-wide association studies (GWASs). Most of the literature focuses on low-order interactions between single-nucleotide polymorphisms (SNPs) with significant main effects. Results In this paper we propose an original approach for detecting epistasis at the gene level, without systematically filtering on significant genes. We first compute interaction variables for each gene pair by finding its Eigen-Epistasis component, defined as the linear combination of Gene SNPs having the highest correlation with the phenotype. The selection of significant effects is done using a penalized regression method based on Group Lasso controlling the False Discovery Rate. Conclusion The method is tested against two recent alternative proposals from the literature using synthetic data, and shows good performances in different settings. We demonstrate the power of our approach by detecting new gene-gene interactions on three genome-wide association studies. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1488-0) contains supplementary material, which is available to authorized users.


Background
Genome Wide Association Studies (GWASs) look for genetic markers linked to a phenotype of interest. Typically, hundreds of thousands of single nucleotide polymorphisms (SNPs) are studied for a limited number of individuals using high-density genotyping arrays. Usually the association between each SNP and the phenotype is tested using single-marker methods. Multiple markers may also be considered, but these are typically selected using simple forward-selection methods. GWASs are a powerful tool for investigating the genetic architecture of complex diseases and have been successful in identifying hundreds of variants. However, they have been able to explain only a small proportion of the phenotypic variations expected from classical family studies [1]. A number of explanations for this missing heritability have been put forward. For example, it has been suggested that shared environments among relatives are not adequately taken into account. Another suggestion is that much larger numbers of variants with small effects remain to be identified. Rare variants, which are difficult to find using existing genotyping arrays [1], seem to be important causal factors, and so do structural variations. But complex diseases may also be caused, at least in part, by complex genetic structures with multiple interactions between markers (a phenomenon termed epistasis). Whereas in pedigree studies the genetic effect on phenotype is seen as part of the additive genetic variance, in GWASs it is seen as an unmeasured interaction between genes [2]. For example, Zuk et al. proposed a model that takes into account epistatic interaction in relation to Crohn's disease [3]. They found that 80% of the missing heritability could be due to genetic interactions.
In recent years a number of methods for studying epistasis have been proposed and reported in various reviews [4,5,6]. They vary in terms of their data analysis (genome-wide or filtering) and their statistical methodology (Bayesian, frequentist, machine learning or data mining). Most of them focus on singlelocus interactions, but considering interactions at the gene level can have several advantages. First, given that genes are the functional unit of the genome, results may be more biologically interpretable. Second, genetic effects are more easily detected when SNP effects are aggregated together. Third, gene-based analysis simplifies the multiple testing problem by reducing the number of variables. Several gene-gene methods have been proposed. These are based on a summarizing step which is used to obtain information at the gene level. In more recent methods, filters or penalized models are used to make the method applicable to a large number of genes, while older methods are only applicable to two or a very limited number of genes. For the summarizing step, most methods resort to a principal components (PC) approach, but each method has its specific characteristics. We describe some of these below.
Chatterjee et al. harnessed Tukey's one-degree-of-freedom method to investigate interaction between two genes [7]. Their method is based on the assumption that the SNPs included in each gene region act as surrogates for an underlying biological phenotype. The genotypic information for the gene region is extracted  [8]. On the one hand, they used Principal Component Analysis (PCA) to summarize SNP information within a gene, and on the other hand they used Partial Least Squares (PLS) to extract components that summarize, first, the information among SNPs in a gene and, second, the correlation between SNPs and the outcome of interest.
They then proposed an interaction test based on either the first PC or the first PLS component for each gene, and were able to show that the PCA and PLS methods often outperformed Tukey's one-degree-of-freedom method. But it is worth noting that the main objective of these three methods was improving the detection of associations in the presence of gene-gene interactions, rather than identifying the interactions themselves. Other approaches based on principal component analysis have since been proposed for epistasis detection. Li et al. proposed selecting, as the gene representation, PCs that are able to explain at least 80% of the variation [9]. Genotypic data are characterized by the high correlation among markers resulting from so-called linkage disequilibrium (LD). Procedures that take LD information into account have been developed for epistasis detection. For example, He et al. proposed an approach using LD information to weight genotype scores which are then aggregated using principal components [10]. Rajapakse et al. developed a gene-based test of interactions for case-control studies which compares LD patterns between cases and controls [11]. Using the same idea, Peng et al. used a canonical correlation-based U-statistic model (CCU) to detect co-association in case-control studies [12]. The idea is to test for two given genes the difference between canonical correlation coefficient computed by CCA among cases and among controls. Their work was subsequently extended to include kernel [13,14].
However most of these methods can be applied only to a reduced number of genes. Computational constraints mean that it is not feasible to model all genegene interactions directly. One way of overcoming this is to reduce the gene-gene search space by eliminating unimportant genes, and to this end two-step procedures have been developed that first filter out specific genes or SNPs through a genome-wide search before testing for interactions. One example of this is the model-based kernel machine method (3G-SPA) proposed by Li and Cui, which first performs a search for gene pairs contributing to the overall phenotypic variations [15]. Significant pairs are then tested for interaction effects. Another attractive alternative is offered by penalized regression methods that select a subset of important predictors out of a large number of potential predictors.
These methods operate by shrinking the size of the coefficients. The coefficients of predictors with little or no apparent effect are force to be set to zero, reducing the effective degrees of freedom and in many cases making model selection possible. A few approaches using penalized models have been proposed. D'Angelo et al. combined principal component analysis and lasso penalized regression [16].
Wang et al. used a principal component analysis combined with an L1 penalty, with adaptive weights based on gene size, pathway support and effect size [17].
Here we propose a Group Lasso approach [18] that takes into account the group structure of each gene in order to detect epistasis. We introduce Gene-Gene Eigen-Epistasis (G-GEE) as a new approach for computing the genegene interaction part of the model, and we compare G-GEE with two different interaction variable modeling approaches inspired by previous proposals in the literature, namely PCA and PLS. An adaptive ridge-cleaning approach [19] is then used in order to compute p-values for each group.
In the next section, we detail each model and outline the design of the simulation studies performed to compare the performance of the different approaches.
In the Results section, the findings of the simulation studies are shown, and we illustrate our approach on three real datasets relating to ankylosing spondylitis, thyroid carcinomas and inflammatory bowel disease. The different approaches and the results are discussed in the last section.

Methods
We consider n individuals where y = (y 1 , y 2 , ..., y n ) T denotes the vector of trait values. For each individual, genetic variants among G genes are considered. Each gene is described by a given number of SNPs p g where g p g = p. The SNP matrix X ∈ R n×p considers an additive coding scheme in which the genotype value of each SNP j from individual i is denoted X ij ∈ {1, 2, 3}. X i is a pdimensional vector of covariates for observation i and for j ∈ {1, . . . , p}. X g denotes the submatrix of X whose columns are the p g SNPs of gene g. A generalized linear model is generally assumed for GWAS, where the phenotype is considered as a random variable y i whose conditional expectation can be written as a function of the covariates X i and their interactions Z i , and Z i is the ith line of the matrix of interactions and γ a parameter vector of appropriate dimension. When the phenotype is binary (case control study), it is usual to assume a logistic model where g() is the logit and Y is assumed to follow a binomial distribution. Below we will consider only quantitative phenotypes using a classical linear model. In this case g() is the identity and the residuals are assumed to be Gaussian.
The main effect of each gene is modeled through the sum of the effects of all its SNPs. Concerning interaction effects, we compute new variables representing interactions between two specific genes and define as a group all the interaction variables related to a given pair of genes. The matrix of interaction is thus structured into G(G − 1)/2 submatrices: where Z rs describes the interactions between the two genes r and s. The parameter vector γ is accordingly structured into sub-vectors γ rs . We will now present and compare three different approaches for modeling gene-gene interactions.

Modeling Gene-Gene interactions
Let us consider two genes r and s described respectively by p r and p s SNPs. A possible interaction term describing the epistasis between the two genes is We hereafter set W rs = {X r ij X s ik } j=1,··· ,pr;k=1,··· ,ps i=1···n . In this case the submatrix of interactions is Z rs = W rs and γ rs = {γ rs jk } is a vector of size p r p s . The number of parameters in such a model is obviously too large to be reliably estimated. For this reason a number of papers in the literature consider reducing the dimension of γ.
In this paper we will consider three different methods for reducing the dimension reduction, namely Principal Component Analysis (PCA), Partial Least Squares (PLS), and our proposed Gene-Gene Eigen-Epistasis approach that we have termed G-GEE.

Principal Component Analysis
Principal Component Analysis (PCA) can reduce the number of variables describing each gene r from p r to q r < p r . Considering gene r described by p r SNPs, we compute the matrix of the first q principal components where U r is the matrix of the first q r principal axes. Using C r and C s instead of X r and X s in the computation of the interaction allows the number of parameters relative to each interaction to be controlled. This control is achieved by choosing the number of principal components q. The PCA model that we describe draws upon ideas in [20]. The interaction term takes the form Relating this expression to the general form of the interaction term W rs i described above, we can see that performing PCA prior to computing the interactions is a means of constraining the linear interaction term of Equation
In particular, if a single principal component is chosen, there will be only one parameter to estimate per interaction.

Partial Least Squares
Wang et al. proposed an alternative method for integrating interactions using a PLS approach [8]. Let (X r , X s ) be the genotypic matrix for the given pair of genes (r, s). Their approach computes the components that maximize cov 2 (X r u, T v), with T = (y, X s ) and (u, v) the weight vectors. The interaction of a couple of genes (r, s) is then represented by the first q components: In this approach phenotypic information is retained when the interaction variables are constructed.

Gene-Gene Eigen-Epistasis
We propose an original approach for modeling interactions. The general idea is to consider the interaction variable between the two genes r and s as a function f u (X r , X s ) parameterized by u. One way to estimate u is to maximize the correlation between the interaction function and the phenotype: cov 2 (y, f u (X r , X s )).
If we consider the function f to be linear, our problem becomes easily tractable and has only one solution. Setting where W rs = {X r ij X s ik } j=1··· ,pr;k=1,··· ,ps i=1···n and u ∈ R prps we obtain the following problem: The solution u is the eigenvector corresponding to the largest eigenvalue of the matrix W rsT yy T W rs , which is the vector W rsT y. The complexity of computing u is therefore in O(np r p s ). We then use the projection of the matrix W rs on u as the interaction variable. The resulting Eigen-Epistasis vector Z is the linear combination of all the SNP-SNP interactions being the most correlated with the phenotype. In its construction, G-GEE has similarities with PLS. The main difference lies in the original design matrix. PLS searches for components that maximize cov 2 (X r u, yX s v), whereas G-GEE retains the component that maximizes cov 2 (y, W rs u), with W rs the matrix of all pairwise interaction between the two genes r and s. Like PLS, G-GEE takes phenotypic information into account in the construction of the interaction variables. Other methods as such as CCU [12] and the kernel versions of CCU [13,14] that we referred to in the introduction also consider the phenotype in their construction, but these methods can be applied only to case-control problems.

Estimation of coefficients
We propose a Group Lasso approach [18] for estimating the parameters of linear or logistic (case control) regression. A group comprises either the SNPs of a given gene, or interaction terms relative to a given gene-pair interaction. In the particular case of linear regression, the model parameters are estimated by: The parameter λ is selected by cross-validation.
In order to improve estimation accuracy and to obtain p-values for each of the selected groups, we use the adaptive ridge cleaning approach proposed by Bécu et al. [19]. This screen and clean procedure is a two-stage method.
The group lasso model is first fitted on half of the data. The coefficient of the candidate groups selected by the model are then introduced into a ridge regression model fitted on the second half of the data with a specific penalty that allows the group structure to be taken into account. For each group the significance of the regression coefficients is estimated using permutation tests.

Simulation Design
To evaluate the performance of the proposed approach, we conducted two simulation studies, the first using simulated data and the second using a real dataset relating to ankylosing spondylitis. In each case we compared the proposed G-GEE model to the two other interaction variable modeling approaches. The first simulation corresponds to a simplified context where all parameters were controlled and external interference limited, while the second simulation corresponds to a realistic context with a realistic pattern of MAF and LD.

Design
Genotypes Our first (simplified) simulation study was adapted from the model used in [21] with an extension to control the minor allele frequency (MAF) of each SNP. The n lines of the genotype matrix are an i.i.d. sample from a multivariate random vector X i ∼ N p (0, Σ). The correlation matrix Σ is block diagonal, each block corresponding to a gene. Two variables belonging to the same gene are correlated at level ρ = 0.8 while all other correlations are null.
Each SNP (column of the genotype matrix) is randomly assigned an MAF p from a uniform distribution between 0.05 and 0.5. An MAF value of 0.2 is assigned to all causal SNPs. The genotype frequencies derived from the Hardy-Weinberg equation are then used to discretize X ik values to 1, 2 or 3. In practice, X ik is In the second (realistic) simulation study using a real ankylosing spondylitis dataset, genes are randomly selected. The number of SNPs composing each genes varies according to the selection.
Phenotypes For both simulation studies, we generated phenotype vectors using two different schemes. Our first scheme corresponds to the model proposed by Wang et al. [17] (which, for the sake of brevity, we will refer to hereafter as the "Wang Pathway" model): where C and C 2 are respectively the set of causal SNPs and causal interactions, and i a random Gaussian variable. For each causal gene g, we consider two causal SNPs and a coefficient β g is assigned to the standardized sum of these causal SNPs. In the same way, for the interactions, all the causal SNPs from a causal pair (r, s) are pairwise multiplied and a coefficient γ rs is assigned to the standardized sum of the product.
Our second scheme for simulating phenotypes is based on the following model: The difference with the first model concerns the simulation of the interaction effect. In the second model the interaction effect for a causal couple (r, s) is defined as the product of the first PCA component C r .1 of gene r and the first In both models, β 0 is set to 0, and i are generated independently from a N (0, σ 2 ), with σ 2 determined from the coefficient of determination R 2 that calibrates the strength of the association. Both simulation models can be written .
We remark that: Thus, replacingv ar( i ) by σ 2 , andĉ ov( i , Q i φ −ȳ) by −σ 2 /n, we obtain This relation between R 2 and σ 2 gives us an expression for σ 2 that depends on R 2 , We looked at how much of the coefficient of determination R 2 is explained by main effects, and how much is explained by interaction effects, in order to determine their respective roles in the model.
For a similar reason, when simulating phenotypes, Wang et al. [17] examined how much of partial R 2 was due to interaction effects. They selected coefficient values so that 30% of the partial R 2 was explained by interaction effects. Li and Cui [15] did not use the R 2 directly, but they simulated data assuming different proportions of interaction effects among the total genetic variance. In our study, once the phenotype y had been set for each simulated design matrix, we computed how much of the R 2 could be attributed to interaction and main effects as T respectively, with R 2 I the R-square value for the model containing only simulated interaction effects, R 2 M the R-square value where there were only simulated main effects, and R 2 T R-square value where there were both simulated main effects and simulated interaction effects.
Scenarios In the first (simplified) simulation study, genotypes are simulated as described in the design. We considered six genes, each composed of six SNPs and for 600 subjects. We define one causal interaction between genes and two causal genes with main effects, and the simulation takes place using two alternative simulation settings: • (1) one interaction and two main effects involving the same genes • (2) one interaction and two different main effects For these two settings, different coefficients of determination, from 0.05 to 0.7, are considered and 1000 iterations are performed.
In the second (realistic) simulation study, genotypes come from a real dataset comprising 763 individuals. At each iteration we randomly select six genes of various size (from 1 to 1119 with a median of 2 SNPs) in the dataset. We consider the five following settings: • (1) one interaction and two main effects involving the same genes

Real data illustration
To illustrate our approach we applied the proposed method on three real datasets related to ankylosing spondylitis, thyroid carcinomas and inflammatory bowel disease.
The dataset regarding ankylosing spondylitis consists of the French subset of the large study of the International Genetics of Ankylosing Spondylitis (IGAS) study [22]. For this subset, unrelated cases were recruited through the

Simulation studies
In the following, we will refer to the different simulation settings by using letters as described in Table 1.
Results from the simplified simulation study  In the first setting ( Figure 1(A,B)), we consider genes 1 and 2, both having main and interaction effects. When the phenotype is simulated using the Wang Pathway model, the G-GEE and PLS methods have a higher power to detect the interaction effect than PCA method, which tends to identify only the two main effects of the two genes ( Figure 1(A)). Whereas for PCA and PLS the power is nondecreasing with R 2 , for G-GEE we observe a U-shaped curve. For the smallest R 2 values, which correspond to the most difficult cases, the power of G-GEE to detect the interaction tends to decrease. When R 2 values reach 0.4, G-GEE's power to detect the interaction starts to increase. The situation is different for the main effects, since G-GEE's power to detect these increases continuously with R 2 [see Additional file 1]. For PLS, the power to detect the interaction effect is continuously nondecreasing. Note, however, that for this method one of the two main effects (here gene 1) is detected to the detriment of the second, regardless of the value of R 2 . In the PCA phenotype simulation model (Figure 1(B)), G-GEE has a higher power than the other methods to detect interaction effects while retaining a good specificity, whatever the value of R 2 . The reasonably high power of the PCA method can be explained by the similarity between the phenotype simulation model and the estimation model.
It is worth noting that in this first setting, only a few variables are falsely significant, which reflects a good specificity for all methods (the worst being for X.Genes4.Genes6 In the second setting (Figure 1(C,D)), genes 1 and 2 have only main effects, and genes 3 and 4 have only an interaction effect. When the phenotype is simulated using the Wang Pathway model, the interaction power of G-GEE is uniformly higher than that of the other methods (Figure 1(C)). For all values of R 2 , PCA tends to detect false main effects for genes 3 and 4, but not to detect interaction effects. In the PCA phenotype simulation (Figure 1(D)), PCA has a good power to detect interaction effects, but once again these good Results from the realistic simulation study Figures 2 and 3 show results for the first three settings. Figure 2 shows the power to detect gene interaction depending on the R 2 , and Figure 3 shows heatmaps of significant effects when R 2 = 0.2. In both figures, the upper row relates to phenotypes simulated using the Wang Pathway model, and the lower row to phenotypes simulated using the PCA model.  X.Gene4.Gene6 Gene4 Gene3 Gene2 Gene1 Figure 3: Discoveries under a realistic context Heatmaps of the ratio of the number of times where each variable was significant to the total number of simulations for R 2 = 0.2.The panels A, B, C, D, E and F refer to the different simulation settings described in Table 1. In the third setting, where only one interaction is simulated between genes 1 and 2, G-GEE has a higher power to detect interaction than PLS and PCA when the phenotype is simulated using the Wang Pathway model (Figure 2 E).
The power of PCA is higher in the PCA phenotype simulation model because of its similarity to the estimation model, whereas the power of PLS is almost null (Figure 2   Percentage of R 2 attributable to interaction and main effects respectively Table 2: Average percentage of R 2 attributable to interaction and main effects, by setting, in the first simulation study. Using each setting in both simulation studies, we determine the p I and p M average values that correspond to the proportion of the R 2 attributed to interaction and main effects, respectively. For most settings, the p I depends on the number of simulated effects. With one interaction and two main effects simulated the R 2 part attributable to interaction effects is around 33% ( Table 2 (B, C, D), Table 3 (C, D)). For the setting with numerous effects [see Additional file 2], the average p I is 67% because we consider four interaction effects for only two main effects. Finally, as expected, when only interaction effects are Table 3: Average percentage of R 2 attributable to interaction and main effects, by setting, in the second simulation study. simulated, the average p I is 100% (Table 3 (E, F)) and 0% when only main effects are simulated (Table 3 (OME)). However, the R 2 distribution between main and interaction effects is not distinguishable in the setting where the phenotype is simulated using the Wang Pathway model with the same main and interaction effects. The p I and p M values are all above 90% (Table 2 (A), Table 3 (A)). In the second simulation study, the R 2 distribution is also not well divided between main and interaction effects when the phenotype is simulated under the PCA model, though p M is still higher than p I (Table 3 (B)).

Conclusions
The results obtained in both simulation studies point to a certain confusion between main and interaction effects. When simulated interaction and main effects involve different genes, the methods tend to detect as interaction effects the pairs of genes simulated to have main effects and, conversely, to detect as main effects the genes simulated to having interaction effects.
Overall, G-GEE tends to detect more false interactions than false main effect whereas PLS and PCA tend to detect more false main effects though PLS tends to attribute a false interaction effect between genes 1 and 2. This type of confusion may explain the U-shaped power curve for G-GEE observed in the first simulation study (Figure 1(A)). As the problem becomes harder, the genetic effects of both genes are preferentially assigned to the interaction effect, implying a better power to detect interaction where R 2 values are small. Finally, we remark that for G-GEE false detections of main effects are more frequent when the PCA phenotype simulation model is used, whereas for the PLS and PCA methods, where the number of false detections for main effects is higher when the Wang Pathway phenotype simulation model is used.
Other observations regarding the power of the different methods can be made with these simulation results. PLS has more trouble than PCA and G-GEE in detecting interaction effects, and has a tendency to detect the first main effect with a higher power than the second main effect when two main effects are simulated. For all methods, the power to detect interactions increases more slowly with respect to R 2 when simulations are performed using real data genotypes than with fully simulated genotypes, but we observe that in the first setting the curve representing the interaction power of G-GEE is detached from the others, reflecting the superior performance of G-GEE over PLS and PCA.
Note that the power of G-GEE to detect main effects is always less than that of PCA and PLS when R 2 < 0.4 [see Additional file 1 and 5]. In short, G-GEE performs better when detecting interactions than when detecting main effects.

Ankylosing spondylitis
Ankylosing spondylitis (AS) is a common form of inflammatory arthritis predominantly affecting the spine and pelvis. It occurs with a prevalence of 0.1% to 1.4% depending on the considered population [24]. Genetic factors account for more than 90% of the risk of susceptibility to AS. Human leukocyte antigen (HLA) class I molecule HLA B27, belonging to the Major Histocompatibility Complex (MHC) region, was the first genetic risk factor identified as associated with ankylosing spondylitis in the 1970's [25,26] and remains the most important risk locus for this pathology. Despite the strong association only a small portion of HLA-B27 carriers develop the disease. Furthermore, studies in families suggest that less than 50% of the overall genetic risk is due to HLA-B27, which suggests that other genetic factors are involved [27]. A number of updated reviews on AS genetics, including genome-wide association study (GWAS) results, identified new ankylosing spondylitis-susceptibility genes outside of the MHC region [28,29].
We applied all the methods described above to the AS dataset. The data contain 408 cases and 358 controls, and each individual was genotyped for 116, 513 SNPs with Immunochip technology. For each SNP we obtained detailed genetic information, such as gene affiliation, with the NCBI2R package [30] which annotates lists of SNPs with current information from NCBI. We considered only SNPs located within a single gene in order to form gene groups without overlap. We focused our analysis on a list of 29 genes previously identified as having a main effect in GWAS.
The three methods tested yield different results, and only the PLS and G-GEE methods identify interactions. PCA detects only the main effect HLA-B and identifies no interactions. PLS detects the main effect HLA-B, but also identifies one interaction effect between the genes EOMES and BACH2. Our method G-GEE does not detect any main effects, but it shows two significant interactions, the first between the genes HLA-B and SULT1A1 and the second between IL23R and ERAP2.

Thyroid carcinomas
Thyroid cancers are thought to be related to a number of environmental and genetic predisposing factors and can be classified in various types and subtypes.
Most association studies have focused on main effects but only a limited number of genes were identified. Recently, some papers focus on the detection of epistatic interactions [31,23]. We applied our proposed approach on the two data sets used in Luzón-Toro el al. [23]

Discussion
In this paper we compared different approaches for modelling gene-gene epistasis in a penalized regression framework. Our primary concern was the detection of interaction effects, and for this purpose we defined a general model and tested different interaction terms. We focused our analysis at the gene scale and compared three ways to design the interaction term. Some methods were inspired by previous proposed approaches based on dimensional reduction methods including Principal Component Analysis (PCA) and Partial Least Square analysis (PLS). We additionally proposed a new interaction modeling approach that we called Gene-Gene Eigen-Epistasis (G-GEE), where one interaction variable is built for each couple. The interaction variable was defined based on a criterion that maximizes the covariance between the phenotype and the pairwise SNP product matrix of the two genes. The interaction components were then introduced in a Group Lasso penalized regression model that takes the gene structure into account and is capable of handling a large number of genes simultaneously.
A power study of the different methods based on two different simulation schemes (simplified and realistic) provided us with a rich body of information.
Across various papers in the literature we find comparisons of similar methods that use different phenotype simulation settings. In the present work we compared two simulation models. Our first model was from a previous study [17]  For all methods we observed a confusion phenomenon when active genes are not simulated with both main and interaction effects. False detections of interactions concern genes that were simulated to have main effects, and false detections of main effects concern genes simulated to have interaction effects.
This phenomenon reveals the difficulty that all methods encounter in clearly distinguishing the different types of effects. There are more false main detections when using methods such as PCA and PLS that are better at detecting main effects (except when the phenotype is simulated using the PCA model). As for interaction effects, the G-GEE methods make more false interaction detections than PCA and PLS.
When genotypes are fully simulated in the simplified simulation study, the G-GEE and PCA approaches performed better when the PCA phenotype simulation model was used, whereas the PLS method was not very sensitive to the choice of phenotype simulation model. Unlike PCA and PLS, G-GEE is better at detecting interaction effects than at detecting main effects when simulations use a real data set. Since the simulation study using realistic data is meant to mimic real genotype data structure, we conclude that in a real context G-GEE will be better at detecting interaction effects than main effects.
In comparison to SNP-SNP interaction approaches, the gene-scale dimension of our proposed method means that considerably fewer interaction variables need to be considered within a genetic region. This reduction in problem size allows larger problems to be handled. Moreover, a penalized regression method allows a true multivariate approach over a larger number of genes. It also extends other proposed gene-scale approaches, such as that presented by Wang et al. [8]. The ability to handle a relatively large number of genes simultaneously makes the detection of interactions between different genetic regions possible. This might be useful as an initial step, prior to using SNP-SNP interaction methods that may provide more accurate information.
As the G-GEE method is not able yet to consider all human genes at the same time, it is necessary to specify a list of genes to be explored for potential interactions. Given that its power to detect main effects is low, for the detection of main effects it will be safer to use previously acquired knowledge of the genetic effects, or to use a pre-processing method. Another limitation of the method is gene size. Computing the gene Eigen-Epistasis vector for two genes of size p r and p s requires an n × (p r p s ) matrix to be computed.
Prospects are improving the G-GEE method's performance by optimizing the computational cost and exploring new interaction functions to be plugged into the G-GEE criterion.
Availability of data and material: Our proposed G-GEE method has been implemented in an R package which is available on github: https://github.com/vstanislas/GGEE

List of abbreviations used
• 3G-SPA: model-based kernel machine method Genetics of Ankylosing Spondylitis (IGAS) study [22]. For this subset, unrelated cases were recruited through the Rheumatology clinic of Ambroise Paré Hospital (Boulogne-Billancourt, France) or through the national self-help patients' association: "Association Française des Spondylarthritiques". Population-matched unrelated controls were obtained from the "Centre d'Etude du Polymorphisme Humain", or were recruited as healthy spouses of cases. The protocol was reviewed and approved by the Ethics committee of the Ambroise Paré hospital.
All participants gave their informed consent to the study.  Additional file 5 - Figure S5 Comparison of power to detect main effects in the realistic simulation study. Gene3.Gene4 Gene5.Gene6 Gene7.Gene8 Gene9.Gene10 Figure S2: Discoveries for the setting with numerous effects Heatmap of the ratio of the number of times where each variable was significant to the total number of simulations for R 2 = 0.7 using the Wang Pathway model for the phenotype simulation with fully simulated data. We consider 25 genes with two main effects for genes 1 and 2, and four interaction effects between genes 3 and 4, genes 5 and 6, genes 7 and 8, and genes 9 and 10.