An efficient algorithm to perform multiple testing in epistasis screening
 François Van Lishout^{1, 2}Email author,
 Jestinah M Mahachie John^{1, 2},
 Elena S Gusareva^{1, 2},
 Victor Urrea^{6},
 Isabelle Cleynen^{5},
 Emilie Théâtre^{3, 4},
 Benoît Charloteaux^{3},
 Malu Luz Calle^{6},
 Louis Wehenkel^{1, 2} and
 Kristel Van Steen^{1, 2}Email author
DOI: 10.1186/1471210514138
© Van Lishout et al.; licensee BioMed Central Ltd. 2013
Received: 10 May 2012
Accepted: 12 April 2013
Published: 24 April 2013
Abstract
Background
Research in epistasis or genegene interaction detection for human complex traits has grown over the last few years. It has been marked by promising methodological developments, improved translation efforts of statistical epistasis to biological epistasis and attempts to integrate different omics information sources into the epistasis screening to enhance power. The quest for genegene interactions poses severe multipletesting problems. In this context, the maxT algorithm is one technique to control the falsepositive rate. However, the memory needed by this algorithm rises linearly with the amount of hypothesis tests. Genegene interaction studies will require a memory proportional to the squared number of SNPs. A genomewide epistasis search would therefore require terabytes of memory. Hence, cache problems are likely to occur, increasing the computation time. In this work we present a new version of maxT, requiring an amount of memory independent from the number of genetic effects to be investigated. This algorithm was implemented in C++ in our epistasis screening software MBMDR3.0.3. We evaluate the new implementation in terms of memory efficiency and speed using simulated data. The software is illustrated on reallife data for Crohn’s disease.
Results
In the case of a binary (affected/unaffected) trait, the parallel workflow of MBMDR3.0.3 analyzes all genegene interactions with a dataset of 100,000 SNPs typed on 1000 individuals within 4 days and 9 hours, using 999 permutations of the trait to assess statistical significance, on a cluster composed of 10 blades, containing each four QuadCore AMD Opteron(tm) Processor 2352 2.1 GHz. In the case of a continuous trait, a similar run takes 9 days. Our program found 14 SNPSNP interactions with a multipletesting corrected pvalue of less than 0.05 on reallife Crohn’s disease (CD) data.
Conclusions
Our software is the first implementation of the MBMDR methodology able to solve largescale SNPSNP interactions problems within a few days, without using much memory, while adequately controlling the type I error rates. A new implementation to reach genomewide epistasis screening is under construction. In the context of Crohn’s disease, MBMDR3.0.3 could identify epistasis involving regions that are well known in the field and could be explained from a biological point of view. This demonstrates the power of our software to find relevant phenotypegenotype higherorder associations.
Keywords
Epistasis Multiple testing maxT MBMDR GWA studies Crohn’s diseaseBackground
The complete sequence of the human genome has left scientists with rich and extensive information resources. The bloom of bioinformatics, and hence the wide availability of software, has improved the possibility to access and process genomic data. Genomewide association (GWA) studies, using a dense map of SNPs, have become one of the standard approaches for unraveling the basis of complex genetic diseases [1]. Despite their success, only a modest proportion of currently available heritability estimates can be explained by GWA studies discovered loci [2]. Commonly performed GWA studies usually oversimplify the underlying complex problem, in that usually no account is made for the existence of multiple “small”associations and nonSNP polymorphisms, nor epigenetic effects, genetic pathways, geneenvironment and genegene interactions [3, 4].
A lot of methods and software tools exist to perform largescale epistasis studies [5]. These Genomewide Association Interaction (GWAI) studies typically involve balancing between achieving sufficient power, reducing the computational burden and controlling type I error rates. Here, we present a new software tool to perform largescale epistasis studies, using the MBMDR methodology [69]. MBMDR is a nonparametric data mining method (no assumptions are made about genetic modes of inheritance) that is able to identify interaction effects for a variety of epistasis models in a powerful way. It is able to distinguish between multiple pure interaction effects and interaction effects induced by important main effects through efficient main effects correction strategies. Apart from identifying multiple sets of significant genegene interactions, MBMDR can also be used to highlight geneenvironment interactions in relation to a trait of interest, while efficiently controlling type I error rates. For now, the trait can either be expressed on a binary or continuous scale, or as a censored trait. Extensions to accommodate multivariate outcomes are underway. Here, we mainly focus on secondorder genegene interactions using biallelic genetic markers. However, our software can also handle multiallelic data and categorical environmental exposure variables, as will be shown in the implementation section. Our C++ software greatly enhances MBMDR’s first implementation as an Rpackage [10], both in terms of flexibility and efficiency.
Implementation
Input/Output
New implementation of maxT
In this section, we present Van Lishout’s implementation of maxT and prove that it requires a memory proportional to n (this is: O(n) memory), whereas the classical implementation of maxT requires O(m) memory. Here, m and n refer to the number of SNP pairs and the number of top pairs to retain in the output, respectively.
 1.
Compute the teststatistics for all pairs of SNPs (j=1,…,m) and sort them. The result is the Real Data vector of Figure 2 where T _{0,1}≥T _{0,2}≥…≥T _{0,m}.
 2.
Generate B random permutations of the trait column. For each permutation i=1,…,B, compute the teststatistics T _{i,j} for all pairs of SNPs (j=1,…,m) in the order defined by the Real Data vector. Force the monotonicity of the rows: for j=m−1,…,1 replace T _{i,j} by T _{i,j+1} if T _{i,j}<T _{i,j+1}.
 3.
For each pair of SNPs j=1,…,m compute the number a _{ j } of T _{i,j} values such that T _{i,j}≥T _{0,j}, for i=0,…,B.
 4.
Compute the pvalues using the equation ${p}_{j}=\frac{{a}_{j}}{B+1}$ for each pair of SNPs. Force the monotonicity of the pvalues: for j=1,…,m−1 replace p _{j+1} by p _{ j } if p _{j+1}<p _{ j }.
Note that the intuition behind the monotonicity enforcing procedure at step 2 is to correct the teststatistics that are obviously too pessimistic: the teststatistic of a pair P1 should not be lower than the teststatistic of a less significant pair P2. Replacing the teststatistic computed for P_{1} by the one computed for P_{2} is therefore a better estimation of the significance of P_{1}. The amount of falsenegative results would be higher without this procedure. Similarily, the purpose of the monotonicity enforcing procedure at step 4 is to correct pvalues that are obviously too optimistic: the pvalue of P_{2} should not be lower than the pvalue of P_{1}. Replacing the pvalue computed for P_{2} by the one computed for P_{1} is therefore a better estimation of the significance of P_{2}. The amount of falsepositive results would be higher without this final step.
From a memory point of view, it is best to implement the aforementioned algorithm in a slightly different way. Indeed, the current description implies all Permutation vectors of Figure 2 to be in memory at the same time. This requires O(B m) memory. In fact, a memory of O(m) can be achieved by adopting a different approach. The idea is that the a_{ j } values calculated at step 3, can already be calculated onthefly. A vector avalues of all a_{ j } values can be created from scratch and initialized with 1’s values. Indeed, note that at step 3 the original sample series counts as 1 of B+1 available samples to assess significance. For i=0,T_{0,j}≥T_{0,j} and hence a_{ j }=1,∀j=1,…,m. The elements of the avalues vector can then be updated at the end of each iteration i=1,…,B of step 2 by incrementing the a_{ j } values corresponding to the T_{i,j}≥T_{0,j} by one. In this way, all i^{ t h }T_{i,j} values can be removed from memory at the end of the i^{ t h } iteration since they are no longer of any use. Hence, only a single Permutation vector is stored instead of B vectors. In fact, applying step 4 to the avalues vector obtained at the end of this procedure readily leads to the final pvalues vector.
This proves that this algorithm requires O(m) memory. Obviously, if M denotes the number of SNPs, m is given by the formula m=M(M−1)/2. The memory usage of the classical implementation thus rises quadratically with the number of SNPs, whereas we will now see that our method is independent of it.
The monotonicity enforcing process executed at the end of step 2, implies that we need to calculate all T_{i,j} values, even if we are only interested in the first n pvalues. However, not all of these T_{i,j} values have to be stored in memory. For our purpose, only T_{i,j}(1≤j≤n) and M_{ i }, the maximum of the [T_{i,n+1},…,T_{i,m}] elements, are required. In other words, there is no need to explicitly propagate M_{ i } to position n+1. It suffices to compute M_{ i } and to replace T_{i,n} by M_{ i } if and only if M_{ i }>T_{i,n}. The monotonicity enforcement continues from positions n−1 through 1.
 1.
Compute the teststatistics for all pairs but store only the n highest ones. The result is a Real data vector where T _{0,1}≥T _{0,2}≥…≥T _{0,n}.
 2.
Initialize a vector a of size n with 1’s.
 3.Perform the following operations for i=1,…,B:
 (a)
Generate a random permutation of the trait column.
 (b)
Compute the teststatistics T _{i,1},…,T _{i,n} and store them in a Permutation _{ i } vector.
 (c)
Compute the maximum M _{ i } of the teststatistics values T _{i,n+1},…,T _{i,m}.
 (d)
Replace T _{i,n} by M _{ i } if T _{i,n}<M _{ i }.
 (e)
Force the monotonicity of the Permutation _{ i } vector: for j=n−1,…,1 replace T _{i,j} by T _{i,j+1} if T _{i,j}<T _{i,j+1}.
 (f)
For each j=1,…,n, if T _{i,j}≥T _{0,j} increment a _{ j } by one.
 (a)
 4.
Divide all values of vector a by B+1 to obtain the pvalues vector p. Force monotonicity as follows: for j=1,…,n−1, replace p _{j+1} by p _{ j } if p _{j+1}<p _{ j }.
Two remarks are in place:
First, the main idea of the Sorting by insertion algorithm [13] can be recycled to perform step 1 using O(n) memory. The Real Data vector is first initialized with the first n computed teststatistics and sorted using the quick sort algorithm [13]. Then, at each iteration, the next teststatistic is calculated and compared with the smallest value of the vector. If it is smaller or equal nothing has to be done. Otherwise, the smallest value is removed and the new one is inserted in order to preserve the sorting. This insertion requires $\frac{n}{2}$ operations on average. This method works particularly well on largescale problems, where m>>n. Intuitively, the probability of having to insert will decrease at each iteration and tend to zero because the Real Data vector will contain higher and higher values. This algorithm will take O(m) time on average, but could degenerate in O(n m), which is still linear.
Second, it should be noted that step 3(b) and 3(c) can be merged into a single step. The idea is to create first a hash table containing the indexes of the n best pairs, resolving collision by separate chaining [13]. The teststatistics T_{i,j} can then be computed in any convenient order. At each iteration, the hash table is used to decide (almost instantaneously) if the current value corresponds to one of the n best pairs or not, and perform either step 3(b) or step 3(c) accordingly.
Parallel workflow
 1.
Compute the teststatistics for all pairs on one machine and save the n highest ones into a file topfile.txt. This file should be saved at a location on which each machine has read access. It will contain the information of the Real Data vector of Figure 2 and have thus a size of only O(n).
 2.Split the computation of the permutations homogeneously between the Z machines. On each machine z=1…Z, perform the following operations:
 (a)
Read the file topfile.txt
 (b)
Initialize a vector p of size n with 0’s.
 (c)
Execute step 3 of Van Lishout’s maxT algorithm for each permutation assigned to z (using vector p instead of a).
 (d)
Save the p vector into a file permutation _{ z } .txt.
 (a)
 3.
When all machines have terminated their work, sum all vectors of the files permutation _{ 1 } .txt…permutation _{ Z } .txt to obtain a vector p. Perform step 4 of Van Lishout’s maxT algorithm on this vector.
However, the main feature that makes our software fast is not parallelization, but speed of the teststatistic computations. Indeed, we have seen that the maxT algorithm computes B×mT_{ i,j } values. Solving B=999 permutations with a dataset of M=100,000 SNPs, i.e. m=O(10^{ 10 }) pairs of SNPs, means thus O(10^{ 13 }) computations to perform. It is obvious that the computation of the teststatistic T_{ i,j } has to be very fast and that each improvement can have a dramatic influence on the final computing time. We show in the next section how we achieve this goal.
Teststatistic computation
This section presents the implementation of the computation of the T_{ i,j } values, capturing the degree of association between the j^{ th } pair of SNPs [SNP_{ lj },SNP_{ rj }] and the i^{ th } permutation of the trait Trait_{ i }. Let M+1 (n+1) be the number of possible values for SNP_{ lj } (SNP_{ rj }). In practice, most of the studies concern biallelic genetic markers and M=N=2. However, our program automatically detects the exact values of M and N, so that multiallelic variables are also covered. Furthermore, categorical environment variables can also be handled, as long as they are coded 0, 1,... M (N).
Since we are interested in solving largescale problems, we must realize that the part of the code that reads the dataset at the start of the program cannot store it in cache because of its size. Accessing to the trait and SNP values is thus slow and must be avoided as much as possible. For this reason, the three columns of interest (Trait_{ i }, SNP_{ lj } and SNP_{ rj }) will be passed by value and not by reference to the function. In this way, an explicit local copy of them will be performed, on which the function will be able to work faster.
 1.
Generation of the affectedsubjects and unaffectedsubjects matrices. These matrices are easily obtained by performing a loop over the subjects of the dataset: for a=1,…n, if c _{ a }=1 increment a cell of the affectedsubjects matrix, else a cell of the unaffectedsubjects matrix. The cell to be incremented depends on the genotype: g _{ alj } indicates which row of the matrix has to be incremented and g _{ arj } which column.
 2.
Generation of the HLOmatrix from the two matrices generated at step 1. The value of each R _{ mn } elements depends on a test for association between the trait and the genotype (SNP _{ lj } = m, SNP _{ rj } = n). This can be a χ ^{ 2 } test with one degree of freedom in the case of a binary trait, an Ftest in the case of a continuous trait, a logrank test in the case of survival data. However, the architecture of the software makes it easy to implement other test statistics that are appropriate for the data at hand. For binary traits, the implemented test statistic is defined by$\frac{{(\mathit{\text{ad}}\mathit{\text{bc}})}^{2}(a+b+c+d)}{(a+b)(c+d)(b+d)(a+c)}$, where a and b refer to the number of affected and unaffected subjects having the genotype (SNP _{ lj } = m, SNP _{ rj } = n) and c and d refer to the number of affected and unaffected subjects having a different genotype. This statistic follows a χ ^{ 2 } distribution. If we define N _{ A } and N _{ U } to be the total number of affected and unaffected subjects, those values are easy to compute: a = A _{ mn },b = U _{ mn },c = N _{ A } −A _{ mn } and d = N _{ U } −U _{ mn }. At this point, if either a+b or c+d is below a threshold that is a parameter of the program (default value 10) then the test is not performed at all, since it would not be statistically significant. In this case the value of R _{ mn } will be set to “O”, to indicate the absence of evidence that the subset of individuals with multilocus genotype (SNP _{ lj } = m, SNP _{ rj } = n) has neither a high nor a low risk for disease. Otherwise, the test is performed. When the computed χ ^{ 2 } value is not significant based on a liberal significance threshold of 0.1 (default value in the software), the value of R _{ mn } will be set to “O”, to indicate that we cannot reject the independence hypothesis. Otherwise, R _{ mn } will be set to either “H” if (ad−bc)>0, to indicate that the population whose genotype is (SNP _{ lj } = m, SNP _{ rj } = n) has a high risk of having the trait, or to “L” if (ad−bc)<0, to indicate a low risk for this event.
 3.
Computation of T _{ i,j } from the three matrices generated at step 1 and 2. It consists in performing two χ ^{ 2 } tests with one degree of freedom and returning the maximum of both. The first one tests association between the trait and the belonging to the “H” category versus the “L” or “O” category. The second one tests association between the trait and the belonging to the “L” category versus the “H” or “O” category. In the first (second) case, a and b are respectively the number of affected and unaffected subjects belonging to the “H” (“L”) category and c and d to the “L” (“H”) or “O” category. Computing this can be easily achieved by initializing a,b,c and d to zero, and for each R _{ mn } adding A _{ mn } to a and U _{ mn } to b if R _{ mn } = “H” (“L”) and A _{ mn } to c and U _{ mn } to d otherwise.
In summary, this paragraph shows that to make this methodology fast, reading the data of the subjects only once during step 1 to create the affectedsubjects and unaffectedsubjects matrices is a key. In this way, the test statistic computation function can quickly start to work on a very small part of memory that is in cache. The keys that make step 2 and 3 fast are respectively the fact that computing an R_{ mn } value does not require any loop and the fact that a single loop of nine iterations (in the biallelic case) allows to calculate all the numbers needed in the χ^{ 2 } formula.
Results and discussion
Here we present results for both simulated data and reallife data.
Simulated data
Twolocus penetrance table used to create the strong signal
b/b  b/B  B/B  

a/a  0  0.1  0 
a/A  0.1  0  0.1 
A/A  0  0.1  0 
Execution times of MBMDR3.0.3
SNPs  MBMDR3.0.3  MBMDR3.0.3  MBMDR3.0.3  MBMDR3.0.3 

sequential execution  sequential execution  parallel workflow  parallel workflow  
Binary trait  Continuous trait  Binary trait  Continuous trait  
100  45 sec  1 min 35 sec  <1sec  <1sec 
1,000  1 hour 16 minutes  2 hours 39 minutes  38 sec  1 min 17 sec 
10,000  5 days 13 hours  11 days 19 hours  1 hour 3 min  2 hours 14 min 
100,000  ≈ 1.5 year  ≈ 3 years  4 days 9 hours  ≈ 9 days 
Crohn’s disease data
We apply our software to reallife data on Crohn’s disease [17][18]. Here, Caucasian Crohn’s disease patients and healthy controls are genotyped using Illumina HumanHap. Quality control tests are performed on these data excluding SNPs and individuals with more than 5% missing genotypes. Individuals with mean heterozygosity outside the range of 31% to 38% are discarded. The gender of the individuals is predicted from the mean homozygosity on X markers and samples with contradiction between the estimated and the recorded gender are excluded. SNPs violating HardyWeinberg principle are discarded using aχ pvalue threshold of 10^{−4}. Related individuals are identified using pairwise IBS tests and discarded as well. The cleansing process give rise to a set of 1687 unrelated Caucasians (676 CD patients and 1011 healthy controls) and 311,192 SNPs.
For the purpose of this study, we use Biofilter.0.5.1 [19] as an additional data preparation step. It uses a knowledgedriven approach to prioritize genetic markers in genegene interaction screening while reducing the search space. In particular, Biofilter allows the explicit detection and modeling of interactions between a large set of SNPs based on biological information about genegene relationships and genedisease relationships. The knowledgebased support for the models is attributed by implication index, which is simply a number of data sources that provide evidence of genegene interaction or genedisease relationship, and is calculated by summing the number of data sources supporting each of the two genes and the connection between them (see [19] for more details). In practice, to make the prioritization procedure in Biofilter more focused on CD, we apply a list of candidate genes for CD (120 genes collected from the publications [18][20][23]) and 160 groups (collected basing on selective search in Biofilter using keywordscrohn, enteritis, inflam, autoimmune, immune, bowel, gastrointest, ileum, ileitis, intestine, lleocolic, diarrhea, stenosis and cytokine). Using this approach/analysis we ended up with 12,471 SNPs that we further analyze in MBMDR.
SNPSNP interactions having a multiple testing corrected pvalue < 0.05
First SNP  Second SNP  pvalue 

rs11209026  rs7573680  0.004 
rs11465804  rs7573680  0.017 
rs11209026  rs2064689  0.018 
rs11209026  rs6911639  0.021 
rs11209026  rs4766584  0.023 
rs11465804  rs2064689  0.025 
rs11465804  rs4766584  0.028 
rs11465804  rs6911639  0.029 
rs11465804  rs10849401  0.033 
rs11209026  rs296513  0.037 
rs1343151  rs2076756  0.04 
rs11209026  rs10849401  0.044 
rs11209026  rs7786745  0.048 
rs11209026  rs4655683  0.048 
Location of the SNPs involved in a significant SNPSNP interaction
SNP  Position  Gene 

rs11209026  chr1:67705958  IL23R 
rs11465804  chr1:67702526  IL23R 
rs7573680  chr2:240169077  HDAC4 
rs2064689  chr1:67653010  IL23R 
rs6911639  chr6:32978178  HLADOA 
rs4766584  chr12:109663581  ACACB 
rs296513  chr1:200906473  C1orf81 
rs10849401  chr12:6273238  intergenic 
rs7786745  chr7:18422684  intergenic 
rs4655683  chr1:67611613  intergenic 
rs1343151  chr1:67719129  IL23R 
rs2076756  chr16:50756881  NOD2 
Discussion
Several studies have suggested that different signals exist in IL23R, conferring risk or protection to Crohn’s disease. A study by Taylor et al[25], where they aimed to estimate the total contribution of the IL23R gene to CD risk using a haplotype approach, showed that the population attributable risk for these haplotypes was substantially larger than that estimated for the IL23R Arg381Gln variant alone. MBMDR3.0.3 identified several “epistatic” signals from pairs of SNPs located in the IL23R gene. It should be noted though that epistasis signals on SNPs in LD are considered to be nonsynergetic. The MBMDR discoveries on Crohn’s disease also seem to give us a new working hypothesis to expand on the current knowledge (histone deacetylation). Indeed, histone deacetylation results in a compact chromatin structure commonly associated with repressed gene transcription (epigenetic repression), and hereby plays a critical role in transcriptional regulation, cell cycle progression and developmental events. Although not known to physically interact directly, IL23R and HDAC4 could be linked trough MAPK1/STAT3 signaling: MAPK1 has been shown to associate with phosphorylate HDAC4 [26]. Protein phosphorylation regulates the corepressor activity of the deacetylase. MAPK1 also acts as an important activator of STAT3 (signal transducer and activator of transcription 3) which is an essential regulator of immunemediated inflammation. In addition, the IL23/IL23R pathway modulates STAT3 transcriptional activity, and recently it has been shown that CD8+ T cells from Arg381Gln IL23R carriers showed decreased STAT3 activation compared with WT CD8+ T cells [27]. It can thus be hypothesized that a balanced action between the HDAC1/MAPK1 and IL23/IL23R pathways, converging on STAT3 signaling, are important for CD pathogenesis. The fact that no significant SNP pairs remain, following an adjustment of the MBMDR screen for main effects (an observation that already emerged after interpreting Figure 5) seems to suggest that the significant results for the SNP pairs of Table 3 are mainly induced by important main effect players.
MBMDR3.0.3 can accommodate a variety of study designs and outcome types, can correct for important lower order effects and satisfactory deals with the computational burden induced by highlydimensional complex data. In order to upscale the applicability of the MBMDR methodology towards genomewide association interaction analyzes, the method was implemented in C++ and a new version of the maxT algorithm was incorporated. This version requires an amount of memory that is independent from the number of genetic effects to be investigated. We were able to further reduce the execution time, first by parallelizing the processes and second by optimizing the teststatistic function capturing the degree of association between a pair of SNPs and a trait. All of these features, available in MBMDR3.0.3, are promising in the light of GWAI studies. Alternative approaches to deal with execution time are proposed, for example GPU [28] and cloud computing [29]. Used in conjunction with MBMDR, those methods could lead to very fast software tool to solve GWAI studies problems.
Conclusions
In this paper we have presented the epistasis screening software MBMDR3.0.3. It is based on a new implementation of maxT. The main advantage of this improvement, is that it solves memory problems for any kind of analysis by becoming independent from the number of SNPs, without loss of power. We have also presented a fast implementation of a teststatistic function indicating the association between the trait and a pair of SNPs.
We have tested our program on simulated datasets of increasing size. The parallel workflow was tested on a cluster composed of 10 blades, containing each four QuadCore AMD Opteron(tm) Processor 2352 2.1 GHz and is able to analyze all pairwise genegene interactions with a dataset of 100,000 SNPs typed on 1000 individuals within 4 days and 9 hours, using 999 permutations of the trait to assess statistical significance.
Availability and requirements

Project name: MBMDR

Project home page: http://www.statgen.ulg.ac.be

Operating system(s): Mac OS X and Linux

Programming language: C++

Restrictions on use by nonacademics: no limitations
Declarations
Acknowledgements
This paper presents research results of the Belgian Network DYSCO (Dynamical Systems, Control, and Optimization), funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science Policy Office. The scientific responsibility rests with its author(s). Their work was also supported in part by the IST Programme of the European Community, under the PASCAL2 Network of Excellence (Pattern Analysis, Statistical Modelling and Computational Learning), IST2007216886. FVL, LW and KVS also acknowledges support by Alma in Silico, funded by the European Commission and Walloon Region through the Interreg IV Program. For MC and VU, this work was partially supported by Grant MTM200806747C0202 from el Ministerio de Educación y Ciencia (Spain), Grant 050831 from La Marató de TV3 Foundation, Grant 2009SGR581 from AGAURGeneralitat de Catalunya. VU is the recipient of a predoctoral FPU fellowship award from the Spanish Ministry of Education (MEC). We would like to thank Tom Cattaert (former postdoc) for the interesting discussions and help on further improving the software.
Authors’ Affiliations
References
 Hardy J, Singleton A: Genomewide association studies and human disease. N Engl J Med. 2009, 360: 17591768. 10.1056/NEJMra0808700.PubMed CentralView ArticlePubMed
 Manolio TA, Collins FS, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH: Finding the missing heritability of complex diseases. Nature. 2009, 461 (7265): 747753. 10.1038/nature08494.PubMed CentralView ArticlePubMed
 Visscher PM, Brown MA, McCarthy MI, Yang J: Five years of GWAS discovery. Am Soc Hum Genet. 2012, 90: 724. 10.1016/j.ajhg.2011.11.029.View Article
 Zuk O, Hechter E, Sunyaev SR, Lander ES: The mystery of missing heritability: genetic interactions create phantom heritability. Proc Natl Acad Sci. 2012, 109 (4): 11931198. 10.1073/pnas.1119675109.PubMed CentralView ArticlePubMed
 Van Steen K: Traveling the world of genegene interactions. Brief Bioinform. 2011, 13: 119.View Article
 Calle ML, Urrea V, Malats N, Van Steen K: MBMDR: modelbased multifactor dimensionality reduction for detecting interactions in highdimensional genomic data. Tech. Rep. 24, Department of Systems Biology, Universitat de Vic, Vic,: Spain; 2008
 Calle ML, Urrea V, Vellalta G, Malats N, Van Steen K: Improving strategies for detecting genetic patterns of disease susceptibility in association studies. Stat Med. 2008, 27: 65326546. 10.1002/sim.3431.View ArticlePubMed
 Cattaert T, Calle ML, Dudek SM, Mahachie John JM, Van Lishout F, Urrea V, Ritchie MD, Van Steen K: Modelbased multifactor dimensionality reduction for detecting epistasis in casecontrol data in the presence of noise. Ann Hum Genet. 2011, 75: 7889. 10.1111/j.14691809.2010.00604.x.PubMed CentralView ArticlePubMed
 Mahachie John JM, Cattaert T, Van Lishout F, Gusareva E, Van Steen K: Lowerorder effects adjustment in quantitative traits modelbased multifactor dimensionality reduction. PLoS ONE. 2012, 7 (1): e2959410.1371/journal.pone.0029594. http://dx.doi.org/10.1371/journal.pone.0029594,PubMed CentralView ArticlePubMed
 Calle ML, Urrea V, Malats N, Van Steen K: mbmdr: an R package for exploring genegene interactions associated with binary or quantitative traits. Bioinformatics. 2010, 26 (17): 21982199. 10.1093/bioinformatics/btq352.View ArticlePubMed
 Ge Y, Dudoit S, Speed TP: Resamplingbased multiple testing for microarray data analysis. Tech. Rep. 633, Department of Statistics: University of California, Berkley; 2003
 Westfall PH, Young SS: Resamplingbase Multiple Testing. 1993, New York: Wiley
 Knuth D: The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. 1998, AddisonWesley: Reading
 Cattaert T, Urrea V, Naj AC, De Lobel L, De Wit V, Fu M, Mahachie John JM, Shen H, Calle ML, Ritchie MD: FAMMDR: A Flexible familybased multifactor dimensionality reduction technique to detect epistasis using related individuals. PLoS ONE. 2010, 5 (4): e1030410.1371/journal.pone.0010304. http://dx.doi.org/10.1371/journal.pone.0010304,PubMed CentralView ArticlePubMed
 Mahachie John JM, Van Lishout F, Van Steen K: Modelbased multifactor dimensionality reduction to detect epistasis for quantitative traits in the presence of errorfree and noisy data. Eur J Hum Genet. 2011, 19 (6): 696703. 10.1038/ejhg.2011.17.PubMed CentralView ArticlePubMed
 Ritchie MD, Hahn LW, Moore JH: Power of multifactor dimensionality reduction for detecting genegene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemil. 2003, 24 (2): 150157. 10.1002/gepi.10218.View Article
 Libioulle C, Louis E, Hansoul S, Sandor C, Farnir F, Franchimont D, Vermeire S, Dewit O, de Vos M, Dixon A: Novel Crohn disease locus identified by genomewide association maps to a gene desert on 5p13.1 and modulates expression of PTGER4. PLoS Genet. 2007, 3 (4): e5810.1371/journal.pgen.0030058.PubMed CentralView ArticlePubMed
 Barett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, Rioux JD, Brant SR, Silverberg MS, Taylor KD, Barmada MM: Genomewide association defines more than 30 distinct susceptibility loci for Crohn’s disease. Nat Genet. 2008, 40 (8): 955962. 10.1038/ng.175.View Article
 Bush WL, Dudek SM, Ritchie MD: Biofilter: a knowledgeintegration system for the multilocus analysis of genomewide association studies. Pacific Symposium on Biocomputing. 2009, 368379. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2859610/pdf/nihms186228.pdf]
 Raychaudhuri S, Plenge RM, Rossin E, Ng AC, Consortium IS, Purcell SM, Sklar P, Scolnick EM, Xavier RJ, Altshuler D: Identifying relationships among genomic disease regions: predicting genes at pathogenic SNP associations and rare deletions. PLoS Genet. 2009, 5 (9): 115.
 Franke A, McGovern DP, Barrett JC, Wang K, RadfordSmith G, Ahmad T, Lees CW, Balschun T, Lee J, Roberts R: Genomewide metaanalysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nat Genet. 2010, 42 (12): 11181126. 10.1038/ng.717.PubMed CentralView ArticlePubMed
 Kaser A, Zeissig S, Blumberg RS: Inflammatory bowel disease. Annu Rev Immunol. 2010, 28: 573621. 10.1146/annurevimmunol030409101225.PubMed CentralView ArticlePubMed
 Dalal SR, Kwon HK: The role of MicroRNA in inflammatory bowel disease. Gastroenterol Hepatol. 2010, 6: 714722.
 Watkinson J, Anastassiou D: Synergy disequilibrium plots: graphical visualization of pairwise synergies and redundancies of SNPs with respect to a phenotype. Bioinformatics. 2009, 25 (11): 14451446. 10.1093/bioinformatics/btp159.PubMed CentralView ArticlePubMed
 Taylor KD, Targn SR, Mei L, Ippoliti AF, McGovern D, Mengesha E, King L, Rotter JI: IL23R Haplotypes provide a large population attributable risk for Crohn’s disease. Inflamm Bowel Dis. 2008, 14 (9): 11851191. 10.1002/ibd.20478.PubMed CentralView ArticlePubMed
 Zhou X, Richon VM, Wang AH, Yang XJ, Rifkind RA, Marks PA: Histone deacetylase 4 associates with extracellular signalregulated kinases 1 and 2, and its cellular localization is regulated by oncogenic Ras. Proc Natl Acad Sci USA. 2000, 97: 1432914333. 10.1073/pnas.250494697.PubMed CentralView ArticlePubMed
 Sarin R, Wu X, Abraham C: Inflammatory disease protective R381Q IL23 receptor polymorphism results in decreased primary CD4+ and CD8+ human Tcell functional responses. Proc Natl Acad Sci USA. 2011
 SinnottArmstrong NA, Greene CS, Cancare F, Moore JH: Accelerating epistasis analysis in human genetics with consumer graphics hardware. BMC Res Notes. 2009, 2: 14910.1186/175605002149.PubMed CentralView ArticlePubMed
 Wang Z, Wang Y, Tan KL, Wong L, Agrawal D: CEO: a cloud epistasis computing model in GWAS. International Conference on Bioinformatics & Biomedicine; Hong Kong. 2010, [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5706522]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.