Efficient Bayesian approach for multilocus association mapping including genegene interactions
 Pekka Marttinen^{1}Email author and
 Jukka Corander^{2, 3}
https://doi.org/10.1186/1471210511443
© Marttinen and Corander; licensee BioMed Central Ltd. 2010
Received: 8 October 2009
Accepted: 2 September 2010
Published: 2 September 2010
Abstract
Background
Since the introduction of largescale genotyping methods that can be utilized in genomewide association (GWA) studies for deciphering complex diseases, statistical genetics has been posed with a tremendous challenge of how to most appropriately analyze such data. A plethora of advanced modelbased methods for genetic mapping of traits has been available for more than 10 years in animal and plant breeding. However, most such methods are computationally intractable in the context of genomewide studies. Therefore, it is hardly surprising that GWA analyses have in practice been dominated by simple statistical tests concerned with a single marker locus at a time, while the more advanced approaches have appeared only relatively recently in the biomedical and statistical literature.
Results
We introduce a novel Bayesian modeling framework for association mapping which enables the detection of multiple loci and their interactions that influence a dichotomous phenotype of interest. The method is shown to perform well in a simulation study when compared to widely used standard alternatives and its computational complexity is typically considerably smaller than that of a maximum likelihood based approach. We also discuss in detail the sensitivity of the Bayesian inferences with respect to the choice of prior distributions in the GWA context.
Conclusions
Our results show that the Bayesian model averaging approach which explicitly considers genegene interactions may improve the detection of disease associated genetic markers in two respects: first, by providing better estimates of the locations of the causal loci; second, by reducing the number of false positives. The benefits are most apparent when the interacting genes exhibit no main effects. However, our findings also illustrate that such an approach is somewhat sensitive to the prior distribution assigned on the model structure.
Keywords
Background
Given the hugely decreased economic costs of utilizing largescale singlenucleotidepolymorphism (SNP) genotyping to study the genetic architecture of a phenotype of interest, GWA analyses have become popular within many areas of molecular medicine. An excellent review [1] of the statistical challenges in GWA studies highlights the fact that no single approach has yet appeared which would comply to all immediate desiderata in this context, such as high power, reasonable control of spurious findings and relatively inexpensive computational effort. The plethora of different modeling and testing approaches for detecting single and multiple polymorphisms associated with a complex phenotype that has recently appeared in the literature demonstrates the urgent need for reliable and in practice applicable statistical methods in this context [2–9]. A variety of causal and graphical modeling ideas, as well as more standard regression modelling methods have been investigated in these works.
Some of the most challenging aspects of GWA analyses are how to sensibly handle the question of multiple model comparisons and how to identify influential genegene interactions since the number of putative model terms is astronomic and some interactions may involve polymorphisms that lack main effects, leading to reduced power with singlelocus tests. From a theoretical statistical perspective it could be expected that the Bayesian approach [10] would provide satisfactory answers to these issues due to its ability to combine information over many models with varying parametric dimensionality. Advantages of the Bayesian methods in genetic association studies have been discussed in a recent review [11]. However, a primary burden is then how to specify a sensible prior probability distribution for all putative association models, which is a complex task [1]. The fully Bayesian approach has been employed in the context of GWA studies using regression and graphical models; however, the published approaches have not explicitly considered genegene interactions due to the computational burden [4, 7].
In the current work we have aimed to address these challenges by developing an efficient Bayesian modeling approach that explicitly considers genegene interactions. Our work is partially inspired by the work of Marcini et al. where Bayesian singlelocus association tests were developed [5]. Furthermore, we have examined in parallel the Bayesian graphical modeling approach introduced in [4] and discuss the particular sensitivity of Bayesian inferences with respect to the choice of prior distributions in the GWA context. We show how the marginal likelihood score can be analytically derived for a variety of Bayesian genegene interaction models, which in turn enables the use of highly efficient nonstandard Monte Carlo model learning [12]. The advantages of such an approach compared to standard Markov chain Monte Carlo (MCMC) computation have been demonstrated for very highdimensional model learning problems [12–14]. Moreover, contrasted with the maximum likelihood estimation of comparable logistic regression models as in [2], our method is considerably faster, which is of primary importance given the astronomic number of models that can be examined for a single data set.
The present article is structured as follows. In the Methods section we introduce our Bayesian model and learning algorithm for multilocus association mapping, provide a brief overview of alternative methods, and describe a simulation study utilizing real genomewide SNP data as a basis for generating realistic levels of linkage and molecular variation. The results from the simulation study are presented in Results. In Discussion, we summarize the advantages and disadvantages of our approach and discuss some challenges encountered. Conclusions concisely summarizes the main points.
Methods
Bayesian multilocus association model
Consider a casecontrol study involving N individuals for which y_{ i } ∈ {0, 1} denotes the phenotypic status such that y_{ i } = 1 and y_{ i } = 0 correspond to the presence and absence of a disease, respectively, for i = 1, ..., N. Let Z_{ ij } ∈ {0, 1, 2}, i = 1, ..., N denote the observed genotype of individual i at SNP locus j, j = 1, ..., L Furthermore, let π_{ i } denote the probability of individual i carrying the disease, i.e. the event y_{ i } = 1. To simplify the notation in the model definitions introduced below, we will occasionally omit the index i from the disease probabilities when there is no difference between the individuals.
Each elementary model specifies unambiguously a partition S = {s_{1}, ..., s_{ d } } of the N individuals, i.e. the individuals are divided into d nonoverlapping and nonempty subsets or classes s_{1}, ..., s_{ d } associated with distinct disease probabilities. For instance, the elementary model M_{1} (j) specifies a partition with three classes, each corresponding to the individuals having a particular genotype at locus j (Z _{ ij } equals 0, 1 or 2). The following definition enables us to formally characterize our modellearning strategy.
where${p}_{{c}_{1}\cdots {c}_{n}}$are arbitrary probabilities.
are considered combined because they can be obtained through a combination involving either the elementary model itself, or no models at all. Second, the elementary models involved in a combination operation should not include any overlapping sets of loci. To demonstrate the necessity of such a restriction, consider the elementary models M_{1}(j) and M_{5}(j, k). According to the first model M_{1}(j) there exists a main effect on the disease probability when the genotype at locus j changes. However, this contradicts the second model M_{5}(j, k) according to which there is no such effect. The purpose of the restriction imposed on the combination operator is to prevent ambiguous model specifications in this respect. Third, the combined models are formed from main effects and twoway interactions, and do not explicitly represent higherorder interactions. For example, gene pairs AB and CD could be included, but not a triplet ABC. Nevertheless, in our approach the combination of the full models of types M_{1} and M_{3} is always permitted and it can represent even higherorder interactions; however, this comes with the expense of some redundant parameters. Indeed, the number of parameters (which is equal to the number of classes with differing disease probabilities as specified by the combined model) grows exponentially with respect to the number of loci involved in the model. The growth rate depends on whether we use full models or some lower dimensional submodels when formulating the combined model. For example, a sixlocus interaction model can have at maximum 9^{3} = 729 and at minimum 2^{3} = 8 parameters, depending on whether three models of type M_{3} or M_{5}, respectively, are combined. Fourth, we note that the set of elementary models considered here is not exhaustive and it would be straightforward to generalize the approach by including other types of models, for example those with recessive main effects. Here our particular emphasis is on the situation where some causal SNPs lacking main effects have a joint effect, as exemplified by models of type M_{5}. Such SNPs are expected to be most challenging to detect in practice when utilizing single SNP based statistical tests. Finally, we note that models M_{3} and M_{4}, although elementary, can also be obtained by combining models of type M_{1} or M_{2}, respectively. Our reason for including them as such into the pool of elementary models is to enhance the learning algorithm by making plausible twolocus combinations immediately available as building blocks for more complex models.
where n_{ cb } is the number of individuals i in class c with disease status y_{ i } = b and s_{ c }  is the total number of individuals in class c: Formula (10) specifies the standard marginal likelihood arising from the binomial likelihood under the conjugate beta distribution (see, e.g. [16]).
 1.If M _{1} and M _{2} are two distinct models such that M _{2} includes one more SNP than M _{1}, then$\frac{P({M}_{2})}{P({M}_{1})}=\xi $
 2.
The prior distribution is uniform over all combined models which include the same number of loci.
Thus, ξ can be interpreted as the penalty resulting from an increase in the number of SNPs in a model. In practice, it is reasonable to set ξ equal to a small value that depends inversely on the total number of investigated SNPs. This prevents the learning of overly complex models when the number of SNPs in a data set increases. The parameter ξ will be referred to as the structure parameter in the text. The values of ξ used in the analyses are specified in the Specification of the search parameters section.
Stochastic search for model learning
 1.
Calculate the posterior probability distribution over all elementary models in ${\mathcal{M}}_{e}$. In the previous section we derived analytical expressions for these probabilities.
 2.
Select a set ${\mathcal{M}}_{{e}_{1}}\subseteq {\mathcal{M}}_{e}$ of K elementary models corresponding to the K highest posterior probabilities, where K is a userspecified constant determining the accuracy of the approximate model averaging (see below). Further, include all singlelocus elementary models in ${\mathcal{M}}_{{e}_{1}}$.
 3.
Run a stochastic search in the space of combined models of ${\mathcal{M}}_{{e}_{1}}$. For the search we use a nonreversible MCMC algorithm [12]. Search operators are described in closer detail below.
 4.Let ${\mathcal{M}}^{\ast}$ denote the set of all combined models visited during the search algorithm in Step 3 and let ${\mathcal{M}}_{j}\subseteq {\mathcal{M}}^{\ast}$ denote the models in which j th locus is included. Let X_{ j } ∈ {0, 1} denote the indicator variable of the event that locus j is included in the association model. The posterior probability of X_{ j } = 1 then equals:$P({X}_{j}=1Data)=\frac{{\sum}_{M\in {\mathcal{M}}_{j}}P(M\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}Data)}{{\sum}_{M\in {\mathcal{M}}^{*}}P(M\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}Data)},$(12)
where P (M  Data) is the posterior probability of any particular association model M, for which the exact expression was derived in the previous section.
which is the logarithm of the posterior odds in favor of association.
 1.
Add a new randomly selected elementary model to the current model M. Remove from M all elementary models which conflict with the added elementary model.
 2.
Remove a randomly selected elementary model from M.
 3.
Switch an elementary model in M with another elementary model from ${\mathcal{M}}_{{e}_{1}}$. The operator simply considers jointly steps 2 and 1, in this order.
At each iteration these operators are used with the probabilities [0.5, 0.45, 0.05].
where ${\mathcal{M}}^{\ast}$ is the set of models visited during the stochastic search. This estimation strategy is made possible by the fact that P (Data  M) P (M) is available analytically for all models. For a discussion of the advantages of using the nonreversible sampler in a general Bayesian model learning context, see the original article or [17], where even milder conditions for the convergence of the algorithm were provided. The stochastic search method and all alternative methods described in the next section were implemented in Matlab.
Alternative methods for association mapping
A similar pvalue could alternatively be calculated using the MantelHaenszel test (see, e.g. [15]).
where P (Data  M_{ a } (j, k)) is the marginal likelihood based on the twolocus elementary model M_{ a } (j, k) (see Bayesian multilocus association model), and M_{0} is the null model according to which all individuals have the same disease probability. Thus, the probability of the data under the twolocus association model (18) is given by a mixture of three interaction models of different complexities, each having a prior weight of 1/3. (As an alternative to averaging over the three models, we also considered taking the maximum over the models; however, no notable changes in the results were detected.) Notice the difference between the Bayesian scores (13) and (17). In (13) the posterior probability of disease association for locus j is obtained by summing the posterior probabilities of all models in which j is included. In (17) the posterior probability of association for locus j is based on a single twolocus interaction model (which, however, is a mixture of three models) maximizing the probability of data. Analytical expression for the above Bayes factor is based on the derivations provided in the section Bayesian multilocus association model.
Simulated data sets
As the basis of our simulations we use data on 2131 real human subjects with approximately 500,000 SNPs from the autosomal and X chromosomes. The data belong to GenMets sample collected as part of the Health2000 study. Less than 0.1 percent of the observed SNPs were missing in the original data. For the purposes of the simulation study, we impute the missing values by drawing the missing alleles from the marginal allele distributions of the corresponding SNPs. Further details about the characteristics of the data can be found from previously published studies utilizing the data [20, 21]. We carry out experiments with two types of data sets: smaller data sets consisting of 460 SNPs are used in replicate experiments to investigate the average performance under various biological scenarios, and larger whole chromosome data sets consisting of approximately 8,700 SNPs are used as examples of the performance in a computationally more challenging scenario.
 1.
We specify genotype relative risk (GRR), minor allele frequency (MAF) for the causal SNPs, and a generative model for the disease status (see below).
 2.
We randomly select from among the 500,000 original loci four causal SNPs from different chromosomes whose empirical allele frequencies (calculated from all 2131 individuals) match closely (∓1%) the specified MAF. For each of the selected causal SNPs, we select the genomic area surrounding the causal SNP such that 20 closest flanking SNPs having MAF > 0.1 on both sides of the SNP are included in the area. The restriction to SNPs with MAF > 0.1 reflects the ascertainment bias, and is similar to the simulations in [4]. The true causal SNP is then excluded from the data for the corresponding area, which mimics the situation where causal variants are linked to the genotyped loci but their exact locations remain hidden in a study. This procedure yields 4 genomic intervals, each of length 40 SNPs, such that the causal SNPs are located in the centers of the intervals (but not included in the observed data). These intervals represent disease associated genomic areas. We note that the linkage between the causal SNP and the neighboring SNPs is not constant in the resulting data sets; however, the results are averaged over in this respect in the simulations.
 3.
We randomly select three intervals of length 100 SNPs from chromosomes not harboring any of the selected causal SNPs. These intervals represent genomic areas not associated with the disease.
 4.
We concatenate the genomic intervals to a multilocus genotype sequence of length 460 SNPs such that the intervals from Step 3 are inserted between the disease associated intervals from Step 2.
 5.
Based on the four causal SNPs and the specified disease model, we generate a disease status for each of the 2131 individuals in our real genotype data using their observed genotypes. We select the values of the parameters in the disease models to get a prevalence of 40 percent (for details see below). This leads to approximately 850 observed cases in a single data set.
 6.
We select randomly the controls from the remaining set of appr. 1280 individuals to obtain a data set with an equal number of cases and controls. The final data consist of the genotypes of the selected cases and controls at the 460 SNPs and their simulated disease statuses according to the generating model.
where β_{0} is baseline risk chosen such that the resulting prevalence meets the value specified in Step 5 above. According to the first generative model, the risk of having the disease increases multiplicatively (i.e. additively on the log scale) by (GRR  1) * 100% whenever any of the loci involved has at least one diseaserelated allele. According to the second model we have two pairs of interacting loci, (j_{1}, j_{2}) and (j_{3}, j_{4}), such that the risk of having the disease is increased by (GRR  1) * 100% if both loci in either pair have at least one risk allele, and the risk is multiplicative across the locus pairs. The third generative model requires simultaneously that ${Z}_{{j}_{1}}>0$, ${Z}_{{j}_{2}}=0$ and ${Z}_{{j}_{3}}>0$, before the risk increases. This increase is then multiplicative with the increase caused by j_{4} alone. The generative models will be referred to as "multiplicative", "threshold" and "triplet", respectively.
In the simulation setup, we use values: GRR = (1.3, 1.6, 2.0) and MAF = (0.05, 0.1, 0.2) for the causal SNPs. For each of these MAF values, the original data set includes more than 23,000 SNPs to choose from. The simulation setup thus leads to 27 different parameter settings (9 for each of multiplicative, threshold and triplet generative models) and 100 replicate data sets are generated for each setting.
In addition to the 2,700 data sets of size 460 SNPs generated in the way described above, we simulate two whole chromosome data sets which include approximately 8,700 SNPs each. These data sets are generated using the threshold model with values GRR = (1.6, 2.0) and MAF = 0.2. The disease associated genomic areas and disease statuses are generated exactly as before, except that chromosome 21 is excluded as a possible origin for any of the causal SNPs. The areas not associated with the disease are created by dividing all SNPs in the 21st chromosome into five intervals of approximately the same size, and the complete data sets are obtained by inserting the disease associated genomic areas between these intervals.
Specification of the search parameters
As the initial model for the search, we used the empty model which includes no SNPs. To fully specify the search algorithm, K and N_{ iter }must be set, where K is the number of elementary models whose combinations define the search space and N_{ iter }is the number of iterations in the stochastic search algorithm. These parameters were specified and the convergence of the search was monitored in the different simulations as follows:

In the analyses of the data sets with 460 SNPs we used K = 5, 000 and N_{ iter }= 200, 000. The convergence of the search algorithm is investigated by manually inspecting the marginal likelihood trace plots for approximately ten different data sets and the convergence was always reached within the first 20,000 iterations. The same values were used in the search for all the data sets.

In the analyses of the two whole chromosome data sets we used K = 50, 000 and N_{ iter }= 3, 500, 000. The convergence of the search was investigated by manually inspecting the trace plot after each 500, 000 iterations. For both data sets, the convergence was reached during the first set of 500,000 iterations. Further, the highest scoring model did not change after the first 500, 000 iterations in either of the analyses.
Although the likelihood (8) is based on a prospective model (see the section Bayesian multilocus association model), we can utilize the prior knowledge that the numbers of cases and controls in a data set are equal by selecting the hyperparameter α in the distribution (9) to reflect this information. This hyperparameter specifies the distribution of the disease probability parameters, and specifically, by selecting α > 1 we give more weight to disease probabilities close to 0.5. In the supplementary material of their article, Marchini et al. interpret the hyperparameter in terms of oddsratios, and conclude that hyperparameter values larger than unity are better in line with oddsratios one would expect from typical diseases than values less than unity [5]. If not stated otherwise, we used α equal to 3 in our analyses. However, we note that this choice is still fairly noninformative, and the results obtained were practically identical when α = 1 was used (see the next section). Furthermore, in the replicate simulations we used the value 1/460 for the structure parameter ξ in (11). In the investigation of the sensitivity of the inferences with respect to the priors, we considered also values α = 1, 3, 10 and ξ = 1/46, 1/460, 1/4600. In the whole chromosome analyses we used ξ = 1/1000.
Results
Simulations

Detection of causal areas improves with increasing GRR and MAF. Especially under the threshold and triplet simulations, if the conditions GRR ≥ 1.6 and MAF ≥ 0.1 are not satisfied, no method is clearly better than the baseline.

There is considerable variability between the curves for different data sets, as can be seen from the wide 95% intervals for the curves. This is expected as the data sets are based on subsets of real genotype data and may exhibit different levels of linkage between the causal and neighboring SNPs.

The performance of the methods is highest with multiplicative data sets and lowest with triplet data sets.

The mean curve for the BMA is consistently above the pvalue curve in triplet and threshold simulations. In multiplicative simulations the curve for the pvalue is above the curve for the BMA, except when MAF = 0.05 and GRR = 1.3 or 1.6. We note that when the signal gets stronger, even the marginal pvalue is able to identify interacting SNPs without main effects, because such SNPs show some effect also marginally when averaged over the other SNP.
For clarity, the ROC curves for the GxG pvalue and GxG BF methods are excluded from Figures 2, 3 and 4. Usually these curves reside between the pvalue and BMA curves (exact results not shown).

When AUC is considered, BMA is significantly better than any other method in triplet simulations, and, in threshold simulations, only the GxG BF method is competitive with the BMA method. On the other hand, in multiplicative simulations, especially when the signal is strongest (upper right corner), BMA gets lower AUC values than the alternatives.

When the location accuracy is considered, BMA has consistently better or approximately equal performance compared to the alternatives. In multiplicative data sets the preference for the BMA is strongest.
The most striking feature in the results is the fact that the BMA method is clearly inferior in terms of AUC when the data sets are generated according to the multiplicative model and the signal is strongest, and, at the same time, clearly superior to the other methods with multiplicative data sets when measured in terms of location accuracy. The relatively low performance of BMA in terms of AUC for this setting can be explained as follows. When the signal was strongest and the risk increased multiplicatively over the causal loci, it happened with some few data sets that BMA did not show any signal to one of the four causal loci. Consequently, many false positives needed to be included before all four disease associated areas could be appropriately detected. This is also visible in Figure 2, where, for example, in the panel corresponding to GRR = 2.0 and MAF = 0.2 the mean ROC curves for BMA and pvalue are overlapping up to three detected causal areas, but the 97.5th percentile deviates strongly from the mean curves at the level of four detected disease associated regions. The reason why models with four causal SNPs included sometimes get lower posterior probabilities than models with three causal SNPs can be explained by considering the generating multiplicative model. When the increase in risk is maximal, the risk is already considerably high after including any three SNPs in the model, and, consequently, only a minor increase in risk is left to be explained by the fourth causal SNP. Thus, the benefit from adding this particular SNP to the model will not always compensate the penalty resulting from the corresponding increase in the number of parameters. On the other hand, the location accuracy criterion only compares the location accuracy of the highest ranking SNP in the data set, and is therefore unaffected by this phenomenon.
SNP posterior odds summaries
Type  GRR  MAF  Max Score  Accuracy 

Multiplicative  1.3  0.05  [6.1,1.6]  0.57 
1.3  0.1  [4.6,7.0]  0.89  
1.3  0.2  [2.0,11.3]  0.96  
1.6  0.05  [6.0,6.8]  0.88  
1.6  0.1  [0.5,44.9]  0.96  
1.6  0.2  [0.6,71.9]  1.00  
2.0  0.05  [0.6,23.9]  0.95  
2.0  0.1  [1.2,77.6]  1.00  
2.0  0.2  [3.8,129.4]  0.99  
Threshold  1.3  0.05  [6.8,1.0]  0.38 
1.3  0.1  [6.4,0.6]  0.34  
1.3  0.2  [6.2,0.2]  0.55  
1.6  0.05  [7.0,1.3]  0.34  
1.6  0.1  [7.4,0.3]  0.38  
1.6  0.2  [3.4,16.3]  0.96  
2.0  0.05  [6.3,0.6]  0.40  
2.0  0.1  [6.1,8.1]  0.80  
2.0  0.2  [0.7,43.5]  1.00  
Triplet  1.3  0.05  [6.2,1.3]  0.28 
1.3  0.1  [6.3,1.1]  0.41  
1.3  0.2  [7.7,0.5]  0.47  
1.6  0.05  [6.9,0.6]  0.41  
1.6  0.1  [9.9,0.6]  0.59  
1.6  0.2  [5.4,4.6]  0.85  
2.0  0.05  [7.2,1.6]  0.43  
2.0  0.1  [5.5,11.7]  0.80  
2.0  0.2  [2.7,22.8]  0.95 
Time intensity of the methods
Obviously, calculation of the marginal pvalues is the optimal approach in terms of time intensity as the time required is linear with the number of SNPs L in a data set, whereas going through all gene pairs for obtaining the GxG pvalues takes a time proportional to $\left({}_{2}^{L}\right)$. The calculation of GxG BF scores takes approximately one third of the time required for the calculation of GxG pvalue scores and this ratio does not depend on the characteristics of simulated data sets. Most of the calculation time for GxG pvalues was consumed by the fitting of the logistic regression model, for which purpose we used glmfit from the Matlab statistics toolbox. The numerical fitting takes considerably longer time than the analytical evaluation of the GxG Bayes factors. Notice also that GxG Bayes factors are based on the average of three models as opposed to the single model in the GxG pvalue. The time consumed by the stochastic search algorithm for BMA depends on the number of elementary models K, whose combinations define the search space, and the number of iterations of the search algorithm. In our analyses of the chromosomewide data the calculation of all GxG Bayes factors took about 12 hours on a single desktop computer, while the stochastic search required only about 15 minutes when GRR = 1.6 and 50 minutes when GRR = 2.0. This difference in the times is a consequence of the fact that when the signal is strong the algorithm visits higherorder models more often and the evaluations of such models take longer.
Sensitivity to prior
Discussion
Rationale of the multilocus modeling
Our primary target was to develop a modelaveraging approach in which genegene interactions are explicitly considered. Another possibility for implementing this would be to consider enumeratively all singleSNP and GxG models as in our approach, while doing the model averaging over all these models. We considered this alternative at an initial stage of our method development; however, such an approach was discovered to generally suffer from a specific deficiency. To illustrate this, suppose for example that there exist two separate underlying interactions for a particular data set, i.e. corresponding to four causal SNPs in total. Then it is likely that the individual GxG models for each of these interactions are assigned high scores relative to the null model. Nevertheless, due to stochasticity of the genotype counts, it may also easily happen that one of the GxG models is assigned a clearly higher score than the other. Because this model includes only one of the interactions, and excludes the other, evidence against the other interaction is obtained. As a consequence of this, the interaction associated with the lower score will get a very low posterior weight when the averaging over all models is performed (for an illustration, compare panels b and c in Figure 10). To resolve this issue, it is necessary to have the ability to include both interactions simultaneously in a model. This insight from the initial investigations led us to propose our final approach based on the combined models. An additional benefit of the combined models is that they can represent even higherorder interactions, however, this comes with the expense of an increase in the number of redundant parameters, which is likely to reduce the applicability when the number of SNPs involved in the combination is large.
Summary of the results
To compare our novel approach with some standard approaches we carried out a comprehensive simulation study. The simulation study was particularly challenging as the causal variants were not included among observations in the simulated data sets. In the simulations, three types of causal SNPs were considered, 1) SNPs with main effects, 2) SNPs without main effects but with a pairwise interaction effect, and 3) SNPs included in a threeway interaction without pairwise effects. The results show that even if multilocus association findings may lack statistical significance under stringent criteria for the posterior odds score, the calculated relative scores still often correctly highlight the disease associated genomic areas.
The results concerning the detection rate versus false positive rate can be summarized as follows.

When the causal SNPs had main effects, the BMA did not provide improvement over the other methods. On the contrary, some signals detected by the other methods (pvalue, GxG pvalue, GxG BF) investigating SNPs or SNPpairs marginally went undetected in some of the simulations where the causal SNPs had main effects.

When causal SNPs with twoway interaction effects were considered, all methods considering GxG interactions (BMA, GxG pvalue, GxG BF) yielded more satisfactory results than the marginal pvalue.

When causal models with threeway interactions were considered the BMA showed better performance than any of the alternatives.
The final aspect in the list above suggests that when higherorder interactions are present in data, taking them into account in the model may improve their detection. However, we further note that as the effect sizes got larger, even the simplest model, the marginal pvalue, was able to identify most of the causal areas, even if the causal SNPs did not have main effects. The improved localization of the BMA method was most clearly seen when causal SNPs had main effects.
In general, the GxG Bayes factor seemed more competitive than the GxG pvalue when compared with the BMA method. This may be partly explained by the fact that the GxG pvalue handles only a single interaction model at a time, whereas the GxG BF considers an average of three different models.
Especially in the simulations with underlying threeway interactions, the model corresponding to the GxG pvalue may be too limited to appropriately fit the complex interaction model, and consequently, a model with more parameters might perform better. On the other hand, increasing the complexity of the model would decrease the performance of the GxG pvalue in the simpler simulation settings due to additional redundant parameters.
Issues in the Bayesian model averaging
The bottleneck in terms of computational complexity in our approach is the enumeration of all locus pairs once and evaluating the posterior probabilities of the corresponding elementary models. After this has been done, the stochastic search increases only modestly the total time required. The O(L^{2}) time complexity seems unavoidable for any method explicitly considering genegene interactions. Although the enumeration of all locus pairs is feasible with present day cluster computers even on the scale of GWA studies, a straightforward enhancement in terms of time intensity is to carry out a prescreening of SNPs using e.g. a marginal pvalue test with some liberal significance threshold, say, equal to 0.1. It is shown in [2] that such an approach leads to approximately equal power in identification of genegene interactions compared to the exhaustive enumeration of all gene pairs.
According to our experiences, the sensitivity with respect to the prior on the model structure may represent the primary obstacle to be appropriately handled when Bayesian model averaging type methods are applied to data sets with a very large number of SNPs. One solution provided by the Bayesian approach itself is that, instead of uniformly decreasing the value of the structure parameter ξ (penalty from adding another SNP to the model) as the number of SNPs in a data set increases, we could assign different ξ_{ i } parameters to different SNPs based on external knowledge about the particular loci [11]. For example, ξ_{ i } might be set equal to 0.01, 0.001 or 0.0001 depending on whether a SNP belongs to a gene whose function is expected to be related with the disease, any gene at all, or far away from any known gene, respectively. There does not seem to be a simple way for including continuous covariates in our model if one wishes to preserve the ability to analytically integrate out the model parameters, which constitutes the basis of efficient computation. On the other hand, including categorical covariates (such as sex, or age after some appropriate discretization) in our model is straightforward in principle, by treating them similarly as the observed genotypes. However, in practice the increase in the number of parameters may overwhelm the benefits. A possible solution might be to average over models such that the covariates are in turn either included or excluded. Finding an optimal way of doing this is subject to future research.
Conclusions
We have considered the problem of identifying disease associated marker loci when several SNPs have a joint effect on the disease probabilities. We have introduced a novel Bayesian model averaging approach, whose advantages include explicit consideration of the GxG interactions, ability to describe higherorder interactions, and the ability to evaluate the marginal likelihood analytically enabling efficient computation. Our approximate model averaging algorithm makes it possible to include GxG interactions in the analysis even with large data sets. Furthermore, it would be fairly straightforward to modify the algorithm for learning with other model families than the one considered in this article, for example regression models commonly utilized in genetic association studies. Such a generalization would require that the model parameters can be integrated out either analytically or approximately using e.g. the Laplace approximation (see the supplementary material of [5]).
To conclude, our simulations confirm that an appropriate approach to initializing a GWA screening is to investigate marginally each SNP, either by pvalues or corresponding Bayes factors (see, e.g. [11]), as this is the computationally most straightforward and fastest approach, and, in many cases, capable of finding the signals present in data. The relevance of the more complex modeling approaches including GxG interactions is that they may help to detect some causal SNPs which are not visible marginally. Thus, in our opinion, using different approaches sidebyside may provide a more detailed description of the data and aid in finding the missing heritability in complex diseases [23].
Declarations
Acknowledgements
The authors thank Markus Perola, MD, PhD, Institute of Molecular Medicine, Finland (FIMM), for the access to the genotype data on which the simulations were based.
This work was funded by a grant from the Academy of Finland to the Pubgensens project and ERC grant no. 239784.
Authors’ Affiliations
References
 Balding DJ: A tutorial on statistical methods for population association studies. Nature Reviews Genetics 2006, 7: 781–791. 10.1038/nrg1916View ArticlePubMedGoogle Scholar
 Marchini J, Donnelly P, Cardon LR: Genomewide strategies for detecting multiple loci that influence complex diseases. Nature Genetics 2005, 37: 413–417. 10.1038/ng1537View ArticlePubMedGoogle Scholar
 Koopenberg C, Ruczinski I: Indentifying interacting SNPs using Monte Carlo logic regression. Genetic Epidemiology 2005, 28: 157–170. 10.1002/gepi.20042View ArticleGoogle Scholar
 Verzilli CJ, Stallard N, Whittaker JC: Bayesian Graphical Models for Genomewide Association Studies. The American Journal of Human Genetics 2006, 79: 100–112. 10.1086/505313View ArticlePubMedGoogle Scholar
 Marchini J, Howie B, Myers S, McVean G, Donnelly P: A new multipoint method for genomewide association studies by imputation of genotypes. Nature Genetics 2007, 39: 906–913. 10.1038/ng2088View ArticlePubMedGoogle Scholar
 Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB: Detection of gene × gene interactions in genomewide association studies of human population data. Human Heredity 2007, 63: 67–84. 10.1159/000099179View ArticlePubMedGoogle Scholar
 Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ: Simultaneous analysis of all SNPs in genomewide and resequencing association studies. PLoS Genetics 2008, 4: e10001130. 10.1371/journal.pgen.1000130View ArticleGoogle Scholar
 Foraita R, Bammann K, Pigeot I: Modeling genegene interactions using graphical chain models. Human Heredity 2008, 65: 47–56. 10.1159/000106061View ArticlePubMedGoogle Scholar
 Lee SH, van der Werf JHJ, Hayes BJ, Goddard ME, Visscher PM: Predicting unobserved phenotypes for complex traits from wholegenome SNP data. PLoS Genetics 2008, 4: e1000231. 10.1371/journal.pgen.1000231View ArticlePubMedPubMed CentralGoogle Scholar
 Bernardo JS, Smith AFM: Bayesian Theory. Chichester: Wiley; 1994.View ArticleGoogle Scholar
 Stephens M, Balding DJ: Bayesian statistical methods for genetic association studies. Nature Reviews Genetics 2009, 10: 681–690. 10.1038/nrg2615View ArticlePubMedGoogle Scholar
 Corander J, Gyllenberg M, Koski T: Bayesian model learning based on a parallel MCMC strategy. Statistics and Computing 2006, 16: 355–362. 10.1007/s112220069391yView ArticleGoogle Scholar
 Marttinen P, Corander J, Törönen P, Holm L: Bayesian search of functionally divergent protein subgroups and their function specific residues. Bioinformatics 2006, 22(20):2466–2474. 10.1093/bioinformatics/btl411View ArticlePubMedGoogle Scholar
 Marttinen P, Myllykangas S, Corander J: Bayesian clustering and feature selection for cancer tissue samples. BMC bioinformatics 2009, 10: 90. 10.1186/147121051090View ArticlePubMedPubMed CentralGoogle Scholar
 Clayton D, Hills M: Statistical models in epidemiology. Oxford: Oxford University Press; 1993.Google Scholar
 Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis. 2nd edition. Boca Raton: Chapman & Hall/CRC; 2004.Google Scholar
 Corander J, Ekdahl M, Koski T: Parallel interacting MCMC for learning of topologies of graphical models. Data Mining and Knowledge Discovery 2008, 17: 431–456. 10.1007/s1061800800999View ArticleGoogle Scholar
 Schervish M: Theory of statistics. New York: Springer; 1995.View ArticleGoogle Scholar
 Kass R, Raftery AE: Bayes factors. Journal of American Statistical Association 1995, 90: 773–795. 10.2307/2291091View ArticleGoogle Scholar
 Jakkula E, Rehnström K, Varilo T, Pietiläinen O, Paunio T, Pedersen N, deFaire U, Järvelin M, Saharinen J, Freimer N, et al.: The genomewide patterns of variation expose significant substructure in a founder population. The American Journal of Human Genetics 2008, 83(6):787–794. 10.1016/j.ajhg.2008.11.005View ArticlePubMedGoogle Scholar
 Perttilä J, Merikanto K, Naukkarinen J, Surakka I, Martin N, Tanhuanpää K, Grimard V, Taskinen M, Thiele C, Salomaa V, et al.: OSBPL10, a novel candidate gene for high triglyceride trait in dyslipidemic Finnish subjects, regulates cellular lipid metabolism. Journal of Molecular Medicine 2009, 87: 825–35. 10.1007/s001090090490zView ArticlePubMedPubMed CentralGoogle Scholar
 Fawcett T: An introduction to ROC analysis. Pattern Recognition Letters 2006, 27: 861–874. 10.1016/j.patrec.2005.10.010View ArticleGoogle Scholar
 Manolio T, Collins F, Cox N, Goldstein D, Hindorff L, Hunter D, McCarthy M, Ramos E, Cardon L, Chakravarti A, et al.: Finding the missing heritability of complex diseases. Nature 2009, 461: 747–753. 10.1038/nature08494View ArticlePubMedPubMed CentralGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.