 Methodology article
 Open Access
 Published:
MegaSNPHunter: a learning approach to detect disease predisposition SNPs and high level interactions in genome wide association study
BMC Bioinformatics volume 10, Article number: 13 (2009)
Abstract
Background
The interactions of multiple single nucleotide polymorphisms (SNPs) are highly hypothesized to affect an individual's susceptibility to complex diseases. Although many works have been done to identify and quantify the importance of multiSNP interactions, few of them could handle the genome wide data due to the combinatorial explosive search space and the difficulty to statistically evaluate the highorder interactions given limited samples.
Results
Three comparative experiments are designed to evaluate the performance of MegaSNPHunter. The first experiment uses synthetic data generated on the basis of epistasis models. The second one uses a genome wide study on Parkinson disease (data acquired by using Illumina HumanHap300 SNP chips). The third one chooses the rheumatoid arthritis study from Wellcome Trust Case Control Consortium (WTCCC) using Affymetrix GeneChip 500K Mapping Array Set. MegaSNPHunter outperforms the best solution in this area and reports many potential interactions for the two real studies.
Conclusion
The experimental results on both synthetic data and two real data sets demonstrate that our proposed approach outperforms the best solution that is currently available in handling largescale SNP data both in terms of speed and in terms of detection of potential interactions that were not identified before. To our knowledge, MegaSNPHunter is the first approach that is capable of identifying the diseaseassociated SNP interactions from WTCCC studies and is promising for practical disease prognosis.
Background
Single nucleotide polymorphisms (SNPs) are single nucleotide variations of DNA base pairs. Researchers often use SNPs as genetic markers in disease studies. It has been well established in the field that SNP profiles characterize a variety of diseases. By investigating SNP profiles associated with a disease trait, researchers would be able to reveal relevant genes. However, in many complex diseases, SNPs have shown little penetrance individually; on the other hand, their interactions are suspected to possess stronger associations with complex diseases. Some SNPs, which have no direct impact on health, may be linked to nearby genes which do have effects. Researchers hypothesize that many common diseases in humans are not caused by one genetic variation within a single gene, but are determined by complex interactions among multiple genes. Since the sheer volume of data generated by SNP studies is difficult to be manually analyzed, an efficient computational model is required to detect or indicate which pattern is most likely associated with the disease. Then, it will just be a matter of time before physicians can screen individuals for susceptibility to a disease by analyzing their DNA samples for specific SNP patterns, and further design some experiments to target the genes that implicate the disease.
Recently, many methods have been proposed to identify SNP interaction patterns associated with diseases. To name a few studies, BEAM [1] designed a Bayesian marker partition model and used MCMC sampling strategy to estimate the model parameters; MDR [2] applied an exhaustive search model to evaluate all possible multiSNP interactions under some given thresholds; the penalized regression [3] used a variant of logistic regression model with quadratic penalization; CPM [4] used a combinatorial partitioning method for finding the interacted SNPs; RPM [5] extended CPM by using some heuristics to reduce the search space; Monte Carlo Logic Regression [6] combined the logic regression and MCMC in searching the SNP interactions; BGTA [7] proposed a screening algorithm to repeatedly evaluate a large number of randomly generated marker subsets. HapForest [8] used a forestbased approach to identifying haplotypehaplotype interactions. Although these methods perform well on small data sets, most of them (except BEAM) are unable to efficiently detect the multiSNP interactions in genome wide association study.
BEAM has successfully demonstrated its capability of handling large data sets using synthetic data. When the authors applied BEAM to an AMD (agedrelated macular degeneration) study [9], however, BEAM did not report any interactions. One possible reason is that the number of samples is not sufficient to detect the statistically significant interactions. Another possible reason is that BEAM treats local SNP interactions (haplotype effect) equally with global gene interactions during MCMC sampling, which could miss some critical haplotype effects in a genome wide association study because haplotype effects generally appear more frequently than global gene interactions.
Given a genome wide association study with thousands of SNPs and a limited number of samples, it is difficult to detect and evaluate the multiSNP interactions in a traditional statistic manner. The feasible solution is to first find a small set of relatively more relevant SNPs and then evaluate the interactions within it. This procedure was applied in HapForest [8] to infer the haplotypehaplotype interaction.
However, the typical feature selection models, which use univariate ranking on feature importance and arbitrary threshold to select relevant features, cannot be applied because they will filter out those SNPs that have weak marginal effects, while their joint behavior may significantly contribute to disease traits. In this paper, we introduce an alternative learning approach (MegaSNPHunter) to hierarchically rank the multiSNP interactions from local genomic regions to global genome. MegaSNPHunter takes casecontrol genotype data as input and produces a ranked list of multiSNP interactions. In particular, the whole genome is first partitioned into multiple short subgenomes and each subgenome covers the genomic area of possible haplotype effects in practical. For each subgenome, MegaSNPHunter builds a boosting tree classifier based on multiSNP interactions and measures the importance of SNPs one the basis of their contributions in the classifier. The method keeps relatively more important SNPs from all subgenomes and let them compete with each other in the same way at the next level. The competition terminates when the number of selected SNPs is less than the size of a subgenome. At the last step, MegaSNPHunter extracts and reports the valuable multiSNP interactions.
Results
The performance of MegaSNPHunter is evaluated through comparative studies with existing work. The goal of MegaSNPHunter is to discover the multiSNP interactions from genome wide studies. Among many recently proposed methods, BEAM is the best one which could handle the large scale data set and finish in a reasonable time. Therefore, we mainly compare our method with BEAM in this paper using synthetic data generated on the basis of epistasis models and the data sets from two real studies on complex diseases. In the experiments on two real studies, one uses a genome wide study on Parkinson disease (data acquired by using Illumina HumanHap300 SNP chips [10]). The other experiment chooses the rheumatoid arthritis study [11] from Wellcome Trust Case Control Consortium (WTCCC) using Affymetrix GeneChip 500K Mapping Array Set. In our experiments, a SNP marker can take one of the following four states: 0 (missing), 1 (coding for the homozygous reference), 2 (heterozygous), and 3 (homozygous variant). The class label is either 0 (control) or 1 (case).
Experiment on Simulation study
Simulation studies are developed to validate the performance of our approach in correctly determining the associated SNPs defined by an epistatic model. To make the fair comparison, we use the simulation program provided in BEAM package and follow the same procedure in [1] to generate the data based on two epistatic models (additive effect and multiplicative effect). For each model, we choose 12 settings (readers may refer [1] for details) and for each setting, we generate 30 data sets, and each data set includes 1000 SNPs and contains 2000 samples (1000 cases and 1000 controls). The performances of both MegaSNPHunter and BEAM are illustrated in Figure 1. In most settings, MegaSNPHunter performs the same or slightly better than BEAM.
Ideally, the results on the genome wide simulation would be more convincing but such a simulation is computationally expensive. In general, the goal of simulation study is to provide the evidence for validity of our approach. In practice, the real data is very complex and the SNP interactions in the real data may not match any epistatic model. Therefore, our approach does not assume any epistatic model. We believe the most effective criterion for judging the epistatic interaction is that the joint effect is much more significant than the marginal effects of individual SNPs. The next two experiments would show the effectiveness of our approach on the real data.
Experiment on Parkinson study
Parkinson disease is a chronic neurodegenerative disease with a cumulative prevalence of greater than 0.1 percent. The primary symptoms of Parkinson's disease include tremors, rigidity, slow movement, poor balance, and difficulty walking. In this experiment, we choose the study in [10] which provides around 396,000 genotypes in 541 samples. Both BEAM and MegaSNPHunter are tested on this data set. BEAM could not identify any interaction while our MegaSNPHunter selected 7 significant SNP interactions.
MegaSNPHunter is first run on each chromosome with 10 fold cross validation. Cross validation is a model evaluation method that estimates how well the model built from some training data is going to perform on unseen data. The 10 fold cross validation is conducted every time when the boosting tree classifier is built in the whole hierarchical procedure. In our test, the samples are randomly sampled into 10 subsets and each validation uses 9 subsets to train the model and the left one to test the performance. The output from every validation is a classifier and a list of ranked SNPs.
After 10 validations are finished, a post process is invoked to isolate those SNPs whose genotype association χ^{2}P values reach a critical value (default is 0.05), and those SNPs whose interaction's genotype association χ^{2}P values are above a critical value (default is 0.0025). The top ranked SNPs among the selected 302 SNPs are reported in Table 1 with genotype association χ^{2}P values. The selected interactions with genotype association χ^{2}P values are reported in Table 2. To handle the multiple test issue, we conduct an extra permutationbased test (chromosome level) on both single SNP and SNP interactions to correct P values.
We observe that among 12 SNPs involved in the selected interactions in Table 2, only three of them (rs 13032261, rs 7924316 and rs 2235616) have noticeable marginal effects in Table 1. For the other 9 SNPs, their joint effects are much more significant than the corresponding individual SNP effects. Figure 2 shows the genotype distribution of two SNPs (rs 7172832 and rs 906428) and the genotype distribution under the interaction. Figure 3 displays the same information for the interaction between rs 1505376 and rs 3861561. These figures clearly illustrate how the two weak SNPs significantly affect disease traits (the first interaction is not in this case because the marginal effect of rs 2235617 is already significant).
Experiment on rheumatoid arthritis study
The Wellcome Trust Case Control Consortium (WTCCC) is a collaboration of many British research groups. To date, the WTCCC has examined the genetic signals of seven common human diseases: rheumatoid arthritis, hypertension, Crohn's disease, coronary artery disease, bipolar disorder, and type 1 and type 2 diabetes. The rheumatoid arthritis study [11] contains around 500 K genotypes in 3503 samples (1999 cases and 1504 controls). We use the same procedure mentioned above to conduct the experiment. The top ranked SNPs among the selected 213 SNPs are reported in Table 3 with genotype association χ^{2}P values. The selected interactions with genotype association χ^{2}P values are reported in Table 4. The top interaction identified in MegaSNPHunter is between rs 4418931 and rs 4523817. Its genotype association χ^{2}P value is 6.83 * 10^{15}. The genotype distribution of cases and controls for these two SNPs and the distribution under their interaction are plotted in Figure 4.
Both rs 4418931 and rs 4523817 are located on the gene GPC 6, which is a member of the glypican gene family and encodes a product structurally related to GPC 4 [12]. In a latest study of rheumatoid arthritis [13], GPC 4 displays strong expression. The connection between our finding and previous work may imply a complex rheumatoid arthritis associated pattern. More evidences from biological aspect are under investigation. Again, BEAM could not report any significant interaction. The reason that BEAM could not report any interaction is partly because the data from the real studies are too complex to be formulated by one Bayesian marker partition model and the distribution assumptions in BEAM may not be true for the real data. The results from both experiments on real data sets empirically justify that our method performs better than BEAM with respect to finding SNP interactions in genome wide association studies.
Running time comparison
Another attracting point of our MegaSNPHunter is that it runs faster than BEAM. Suppose the number of SNPs in each subgenome is W, the number of SNPs is M, and the number of samples is N. Then the number of subgenomes is $\frac{M}{W}$ + 1. The time for training one boosting tree classifier using one subgenome is O(W · N · log(N)). Then the time for learning at the first level is O(M · N · log(N)). The expected number of SNPs at the second level is $\frac{M}{2}$, and $\frac{M}{{2}^{d1}}$ at the d_{ th }level. Then the time for the learning at the d_{ th }level is O($\frac{M}{{2}^{d1}}$ · N · log(N)). The total running time is O(M · (1 + $\frac{1}{2}$ + ⋯ + $\frac{1}{{2}^{2d1}}$) · N · log(N)) that is equivalent to O(M · N · log(N)). It approximates to 6.20 * 10^{9} for the rheumatoid arthritis study, which is much less than the complexity O(I * N) (around 3.5 * 10^{11}) of BEAM, where I is the number of iterations in MCMC sampling and is set to 10^{8}as default value for a data set with medium size (i.e. around 400, 000 SNPs). Theoretically, I is determined by O(M * N^{d}) with d denoting the number of interacting SNPs (i.e. interaction depth).
Discrimination ability on real data sets
As for the discrimination power of MegaSNPHunter, Table 5 and Table 6 report the prediction accuracies for both experiments on real data sets. They also report the prediction accuracies for each chromosome based on selected SNPs and the prediction accuracies from randomized tests for comparison. The randomized tests randomly select the same number of SNPs as our method has selected for each chromosome and the whole genome, and collect the prediction accuracies using 10fold CV. The reported accuracies for randomized tests are the averages of 50 runs. In both tables, we observe that the randomly selected SNPs from both real data sets can only achieve around 50% prediction accuracy on average. We realize that there are many false positives in selected SNPs because MegaSNPHunter can achieve good performance on every chromosome. How to reduce the false positive error is a challenging problem in genome wide association studies. Although our method does not directly address this issue, nevertheless our method is able to reduce the number of possibly diseaseassociated SNPs and rank those SNPs based on their relevances to the disease trait. Extra filters can be applied to remove false positives.
The parameter setting of MegaSNPHunter
There are four main parameters in the models, including the depth of trees, the threshold for selecting SNPs from trees, the subgenome size and the overlap between subgenome.

1.
The depth of trees indicates the depth of SNP interaction. Since most significant interactions are depth 2, so as long as the depth of trees is above 2, the results would not be changed. MegaSNPHunter uses 5 as default setting.

2.
The size of subgenome depends on the density of SNP data. Each subgenome should cover the genomic area of possible haplotype effects in practical. Before we start the experiment, we collect some statistics on how many SNPs are genotyped for one gene. This number will be used as the size of subgenome.

3.
The overlap between subgenomes is used to solve the boundary problem between genes. Half of the size of subgenome is the best choice. Both the size of subgenome and the overlap between subgenomes depend on the priori knowledge on epistatic interactions.

4.
The threshold for selecting SNPs from trees is a very critical parameter to the method. Our goal is to find interactions among SNPs with weak marginal effects. If the threshold is too stringent, then too many SNPs will be filtered out, while the loose threshold will allow too many SNPs to be selected. In our method, two strategies are applied to deal with this issue.

The first strategy is to select all SNPs involved in the classifier. This is usually used in the situation where most SNPs are clearly irrelevant with diseases. However, in the worst case, the classifier may use all SNPs in training. If too many SNPs are selected in the classifier, the second strategy will be applied.

The second strategy uses a threshold to select relevant SNPs. This threshold is the critical value of χ^{2} statistic. The default setting for single SNP is 0.05, 0.05*0.05 for a pair of interacted SNPs, and so on so forth.
Discussion
The advantages of MegaSNPHunter
The development of MegaSNPHunter was triggered by the limitations of existing works on finding high order SNP interactions from genome wide studies. Given a genome wide study containing thousands of markers, most existing methods either fail to report the statistically significant interactions due to the limited samples, or can not terminate in a reasonable time due to the explosive search space.
MegaSNPHunter addresses these issues by hierarchically reducing the number of relevant SNPs and then extracting the interactions. MegaSNPHunter displays many advantages over the existing methods:

the hierarchical learning strategy can extract both local SNP interactions and global gene interactions in an efficient manner without exhaustive enumeration;

MegaSNPHunter uses a classifier built on SNP interactions to rank the relevances of SNPs, which is superior to the univariate feature selection techniques on finding the SNPs with weak marginal effects but significant joint effects;

MegaSNPHunter is a non parametric method. It does not assume any prior distributions as required by many parametricstatistical methods;

MegaSNPHunter does not assume any particular epistasis models, which is very important for real studies because the models of SNP interactions are unknown and likely to be very complex. Our method only assumes that the further the distance between two SNPs, the less possibility they interact with each other.

MegaSNPHunter could be applied for discrimination, where we can use the selected SNPs to build a classifier for discriminating two or more classes of samples.
The limitations of MegaSNPHunter
The big advantage of MegaSNPHunter is to find the interactions between SNPs with weak marginal effects. To handle the high dimension of genome wide data, MegaSNPHunter partitions the whole genome into multiple short subgenomes and select the relative more important SNPs from each subgenome. If the interacted SNPs are not located in the same subgenome, MegaSNPHunter requires that their marginal effects must be above the medium of marginal effects of their resided subgenomes. We think this is a soft constraint because in reality, most SNPs in the genome do not contribute to any trait variation. If either of interacted SNPs only has trivial marginal effect, it would have little chance to survive and meet its counterpart in the next level.
In the real application, MegaSNPHunter could incorporate some search strategies proposed in [14] as a preprocess to reduce the search space. These search strategies first find diseaseassociated SNPs with noticeable marginal evidence. Then an exhaustive search procedure can be applied to find interactions among them. These strategies complements our method. We could start from using them to find interactions between SNPs with strong marginal effects and next run MegaSNPHunter to find interactions between SNPs with weak marginal effects.
Future Studies
There are several issues we need to address in the future work. Since our method assumes that the strength of interaction is inversely proportional to the distance of SNPs, most findings in our results are local effects. The interactions between SNPs far in distance have already drawn many researchers' attention. We plan to develop new methods to find the global SNP interactions. An efficient sampling strategy is one possible solution. Another critical issue is how to reduce false positives. We plan to incorporate the haplotype information and pathway information to help reduce the false positive error in future study.
Conclusion
In this paper, we propose a novel hierarchical learning algorithm (MegaSNPHunter) to find high order SNP interactions in genome wide association studies. We evaluate MegaSNPHunter through comparative studies on simulated data and the data sets from two real studies including a genome wide study on Parkinson disease [10] and the rheumatoid arthritis study from WTCCC [11]. In the simulation experiment, MegaSNPHunter displays the comparable performance while in the experiments on two real studies, BEAM could not report any interaction patterns but our MegaSNPHunter identifies many interactions among SNPs whose joint effects are more significant than the individual SNP effects. In summary, the hierarchical nature of our nonparametric learning scheme enables our new method to search for interaction patterns more efficiently than existing methods. In this sense, our method is a powerful tool for whole genome data analysis.
Methods
The goal of MegaSNPHunter is to find the remarkable multiSNP interactions from large genome data to explain the observed trait variation. To handle the high dimension of genome wide data, MegaSNPHunter adopts a hierarchial learning approach that first reduces the number of relevant SNPs into a small set and then extract the multiSNP interactions. In the process of finding relevant SNPs, the whole genome is first divided into multiple short subgenomes, and the next step is to rank the importance of SNPs by building a classifier with multiSNP interactions for each subgenome. The importance of SNPs in each classifier is measured by their contributions to the classification power. The flowchart of MegaSNPHunter is illustrated in Figure 5. In the following sections, the base learner for each subgenome is introduced first. Next, the hierarchical learning algorithm is described in details. At last, a new procedure different from bruteforce search is presented to extract the multiSNP interactions from tree classifiers.
Tree Boosting Classifier
There are many popular classification models in machine learning, which could be chosen as our base learner. Among them, classification and regression tree (CART) [15] is one of the best choices because the tree based learning model has a good interpretability of feature interaction. CART recursively generates a tree model by splitting the data using selected features. It uses the GINI index to determine how well the splitting rule separates samples contained in the parent node. Once the best split is found, CART repeats the splitting process for another child node, and continues recursively until further splitting is impossible. The interaction of features is represented as a path from the root node to the leaf node in the tree. However, the treebased model is usually not stable and often sensitive to the data distribution. To increase its discrimination power, one popular solution is to use boosting [16]. Boosting is considered as one of the most powerful learning procedure that theoretically could be used to boost any weak learner (even only slightly better than a random guess), and combine a set of weak learners into a strong learner. Among all boosting models, gradient boosting of regression tree [17] is considered as a highly robust and competitive method for feature selection. It shows excellent performance even when the number of features is large and the relationship between features and class is complex. The general gradient boosting procedure [17] is listed in Algorithm 1 (shown at the end of the paper). The basic idea is to compute a sequence of regression trees, where each successive tree is built for the prediction residuals of the preceding tree. To avoid the overfitting, the size of the trees is usually fixed to some pregiven threshold. L(Y, f(X)) in Algorithm 1 is the loss function to minimize. For a twoclass classification in boosting, the loss function is the negative binomial loglikelihood defined in [17] as
L(y, f) = ∑(y_{ i }f(x_{ i })  log(1 + exp(f(x_{ i })))), y ∈ {0,1}, (1)
where f(x) is defined as
The gradient of loss function L(Y, f(X)) is derived as
The output F of this procedure is a set of regression trees that are added together to perform the classification task.
Algorithm 1 General Gradient Boosting Procedure
Initialized F to be a constant.
for t = 0 to T do
Compute the negative gradient z_{ i }= $\frac{\partial}{\partial f({x}_{i})}$L(y_{ i }, f(x_{ i }))
Fit a regression tree T(x), predicting z_{ i }
Update F as F ← F + η T(x)
end for
MegaSNPHunter
MegaSNPHunter takes casecontrol genotype SNP data as input and produces a ranked list of multiSNP interactions. To find nontrivial multiSNPs interactions in the high dimension of genome wide data, a general approach would first evaluate each SNP individually and select some top ranked ones, and then extract the multiSNP interactions in the selected SNPs. This approach falls short at finding those significant interactions among SNPs with weak marginal effects because those SNPs have high probabilities to be filtered out in the first step. Taking multiSNP interactions into account in the selection stage provides a good solution to this issue. MegaSNPHunter employs a hierarchial learning strategy. In particular, the whole genome is first divided into multiple short subgenomes and a tree boosting classifier is built on each subgenome. The built classifier consists of a collection of regression trees, where each node represents one SNP and each path in the trees indicates a possible interaction of those SNPs on the path. Given a tree boosting classifier {T_{ j }}, the importance of each SNP is measured by its classification contribution to the classification power, which is defined as
where e_{ v }is the empirical error reduction by splitting on x_{ i }using SNP S_{ i }in tree T_{ j }[18]. The average of the relative influence of SNP S_{ i }across all the trees is used to measure its importance.
Using Equation 4, MegaSNPHunter could rank the importance of SNPs in each subgenome. A cutoff threshold can be used to choose the top ones. The selected SNPs from all subgenomes will first merge together and then compete with each other in the same way at the next level. By having all SNPs compete with each other in training classifiers, MegaSNPHunter reduces the large number of relevant SNPs into a very small set. For this small set of SNPs, the multiSNP interactions could be extracted and ranked even using the bruteforce search method like MDR. Nevertheless, one critical drawback of MDR lies in the places that the search depth, which is equivalent to the order of SNP interaction, has to be limited to some certain level in order to complete the search in a reasonable time. In MegaSNPHunter, we design a new procedure to extract the high orders of multiSNP interactions without exhaustive enumeration.
Interaction Extraction
Given a small set of SNPs, it is feasible to test all possible interactions using exhaustive search. However, the number of selected SNPs from a genome wide study may still make exhaustive search of high order interactions very time consuming. Concretely, the number of possible interaction for n SNPs with maximal depth d is ${C}_{n}^{2}+{C}_{n}^{3}+\cdots +{C}_{n}^{d}$. For example, 50 SNPs with maximal depth 5 would give rise to 2,369,885 possible SNP interactions, which would go much higher even with a small increase on the number of SNPs or the maximal depth of SNP interactions. Apparently, the bruteforce search method for extracting high orders of SNP interactions is not a good choice in MegaSNPHunter. In MegaSNPHunter, the built classifier is a collection of trees in which each path represents a possible interaction among SNPs on the path. For those SNP interactions making nontrivial contribution to the traits (case or control) of samples, it is very likely that they will be included in the boosting classifier. Therefore, we could first extract all possible paths from trees and then evaluate the interactions of SNPs on each path. Given K binary trees with maximal depth d, the number of paths from root nodes to leaf nodes is K * 2^{d1}. For each length d path from the root node to the leaf node, the number of possible subpaths with length at least 2 is $\frac{(d1)(d2)}{2}$. Then the total number of possible interactions in our procedure is K * 2^{d2}* (d  1) (d  2) which is far less than ${C}_{n}^{2}+{C}_{n}^{3}+\cdots +{C}_{n}^{d}$ (n is the number of SNPs) for a bruteforce search.
After extracting all possible SNP interactions from the classifier, we rank them using the Hstatistics proposed in [18]. For two given variables (x_{ j }, x_{ k }), the H statistic is defined as
where ${\widehat{F}}_{s}$({x_{ j }}_{j∈s}) estimates the partial dependence of the classifier F on {x_{ j }}_{j∈s}, which is defined as
The partial dependence ${\widehat{F}}_{s}$({x_{ j }}_{j∈s}) is equivalent to the marginal effect of {x_{ k }}_{k ∉ s}in classifier F. Therefore, H(x_{ j }, x_{ k }) measures the fraction of partial dependence ${\widehat{F}}_{jk}$(x_{ j }, x_{ k }) not captured by ${\widehat{F}}_{j}({x}_{j})+{\widehat{F}}_{k}({x}_{k})$. The Hstatistics of high order interactions are defined in the same way as in [18].
Algorithm
To summarize, we propose the hierarchical learning algorithm 2.
Algorithm 2 MegaSNPHunter Algorithm
Given:
D:the depth of interactions
W:the subgenome size
L:The overlap size
S:SNP Data [X, Y] with class label.
while numberOfSNPs(S) > W do
SelectedSNPs ← ∅
Separate S into S_{0}, S_{1},...,S_{ m }where sizeof(S_{ i }) = W (i <m), sizeof(S_{ i }∩ S_{ j }) = O and sizeof(S_{ m }) <W
for i = 0 to m do
F_{ i }← TreeBoostingClassfier(S_{ i }, T, D)
for SNP_{ j }∈ F_{ i }do
SelectedSNPs ← SelectedSNPs + {SNP_{ j }}
end for
end for
S ← S(SelectedSNPs)
end while
F ← TreeBoostingClassfier(S, T, D).
Extract all path P_{ i }from F.
Compute Hstatistic H(P_{ i }) and Rank P_{ i }.
Function T reeBoostingClassfier([X, Y], T, D)
F ← 0
for t = 0 to T do
e_{ i }= y_{ i }$\frac{1}{1+exp(F({x}_{i}))}$, i ∈ (1, n)
Fit a D depth regression tree ${\left\{{R}_{l}\right\}}_{L}^{1}=E({\left\{{R}_{l}\right\}}_{L}^{1}{e}_{i})$
${\lambda}_{l}=\frac{{\displaystyle \sum ({y}_{i}{\tilde{y}}_{i})}}{{\displaystyle \sum {\tilde{y}}_{i}(1{\tilde{y}}_{i})}}$ where ${\tilde{y}}_{i}=\frac{1}{1+exp(F({x}_{i}))}$
F ← F + η∑λ_{ l }1(x ∈ R_{ l })
end for
return F
References
 1.
Zhang Y, Liu JS: Bayesian inference of epistatic interactions in casecontrol studies. Nature Genetics. 2007, 39 (9): 11671173. 10.1038/ng2110.
 2.
Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69: 138147. 10.1086/321276.
 3.
Park MY, Hastie T: Penalized logistic regression for detecting gene interactions. Biostatistics. 2008, 9 (1): 3050.
 4.
Nelson MR, Kardia SL, Ferrell RE, Sing CF: A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Research. 2001, 11: 458470. 10.1101/gr.172901.
 5.
Culverhouse R, Klein T, Shannon W: Detecting epistatic interactions contributing to quantitative traits. Genetic Epidemiology. 2004, 27: 141152. 10.1002/gepi.20006.
 6.
Kooperberg C, Ruczinski I: Identifying interaction SNPs using Monte Carlo logic regression. Genetic Epidemiology. 2005, 28: 157170. 10.1002/gepi.20042.
 7.
Zheng T, Wang H, Lo SH: Backward genotypetrait association (BGTA) – based dissection of complex traits in casecontrol design. Hum Hered. 2006, 62: 196212. 10.1159/000096995.
 8.
Chen X, Liu CT, Zhang M, Zhang H: A forestbased approach to identifying gene and genegene interactions. PNAS. 2007, 104: 1919919203. 10.1073/pnas.0709868104.
 9.
Robert JK, Caroline Z, Emily YC, JenYue T, Richard SS, Chad H, Alice KH, John PS, Shrikant MM, Susan TM, Michael BB, Frederick LF, Jurg O, Colin B, Josephine H: Complement factor H polymorphism in agerelated macular degeneration. Science. 2005, 308: 385389. 10.1126/science.1109557.
 10.
Fung H, Scholz S, Matarin S, SimnSnchez S, Hernandez D, Britton A, Gibbs J, Langefeld C, Stiegert M, Schymick J: Genomewide genotyping in Parkinson's disease and neurologically normal controls: first stage analysis and public release of data. Lancet Neurol. 2006, 5 (11): 911916. 10.1016/S14744422(06)705786.
 11.
The Wellcome Trust Case Control Consortium: Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447: 661678. 10.1038/nature05911.
 12.
Stephenie PS, Beth LV, Scott S: GPC6, a Novel Member of the Glypican Gene Family, Encodes a Product Structurally Related to GPC4 and Is Colocalized withGPC5on Human Chromosome 13. Genomics. 1999, 57 (3): 455458. 10.1006/geno.1999.5793.
 13.
Patterson AM, Cartwright A, David G, Fitzgerald O, Bresnihan B, Ashton BA, Middleton1 J: Differential expression of syndecans and glypicans in chronically inflamed synovium. Ann Rheum Dis. 2008, 67: 592601. 10.1136/ard.2006.063875.
 14.
Marchini J, Donnelly P, Cardon LR: Genomewide strategies for detecting multiple loci that influence complex diseases. Nat Genet. 2005, 37: 413417. 10.1038/ng1537.
 15.
Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Tree. 1984, Chapman Hall, New York
 16.
Schapire RE: Theoretical views of boosting. Computational Learning Theory: Fourth European Conference, EuroCOLT. 1999, 110.
 17.
Friedman JH, Hastie T, Tibshirani R: Additive Logistic Regression: A statistical View of Boosting. Annals of Statistics. 2000, 28 (2): 337374. 10.1214/aos/1016218223.
 18.
Friedman JH, Popescu BE: Predictive Learning via Rule Ensembles, Technical Report. 2005
Acknowledgements
The work is supported by the Research Proposal Competition and the postdoctoral fellowship of Hong Kong University of Science and Technology, and Michael and Betty Kadoorie Cancer Genetics Research Programme II of the Kadoorie Charitable Foundation. We thank Y. Zhang for providing us the BEAM package and insightful discussions.
Author information
Affiliations
Corresponding author
Additional information
Authors' contributions
X.W. and C.Y. designed the models and simulation studies together. N.T. and W.Y. initialized the study and proposed the framework of multilevel learning. Q.Y. and H.X. direct the evaluation of methodologies. All authors contributed to the writing of the manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Wan, X., Yang, C., Yang, Q. et al. MegaSNPHunter: a learning approach to detect disease predisposition SNPs and high level interactions in genome wide association study. BMC Bioinformatics 10, 13 (2009). https://doi.org/10.1186/147121051013
Received:
Accepted:
Published:
Keywords
 Genome Wide Association Study
 Regression Tree
 Genome Wide Study
 Wellcome Trust Case Control Consortium
 Genome Wide Data