- Research
- Open Access
A random forest approach to the detection of epistatic interactions in case-control studies
- Rui Jiang^{1}Email author,
- Wanwan Tang^{1},
- Xuebing Wu^{1} and
- Wenhui Fu^{1}
https://doi.org/10.1186/1471-2105-10-S1-S65
© Jiang et al; licensee BioMed Central Ltd. 2009
- Published: 30 January 2009
Abstract
Background
The key roles of epistatic interactions between multiple genetic variants in the pathogenesis of complex diseases notwithstanding, the detection of such interactions remains a great challenge in genome-wide association studies. Although some existing multi-locus approaches have shown their successes in small-scale case-control data, the "combination explosion" course prohibits their applications to genome-wide analysis. It is therefore indispensable to develop new methods that are able to reduce the search space for epistatic interactions from an astronomic number of all possible combinations of genetic variants to a manageable set of candidates.
Results
We studied case-control data from the viewpoint of binary classification. More precisely, we treated single nucleotide polymorphism (SNP) markers as categorical features and adopted the random forest to discriminate cases against controls. On the basis of the gini importance given by the random forest, we designed a sliding window sequential forward feature selection (SWSFS) algorithm to select a small set of candidate SNPs that could minimize the classification error and then statistically tested up to three-way interactions of the candidates. We compared this approach with three existing methods on three simulated disease models and showed that our approach is comparable to, sometimes more powerful than, the other methods. We applied our approach to a genome-wide case-control dataset for Age-related Macular Degeneration (AMD) and successfully identified two SNPs that were reported to be associated with this disease.
Conclusion
Besides existing pure statistical approaches, we demonstrated the feasibility of incorporating machine learning methods into genome-wide case-control studies. The gini importance offers yet another measure for the associations between SNPs and complex diseases, thereby complementing existing statistical measures to facilitate the identification of epistatic interactions and the understanding of epistasis in the pathogenesis of complex diseases.
Keywords
- Random Forest
- Epistatic Interaction
- Single Nucleotide Polymorphism Marker
- Stepwise Logistic Regression
- Candidate SNPs
Background
Recent advances in human and medical genetics have made it widely accepted that Mendelian disorders that are caused by individual genetic variants are rare, whereas complex diseases that are speculated to be caused by multiple genetic variants, their interactive effects, and/or their interactions with environment factors are common [1, 2]. The interactive effect between two or more genetic variants is typically referred to as epistasis, which is supposed to contribute to complex diseases ubiquitously via the sophisticated regulatory mechanisms in the molecular level of human genetics [3]. Biomedical studies have also been confirming the existence of epistasis in many complex diseases, including myocardial infarction [4], breast cancer [5], hypertension [6], atrial fibrillation [7], diabetes mellitus type 2 [8], AIDS [9], and many others. The detection of epistatic interactions therefore plays a key role in the understanding of the pathogenesis of complex diseases.
Nevertheless, most statistical approaches that have demonstrated remarkable successes in the detection of genetic variants underlying Mendelian diseases have impaired explanatory power in the identification of epistatic interactions responsible for complex diseases [10]. For example, family-based linkage analysis that uses a transmission model to explain the pattern of inheritance of phenotypes and genotypes exhibited in a pedigree works well for Mendelian diseases, but it is ineffective when a single locus fails to explain a significant fraction of a disease [1, 2].
On the other hand, with the completion of human genome project, new opportunities and challenges have been presented for uncovering the genetic basis of complex diseases via genome-wide association studies [3, 11]. With the accumulation of well-phenotyped cases and carefully selected controls, as well as the emergence of high-throughput genotyping techniques, it becomes urgent to develop effective methods for the detection of epistatic interactions.
To embrace the opportunities, a number of multi-locus approaches have been developed. For example, Hoh et al proposed a trimming, weighting, and grouping approach that used the summation of statistics on the basis of single-locus marginal effects and the Hardy-Weinberg equilibrium (HWE) for hypothesis testing [12]. Nelson et al put forward a combinatorial partitioning method (CPM) that exhaustively searched for the combinatorial genotype group that has the most significant difference in the mean of the responding continuous phenotype [13]. Culverhouse et al modified CPM by ignoring partitions that combined individual genotypes with very different mean trait values [14]. Millstein et al developed a pre-screening strategy to reduce the number of tests with the use of a focused interaction testing framework (FITF) [15]. Chatterjee et al used Turkey's 1-degree-of-freedom model to detect interacting loci from different region [16]. Ritchie et al adopted an exhaustive search strategy to detect combinations of loci with the highest classification capability and named their method multifactor-dimensionality reduction MDR [5]. Zhang and Liu introduced a Bayesian epistasis association mapping (BEAM) method that integrated a Bayesian model with a Metropolis-Hasting algorithm to infer the probability that each locus was associated with the susceptibility of a disease [17]. They also proposed the use of a new B statistic instead of the standard χ^{2} statistic. Many machine learning methods have also been applied to this research field from the viewpoint of binary classification and feature selection [18–24].
The effectiveness of most of these methods for genome-wide case-control data has not yet been validated. Besides, many methods rely heavily on the exhaustive search for combinations of multiple loci. This strategy, though feasible when the number of candidate SNPs is small, can hardly be computationally practical for genome-wide association studies in which the number of candidate SNPs is typically very huge. For example, a study on Age-related Macular Degeneration (AMD) has genotyped more than 100 thousand single nucleotide polymorphism (SNP) markers for 96 patients and 50 unaffected people [25]. It has also become very common to genotype 400~500 thousand SNP markers for hundreds of cases and controls in recent genome-wide association studies [26, 27]. With such dense SNPs being genotyped, methods based on the exhaustive search are computationally impractical due to the vast number of combinations of SNPs. One of the main challenges for genome-wide association studies is therefore to design computational methods that are able to reduce the search space for epistatic interactions from an astronomic number of all possible combinations of SNPs to a manageable set of candidates.
In this paper, we study case-control data from the viewpoint of binary classification. Specifically, we treat cases as positive samples and controls as negative samples, and we use SNP markers as categorical features that have three possible values (i.e., the three genotype values at a locus). With this notion, we adopt the random forest [28] that has been widely used in bioinformatics fields such as the selection of genes [19, 20], the identification of gene-gene interactions [19, 22], and the detection of causative nonsynonymous SNPs [29, 30] as the classifier to discriminate cases against controls, with an emphasis on the contribution of each SNP to the classification performance. For this purpose, we first run a random forest with all SNPs to obtain the gini importance of each SNP and then use a sliding window sequential forward feature selection (SWSFS) algorithm to select a subset of SNPs that can minimize the classification error. Since this subset typically contains only a small number of SNPs (e.g., ~100), we are able to enumerate all possible k-way (k = 1, 2, 3) interactions of the candidate SNPs and test for statistical significance their associations with the disease risk.
The above approach, named epi Forest (detection of epi static interactions using random Forest), was compared with three existing methods (BEAM [17], the stepwise logistic regression [11], and the classical χ^{2} test) on three simulated disease models [11]. The results showed that epi Forest was comparable to, sometimes more powerful than, these methods. We further applied the proposed approach to a genome-wide case-control dataset for AMD that contains 116,204 SNPs genotyped for 96 cases and 50 controls [25] and selected a subset of 84 SNPs that can minimize the classification error. Further statistical tests successfully detected from these candidates two SNPs (rs380390 and rs1329428) that were reported to be associated with this disease.
Results
Principles of epi Forest
The classical approach to the detection of single-locus association fits a full logistic regression model with a parameter for each observed genotype and then tests the significance of the fitted model via a χ^{2} approximation of the likelihood ratio test statistic [11]. Alternatively, a χ^{2} test with up to 2 degrees of freedom can be directly applied to the contingency table that records the number of cases and controls for each genotype to test whether the distributions of SNPs for the case and control populations are significantly different. To ensure the overall type I error not exceeding a preset significance level α, a Bonferroni correction is typically applied by multiplying the p-values with the number of SNP markers L (or equivalently setting the significance level to α/L).
Similarly, in order to detect the epistatic interaction of two loci, a full logistic regression model with at most 9 parameters can be fitted and tested, and the p-values should be multiplied by L(L-1)/2 according to the Bonferroni correction [11]. Because the number of SNPs is typically huge (e.g., several hundred thousand) in genome-wide case-control studies, an exhaustive search for all possible combinations of SNPs is computationally impractical. To overcome this limitation, the stepwise logistic regression approach first selects a small fraction (ε, e.g., 10%) of loci according to the significance of their single-locus associations and then tests the interactions between the selected loci [11]. The determination of the fraction ε is, however, not guided. An approach that is able to automatically determine such a small set of good candidate markers is therefore preferred.
With this notion, in the first stage, we use an ensemble learning technique called random forest [28] with all SNPs to do the classification, while the objective is to obtain the contribution, measured by gini importance, of each SNP (see Methods). Then, a Sliding Window Sequential Forward feature Selection (SWSFS) algorithm that adds one SNP at a time from the most significant SNP to the least significant one is applied to greedily search for a small subset of SNPs that could minimize the classification error (see Methods). After this step, a small set of l (<<L, the total number of SNP markers) candidate SNPs that have the most significant contribution to the discrimination of cases against controls is selected.
In the second stage, a hierarchical procedure is adopted to declare the statistical significance that the candidate SNPs are associated with the disease risk. Let α be a preset significance level (e.g., 0.05). In the one-way tests, we apply a statistical test with the use of the B statistic proposed by Zhang and Liu (see [17] and Methods) to every candidate SNP and report all SNPs whose p-values are less than α after Bonferroni corrections for L tests. In the two-way tests, we apply the B statistic to all two-way interactions of the candidates, and report interactions whose p-values are less than α after Bonferroni corrections for L(L-1)/2 tests. In this procedure, if both SNPs in an interaction have already been reported in the one-way tests, we skip the test for their interaction; if one of the SNPs has already been reported in the one-way tests, we use a conditional B statistic for testing the interaction; if neither SNPs in an interaction has been reported in the one-way tests, we use the B statistic for testing the interaction. Similarly, in the three-way tests, we apply the B or conditional B statistics to all three-way interactions of the candidates, and report those with p-values less than α after Bonferroni corrections for L(L-1) (L-2)/6 tests.
Performance of epi Forest
In order to demonstrate the performance of epi Forest, we compared it with three existing methods, BEAM [17], the stepwise logistic regression [11], and the standard single-locus χ^{2} test, on three simulated disease models.
BEAM uses a Bayesian model with the Metropolis-Hasting algorithm to partition SNP markers into three groups: a group G_{0} containing markers unlinked to the disease, a group G_{1} including markers contributing independently to the disease, and a group G_{2} that is composed of markers jointly influencing the disease. After the partition step, candidate SNPs are further tested for significance with the use of the B statistic [17]. In BEAM, there are two prior probabilities need to be pre-determined: the probability that each marker belongs to G_{1} and that of G_{2}. In our studies, we set both priors to 0.001. The stepwise logistic regression first selects the most significant ε fraction of SNPs on the basis of their marginal effects, and then tests all two-way interactions of these SNPs using logistic regressions with likelihood ratio tests [11]. We use ε = 10% in our studies and further test all three-way interactions of the candidates besides the two-way interactions. The classical single-locus χ^{2} test (with at most 2 degrees of freedom) is used as a benchmark in this comparison.
We considered three disease models with different characteristics (see [11] and [17] for details). Briefly, model 1 contains two disease loci that contribute to the disease risk independently. Model 2 is similar to model 1, except that the disease risk is present only when both loci have at least one disease allele. Model 3 is a threshold model in which additional disease alleles at each locus do not further increase the disease risk. Assuming the disease prevalence to be 0.1 for all disease models, each model has three parameters associated: the marginal effect of each disease locus (λ), the minor allele frequencies (MAF) of both disease loci, and the strength of linkage disequilibrium (LD) between the unobserved disease locus and a genotyped locus (r^{2}) [31]. To enumerate all possible combinations of these parameters is impossible. We therefore selected only some typical values for each parameter. For λ, we set it to 0.3, 0.5, and 1.0 for model 1, 2, and 3, respectively. For MAF, we considered four values, 0.05, 0.1, 0.2, and 0.5, for each model. For r^{2}, we simulated for each model two values, 0.7 and 1.0. There were therefore 8 parameter settings for each disease model and a total of 24 comparisons in our simulation studies.
For each parameter setting of each model, we simulated 100 datasets, each of which contains 1,000 markers genotyped for 1,000 cases and 1,000 controls. The minor allele frequency for each non-disease marker is randomly chosen from a uniform [0.0. 0.5] distribution. The performance of a method on a specific parameter setting is measured by the power, defined as the fraction of simulated datasets in which all disease loci are identified at the significance level α = 0.05 after the Bonferroni correction.
Effectiveness of the SWSFS algorithm
The subset of candidate markers that are likely to be associated with the disease risk is screened out with the use of a sliding window sequential forward feature selection (SWSFS) algorithm, given the gini importance provided by the initial run of the random forest (see Methods). It is therefore necessary to see how many markers are typically selected by this algorithm.
In our simulation studies, the power of epi Forest is in general superior than that of the stepwise logistic regression, while the parameter ε (the fraction of candidate markers screened on the basis of their marginal effects) in the stepwise logistic regression is set to 10% (100 markers), which is generally greater than the number of candidates suggested by the SWSFS algorithm. These facts suggest that epi Forest can more precisely pinpoint the candidate SNPs that might be associated with the disease risk, and this procedure is fully automated.
The observations from Figures 2 and 3 suggest the feasibility of using machine learning methods to select a set of candidate markers that are likely to be associated with the disease risk, thereby reducing the search space for epistatic interactions from a large number of SNPs to a small number of selected candidates. Traditional approaches such as the stepwise logistic regression uses the marginal importance of individual markers as the criterion to select the subset of candidate for further exploration, and the size of the subset remains as a free parameter whose determination is not guided. With the use of epi Forest, however, the subset is automatically determined as the one that can minimize the classification error, therefore providing an automated initial screening. On the other hand, because the criterion used by epi Forest (gini importance) is intrinsically different from the p-value provided by likelihood ratio tests that is used in the stepwise logistic regression, it is possible that the gini importance can complement statistical criteria to achieve a better search for epistatic interactions. The results also demonstrate the power of the B statistic over the likelihood ratio test statistic and the χ^{2} statistic, because both epi Forest and BEAM are in general more powerful than the stepwise logistic regression and the χ^{2} test.
Application to AMD
In simulation studies on 1,000 SNPs, epi Forest is comparable to, sometimes more powerful than, three existing methods. Nevertheless, studies have shown that a number of 30,000 to 500,000 common SNPs may be required for genotyping in real genome-wide case-control studies [32, 33]. It is therefore necessary to show whether epi Forest is able to handle such large data in real genome-wide association studies.
For this purpose, we applied epi Forest to an Age-related Macular Degeneration (AMD) dataset [25], which contained 116,204 SNPs genotyped with 96 cases and 50 controls. As suggested in [25], we removed nonpolymorphic SNPs and those that significantly deviated from Hardy-Weinberg Equilibrium (HWE), and we removed all SNPs that had no reference SNP ID or had more than 5 missing genotypes to ensure the high quality of the remaining data. After the filtering, there remained 95,986 SNPs.
It is interesting to see from this figure that the two SNPs, rs380390 and rs1329428, that were reported to be associated with AMD in literature [25], have the largest gini importance among all SNPs. Specifically, the gini importance is 2.73 for rs380390 and 2.44 for rs1329428, while the Bonferroni-corrected p-values are 0.0043 and 0.14 for rs380390 and rs1329428, respectively, according to χ^{2} tests under the assumption of Hardy-Weinberg equilibrium [25]. These observations suggest that higher gini importance may indicate lower p-value.
Top 5 two-way and top 5 three-way interactions in AMD that have the smallest p-values (for the B statistics) before the Bonferroni correction.
SNP Interaction | p-value |
---|---|
(rs6104678, rs7863587) | 1.28 × 10^{-7} |
(rs3743175, rs1394608) | 3.06 × 10^{-7} |
(rs2828155, rs1394608) | 3.06 × 10^{-7} |
(rs4292478, rs1394608) | 7.29 × 10^{-7} |
(rs6104678, rs10512174) | 7.68 × 10^{-7} |
(rs2347060, rs3758141, rs7104698) | 5.57 × 10^{-9} |
(rs2347061, rs3758141, rs7104698) | 5.57 × 10^{-9} |
(rs2347060, rs10503640, rs7104698) | 6.91 × 10^{-9} |
(rs2347061, rs10503640, rs7104698) | 6.91 × 10^{-9} |
(rs2347060, rs1557753, rs7104698) | 1.07 × 10^{-8} |
Discussion
The development of epi Forest is motivated by the following two facts: (1) most existing approaches use pure statistical methods and (2) in the stepwise logistic regression, the selection for the subset of candidate SNPs is not guided. Accordingly, the main contribution of our approach includes: (1) the incorporation of the random forest into case-control studies and (2) the automated screening of the candidate SNPs for further statistical analysis.
The random forest has several advantages over other classifiers in the studies for case-control data. First, and of the most interest, the random forest can natively provide the gini importance that measures the contribution of individual features (SNPs) to the classification. We have also shown that the correlation between this importance measure and the p-value for the B statistic is strongly negative. In this sense, the gini importance may complement the p-value to offer yet another useful measure for the associations between SNPs and complex diseases. Second, the random forest needs no extra cross-validation for evaluating the classification performance, thereby greatly reducing the computational time. Third, as a classical ensemble learning method, the procedures for constructing decision trees in the forest are mutually independent, hence very suitable for large scale parallel computation or hardware acceleration.
The epi Forest framework can also be extended from the following directions. First, the gini importance can itself serve as a statistic and be used with the permutation test to directly offer a p-value. Second, the random forest has an experimental means of estimating interactions between two features. For each tree, features can be ranked on the basis of their gini decreases, and the absolute difference in the ranks of every two features can be calculated. Averaging this difference over all trees, one obtains a measure for the interaction of every pair of features [17]. It is therefore interesting to analyze the relationship between this measure and statistical measures such as the p-value.
Certainly, our approach is not intended to take the place of existing statistical methods for detecting epistatic interactions. Instead, we are interested in showing how machine learning approaches can complement statistical methods to facilitate the exploration of interactions between multiple SNPs, because epistasis plays such an important role in the pathogenesis of complex diseases, and the detection of epistasis still remains a great challenge and needs to be studied from different perspectives.
Conclusion
In this paper, we studied case-control data from the viewpoint of binary classification. We treated cases as positive samples and controls as negative samples, and used SNP markers as categorical features. We adopted random forest to discriminate cases against controls, while the focus was to obtain the gini importance to measure the contribution of each SNP to the classification performance. On the basis of this measure, a sliding window sequential forward feature selection (SWSFS) algorithm was proposed to automatically determine a subset of candidate SNPs that were most likely to be associated with the disease. A hierarchical procedure with the use of the B statistic was applied to declare statistical significance of up to three-way interactions within this set of candidates. This framework, including the random forest and the SWSFS algorithm for initial screening, and the hierarchical procedure for declaring statistical significance, was named epi Forest.
We compared the proposed approach with three existing methods, including BEAM [17], the stepwise logistic regression [11], and the χ^{2} test, on three simulated disease models [11]. The results showed that the power of epi Forest was comparable to, sometimes higher than, that of the other methods.
We further applied epi Forest to a real genome-wide case-control dataset of AMD. The SWSFS algorithm automatically selected a set of 84 SNP markers. It was interesting to see that the two SNPs (rs380390 and rs1329428) already reported as linked to this disease [25] had the highest gini importance. A strong negative correlation between the gini importance and the p-value for the B statistic was also observed.
Methods
Random forest
The random forest is an ensemble learning methodology originated by Leo Breiman (see [28] for details). The basic idea of ensemble learning is to boost the performance of a number of weak learners via a voting scheme, where a weak learner can be an individual decision tree, a single perceptron/sigmoid function, or other simple and fast classifiers. As for the random forest, its hallmarks mainly include (1) bootstrap resampling, (2) random feature selection, (3) full depth decision tree growing, and (4) Out-of-bag (OOB) error estimate.
Given a set of N binary labelled training samples, where x_{ i }(i = 1, 2,..., N) is a vector of predictor variables (features) and y_{ i }the response variable (class label), a random forest targets on generating a number of M decision trees from these samples. For each tree, the same number of N samples is randomly selected with replacement (bootstrap resampling) to form a new training set, and the samples not selected are called out-of-bag (OOB) samples. Using this new training set, a decision tree is grown to the largest extent possible without any pruning according to the CART methodology [34], while in the split of each node, only a small number of m randomly selected features instead of all predictor variables is considered (random feature selection). Repeating the creation of a decision tree M times, we have a number of M distinct decision trees, forming a randomly generated "forest."
Unlike most machine learning methods that need to resort to cross-validation for the estimation of classification error, the random forest can natively estimate an out-of-bag (OOB) error in the process of constructing the forest, and this estimate is claimed to be unbiased in many tests [28]. With the construction of a decision tree, each OOB sample is tested, and its (OOB) classification result is collected. Upon the finish of constructing the entire forest, OOB classification results for each sample are used to determine a decision for this sample via a majority voting rule. The fraction of decisions that disagree with the true class label is then the OOB error estimate.
These characteristics make the random forest suitable for handling large-scale samples with thousands of features and thus gaining a wide spectrum of applications in bioinformatics such as the selection of genes [19, 20], the identification of gene-gene interactions [19, 22], and the detection of causative nonsynonymous SNPs [29, 30]. In our studies, we use the "randomForest" package in R. The number of trees (M) and the number of features randomly selected in each node (m) are referred to as ntree and mtry (with the default value $\lfloor \sqrt{\#\left\{\text{SNPs}\right\}}\rfloor $) in this package, respectively. Detailed discussion about the effects of these parameters to the classification performance can be found in [20] and [28].
Gini importance
Suppose that η, a node of a decision tree T, contains a number of n samples, in which n_{0} are negative and n_{1} = n - n_{0} are positive. The relative frequencies of the negative and positive samples, f_{0} and f_{1}, respectively, can then be estimated as
f_{0} = n_{0}/n and f_{1} = n_{1}/n_{1}/n,
This formula can be generalized to account for three or more classes [34]. Now, for a split at this node that yields two sub-nodes η_{ l }and η_{ r }, the decrease of the gini impurity for this split is calculated as
ΔΦ(η) = Φ(η) - f_{ l }Φ(η_{ l }) - f_{ r }Φ(η_{ r }),
where T is the collection of all decision trees in the random forest.
The random forest provides another randomization mechanism to estimate the importance of individual features. When a decision tree is constructed, the correct classifications for the OOB samples can be counted. Now, for a feature v, randomly permute its values in the OOB samples and again count the correct classifications. The average of the difference in these two counts over all trees in a forest is then defined as the raw importance of the feature v.
It has been shown that the gini importance and the raw importance are very consistent [28], but the computation of the gini importance is much more economy. We therefore adopt the gini importance to measure the contribution of a SNP to the classification performance in our studies.
Sliding window sequential forward feature selection
The key step in epi Forest is to automatically determine a subset of candidate SNPs that are likely to be linked to the disease. We use a sliding window sequential forward feature selection (SWSFS) algorithm on the basis of the gini importance for this purpose.
Suppose that M = {m_{1},..., m_{ L }}. are a number of L markers, and through an initial run of a random forest, the gini importance for these markers has been obtained as G = {g_{1},..., g_{ L }}, where g_{ i }= GI(m_{ i }). Besides, the order of these markers on the basis of the importance is obtained as O = {o_{1},..., o_{ L }}, that is, for every (i, j) that satisfies 1 ≤ i <j ≤ L, ${g}_{{o}_{i}}\ge {g}_{{o}_{j}}$. In other words, ${m}_{{o}_{1}}$ is the most important marker, ${m}_{{o}_{2}}$ the second most important one, and so forth. A Naïve Greedy Sequential Forward feature Selection (NGSFS) algorithm can then be designed as follows:
- 1.
i := 1; k := 1;
- 2.
WHILE (i ≤ L) DO
- 3.
Error [i] := randomForest(M [o_{1},..., o_{ i }]);
- 4.
IF (Error [i] < Error [k])
- 5.
k := i;
- 6.
END IF
- 7.
END WHILE
- 8.
RETURN M [o_{1},..., o_{ k }];
Here, M [o_{1},..., o_{ i }] is the subset of the first i most important SNPs, M [o_{1},..., o_{ k }] the subset of selected SNPs, and k the number of selected SNPs.
This algorithm greedily searches for a subset of SNPs that can minimize the classification error instead of enumerating all 2^{ L }- 1 nonempty subsets of the L SNPs. However, when L is huge, even this algorithm is computationally intractable. In our studies, we find that the classification errors produced by the NGSFS have a "V" shape with many local minimum, and the true global minimum is typically produced when only a small number of the most important SNPs are used (see Figure 6 as an example). Also, the construction of a random forest with fewer features is faster. These observations motivate us to propose the following Sliding Window Sequential Forward feature Selection (SWSFS) algorithm:
- 1.
i =: 1; k =: L; w =: 20;
- 2.
WHILE (i ≤ L) DO
- 3.
Error [i] =: randomForest(M [o_{1},..., o_{ i }]);
- 4.
IF (i > w AND i - w = argmin_{i-w ≤ j ≤ i}{Error [j]})
- 5.
k =: i - w;
- 6
BREAK;
- 7.
END IF
- 8.
END WHILE
- 9.
RETURN M [o_{1},..., o_{ k }];
This algorithm greedily searches for the first subset of markers in which the left boundary should have the minimal classification error in a window of size w. The window size determines how robust the algorithm could be, and we simply set it to 20 in this paper.
Statistical tests
We adopt a hierarchical procedure on the basis of the B statistic [17] in the second stage of epi Forest to declare statistical significance of up to three-way interactions within the candidate SNPs that are selected by the SWSFS algorithm.
There are two motivations for using the B statistic: (1) it is more powerful than the standard χ^{2} statistic, and (2) the marginal effects of already reported individual SNPs or partial interactions can be handled via the use of a conditional B statistic. Here we briefly introduce the B staistic, while the detailed derivation should refer to [17].
Given a set Ω of k markers, we like to test the hypothesis H_{0}: SNPs in Ω are not associated with the disease versus H_{1}: SNPs in Ω are jointly linked to the disease. For this purpose, a B statistic is defined as
B_{Ω} = ln [P_{1}(D_{Ω}, U_{Ω})/P_{1}(D_{Ω}, U_{Ω})],
where D_{Ω} and U_{Ω} are the observed case and control data for the set of markers, respectively, and P_{1}(D_{Ω}, U_{Ω}) and P_{0}(D_{Ω}, U_{Ω}) are Bayesian factors (marginal probabilities of the data) under the alternative and the null hypotheses, respectively.
More specifically, P_{1} (D_{Ω}, U_{Ω}) is calculated as
P_{1} (D_{Ω}, U_{Ω}) = P_{join} (D_{Ω} [P_{ind} (U_{Ω}) + P_{join} (U_{Ω})],
and P_{0}(D_{Ω}, U_{Ω}) is calculated as
P_{0} (D_{Ω}, U_{Ω}) = P_{ind}(D_{Ω}, U_{Ω}) + P_{join}(D_{Ω}, U_{Ω}).
In the above formulae, α_{ j }and β_{ j }are pseudo-counts with default values 0.5 (see [17]), n_{ ij }the number of controls that have the j-th genotype at the i-th SNP, and n_{ j }the number of controls that have the j-th combinatory genotype. P_{join}(D_{Ω}), P_{ind}(D_{Ω}, U_{Ω}), and P_{join}(D_{Ω}, U_{Ω}) can be calculated in a similar way. These formulae are derivation from a Bayesian marker partition model in [17]. It has been shown that under the null hypothesis, 2B_{Ω} has asymptotically a shifted χ^{2} distribution with 3^{ k }- 1 degrees of freedom. A p-value can therefore be calculated using the χ^{2} distribution.
Give a subset ω of Ω, where the t markers in ω are linked to the disease through either individual and/or partial interactive effects. A conditional B statistic, B_{Ω|ω}, for the rest markers can then be calculated in a similar way as the B statistic. Furthermore, under the null hypothesis that the rest markers are unlinked to the disease, 2B_{Ω|ω}follows asymptotically a shifted χ^{2} distribution with 3^{ k }-3^{ t }degrees of freedom. A p-value can then be calculated accordingly.
Declarations
Acknowledgements
We thank Dr. Hoh J for providing us the AMD data-set. This study was supported by the Natural Science Foundation of China grants 60805010, 60805009, 60575014, the Hi-Tech Research and Development Program of China (863 project) grant 2006AA02Z325, the National Basic Research Program of China (973 project) grant 2004CB518605, and a starting up supporting plan at Tsinghua University.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1
Authors’ Affiliations
References
- Glazier AM, Nadeau JH, Aitman TJ: Finding genes that underlie complex traits. Science. 2002, 298 (5602): 2345-2349. 10.1126/science.1076641.View ArticlePubMedGoogle Scholar
- Lander E, Kruglyak L: Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat Genet. 1995, 11 (3): 241-247. 10.1038/ng1195-241.View ArticlePubMedGoogle Scholar
- Moore JH, Williams SM: New strategies for identifying gene-gene interactions in hypertension. Ann Med. 2002, 34 (2): 88-95. 10.1080/07853890252953473.View ArticlePubMedGoogle Scholar
- Tiret L, Bonnardeaux A, Poirier O, Ricard S, Marques-Vidal P, Evans A, Arveiler D, Luc G, Kee F, Ducimetiere P: Synergistic effects of angiotensin-converting enzyme and angiotensin-II type 1 receptor gene polymorphisms on risk of myocardial infarction. Lancet. 1994, 344 (8927): 910-913. 10.1016/S0140-6736(94)92268-3.View ArticlePubMedGoogle Scholar
- Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69 (1): 138-147. 10.1086/321276.PubMed CentralView ArticlePubMedGoogle Scholar
- Williams SM, Ritchie MD, Phillips JA, Dawson E, Prince M, Dzhura E, Willis A, Semenya A, Summar M, White BC: Multilocus analysis of hypertension: a hierarchical approach. Hum Hered. 2004, 57 (1): 28-38. 10.1159/000077387.View ArticlePubMedGoogle Scholar
- Tsai CT, Hwang JJ, Chiang FT, Wang YC, Tseng CD, Tseng YZ, Lin JL: Renin-angiotensin system gene polymorphisms and atrial fibrillation: A regression approach for the detection of gene-gene interactions in a large hospitalized population. Cardiology. 2008, 111 (1): 1-7. 10.1159/000113419.View ArticlePubMedGoogle Scholar
- Cho YM, Ritchie MD, Moore JH, Park JY, Lee KU, Shin HD, Lee HK, Park KS: Multifactor-dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus. Diabetologia. 2004, 47 (3): 549-554. 10.1007/s00125-003-1319-x.View ArticlePubMedGoogle Scholar
- Martin MP, Qi Y, Gao X, Yamada E, Martin JN, Pereyra F, Colombo S, Brown EE, Shupert WL, Phair J: Innate partnership of HLA-B and KIR3DL1 subtypes against HIV-1. Nat Genet. 2007, 39 (6): 733-740. 10.1038/ng2035.PubMed CentralView ArticlePubMedGoogle Scholar
- Risch NJ: Searching for genetic determinants in the new millennium. Nature. 2000, 405 (6788): 847-856. 10.1038/35015718.View ArticlePubMedGoogle Scholar
- Marchini J, Donnelly P, Cardon LR: Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet. 2005, 37 (4): 413-417. 10.1038/ng1537.View ArticlePubMedGoogle Scholar
- Hoh J, Wille A, Ott J: Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res. 2001, 11 (12): 2115-2119. 10.1101/gr.204001.PubMed CentralView ArticlePubMedGoogle Scholar
- Nelson MR, Kardia SL, Ferrell RE, Sing CF: A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001, 11 (3): 458-470. 10.1101/gr.172901.PubMed CentralView ArticlePubMedGoogle Scholar
- Culverhouse R, Klein T, Shannon W: Detecting epistatic interactions contributing to quantitative traits. Genet Epidemiol. 2004, 27 (2): 141-152. 10.1002/gepi.20006.View ArticlePubMedGoogle Scholar
- Millstein J, Conti DV, Gilliland FD, Gauderman WJ: A testing framework for identifying susceptibility genes in the presence of epistasis. Am J Hum Genet. 2006, 78 (1): 15-27. 10.1086/498850.PubMed CentralView ArticlePubMedGoogle Scholar
- Chatterjee N, Kalaylioglu Z, Moslehi R, Peters U, Wacholder S: Powerful multilocus tests of genetic association in the presence of gene-gene and gene-environment interactions. Am J Hum Genet. 2006, 79 (6): 1002-1016. 10.1086/509704.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Y, Liu JS: Bayesian inference of epistatic interactions in case-control studies. Nat Genet. 2007, 39 (9): 1167-1173. 10.1038/ng2110.View ArticlePubMedGoogle Scholar
- Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP, Van Eerdewegh P: Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol. 2005, 28 (2): 171-182. 10.1002/gepi.20041.View ArticlePubMedGoogle Scholar
- Chen X, Liu CT, Zhang M, Zhang H: A forest-based approach to identifying gene and gene gene interactions. Proc Natl Acad Sci USA. 2007, 104 (49): 19199-19203. 10.1073/pnas.0709868104.PubMed CentralView ArticlePubMedGoogle Scholar
- Diaz-Uriarte R, Alvarez de Andres S: Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006, 7: 3-10.1186/1471-2105-7-3.PubMed CentralView ArticlePubMedGoogle Scholar
- Heidema AG, Boer JM, Nagelkerke N, Mariman EC, van der AD, Feskens EJ: The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet. 2006, 7: 23-10.1186/1471-2156-7-23.PubMed CentralView ArticlePubMedGoogle Scholar
- McKinney BA, Reif DM, Ritchie MD, Moore JH: Machine learning for detecting gene-gene interactions: a review. Appl Bioinformatics. 2006, 5 (2): 77-88. 10.2165/00822942-200605020-00002.PubMed CentralView ArticlePubMedGoogle Scholar
- Phuong TM, Lin Z, Altman RB: Choosing SNPs using feature selection. J Bioinform Comput Biol. 2006, 4 (2): 241-257. 10.1142/S0219720006001941.View ArticlePubMedGoogle Scholar
- Ye Y, Zhong X, Zhang H: A genome-wide tree- and forest-based association analysis of comorbidity of alcoholism and smoking. BMC Genet. 2005, 6 (Suppl 1): S135-10.1186/1471-2156-6-S1-S135.PubMed CentralView ArticlePubMedGoogle Scholar
- Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST: Complement factor H polymorphism in age-related macular degeneration. Science. 2005, 308 (5720): 385-389. 10.1126/science.1109557.PubMed CentralView ArticlePubMedGoogle Scholar
- Fung HC, Scholz S, Matarin M, Simon-Sanchez J, Hernandez D, Britton A, Gibbs JR, Langefeld C, Stiegert ML, Schymick J: Genome-wide genotyping in Parkinson's disease and neurologically normal controls: first stage analysis and public release of data. Lancet Neurol. 2006, 5 (11): 911-916. 10.1016/S1474-4422(06)70578-6.View ArticlePubMedGoogle Scholar
- Simon-Sanchez J, Scholz S, Fung HC, Matarin M, Hernandez D, Gibbs JR, Britton A, de Vrieze FW, Peckham E, Gwinn-Hardy K: Genome-wide SNP assay reveals structural genomic variation, extended homozygosity and cell-line induced alterations in normal individuals. Hum Mol Genet. 2007, 16 (1): 1-14. 10.1093/hmg/ddl436.View ArticlePubMedGoogle Scholar
- Breiman L: Random forest. Machine Learning. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Jiang R, Yang H, Sun F, Chen T: Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy. BMC Bioinformatics. 2006, 7: 417-10.1186/1471-2105-7-417.PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang R, Yang H, Zhou L, Kuo CC, Sun F, Chen T: Sequence-based prioritization of nonsynonymous single-nucleotide polymorphisms for the study of disease mutations. Am J Hum Genet. 2007, 81 (2): 346-360. 10.1086/519747.PubMed CentralView ArticlePubMedGoogle Scholar
- Pritchard JK, Przeworski M: Linkage disequilibrium in humans: models and data. Am J Hum Genet. 2001, 69 (1): 1-14. 10.1086/321275.PubMed CentralView ArticlePubMedGoogle Scholar
- Collins A, Lonjou C, Morton NE: Genetic epidemiology of single-nucleotide polymorphisms. Proc Natl Acad Sci USA. 1999, 96 (26): 15173-15177. 10.1073/pnas.96.26.15173.PubMed CentralView ArticlePubMedGoogle Scholar
- Kruglyak L: Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat Genet. 1999, 22 (2): 139-144. 10.1038/9642.View ArticlePubMedGoogle Scholar
- Duda RO, Hart PE, Stork DG: Pattern Classification (Second Edition). 2001, New York: John Wiley & Sons, IncGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.