 Proceedings
 Open Access
 Published:
A Markov blanketbased method for detecting causal SNPs in GWAS
BMC Bioinformatics volume 11, Article number: S5 (2010)
Abstract
Background
Detecting epistatic interactions associated with complex and common diseases can help to improve prevention, diagnosis and treatment of these diseases. With the development of genomewide association studies (GWAS), designing powerful and robust computational method for identifying epistatic interactions associated with common diseases becomes a great challenge to bioinformatics society, because the study of epistatic interactions often deals with the large size of the genotyped data and the huge amount of combinations of all the possible genetic factors. Most existing computational detection methods are based on the classification capacity of SNP sets, which may fail to identify SNP sets that are strongly associated with the diseases and introduce a lot of false positives. In addition, most methods are not suitable for genomewide scale studies due to their computational complexity.
Results
We propose a new Markov Blanketbased method, DASSOMB (Detection of ASSOciations using Markov Blanket) to detect epistatic interactions in casecontrol GWAS. Markov blanket of a target variable T can completely shield T from all other variables. Thus, we can guarantee that the SNP set detected by DASSOMB has a strong association with diseases and contains fewest false positives. Furthermore, DASSOMB uses a heuristic search strategy by calculating the association between variables to avoid the timeconsuming training process as in other machinelearning methods. We apply our algorithm to simulated datasets and a real casecontrol dataset. We compare DASSOMB to other commonlyused methods and show that our method significantly outperforms other methods and is capable of finding SNPs strongly associated with diseases.
Conclusions
Our study shows that DASSOMB can identify a minimal set of causal SNPs associated with diseases, which contains less false positives compared to other existing methods. Given the huge size of genomic dataset produced by GWAS, this is critical in saving the potential costs of biological experiments and being an efficient guideline for pathogenesis research.
Background
Compared to Mendelian disorders that are monogenic and rare in population, some common complex diseases like various types of cancers, diabetes and hypertension are conjectured to be caused by two types of interactions related with multiple genetic factors: genegene interactions and geneenvironment interactions [1]. Interactions between genes or single nucleotide polymorphisms (SNPs) in chromosomal regions are called epistasis [2–4]. Detecting epistasis associated with complex and common diseases became an important issue in human genetics and can build a new pavement towards the improvement of prevention, diagnosis and treatment of these diseases.
Recent development of highthroughput technologies has made it possible to produce a huge amount of genotype data and contribute to the analysis of genomewide association studies (GWAS) [5–7]. Furthermore, the international HapMap project has been used to support GWAS actively by the analysis of the common patterns of DNA sequence variations in different populations [8, 9]. However, the number of SNPs from casecontrol GWAS is typical more than 10 million and using traditional epistatic interactions detection methods such as parametric regression to identify multiple loci causing diseases simultaneously among all possible combinations of SNPs is inappropriate from genomewide casecontrol data. Therefore, designing robust and manageable methods to address this mathematical and computational problem presents a great challenge to scientists in bioinformatics.
By far, a number of statistical methods have been proposed to detect epistatic interactions. Among these, the most commonlyused parametric statistical method is logistic regression [10]. Logistic regression models the probability of a disease as a linear function of independent SNPs (SNPs are expressed as ternary variables) and finds an optimal logical SNP set associated with the disease status by simulated annealing algorithm [11]. When used for modelling highorder interactions, logistic regression methods relates to many empty contingencytable cells, which often leads to very large standard errors for parameter estimation and therefore increases the type I errors. Meanwhile, if the number of samples is small, highorder interaction models creates a large number of parameters and often results in an overfitting problem. To overcome these problems in logistic regression, Richie et al proposed and developed a multifactor dimensionality reduction (MDR) method [12–15]. MDR first constructs a contingency table for every possible set of SNPs and then labels the cells of the table “high risk” or “low risk” based on the cases/control ratio of each cell. By the label of each cell in the contingency table, MDR runs 10fold crossvalidation to select an SNP set with the smallest prediction error and/or the largest consistency. The merit of MDR compared to other statistical methods is that MDR is nonparametric and modelfree. However, it has two fundamental limitations: MDR selects the kway interactions purely by the prediction performance and moreover it employs an exhaustive searching strategy to avoid local optima, which makes it impractical for largescale datasets. Therefore, when applied to largescale datasets, MDR requires to use some feature selection methods such as ReliefF [16] as a filter for the top N SNPs, which will affect the performance of MDR significantly. Park and Hastie [17] made efforts to detect genegene interactions using a forward stepwise method based on penalized logistic regression (stepPLR). However, regression methods are typically computationally expensive because of the time needed for parameter estimations. Although stepPLR adopted forward selection and penalization to choose the causal SNPs, it can not overcome the essential limitations of regression. Recently, Zhang and Liu proposed a Bayesian epistasis association mapping (BEAM) method [18]. BEAM is a Bayesian marker partition model using Markov Chain Monte Carlo to reach an optimal marker partition with the highest posterior probability and a new B statistic instead of the conventional x^{2} statistic to check each marker or set of markers for significant associations with the disease. Despite their success to some degrees, statistical methods can only be applied to smallscale analysis due to their computational complexity.
The alternative approaches for statistical methods are machinelearning methods since detecting epistatic interactions is highly related to feature selection problem. Chen et al. proposed a support vector machine approach for detecting genegene interactions based on RFE (recursive feature elimination), RFA (recursive feature addition) and GA (genetic algorithm) feature selection methods [19]. Jiang et al. adopted random forests, which is an ensemble learning technique, to the detection of epistatic interactions in casecontrol studies [20]. They first ranked SNPs based on gini importance of each SNP from random forests and then performed a greedy search for a small subset of SNPs that could minimize the classification error by a Sliding Window Sequential Forward feature Selection (SWSFS) algorithm. The common limitation of machine learningbased methods is that they typically identify a SNP set that produces the highest classification accuracy, but not necessarily has the strongest association with the diseases. As a result, machine learningbased approaches tend to introduce many false positives, since the including of more SNPs increases classification accuracies.
In this paper, we propose a new Markov Blanketbased method, DASSOMB (Detection of ASSOciations using Markov Blanket) to detect epistatic interactions in casecontrol studies. The Markov Blanket is a minimal set of variables, which can completely shield the target variable from all other variables based on Markov condition property. Thus we can guarantee that the SNP set detected by DASSOMB has a strong association with diseases and contains fewest false positives. Furthermore, DASSOMB performs a heuristic search by calculating the association between variables to avoid the timeconsuming training process as in SVMs and Random Forests.
We compare DASSOMB with four other commonly used methods (BEAM [18], SVM [19], MDR [12–15] and stepPLR [17]) on simulated datasets generated from three disease models [10, 18, 20]. The results show that DASSOMB significantly outperforms other methods and is capable of finding SNPs strongly associated with diseases. For genomewide casecontrol datasets, we use the Agerelated Macular Degeneration (AMD) dataset containing 116,204 SNPs genotyped for 96 cases and 50 controls [7]. DASSOMB can find the AMD associated SNP rs380390 in the result SNP set and this demonstrates the power and scalability of DASSOMB.
Results and discussion
Epistatic models and simulation study
We first evaluate the proposed DASSOMB on simulated data sets, which are generated from three commonlyused disease models developed elsewhere [10, 18]. We show the three disease models in Table 1. In each cell of the table are the disease odds for each genotype combination at two loci (A and B), where α is the baseline effect and θ is the genotypic effect. In model 1, two disease loci contribute to the disease risk independently and produce additive effects. In model 2, the disease risk is presented only when both loci have at least one disease allele. Model 3 is a threshold model and is similar to model 2 except that additional disease alleles at each locus do not further increase the disease risk.
To generate data, we need to determine three parameters associated with each model: the marginal effect of each disease locus (λ), the minor allele frequencies (MAF) of both disease loci, and the strength of linkage disequilibrium (LD) between the unobserved disease locus and a genotyped locus. LD is a nonrandom association of alleles at different loci and is quantified by the squared correlation coefficient r^{2} calculated from allele frequencies [21]. The prevalence of a disease is the proportion the total number of cases of the disease in the population and in this paper we assume that the disease prevalence is 0.1 for all these three disease models [10]. The marginal effect of each disease locus (λ) can be determined by the baseline effect α and the genotypic effect θ in Table 1 and the minor allele frequencies (MAF) of both disease loci. So first we fix λ, the disease prevalence and MAF of both disease loci. Then we numerically derive the model parameters θ and α. Based on θ and α, we calculate the conditional probability of each genotype combination given disease status which is necessary for generating data [22]. In this paper, we set parameters for each model as follows:

Model1: λ =0.3; r^{2} =0.7,1.0; MAF=0.05, 0.1, 0.2, 0.5.

Model2: λ =0.3; r^{2} =0.7,1.0; MAF=0.05, 0.1, 0.2, 0.5.

Model3: λ =0.6; r^{2} =0.7,1.0; MAF=0.05, 0.1, 0.2, 0.5.
For each nondisease marker, we randomly chose its MAF from a uniform distribution in [0.0. 0.5]. We generate 50 datasets and each dataset contains 100 markers genotyped for 1,000 cases and 1,000 controls based on each parameter setting for each model.
We compare the DASSOMB algorithm with four commonly used methods: BEAM, Support Vector Machine, MDR and stepPLR on the three simulated disease models. We use power as our evaluation criterion, which is defined as the proportion of simulated datasets in which only two diseases associated markers are identified without any false positives, to measure the performance of each method.
BEAM uses a Bayesian marker partition model to partition SNPs into three groups: group 0 contains markers unlinked to the disease, group 1 contains markers contributing independently to the disease, and group 2 contains markers that jointly influence the disease. After the partition step by MCMC, candidate SNPs or groups of SNPs are further filtered by the B statistic [18]. The BEAM software is downloaded from http://www.fas.harvard.edu/~junliu/BEAM.
For support vector machines, we use LIBSVM with a RBF kernel to detect genegene interactions [23]. A grid search is used for selecting optimal parameters. Instead of using the exhaustive greedy search strategy for SNPs as in [19], which is very timeconsuming and infeasible to largescale datasets, we turn to a search strategy used in [20]. First we rank SNPs based on the mutual information between SNPs and disease status label that is 0 for the control and 1 for the case. Then, we use a sliding window sequential forward feature selection (SWSFS) algorithm in [20] based on SNPs rank. The window size in SWSFS algorithm determines how robust the algorithm could be and we set it to 20.
Since MDR algorithm can not be applied to a large dataset directly, we first select top 10 SNPs by ReliefF [16], a commonlyused feature selection algorithm, and then MDR performs an exhaustive search for a model consisting of no more than four SNPs that can maximize crossvalidation consistency and prediction accuracy. When one model has the maximal crossvalidation consistency and another model has the maximal prediction accuracy, MDR follows statistical parsimony (selects the model with fewer SNPs).
For stepPLR, we download the R package from CRAN (ftp://200.17.202.1/CRAN/web/packages/stepPlr). StepPLR provides both stepwise forward and backward methods for feature selection procedure. We use both methods and set the regularization parameter λ to default value (10^{4}) for the L2 norm of the coefficients.
The results on the simulated data are shown in Figure 1. As can be seen, among the five methods, the DASSOMB algorithm performs the best. BEAM is the second best. Interestingly, BEAM prefers to assign the two diseaseassociated markers to group 1, which means that BEAM considers that the two disease SNPs affect the disease independently. In most cases, the powers of both MDR and SVM are much smaller than those of the DASSOMB and BEAM algorithms. For the MDR algorithm, the poor performance may be due to the use of ReliefF to reduce SNPs from a very large dimensionality.
In some other studies, the definition of power is not in a strict sense. For example, in [18, 20] the power is defined as the proportion of 50 data sets in which all associated markers are identified at a significance threshold of 0.1 after Bonferroni correction. In other words, false positives are allowed in the final SNP sets. Accordingly, we also evaluate the methods in terms of the power defined as the proportion of simulated datasets in which two diseases associated markers are identified with no more than two false positives. The results of those three models are shown in Table 2. In parentheses we list the average number of false positives. From Table 2, we can see that the DASSOMB again outperforms other algorithms. Furthermore, the DASSOMB algorithm finds SNP sets with fewer false positives. Compared to the strict definition of power, a difference we can see is that for MAF > 10%, SVM actually detects the two disease associated markers in more datasets than BEAM, however, at the cost of introducing more false positives.
Application to real data
From the results on simulated data with 100 SNPs, DASSOMB demonstrates a better performance than four other methods. Notice that a real genomewide casecontrol association study may require genotyping of 30,000–1,000,000 common SNPs. In this section, we show that DASSOMB algorithm can also handle largescale datasets in real genomewide casecontrol studies. We consider an Agerelated Macular Degeneration (AMD) dataset, which contains 116,204 SNPs genotyped with 96 cases and 50 controls [7]. AMD (OMIM 603075) [24] is a common genetic disease related to the progressive visual dysfunction in age over 70 in the developed country. A GWA study was successfully conducted on this disease finding two associated SNPs, rs380390 and rs1329428 (‘rs’: assigned reference SNP ID by dbSNP [25]) in noncoding region of the gene for complement factor H (CFH), which is located on chromosome 1 in a region linked to AMD [7].
In the phase of preprocessing data, we remove nonpolymorphic SNPs and those that significantly deviated from HardyWeinberg Equilibrium (HWE). We also remove all SNPs that have more than five missing genotypes. After filtering, there are 91,495 SNPs lying in 22 autosomal chromosomes remained.
DASSOMB detects two associated SNPs. The first one is SNP rs380390, which is already found in [7] with a significant association with AMD. The other SNP detected by the DASSOMB algorithm is SNP rs1374431, which is also located in a noncoding region between LOC644301 and KIAA1715 in chromosome 2q31 [26]. KIAA1715, alternatively called LNP (Lunapark), is reported at OMIM (OMIM 610236) and usually found in adult brain regions. Although no evidences were reported with this gene related to AMD in the literature, it may be a plausible candidate gene associated with AMD.
Conclusions
Detecting epistatic interactions associated with complex and common diseases has become an important issue in human genetics and can improve prevention, diagnosis and treatment of those diseases. GWAS provides a huge amount of whole genomic data and therefore an unprecedented opportunity to identify causal genes or SNPs for some complex diseases. Traditional statistical methods, however, are not suitable for dealing with large datasets because of their computational complexity. Machinelearning approaches can be scaled to large datasets, but most existing machinelearning methods do not consider the complexity of genetic mechanisms and only focus on the selection of SNPs sets, which show the best classification capacity. This will introduce many false positives inevitably.
In this paper, we use a Markov Blanketbased method, DASSOMB, to identify epistatic interactions. We compared DASSOMB with four other methods, BEAM, Support Vector Machine, MDR and stepwise penalized logistic regression over simulated datasets. Our results show that the DASSOMB algorithm outperforms other methods in terms of the power. It can identify a minimal set of SNPs associated with diseases, which contains less false positives. This is critical in saving the potential costs of biological experiments and being an efficient guideline for pathogenesis research.
Methods
Markov Blanket
Bayesian networks are probabilistic graphical models representing a joint probability distribution J over a set of random variables X_{1},X_{2}…,X_{ n } by a directed acyclic graph (DAG) G and encode the Markov condition property: each node is conditionally independent of its nondescendents given its parents [27]. In this case, the joint probability distribution J can be represented as
where Pa(X_{ i }) denotes the set of parents of X_{ i } in G.
For three random variables X, Y and Z, if the probability distribution of X conditioned on both Y and Z is equal to the probability distribution of X conditioned only on Y, i.e., P(XY,Z) = P(XY), X is conditionally independent of Z given Y. This conditional independence is represented as. Similarly, represents conditional dependence.
Definition 1 (Faithfulness) A Bayesian network N and a joint probability distribution J are faithful to each other if and only if every conditional independence entailed by the DAG of N and the Markov Condition is also present in J [28].
We can define the Markov Blanket of a target variable of T, MB(T), as a minimal set for which , for all where V is the variable set in Bayesian network N. The Markov Blanket of a variable T is a minimal set of variables which can completely shield variable T from all other variables. All other variables are probabilistically independent of the variable T conditioned on the Markov Blanket of variable T. We show an example of the Markov Blanket in Figure 2. The MB(T) of the variable T is the set of grayfilled nodes {B, L, M, D, X } and variable S and U are independent of T conditioned on {B, L, M, D, X }.
Theorem 1. If Bayesian network N is faithful to its corresponding joint probability distribution J, then for every variable T, MB(T) is unique.
Given the definition of Markov Blanket, the probability distribution of T is completely determined by the values of variables in MB(T). Therefore, the detection of Markov Blanket has been applied for optimal variable selection problem [29]. In addition, the Markov Blanket can be used for causal discovery because MB (T) is the union of direct cause variables (parents), direct effect variables (children), and direct cause variables (spouse) of direct effect variables of T. Thus the Markov Blanket learning method is suitable for detection of epistatic interactions in genomewide casecontrol studies, e.g., to identify a minimal set of SNPs which may cause the disease for further experiments.
G^{2}Test
The G^{2} test is commonly used to test independence and conditional independence between two variables for discrete data as an alternative to the X^{2} test because G^{2}values are additive and can be applied to more complicated statistical designs [28, 30, 31]. The null hypothesis for G^{2} test is that the two variables are independent.
Assume that we have a contingency table to record and analyze the joint distribution of two variables. The count in a particular cell in a contingency table, x_{ ij }, is the value of a random variable from N samples with a multinomial distribution. Let represent the sum of elements in all cells along the i th row, and denote the sum of the counts in all cells along the j th column. If these two variables are independent based on the null hypothesis, the expected value of the random variable x_{ ij } is:
We can compute the conditional independence from appropriate marginal distributions in a similar way. For instance, to determine whether the first variable is independent of the second conditioned on the third, we can calculate the expected value of a cell x_{ ijk } as
For n cells in a contingency table, assume that the observed numbers are denoted by O_{ 1 }, O_{ 2 }, …, O_{ n } and the corresponding expected numbers by E_{ 1 }, E_{ 2 }, …, E_{ n }, then, the G^{2} is given by
which has an asymptotical distribution as chisquare (X^{2}) with appropriate degrees of freedom. The degrees of freedom (df) for the G^{2} test between two variables A and B can be calculated as:
df = (Cat(A) – 1)×(Cat(B)1) (5)
and the degrees of freedom (df) for the G^{2} test between A and B conditional on the third variable C can be calculated as:
where Cat(X) is the number of categories of the variable X and n is the number of variables in C. Here in (5) and (6) we assume that there are no empty cells in the contingency table. If there are some empty cells in the contingency table, we should reduce the degrees of freedom from (5) or (6) by the number of empty cells.
As described next, the proposed DASSOMB uses G^{2} to test the association and independence between SNPs and disease status.
DASSOMB
We use a Markov Blanketbased algorithm, DASSOMB, to detect genegene interactions (Figure 3). Let T denote the disease status and V the set of all variables containing T and all SNPs. There are two types of phases in DASSOMB: forward phase and backward phase. In each loop of the forward phase, if one variable shows a maximal G^{2} score conditioned on MB(T) and is dependent on target variable T , it will be admitted into MB(T). This admission operation is followed by a backward phase to remove false positives by conducting conditional independence tests. If no more variable will be added into MB(T) in the forward phase, we will enter the final backward phase to remove variables that do not belong to MB(T).
There are several methods to find the Markov Blanket for the target T: KS algorithm [32], GS algorithm [33], IAMB [34], MMMB [35] and HITONMB [29]. Different Markov Blanket methods have their own advantages and disadvantages. For example, IAMB is computationally efficient, but tends to include some false positives and is not sampleefficient. Comparing to IAMB, DASSOMB adds a backward phase after each step of selecting a variable in the forward phase to remove false positives, make the size of MB(T) as small as possible and therefore improve the sampleefficiency. In addition, it uses subset S of MB (T) rather than the remaining set MB (T)  {Y} while conducting the conditional independence tests in the backward phase. Here we let the size of subset S of MB (T) be larger than zero and exclude the empty set because of the joint effect of set of SNPs on the disease status. These two changes can make the detected results more reliable.
Abbreviations
 GWAS:

genomewide association studies
 DASSOMB:

Detection of ASSOciations using Markov Blanket
 SNP:

single nucleotide polymorphisms
 MDR:

multifactor dimensionality reduction
 stepPLR:

stepwise penalized logistic regression
 BEAM:

Bayesian epistasis association mapping
 MCMC:

Markov Chain Monte Carlo
 RFE:

recursive feature elimination
 RFA:

recursive feature addition
 GA:

genetic algorithm
 SWSFS:

Sliding Window Sequential Forward feature Selection
 AMD:

Agerelated Macular Degeneration
 MAF:

minor allele frequencies
 LD:

linkage disequilibrium
 HWE:

HardyWeinberg Equilibrium
 DAG:

directed acyclic graph.
References
 1.
Antonarakis SE, Beckmann JS: Mendelian disorders deserve more attention. Nat Rev Genet 2006, 7: 277–282.
 2.
Cordell HJ: Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum Mol Genet 2002, 11: 2463–2468.
 3.
McKinney BA, Reif DM, Ritchie MD, Moore JH: Machine learning for detecting genegene interactions: a review. Appl Bioinformatics 2006, 5: 77–88.
 4.
Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB: Detection of gene x gene interactions in genomewide association studies of human population data. Hum Hered 2007, 63: 67–84.
 5.
Easton DF, Pooley KA, Dunning AM, Pharoah PD, Thompson D, Ballinger DG, Struewing JP, Morrison J, Field H, Luben R, et al.: Genomewide association study identifies novel breast cancer susceptibility loci. Nature 2007, 447: 1087–1093.
 6.
Fellay J, Shianna KV, Ge D, Colombo S, Ledergerber B, Weale M, Zhang K, Gumbs C, Castagna A, Cossarizza A, et al.: A wholegenome association study of major determinants for host control of HIV1. Science 2007, 317: 944–947.
 7.
Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, et al.: Complement factor H polymorphism in agerelated macular degeneration. Science 2005, 308: 385–389.
 8.
The International HapMap Project. Nature 2003, 426: 789–796.
 9.
A haplotype map of the human genome. Nature 2005, 437: 1299–1320.
 10.
Marchini J, Donnelly P, Cardon LR: Genomewide strategies for detecting multiple loci that influence complex diseases. Nature genetics 2005, 37: 413–417.
 11.
Kooperberg C, Ruczinski I: Identifying interacting SNPs using Monte Carlo logic regression. Genetic epidemiology 2005, 28: 157–170.
 12.
Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for detecting genegene and geneenvironment interactions. Bioinformatics (Oxford England) 2003, 19: 376–382.
 13.
Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC: A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. Journal of theoretical biology 2006, 241: 252–261.
 14.
Ritchie MD, Hahn LW, Moore JH: Power of multifactor dimensionality reduction for detecting genegene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genetic epidemiology 2003, 24: 150–157.
 15.
Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. American journal of human genetics 2001, 69: 138–147.
 16.
RobnikŠikonja M, Kononenko I: Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning 2003, 53: 23–69.
 17.
Park MY, Hastie T: Penalized logistic regression for detecting gene interactions. Biostatistics (Oxford England) 2008, 9: 30–50.
 18.
Zhang Y, Liu JS: Bayesian inference of epistatic interactions in casecontrol studies. Nature genetics 2007, 39: 1167–1173.
 19.
Chen SH, Sun J, Dimitrov L, Turner AR, Adams TS, Meyers DA, Chang BL, Zheng SL, Gronberg H, Xu J, Hsu FC: A support vector machine approach for detecting genegene interaction. Genetic epidemiology 2008, 32: 152–167.
 20.
Jiang R, Tang W, Wu X, Fu W: A random forest approach to the detection of epistatic interactions in casecontrol studies. BMC bioinformatics 2009, 10(Suppl 1):S65.
 21.
Pritchard JK, Przeworski M: Linkage disequilibrium in humans: models and data. American journal of human genetics 2001, 69: 1–14.
 22.
Li J, Chen Y: Generating samples for association studies based on HapMap data. BMC bioinformatics 2008, 9: 44.
 23.
Chang Cc, Lin CJ: LIBSVM: A library for support vector machines. 2001.
 24.
Hamosh A, Scott AF, Amberger J, Bocchini C, Valle D, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2002, 30: 52–55.
 25.
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001, 29: 308–311.
 26.
Shriver MD, Mei R, Parra EJ, Sonpar V, Halder I, Tishkoff SA, Schurr TG, Zhadanov SI, Osipova LP, Brutsaert TD, et al.: Largescale SNP analysis reveals clustered and continuous patterns of human genetic variation. Hum Genomics 2005, 2: 81–89.
 27.
Chen XW, Anantha G, Lin X: Improving Bayesian Network Structure Learning with Mutual InformationBased Node Ordering in the K2 Algorithm. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2008, 20: 628–640.
 28.
Spirtes P, Glymour CN, Scheines R: Causation, prediction, and search. 2nd edition. MIT Press; 2000.
 29.
Aliferis CF, Tsamardinos I, Statnikov A: HITON: a novel Markov Blanket algorithm for optimal variable selection. AMIA Annual Symposium proceedings /AMIA Symposium 2003, 21–25.
 30.
Sokal RR, Rohlf FJ: Biometry : the principles and practice of statistics in biological research. 3rd edition. Freeman; 1995.
 31.
McDonald JH: Handbook of Biological Statistics. 2nd edition. Sparky House Publishing; 2009.
 32.
Koller D, Sahami M: Toward Optimal Feature Selection. In Proceedings of 13th conference on machine learning: 3–6 July 1996; Bari, Italy. Edited by: Lorenza Saitta. Morgan Kaufmann; 1996:284–292.
 33.
Margaritis D, Thrun S: Bayesian Network Induction via Local Neighborhoods. In Proceedings of Neural Information Processing Systems 12:29 Nov4 Dec 1999; Denver. Edited by: Sara A. Solla, Todd K. Leen and KlausRobert Müller. MIT Press; 1999:505–511.
 34.
Tsamardinos I, Aliferis C, Statnikov A, Statnikov E: Algorithms for Large Scale Markov Blanket Discovery. In Proceedings of the 16th International FLAIRS Conference: 11–15 May 2003; St. Augustine. Edited by: Doug Dankel. AAAI Press; 2003:376–380.
 35.
Tsamardinos I, Aliferis C, Statnikov A: Time and Sample Efficient Discovery of Markov Blankets And Direct Causal Relations. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining: 24–27 August 2003; Washington, D.C. Edited by: Lise Getoor. ACM; 2003:673–678.
Acknowledgements
This work is supported by the US National Science Foundation Award IIS0644366.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 3, 2010: Selected articles from the 2009 IEEE International Conference on Bioinformatics and Biomedicine. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/11?issue=S3.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
BH designed and implemented the DASSOMB method. BH and MP participated in testing the existing methods and analyzing experimental results. XWC conceived the study and designed the experiments. All authors helped in drafting the manuscript and approved the final manuscript.
Bing Han, Meeyoung Park contributed equally to this work.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Han, B., Park, M. & Chen, Xw. A Markov blanketbased method for detecting causal SNPs in GWAS. BMC Bioinformatics 11, S5 (2010). https://doi.org/10.1186/1471210511S3S5
Published:
DOI: https://doi.org/10.1186/1471210511S3S5
Keywords
 Bayesian Network
 Minor Allele Frequency
 Epistatic Interaction
 Multifactor Dimensionality Reduction
 Causal SNPs