Detecting disease-associated genotype patterns
© Long et al; licensee BioMed Central Ltd. 2009
Published: 30 January 2009
In addition to single-locus (main) effects of disease variants, there is a growing consensus that gene-gene and gene-environment interactions may play important roles in disease etiology. However, for the very large numbers of genetic markers currently in use, it has proven difficult to develop suitable and efficient approaches for detecting effects other than main effects due to single variants.
We developed a method for jointly detecting disease-causing single-locus effects and gene-gene interactions. Our method is based on finding differences of genotype pattern frequencies between case and control individuals. Those single-nucleotide polymorphism markers with largest single-locus association test statistics are included in a pattern. For a logistic regression model comprising three disease variants exerting main and epistatic interaction effects, we demonstrate that our method is vastly superior to the traditional approach of looking for single-locus effects. In addition, our method is suitable for estimating the number of disease variants in a dataset. We successfully apply our approach to data on Parkinson Disease and heroin addiction.
Our approach is suitable and powerful for detecting disease susceptibility variants with potentially small main effects and strong interaction effects. It can be applied to large numbers of genetic markers.
It is now generally accepted that complex genetic traits such as diabetes and schizophrenia are under the influence of multiple interacting loci and environmental triggers, each with a possibly small effect. Thus, to overcome the limitations of traditional single-locus association analysis (looking for main effects of single marker loci), various methods have been developed to investigate the joint disease association of multiple markers. As an approximation to multivariate analysis , sums of single-locus association statistics  have proven to be efficient, powerful , and applicable to large numbers of markers. A more direct approach is to investigate whether sets of specific alleles or genotypes at different loci occur more frequently in case than control individuals [4, 5].
Haplotypes are sets of alleles, one each at different loci, on a chromosome. Each individual has two haplotypes but, because of unknown phase, it is generally not possible to identify the two specific haplotypes in a given individual. For this reason, we prefer to work with sets of geno types at different loci (diplotypes). To allow for possible interactions between any genes, we consider genotypes at different loci, wherever these occur in the genome, and refer to such sets of genotypes as genotype patterns.
The simplest situation to consider is that of two genetic markers. Throughout this paper we will focus on single nucleotide polymorphisms (SNPs). One pair of SNPs, each with 3 genotypes, comprises a maximum of 9 genotype patterns. A possible strategy is to investigate all pairs of SNPs to see for each pair whether the genotype pattern frequencies are different in case and control individuals . Several diseases have been described requiring a mutation at two different loci while occurrence of a mutation at only one locus does not lead to disease .
At least in experimental organisms, researchers have developed methods to specifically search for epistatic interactions between two loci, with the aim to detect networks of interacting loci . In human genetics, several approaches to detecting the joint effects of multiple susceptibility variants have been proposed [9–14]. However, many of these methods are applicable only to small numbers of SNPs and are not suitable for genome-wide analysis of 100,000s of markers. For this reason, two-step approaches are more promising [2, 15, 16]. Specifically, for two-locus disease models, a two-step approach has been proposed that initially selects SNPs based on significant single-locus tests, with logistic regression analysis of all possible main and interaction effects at the second stage . Here, we proceed in a similar manner but with important differences: initial selection of SNPs is also carried out based on their single-locus test results but irrespective of whether these are significant, and the number of these test SNPs to be carried forward for second-stage analysis may be varied from 1 up to any desired maximum number, m, limited only by computer resources. Instead of logistic regression analysis, we propose to test whether for a given number m of test SNPs the frequencies of genotype patterns is different in case and control individuals.
It should be noted that genotype pattern frequencies comprise main effects (at one or multiple loci) and epistatic interaction effects. Here our focus is on genotype patterns with an interest in finding pattern frequencies reflecting a high degree of epistatic interaction so that disease-associated SNPs may be found even though they contribute to disease more through interaction effects than direct main effects. Such genes have been found, for example, in barley , and so-called digenic traits are known in humans .
Pattern recognition approach
Consider a dataset with a certain number of case and control individuals, each genotyped at a number M of SNPs, where M might be equal to 1 million, say. We want to find disease-associated patterns of m SNPs, m <<M, each pattern consisting of a genotype at each of the m SNPs. The total number of subsets of m SNPs that can be formed from M SNPs is equal to M!/[m!(M - m)!], and each of these subsets contains a maximum of 3m genotype patterns. Thus, searching through all possible patterns of even a small number m of SNPs is an enormous task and generally too computationally intensive. One might consider picking a random sample of all possible genotype patterns and evaluate them for association with disease, that is, testing whether pattern frequencies are significantly different between case and control individuals. A more satisfactory solution seems to select patterns with certain desirable properties, for example, patterns formed of SNPs that have been suitably prioritized . In the absence of biological and sequence-based information, the statistically most relevant prioritization is based on the significance level (p-value) achieved in an association test. In other words, we select SNPs having the largest main effects. Presumably, epistatic interaction as a disease causing mechanism will also result in some main effects although these are not necessarily significant. Such main effects are at least predicted under logistic regression models (see below). We propose the following two-step approach.
(1) For each of a possibly very large number M of SNPs, an association test is carried out, for example, chi-square is computed for a 2 × 3 contingency table of genotypes, where the two rows correspond to cases and controls, and columns refer to the three SNP genotypes. The top m SNPs (with largest chi-square) are singled out for further analysis, where m may be guided by considerations such as potential importance for disease of the genes containing these SNPs. (2) For the selected m test SNPs, all genotype patterns are identified and their frequencies established. Those patterns with frequencies below 5% in cases and controls each are pooled into a "rare" class. For example, with m = 3, the total number of genotype patterns is 27. Some patterns may not occur in the given data and rare patterns are pooled into a single class so that only r patterns with appreciable frequency are observed. These data are now arranged in a 2 × r contingency table, for which chi-square and an associated significance level, pm, is computed.
We take pm as our experiment-wise test statistic (rather than chi-square, which may have different degrees of freedom depending on the number s of genotype patterns). Clearly, pm may not be taken at face value as a significance level – the type 1 error would be grossly inflated because pm is based on the ascertainment of the m best SNPs. Thus, we need to establish the significance level associated with our observed value of pm, that is, determine the probability of obtaining a value as small or smaller than the one observed without there being an association (null hypothesis). We do this in N randomization samples, that is, we randomly permute the labels "case" and "control" and leave everything else intact. In each randomization sample, the top m SNPs are determined, wherever they occur in the genome, and p'm is computed in analogy to pm in the observed dataset. A large number of randomization samples will then furnish an approximate null distribution for pm, from which the significance level is obtained as the proportion of randomization samples with p'm = pm. Unless noted otherwise, our calculations are carried out with N = 50,000 randomization samples. We compare our pattern test with the following conventional SNP-by-SNP approach.
For each of the m best SNPs we compute chi-square for a 2 × 3 contingency table, with rows corresponding to cases/controls and columns representing the three SNP genotypes. Significance levels associated with each of the m SNPs are again evaluated in randomization samples. That is, for each observed chi-square the proportion of randomization samples is determined in which a chi-square as large or larger than the one observed occurs. In other words, the reference (null) distribution is that of the largest chi-square .
Disease model and power calculations
To evaluate whether our pattern test does better than the single-locus test, we carry out power calculations by generating case and control individuals under a suitable disease inheritance model. Because of its flexibility, we apply a logistic regression model  as follows. We assume md = 3 disease loci, each with two equally frequent alleles A and B. At the i-th disease locus, we assign a code xi = -1 to genotypes AA and AB, and a code xi = +1 to genotype BB. Thus, genotype codes of -1 and +1 have associated respective probabilities of occurrence of 0.75 and 0.25. Then the logistic disease model is given by
log [f/(1 - f)] = a0 + a1(x1 + x2 + x3) + a2(x1x2 + x1x3 + x2x3) + a3x1x2x3 = s,
where f is the conditional probability of being affected given the genotype pattern specified by the xi's (f is often called penetrance) and the ai are coefficients representing main (a1) and interaction effects (a2, a3). For simplicity, all main effects are the same and all first-order (pairwise) interaction effects are the same, while a1 = a2 = a3 = 0 corresponds to the null hypothesis of no genetic effects. For any setting of ai > 0 (i > 0), the parameter a0 is chosen in such a way that the model predicts a disease prevalence of K = 0.05. For example, this prevalence is predicted by parameter values a1 = a2 = 0, a3 = 3, and a0 = -5.05. Note that such a model containing only a second-order interaction effect does induce some direct genotype effects at each of the three disease loci. Depending on the genotype pattern, these parameter settings lead to odds ratios (ORs) for disease ranging from 0.0034 through 3.12 and, at each single locus, lead to ORs of 0.58 and 1.71 for the respective genotype codes xi = -1 and xi = +1.
Initially, we only consider md = 3 SNPs in complete association with disease susceptibility variants. This way we directly compare power between the pattern and single-locus tests. Subsequently we also consider the effects of unassociated SNPs and how they degrade power.
Simulation with disease SNPs
Simulation with disease and unassociated SNPs
With md susceptibility SNPs and M unassociated SNPs (total of md + M SNPs), we need to define "success" in our simulations. In a given randomization sample, it is no longer sufficient that p'm = pm. In addition, we require that the particular subset of m = 3 ascertained SNPs with highest chi-square values comprise at least one disease SNP. An analogous rule applies to the single-locus test.
Choice of number Of SNPs for analysis
Analysis of Parkinson's Disease dataset
A dataset with approximately 540 case and control individuals and approximately 408,000 SNPs genome-wide is available for re-analysis . For our purpose, we chose to work with the 19,494 SNPs on chromosome 11 that passed our quality control measures; SNP rs10501570 on that chromosome was previously associated with Parkinson's Disease (PD) . We decided to work with a subset size of m = 3 SNPs. In chi-square tests for 2 × 3 tables (case/control phenotype versus three SNP genotypes), the three largest values occurred for SNPs rs12364577, rs1377470, and rs10501570. In 10,000 randomization samples, the largest chi-square had an associated significance level of 5/10,000 = 0.0005 (corrected for multiple testing) while the other two chi-square values had associated p-values of 0.0023 and 0.0030. Forming genotype patterns for these three SNPs, we found five patterns with frequencies of at least 0.05 in either case or control individuals; the remaining patterns were combined into a "rare" class. The chi-square for the resulting 2 × 6 table had an associated randomized significance level of 0.0064. Evidently, for this dataset the single-locus test was more effective than our pattern test (0.0005 < 0.0064), presumably because of strong single-locus effects.
Analysis of a dataset for heroin addiction
In 104 former severe heroin addicts and 101 control individuals (all Caucasians), a case-control association analysis was carried with 10,000 SNPs . Single-locus analysis furnished nonsignificant results: The largest chi-square in 2 × 3 tables of genotypes had an associated p-value of 0.15 (obtained in 10,000 randomization samples). On the other hand, evaluating genotype patterns for the three SNPs with individually smallest p-values revealed five common patterns, and the resulting 2 × 6 table of pattern frequencies had an associated experiment-wise significance level of 0.026. As reported in , a specific genotype pattern was associated with heroin addiction with an odds ratio of 6.25 while none of the component SNPs by themselves were significantly disease-associated. Interestingly, when we investigated genotype patterns comprising two, four, or five SNPs, the significance levels associated with the corresponding pattern tables were all in excess of 0.30. Thus, the specific set of three genetic variants identified seems to predispose to disease .
We report here a new multi-locus method for genetic case-control association analysis that jointly evaluates main and interaction effects of multiple genetic variants. In contrast to our Set Association method developed previously , it is not based on main effects of individuals SNPs only. Our method is able to find epistatic interaction effects although the test SNPs chosen for analysis are determined by SNP-specific association statistics. Clearly, if SNPs have weak individual effects then they are unlikely to be picked up by our approach. In an application to a study of former heroin addicts, our genotype pattern method clearly showed its advantages over standard single-locus methods.
We implemented our approach in a computer program, RandomPat, which is available on request and will soon be available for downloading from our website .
In its current implementation, our method has a shortcoming in that it cannot handle missing observations. Thus, for m SNPs it will ignore patterns that contain one or more missing genotypes.
A major advantage of our approach is that the number, m, of test SNPs can be chosen by the investigator. Varying m from 1 up to a suitable upper limit will show, which number of test SNPs is most significant in a given dataset. As demonstrated in the previous section, such an approach is expected to estimate the number of disease variants if they are all of equal strengths.
As we demonstrated, our method is clearly superior to the traditional approach of looking for single-locus effects. In addition, it is suitable for estimating the number of disease variants in a dataset. On the other hand, purely epistatic loci (no main effects by themselves) cannot be detected by our approach.
This study used data from the SNP Database at the NINDS Human Genetics Resource Center DNA and Cell Line Repository , as well as clinical data. The original genotyping was performed in the laboratories of Drs. Singleton and Hardy, (NIA, LNG), Bethesda, MD USA. We gratefully acknowledge support of China NSF grants no. 30700442 (QZ) and no. 30730057 (JO), and U.S. grant MH44292. One of us (JO) is a co-author on the heroin-addiction paper; the current work grew out of that collaboration, which we gratefully acknowledge.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1
- Manly BFJ: Randomization, bootstrap, and Monte Carlo methods in biology. 2007, Boca Raton, FL: Chapman & Hall/CRC, 3Google Scholar
- Hoh J, Wille A, Ott J: Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res. 2001, 11 (12): 2115-2119. 10.1101/gr.204001.PubMed CentralView ArticlePubMedGoogle Scholar
- Kim S, Zhang K, Sun F: Detecting susceptibility genes in case-control studies using set association. BMC Genet. 2003, 4 (Suppl 1): S9-10.1186/1471-2156-4-S1-S9.PubMed CentralView ArticlePubMedGoogle Scholar
- Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69 (1): 138-147. 10.1086/321276. Epub 2001 Jun 2011.PubMed CentralView ArticlePubMedGoogle Scholar
- Moore JH, Ritchie MD: STUDENTJAMA. The challenges of whole-genome approaches to common diseases. Jama. 2004, 291 (13): 1642-1643. 10.1001/jama.291.13.1642.View ArticlePubMedGoogle Scholar
- Hoh J, Ott J: Mathematical multi-locus approaches to localizing complex human trait genes. Nat Rev Genet. 2003, 4 (9): 701-709. 10.1038/nrg1155.View ArticlePubMedGoogle Scholar
- Ming JE, Muenke M: Multiple hits during early embryonic development: digenic diseases and holoprosencephaly. Am J Hum Genet. 2002, 71 (5): 1017-1032. 10.1086/344412.PubMed CentralView ArticlePubMedGoogle Scholar
- Roguev A, Wiren M, Weissman JS, Krogan NJ: High-throughput genetic interaction mapping in the fission yeast Schizosaccharomyces pombe. Nat Methods. 2007, 4 (10): 861-866. 10.1038/nmeth1098.View ArticlePubMedGoogle Scholar
- Nelson MR, Kardia SL, Ferrell RE, Sing CF: A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001, 11 (3): 458-470. 10.1101/gr.172901.PubMed CentralView ArticlePubMedGoogle Scholar
- Zee RY, Hoh J, Cheng S, Reynolds R, Grow MA, Silbergleit A, Walker K, Steiner L, Zangenberg G, Fernandez-Ortiz A: Multi-locus interactions predict risk for post-PTCA restenosis: an approach to the genetic analysis of common complex disease. Pharmacogenomics J. 2002, 2 (3): 197-201. 10.1038/sj.tpj.6500101.View ArticlePubMedGoogle Scholar
- Ruczinski I, Kooperberg C, LeBlanc ML: Exploring interactions in high-dimensional genomicdata: an overview of LogicRegression, with applications. Journal of Multivariate Analysis. 2004, 90: 178-195. 10.1016/j.jmva.2004.02.010.View ArticleGoogle Scholar
- Kooperberg C, Ruczinski I: Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol. 2005, 28 (2): 157-170. 10.1002/gepi.20042.View ArticlePubMedGoogle Scholar
- Lauer MS, Alexe S, Pothier Snader CE, Blackstone EH, Ishwaran H, Hammer PL: Use of the logical analysis of data method for assessing long-term mortality risk after exercise electrocardiography. Circulation. 2002, 106 (6): 685-690.View ArticlePubMedGoogle Scholar
- Reddy A, Wang H, Yu H, Bonates TO, Gulabani V, Azok J, Hoehn G, Hammer PL, Baird AE, Li KC: Logical Analysis of Data (LAD) model for the early diagnosis of acute ischemic stroke. BMC Med Inform Decis Mak. 2008, 8: 30-10.1186/1472-6947-8-30.PubMed CentralView ArticlePubMedGoogle Scholar
- Hoh J, Wille A, Zee R, Cheng S, Reynolds R, Lindpaintner K, Ott J: Selecting SNPs in two-stage analysis of disease association data: a model-free approach. Ann Hum Genet. 2000, 64 (Pt 5): 413-417. 10.1046/j.1469-1809.2000.6450413.x.View ArticlePubMedGoogle Scholar
- Marchini J, Donnelly P, Cardon LR: Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet. 2005, 37 (4): 413-417. 10.1038/ng1537. Epub 2005 Mar 2027.View ArticlePubMedGoogle Scholar
- Xu S, Jia Z: Genomewide analysis of epistatic effects for quantitative traits in barley. Genetics. 2007, 175 (4): 1955-1963. 10.1534/genetics.106.066571.PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang R, Yang H, Zhou L, Kuo CC, Sun F, Chen T: Sequence-based prioritization of nonsynonymous single-nucleotide polymorphisms for the study of disease mutations. Am J Hum Genet. 2007, 81 (2): 346-360. 10.1086/519747.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu S: An empirical Bayes method for estimating epistatic effects of quantitative trait loci. Biometrics. 2007, 63 (2): 513-521. 10.1111/j.1541-0420.2006.00711.x.View ArticlePubMedGoogle Scholar
- Fung HC, Scholz S, Matarin M, Simon-Sanchez J, Hernandez D, Britton A, Gibbs JR, Langefeld C, Stiegert ML, Schymick J: Genome-wide genotyping in Parkinson's disease and neurologically normal controls: first stage analysis and public release of data. Lancet Neurol. 2006, 5 (11): 911-916. 10.1016/S1474-4422(06)70578-6.View ArticlePubMedGoogle Scholar
- Nielsen DA, Ji F, Yuferov V, Ho A, Chen A, Levran O, Ott J, Kreek MJ: Genotype patterns that contribute to increased risk for or protection from developing heroin addiction. Mol Psychiatry. 2008, 13 (4): 417-428. 10.1038/sj.mp.4002147. Epub 2008 Jan 2015.View ArticlePubMedGoogle Scholar
- Statistical Genetics Beijing. [http://www.genemapping.cn]
- Coriell NINDS Collection. [http://ccr.coriell.org/ninds]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.