GLOSSI: a method to assess the association of genetic loci-sets with complex diseases
© Chai et al; licensee BioMed Central Ltd. 2009
Received: 07 July 2008
Accepted: 03 April 2009
Published: 03 April 2009
The developments of high-throughput genotyping technologies, which enable the simultaneous genotyping of hundreds of thousands of single nucleotide polymorphisms (SNP) have the potential to increase the benefits of genetic epidemiology studies. Although the enhanced resolution of these platforms increases the chance of interrogating functional SNPs that are themselves causative or in linkage disequilibrium with causal SNPs, commonly used single SNP-association approaches suffer from serious multiple hypothesis testing problems and provide limited insights into combinations of loci that may contribute to complex diseases. Drawing inspiration from Gene Set Enrichment Analysis developed for gene expression data, we have developed a method, named GLOSSI (G ene-lo ci S et Analysi s), that integrates prior biological knowledge into the statistical analysis of genotyping data to test the association of a group of SNPs (loci-set) with complex disease phenotypes. The most significant loci-sets can be used to formulate hypotheses from a functional viewpoint that can be validated experimentally.
In a simulation study, GLOSSI showed sufficient power to detect loci-sets with less than 10% of SNPs having moderate-to-large effect sizes and intermediate minor allele frequency values. When applied to a biological dataset where no single SNP-association was found in a previous study, GLOSSI was able to identify several loci-sets that are significantly related to blood pressure response to an antihypertensive drug.
GLOSSI is valuable for association of SNPs at multiple genetic loci with complex disease phenotypes. In contrast to methods based on the Kolmogorov-Smirnov statistic, the approach is parametric and only utilizes information from within the interrogated loci-set. It properly accounts for dependency among SNPs and allows the testing of loci-sets of any size.
The genetic component of complex disorders such as hypertension, Parkinson's disease, cancer, and diabetes is believed to result from the compound effect of multiple DNA variations in different chromosomal regions. In this context, the paradigm of searching across the genome for univariate single nucleotide polymorphism (SNP) associations may not be the most appropriate or realistic strategy. A preferred approach would consider the effects of multiple SNPs jointly. Unstructured enumeration of all possible combinations of SNPs for association is computationally demanding, if not infeasible. Variable selection needs to be performed before testing such multi-locus effects due to the discrepancy between numbers of SNPs and sample size in a typical genome-wide association study. In the current work, we focused the proposed association analyses of SNPs belonging to genes that are biologically related. The criteria for grouping SNPs can be based on biological theory, expert opinion, or localization in genes that control the same functional process or are co-regulated. Such groups of SNPs will be referred to as loci-sets. We have developed a method called GLOSSI (G ene-lo ci S et Analysi s) to score loci-sets as a function of the significance level of the individual SNPs comprising each loci-set. In what follows, we will use the terms locus and SNP interchangeably.
The idea of directly scoring a predefined set of genetic features is not new. It has sparked considerable interest in the context of gene expression data analysis since the publication of the pioneering paper by Mootha et al. [1, 2]. These authors designated and implemented the Gene Set Enrichment Analysis (GSEA) approach to identify functionally related genes that display overall coordinated expression changes with respect to biological states or disease phenotypes. The annotated biological function is expected to be more relevant if the set is 'enriched' with genes showing good-to-moderate association signals as compared to the remaining genes.
Recently, Wang et al.  built on work of Subramanian et al.  and extended it to genotyping data. Since many SNPs can be assigned to the same gene, the authors used the best signal (biggest χ2-value) from each gene in their calculation. Similar to GSEA, enrichment of association signals was measured by using a modified Kolmogorov-Smirnov (KS) statistic and statistical significance determined through permutation testing. One drawback of the KS statistic is that it depends, in part, on the signals outside of the tested loci-set. Put another way, it assumes the 'real' causal SNPs are fully contained in a single relevant loci-set, if such a set exists. In practice, causal SNPs can probably span across multiple loci-sets, without accounting for the imperfect SNP classification that might arise, for instance, from the empirical definition of the boundary of a loci-set. Under these conditions, application of the KS statistic will result in the attenuation of the overall significance of the relevant loci-set. Another limitation pointed out by the authors is the need to carry out the computationally demanding permutation of sample labels, instead of the faster gene label permutation to properly assess the statistical significance of the KS statistic. When many loci-sets have to be tested, the computational challenge is increased since a larger number of permutations have to be performed so as to detect significant association with correction for multiple hypothesis testing.
The method we describe below addresses these two issues. GLOSSI scores loci-sets by an alternative strategy that only focuses on information from within a loci-set and allows the determination of significance level with relative computational ease.
Results and discussion
Fisher's combined probability test
A measure of statistical significance is first calculated between each of the J loci with a chosen binary phenotype (eg case versus control). Either allele or genotype frequencies can be used as the basis for testing a locus in terms of its ability to distinguish the two phenotypic classes under study. Various statistical approaches are appropriate for deriving the p-value, from contingency-table-based methods: Fisher's exact test, Pearson's χ2 test, or Cochran-Armitage trend test; to regression-based techniques: logistic analysis, probit analysis, or complementary-log-log analysis. These approaches could have widely different methodological assumptions, specifically on the way in which the phenotype depends on the loci (eg additive, recessive, dominant, or unconstrained). Because it is a common belief that the additive assumption is generally adequate for complex disorders, we opt for the Cochran-Armitage trend test in view of its statistical power. The trend test statistic is formulated in the Method Section. Henceforth, we denote p-value for the j- th locus by p j .
The null hypothesis of no association between y i and s ij implies that p j is distributed as a standard uniform random variable, taking values in the interval (0,1]. Furthermore, t j = -2log p j has a chi-square distribution with two degrees of freedom.
with in which Ω is the covariance matrix of t. Brown showed in his paper that the approximation works well in general except when t j 's are highly negatively correlated. Since correlations between t j s from two-sided tests can only be positive, the approximation should be adequate for most genetic association studies. Note that Ω is unknown and needs to be estimated. We chose to perform the estimation through shuffling the phenotype labels 100 times, though smaller number of permutations are often sufficient to attain a stable estimate for Ω. Details of the permutation scheme are deferred to the Method Section.
In order to objectively assess the potential of the proposed methodology, we conducted a simulation study. A web-based tool, namely HapSample , was used to generate case-control samples with genetically realistic genotypes. We restricted the simulation to a subset of SNPs interrogated by the Sentrix® HumanHap300 BeadChip , which consists mostly of tag SNPs. More explicitly, a total of 59,140 SNPs no more than 20 millions bases away from the end of each autosome were retained in the study. Of these, we filtered out 770 loci based on the following criteria: 737 are not in the HapMap phase I/II data ; 20 have a minor allele frequency (MAF) of zero according to the HapMap project (Utah population); and 13 lie near the edge of some chromosomes which possess linkage patterns (inferred from HapMap data) that are not compatible with the simulator recombination algorithm.
Parameter specification in the simulated examples
Number of causal SNPs
MAF† of causal SNPs
Whether causal SNPs were 'genotyped'
1, 5 or 20
1, 5 or 20
1, 5 or 20
1, 5, or 20
1, 5, or 20
1 or 5
1, 5, or 17‡
Estimated type I error rates for GLOSSI in the simulated examples
Total sample size
Nominal rate, α
Proportion of p-value <α
It is of interest to compare the performance of GLOSSI against the modified KS approach proposed by Wang et al. . To this end, eight additional non-overlapping regions (hypothetical genes) were randomly picked from each chromosome. We altered the extension of the newly selected regions to either double, equal or halve the size of the hypothetical genes in the loci-set. These were used to create a reference distribution in the modified KS test. Without loss of generality, we focused the comparison on the null and two alternative hypotheses with the use of our data set of 200 cases and 200 controls. Scenarios 17 and 18 were chosen here because their powers were near 80% in the case of GLOSSI.
Estimated type I error rates for the modified KS statistic
Type of permutation
Relative size of genes: out/in loci-set
Proportion of p-value < 0.05
Power of the modified KS statistic when two genes in the reference set consist of a causal SNP
Number of genes outside of loci-set
Relative size of genes: out/in loci-set
Proportion of p-value < 0.05
Antihypertensive response example
Loci-sets with unadjusted p-value no greater than 0.1% in the antihypertensive response example
No. relevant gene
TPO signaling pathway
Erk1/Erk2 Mapk signaling pathway
Sprouty regulation of tyrosine kinase signals
Multiple antiapoptotic pathways from IGF-1R signaling lead to bad phosphorylation
Transcription factor CREB and its extracellular signals
Growth hormone signaling pathway
PTEN dependent cell cycle arrest and apoptosis
Upregulated in acute rejection transplanted kidney biopsies
IL 3 signaling pathway
Trka receptor signaling pathway
IL-2 receptor beta chain in T cell activation
B cell antigen receptor
IL 4 receptor signaling in B lymphocytes
Calcium signaling by HBx of Hepatitis B virus
IGF-1 signaling pathway
Down regulated following Apc loss
Inhibition of cellular proliferation by gleevec
IL 6 signaling pathway
Insulin signaling pathway
Upregulated in fibroblasts following infection with human cytomegalovirus
Down regulated by both curcumin and sulindac in SW260 colon carcinoma cells
Upregulated by TPA in resistant HL-525 cells
Upregulated by UV-B light in epidermal keratinocytes
Upregulated in well functioning transplanted kidney biopsies
GLOSSI reported 26 loci-set with a q-value lower than 5% in Whites but no loci-set passed this cutoff in Blacks (Table 5). The size of the 26 loci-sets ranges from 16 to 300 SNPs. This result is quite encouraging since single SNP methods previously applied to the same datasets could not detect any SNP that was statistically significantly associated with DBP response to hydrochlorothiazide (unpublished results). Among the top ranking loci-sets of the populations, two were derived from the same gene expression experiment of kidney transplant biopsies . These loci-sets are 'upregulated in acute rejection transplanted kidney biopsies' (MsigDB ID = c2:834, p-value = 0.0003) for non-Hispanic Caucasians and 'upregulated in well functioning transplanted kidney biopsies' (MsigDB ID = c2:836, p-value = 0.0009) for African Americans. Although biological interpretation of the results is not straightforward, one can hypothesize that genes in those two loci-sets are related to kidney pathophysiology or normal physiology and, therefore, may be relevant to sodium excretion, blood pressure regulation, and DBP response to diuretic therapy. One could also speculate that the different physiological mechanisms indexed by these two loci-sets are consistent with known differences in diuretic response between Black and White individuals with hypertension.
Other loci-sets are less informative and harder to interpret. Inspection of their names suggests that several of the significant loci-sets in Whites could conceivably be involved in regulation of antihypertensive drug response. These include 'growth hormone signaling pathway' (MsigDB ID = c2:198), 'calcium signaling by HBx of Hepatitis B virus' (MsigDB ID = c2:569), and 'insulin signaling pathway' (MsigDB ID = c2:229). The relationship of some other loci-sets with DBP response to hydrochlorothiazide requires a more speculative interpretation. For example, a few of them appear to be related to cell growth regulation but with no obvious relationship to blood pressure. However, a possible connection could exist through mitogenic hormones that are often vasoconstrictive and antinatriuretic and, therefore, would elevate blood pressure (eg, angiotensin II). Conversely, vasodilating and natriuretic hormones that lower blood pressure are often anti-mitogenic (eg, atrial natriuretic peptide).
The GLOSSI methodology for scoring loci-sets (a priori defined groups of SNPs) overcomes limitations of commonly-used single SNP approaches. The origin of a loci-set facilitates the interpretation of statistical outputs, providing a biological understanding of the mechanisms that underlie diseases or other phenotypes of interest. In contrast to the approach of Wang et al. , the proposed procedure is parametric: it assumes that p-values from individual SNPs follow a standard uniform distribution under the null hypothesis of no association and infers statistical relevance of each loci-set against a χ2 distribution. Consequently it has the advantage of computational speed, demands measurements only of SNPs within the query loci-set, and imposes no constraint on the size of the set. Although we only focus on binary phenotypes in this communication, the technique is general and equally applicable to other kinds of outcomes or any types of genome-scale data. In particular, the locus-specific p-values could be generated by statistical methods equipped with the ability to control for the presence of covariates (eg age, gender, etc). Appropriate adjustment for additional covariates would allow more accurate estimation of the true genotype-phenotype effect. The performance of the proposed method was evaluated by using computer simulated data as well as data from an antihypertensive pharmacogenomic study. In the simulation study, GLOSSI yielded the anticipated type-I error rate when no SNP in the loci-set was related to the binary outcome. Also, it demonstrated sufficiently high power for detecting loci-sets in which a fair number of SNPs (< 10%) had moderate to large effect sizes and intermediate MAF values. In the real data example, the proposed method appears to have been able to identify novel loci-sets not previously known or suspected to be involved in blood pressure regulation or antihypertensive drug response.
The lack of firm biological interpretation in the antihypertensive response example underlines one of the limitations of our method. Although GLOSSI is capable of detecting relevant loci-sets as demonstrated in the simulation experiment, its usefulness depends directly on the definition and availability of loci-sets when applying it to biological data. The currently available functionally annotated loci-sets are biased toward groups of genes involved in cancers since most of them were derived from such disease studies but very few of them focus on blood pressure or kidney-related investigations. Undoubtedly, more annotated and curated loci-sets will be available over time, which in turn will increase the applicability of GLOSSI for a given disease phenotype. The definition of a loci-set itself can also be challenged. The current assignment of SNPs to a gene, according to fixed physical distance boundaries from that gene, might not be optimal, not even in principle, let alone given the uncertainty in determining the appropriate fixed distance.
It must be stressed that GLOSSI only accounts for the additive, independent effect of individual SNPs and, therefore, ignores possible biological interactions that might exist. The joint effect of SNPs within a loci-set can be captured using multivariate methods [19–21]. However, a fair comparison of multivariate models derived from various loci-sets is hard to achieve since it demands sample label permutation testing. More specifically, the statistical model needs to be rebuilt for every loci-set in each permutation, which quickly becomes impractical as the numbers of loci-sets and permutations increase. Other complications that might arise during the application of multivariate analysis include overfitting and model instability. To balance the need for joint effects modeling with computational time effectiveness, one can envision developing a hierarchical approach that first uses GLOSSI for rapid identification of significant loci-sets followed by more extensive multivariate modeling. This approach is currently being investigated in our group.
Cochran-Armitage trend test
with δ(.) signifies an indicator function taking value one if its argument is correct and zero otherwise. The null hypothesis of the test is no linear trend in the proportion of group memberships at each of the three SNP genotypes, i.e. the proportion of 0/1 class distinction is the same for all levels. The statistic follows a standard normal distribution under this null hypothesis. Hence the evidence of association (p-value) between y i and s ij can be inferred by comparing T j 2 to a χ2 distribution with 1 degree of freedom, i.e. this will be a two-sided test.
Estimating covariance matrix by permutation
The existence of local LD implies that t j , j = 1,..., J, are not independent. Their covariance matrix, Ω, under the null hypothesis can be estimated through the use of permutation as follows. Assume that the phenotype measurements are independently and identically distributed. The subscript of y1, y2,..., y I are first shuffled reiteratively. One could generate either all I! permissible permutations or just a random sample of them. Then recalculate t j for all loci over every permuted datasets. Each of the resulting set of t-values represents a joint observation from the sampling distribution of t = (t j ,..., t j )T that is consistent with the null hypothesis. Given enough permutations, the empirical covariance of the t-values from the above should approximate Ω. Note that because not all SNPs are assigned to loci-sets, it is more computationally efficient to perform the calculations only on the relevant loci.
This work was supported partly by HL 74735, HL 53330 and Mayo Foundation. We are grateful to the authors of HapSample, in particular Fred Wright, for their assistance in the simulation study.
- Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC-1 α -responsive genes involved in oxidative phosphorylation are coordinately down-regulated in human diabetes. Nat Genet 2003, 34: 267–273. 10.1038/ng1180View ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005, 102: 15545–15550. 10.1073/pnas.0506580102PubMed CentralView ArticlePubMedGoogle Scholar
- Wang K, Li M, Bucan M: Pathway-based approaches for analysis of genome-wide association studies. Am J Hum Genet 2007, 81: 1278–1283. 10.1086/522374PubMed CentralView ArticlePubMedGoogle Scholar
- Fisher RA: Statistical methods for research workers. London: Oliver and Boyd; 1932.Google Scholar
- Little RC, Folks JL: Asymptotic optimality of Fisher's method of combining independent tests. J Am Stat Assoc 1971, 66: 802–806. 10.2307/2284230View ArticleGoogle Scholar
- Little RC, Folks JL: Asymptotic optimality of Fisher's method of combining independent tests II. J Am Stat Assoc 1973, 68: 193–194. 10.2307/2284167View ArticleGoogle Scholar
- Brown MB: A method for combining non-independent, one-sided tests of significance. Biometrics 1975, 31: 987–992. 10.2307/2529826View ArticleGoogle Scholar
- Wright FA, Huang H, Guan X, Gamiel K, Jeffries C, Barry WT, Pardo-Manuel F, Sullivan PF, Wihelmsen KC, Zou F: Simulating association studies: a data-based resampling method for candidate regions or whole genome scans. Bioinformatics 2007, 23: 2581–2588. 10.1093/bioinformatics/btm386View ArticlePubMedGoogle Scholar
- Illumina Inc[http://www.illumina.com]
- International HapMap Project[http://www.hapmap.org]
- Chapman AB, Schwartz GL, Boerwinkle E, Turner ST: Predictors of antihypertensive response to a standard dose of hydrochlorothiazide for essential hypertension. Kidney Int 2002, 61: 1047–1055. 10.1046/j.1523-1755.2002.00200.xView ArticlePubMedGoogle Scholar
- Affymetrix Inc[http://www.affymetrix.com]
- Cutler DJ, Zwick ME, Carrasquillo MM, Yohn CT, Tobin KP, Kashuk C, Mathews DJ, Shah NA, Eichler EE, Warrington JA, Chakravarti A: High-throughput variation detection and genotyping using microarrays. Genome Res 2001, 11: 1913–1925.PubMed CentralPubMedGoogle Scholar
- Turner ST, Bailey KR, Fridley BL, Chapman AB, Schwartz GL, Chai HS, Sicotte H, Kocher JPA, Rodin AS, Boerwinkle E: Large-scale genomic association analysis suggests a pharmacogenomic locus on chromosome 12 influencing antihypertensive response to thiazide diuretic. Hypertension 2008, 52: 359–365. 10.1161/HYPERTENSIONAHA.107.104273PubMed CentralView ArticlePubMedGoogle Scholar
- Molecular Signature Database – Broad Institute, Cambridge MA[http://www.broad.mit.edu/gsea/msigdb]
- Storey JD: direct approach to false discovery rates. J Roy Stat Soc B 2002, 64: 479–498. 10.1111/1467-9868.00346View ArticleGoogle Scholar
- Storey JD: The positive false discovery rate: a Bayesian interpretation and the q-value. Ann Stat 2003, 31: 2013–2035. 10.1214/aos/1074290335View ArticleGoogle Scholar
- Flechner SM, Kurian SM, Head SR, Sharp SM, Whisenant TC, Zhang J, Chismar JD, Horvath S, Mondala T, Gilmartin T, Cook DJ, Kay SA, Walker JR, Salomon DR: Kidney transplant rejection and tissue injury by gene profiling of biopsies and peripheral blood lymphocytes. Am J Transplant 2004, 4: 1475–1489. 10.1111/j.1600-6143.2004.00526.xPubMed CentralView ArticlePubMedGoogle Scholar
- Dinu V, Zhoa H, Miller PL: Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis. J Biomed Inform 2007, 40: 750–760. 10.1016/j.jbi.2007.06.002View ArticlePubMedGoogle Scholar
- Lesnick TG, Papapetropoulos S, Mash DC, Ffrench-Mullen J, Shehadeh L, de Andrade M, Henley JR, Rocca WA, Ahlskog JE, Maragonore DM: A genomic pathway approach to a complex disease: axon guidance and Parkinson disease. PLoS Genet 2007, 3: e98. 10.1371/journal.pgen.0030098PubMed CentralView ArticlePubMedGoogle Scholar
- Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactor dimensionality reduction reveals high-order interactions among estrogen metabolism genes in sporadic breast cancer. Am J Hum Genet 2001, 69: 138–147. 10.1086/321276PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.