Statistical significance for hierarchical clustering in genetic association and microarray expression studies
© Levenstien et al; licensee BioMed Central Ltd. 2003
Received: 30 July 2003
Accepted: 11 December 2003
Published: 11 December 2003
With the increasing amount of data generated in molecular genetics laboratories, it is often difficult to make sense of results because of the vast number of different outcomes or variables studied. Examples include expression levels for large numbers of genes and haplotypes at large numbers of loci. It is then natural to group observations into smaller numbers of classes that allow for an easier overview and interpretation of the data. This grouping is often carried out in multiple steps with the aid of hierarchical cluster analysis, each step leading to a smaller number of classes by combining similar observations or classes. At each step, either implicitly or explicitly, researchers tend to interpret results and eventually focus on that set of classes providing the "best" (most significant) result. While this approach makes sense, the overall statistical significance of the experiment must include the clustering process, which modifies the grouping structure of the data and often removes variation.
For hierarchically clustered data, we propose considering the strongest result or, equivalently, the smallest p-value as the experiment-wise statistic of interest and evaluating its significance level for a global assessment of statistical significance. We apply our approach to datasets from haplotype association and microarray expression studies where hierarchical clustering has been used.
In all of the cases we examine, we find that relying on one set of classes in the course of clustering leads to significance levels that are too small when compared with the significance level associated with an overall statistic that incorporates the process of clustering. In other words, relying on one step of clustering may furnish a formally significant result while the overall experiment is not significant.
Hierarchical clustering is an information theoretical method that sequentially merges samples based on the pair-wise similarity of a given measurement to form common groups until all samples are contained in a single group . The method has many applications and is widely used in the analysis of biological data. For example, researchers testing for association between haplotypes and disease have employed hierarchical clustering as a means to reduce a large number of haplotypes to a manageable number of haplotype classes with the aim to increase statistical power . The alleles present at multiple genetic marker loci across a given chromosome form a haplotype . With an increasing number of marker loci, the number of possible haplotypes grows exponentially so that many of these haplotypes tend to have low frequency. In comparisons of haplotype frequencies between case and control individuals, the corresponding contingency tables are often sparse and difficult to interpret. Hierarchical clustering then allows researchers to merge haplotypes into classes that are easier to handle. Variability within a haplotype class is generally considered unimportant (random noise) so that the researcher can focus on the "larger picture", that is, whether some of the haplotype classes exhibit a statistically significant difference in frequency between case and control individuals. Typically, the statistical significance (computed with exact tests ) for an initial, sparse contingency table is lower than for tables obtained by clustering the haplotype classes present in the initial table. Often the process of clustering incorrectly removes the variation within a class, and in these cases the increase in statistical significance is entirely due to the clustering process. Here we propose an analysis method that properly takes clustering into account. We achieve this by defining the strongest result or, equivalently, the smallest p-value, occurring in the course of clustering as the statistic of interest and computing its associated (experiment-wise) statistical significance.
Another example of hierarchical clustering is its application in microarray analyses [5–7]. Often clustering of arrays based on microarray expression data is utilized to distinguish tumor subclasses, which have clinical implications [8, 9]. In many of these studies involving microarray expression data from tumor specimens, researchers are interested in examining survival information for the subjects who contributed the samples and comparing the survival curves between groups formed by the hierarchical clustering procedure [10–13]. The methods developed in this paper will be applied to three previously published datasets in which hierarchical clustering has been employed. One of these datasets involves a haplotype association analysis while the other two datasets refer to survival analyses of groups of individuals determined by microarray expression measurements.
The problem of testing group differences sequentially is in the framework of multiple testing. Historically, both genetic association studies and microarray studies have been plagued with multiple testing problems. In the case of association studies, multiple testing occurs because researchers perform tests of association for large numbers of haplotypes, alleles, or genotypes across entire chromosomes or genomes . In the case of microarray data analysis, researchers sequentially test thousands of genes for differential expression. Testing at each different clustering step within a hierarchical structure also represents a form of multiple comparisons; therefore, the experiment-wise type I error is inflated. Various correction methods such as Bonferroni, step-up, and step-down have been employed to adjust for the multiplicity of testing . These procedures appear to work well only when the tests in the sequence are independent or weakly correlated. Since the tests within the hierarchy possess a nested structure, these procedures are inappropriate for our situation. As mentioned above, here we propose an alternative solution by defining a single test statistic, for which we evaluate the experiment-wise statistical significance.
Consider multiple steps in hierarchical clustering. For each of n steps of the hierarchy, we calculate our statistic of interest depending on the application. In the case of haplotype association tests, we compute the Pearson ?2  for a 2 × s contingency table (case/control individuals versus s haplotypes or haplotype classes) while in the case of survival analyses, we compute the log-rank statistic . We represent these statistics as a vector, , where X i represents the statistic obtained at the ith step in the clustering process. To make statistics from different steps comparable, we compute the significance level, p i , associated with X i and call this a local p-value. We approximate these local empirical significance levels via permutation analysis. These permutation methods involve randomly permuting labels for each individual as follows. For haplotype association tests, we permute the case/control labels [17, 18] while for survival analyses, we permute failure times and censorship statuses jointly. For each permutation of the dataset, we cluster the permuted samples as illustrated by the dendrogram and calculate a null statistic based on the permuted samples at each step in order to generate the null distribution for the statistic. We can represent the collection of null statistics calculated from each of m permutations of the data at each of n steps within the hierarchy as the matrix,
where the entry appearing in the ith row and the jth column, X ij , is the statistic of interest computed from the ith permutation of the data at the jth step in the hierarchy. At each step of the hierarchy, by comparing the statistic we computed from the data with the null statistics we computed from the m permutations, we calculate a local p-value, p j , as the proportion of permutation samples with a null statistic at least as large as the observed statistic. We represent the local p-values as the vector, .
Permutation (randomization) samples allow one to conveniently approximate the sampling distribution of test statistics under the null hypothesis (the "null distribution"). Ideally, permutation tests are based on the total of all permutations but in practice we usually can only collect a random sample from these permutations. The number m of permutation samples should be large enough to adequately represent the sample space of permutations. For the haplotype data (example 1), at each step we compared approximated p-values obtained with different values of m with exact p-values calculated with the aid of the statistical software package StatXact 5. For the first few steps in the hierarchy, values of m on the order of 10,000 were sufficient to provide p-values very close to the correct ones. However, at later steps, agreement was only obtained with m = 100,000, presumably because at early steps the total number of permutations is much smaller than at later steps. The calculations for the two survival analyses (examples 2 and 3) were also performed with m = 100,000.
In order to gain an empirical significance assessment for the entire experiment, we define a single statistic, that is, the smallest of the local p-values, min i (p i ) . To assess the empirical significance level (global p-value), p min , associated with this statistic, we generate the null distribution of min i (p i )from the matrix of null statistics, X null . In this matrix, we consider each row (replicate) in turn as observed data and evaluate these data based on the remaining m - 1 null data as described above for m null data. That is, for each of these "null observed" permutation samples a minimum p-value is obtained at whatever step it occurs. This leads to a set of m null values for min i (p i ). The proportion of these values at least as small as the observed min i (p i )represents the global significance level, p min , associated with our single experiment-wise statistic. Since this approach requires that the p-values be ordered, starting with the most significant, it could be considered a step-down p-value adjustment procedure similar to the procedure developed by Westfall and Young . If p min = 0.05 then we say that the experiment (at least one of the steps in the clustering process) is significant at the 5% level.
In addition, we are interested in assessing whether clustering has provided a benefit for the data analysis. In order to achieve this, we compare the statistical significance for the entire experiment involving tests at each step created by clustering with the statistical significance for an experiment where no clustering has been applied; that is, we compare the global p-value, p min , with the significance level, p0, of the statistic prior to clustering. (We are able to make this exact comparison for the haplotype association application; however, for the microarray expression studies, we must compare the global significance with the statistical significance at step 2 in the hierarchy since permuting the data at steps 0 and 1 creates a collection of null statistics possessing a very small variance.) Clustering is only beneficial when p min <p0 (or p2 for the microarray datasets) since the results are more significant when clustering is applied. It may well happen that the smallest p-value, min i (p i ), at one of the steps in the course of clustering is smaller than p0 (or p2 for the microarray datasets), but the clustering process is such that this smallest p-value has a high probability of occurring by chance. In that case, one will find that p min >p0 (or p2 for the microarray datasets).
Statistics of interest
As mentioned above, in the case of association studies between haplotypes and disease we employ the Pearson ?2 to test each step of the hierarchy for association . However, in the case of survival analyses, our statistic of interest is the log-rank statistic . It provides an overall comparison of the Kaplan-Meier survival curves for two or more groups of subjects. For r groups, the log-rank statistic asymptotically follows a ?2 distribution with r - 1 degrees of freedom under the null hypothesis of equality of survival curves.
To demonstrate our approach on real data, we re-analyze the following three previously published datasets.
Example 1 (haplotype data)
The first dataset consists of 52 statistically predicted haplotypes in 172 African-American study participants (137 case and 35 control individuals) . The aim of that case-control study was to test for association between haplotypes at 25 single nucleotide polymorphism (SNP) loci in the human µ opioid receptor gene (OPRM1) and substance dependence. The large number of haplotypes was difficult to interpret and appeared to create a situation with insufficient power to detect association. Thus, hierarchical clustering was applied to the 52 haplotypes. These were sequentially grouped according to the procedure CLUSTER (method = BAVERAGE, measure = SEUCLID) from the SPSS software package for Windows . For each step of the resulting dendrogram, the hierarchical clustering procedure designates which haplotypes are clustered to form haplotype classes. At each step of the hierarchy an association test was performed between haplotype classes and disease status. As the clustering progressed, the number of classes became smaller and smaller.
Example 2 (lung cancer data)
This dataset contains expression levels for 835 unique genes represented by 918 cDNA clones in tissues harvested from lung cancer patients and normal individuals . Specifically, expression levels are measured in 41 adenocarcinomas (ACs), 16 squamous cell carcinomas (SCCs), five large cell lung cancers (LCLCs), five small cell lung cancers (SCLCs), five normal lung samples, and one normal fetal lung sample. Based on the Complete Linkage method and Pearson's correlation coefficient as a measure of similarity in the CLUSTER software, hierarchical cluster analysis was performed to group the samples according to the degree of similarity present in the gene expression data. In the resulting dendrogram, the AC samples appeared in three distinct clusters. The aim of the study was to examine whether the groups of AC samples created by the hierarchical clustering procedure correlated with clinical outcomes of the AC patients, that is, whether the Kaplan-Meier survival curves differed for these groups .
Example 3 (lymphoma data)
The third dataset contains expression levels of cDNA clones from genes expressed in germinal center B-cells for 47 samples of diffuse large B-cell lymphoma (DLBCL) . Hierarchical clustering was performed with the CLUSTER program and the Pearson correlation coefficient as its similarity measure to group the samples by similarity of gene expression levels for all genes expressed in germinal center B-cells. The resulting dendrogram shows two main branches, one containing samples with expression patterns similar to those of germinal center B-cells and one containing samples with expression patterns similar to those of activated B-cells. To examine the clinical relevance of this subdivision of DLBCL, a Kaplan-Meier survival analysis for the two groups of patients was performed based on the dendrogram's penultimate clustering step .
In hierarchical clustering, evaluating the minimum local p-value in isolation, outside of the context of the larger hierarchical structure used to organize the data, can drastically affect the interpretation of test results. For example, even though the haplotype data show an apparently significant result with a minimum p-value of 0.0561, our analysis demonstrates that clustering the same data but without association between haplotypes and disease has a high chance of obtaining such a "significant" result. In fact, that chance is p min = 0.6918, which represents the actual significance level of the experiment. On the other hand, as example 2 shows, clustering can improve the significance of a result. The global significance level, p min , can be viewed as a version of min i (p i ) corrected for multiple testing within the hierarchy. In all three examples presented above, applying the Bonferroni correction to min i (p i ) provides an upper bound for p min in agreement with theory. In fact, it is interesting that for examples 2 and 3 our procedure which accounts for the correlated nature of the tests yields a p min value only slightly smaller than the Bonferroni correction applied to min i (p i ).
How can we explain that in some cases clustering is beneficial while in other cases it is not? Presumably, some datasets possess an underlying heterogeneity; that is, such datasets are composed of samples from multiple distinct populations. If the information used for clustering (haplotypes for example 1 and gene expression patterns for examples 2 and 3) is related to the information used to perform the statistical test (in our examples, proportions of cases to controls and survival times), hierarchical clustering will detect the heterogeneity. Otherwise, the clustering process is random, and any heterogeneity detected is artificial. Our approach allows one to distinguish between these two situations. If the clustering process is random because the information used for clustering and calculating the test statistic are unrelated (or because the dataset is homogeneous), a large p min will result indicating that any small local p-values probably occurred only by chance. Whereas if the clustering process is directed by a measurement strongly related to the test statistic, a small p min will result indicating that any heterogeneity found within the hierarchy is most likely real.
Often when hierarchical clustering is applied to a dataset, it is of interest to determine the true number of classes present. This situation commonly arises in the analysis of microarray data. For instance, as in examples 2 and 3, in the study of human cancers, researchers often utilize microarray expression data to cluster samples. From the hierarchical structure created by clustering, it may be of interest to distinguish the optimum number of tumor subclasses that are most clinically relevant. Several statistics-based methods have been utilized to estimate the true number of groups from such microarray expression datasets [23, 24]. However, such methods rely solely on the expression data itself. Alternatively, it may prove practical in such microarray expression studies to consider additional information available, such as survival data, on each sample for distinguishing clinically relevant subclasses. Employing our procedure of calculating the local p-values for a test statistic at multiple steps within the hierarchy and then selecting the step where the minimum of these p-values occurs as the basis for determining the true number of classes which exist for a given dataset may provide an advantage over existing methods. Of course, if such a method for determining the true number of classes is applied, the global p-value will provide an assessment of its significance. However, applying our procedure to some datasets, such as the data in example 3, results in determining a large number of true classes. In fact, the number of classes determined may be so large that the use of these expression-based tumor subclasses in clinical diagnosis may not provide a benefit. Therefore, in order to increase the practicality of our method, it may prove necessary to eliminate some of the lower steps in the hierarchy from eligibility for selecting the minimum local p-value and the calculation of its significance.
Besides determining subclasses for biological samples, hierarchical clustering is often employed in the context of microarray expression studies in order to identify groups of genes that are regulated in a similar manner. In these cases, the clustering is performed on the genes rather than on the samples. Our method relies on two sets of data – one for clustering and a second for the statistical test. Since the samples possess both expression data across genes and survival data, our method is applicable to hierarchies created by clustering on samples. However, genes only possess expression data across samples, and, consequently, our method is inappropriate for analyzing the significance of hierarchies created by clustering on genes.
Our approach may be viewed as a contribution to the problem of multiple testing. We address this problem by defining a single experiment-wise statistic whose associated empirical significance level represents the overall significance of the experiment. For the cases we have examined, the experiment refers to performing a test at each step in a hierarchy created by clustering. However, the meaning of experiment can be expanded to reflect other practices adopted by researchers. For example, researchers may apply several clustering algorithms involving various combinations of clustering methods and distance measures before finalizing their choice of clustering algorithm. Since this practice introduces an additional test at each step within each of the trial hierarchies, it compounds the effect of multiple testing. Additionally, in some situations researchers may be interested in testing for heterogeneity among groups with multiple measurements. For instance, when searching for clinically relevant subclasses of cancer, researchers may examine groups for differences in survival times, as well as, differences in physical characteristics of the tumor cells. Both sets of information may be clinically relevant; however, to correct for the additional testing, the meaning of experiment in calculating p min must be expanded to reflect the entire process employed by the researcher. Of course, it is possible that the process of hierarchical clustering forms medically relevant groups that do not display heterogeneity for any of the measurements collected. In this case, our strategy will not find these groups as the true grouping structure for the samples.
Several other methods addressing multiple comparison problems have been proposed and are in current use. In particular, as an alternative to the classical significance level, p, the False Discovery Rate (FDR) has become rather popular . However, it is important to keep in mind that p and FDR are not really comparable – p is the conditional probability of a significant test result given the null hypothesis is true (the expected proportion of false positive results among all "false" results, i.e., results obtained under the null hypothesis) while FDR is the conditional probability of the null hypothesis being true given a significant test result (the expected proportion of false positive results among all "positive" results, i.e., significant test results). Future research will have to determine which of these various approaches to eliminate the effects of multiple testing is most effective.
We are grateful to Dr. Margret Hoehe for providing the dataset used as example 1 and to Chad Haynes for programming support. This work was supported by NIH grants MH59492 and HG00008.
- Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction New York, Springer 2001.View ArticleGoogle Scholar
- Hoehe MR, Kopke K, Wendel B, Rohde K, Flachmeier C, Kidd KK, Berrettini WH, Church GM: Sequence variability and candidate gene analysis in complex disease: association of mu opioid receptor gene variation with substance dependence. Hum Mol Genet 2000, 9: 2895–2908. 10.1093/hmg/9.19.2895View ArticlePubMedGoogle Scholar
- Ott J: Analysis of Human Genetic Linkage 3 Edition Baltimore, The Johns Hopkins University Press 1999.Google Scholar
- Agresti A: An Introduction to Categorical Data Analysis. Wiley Series in Probability and Statistics NewYork, John Wiley and Sons 1996.Google Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
- Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 1999, 96: 6745–6750. 10.1073/pnas.96.12.6745PubMed CentralView ArticlePubMedGoogle Scholar
- Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 2000, 11: 4241–4257.PubMed CentralView ArticlePubMedGoogle Scholar
- Chung CH, Bernard PS, Perou CM: Molecular portraits and the family tree of cancer. Nat Genet 2002, Suppl 32: 533–540. 10.1038/ng1038View ArticleGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531–537. 10.1126/science.286.5439.531View ArticlePubMedGoogle Scholar
- Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Eystein Lonning P, Borresen-Dale AL: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA 2001, 98: 10869–10874. 10.1073/pnas.191367098PubMed CentralView ArticlePubMedGoogle Scholar
- Garber ME, Troyanskaya OG, Schluens K, Petersen S, Thaesler Z, Pacyna-Gengelbach M, van de Rijn M, Rosen GD, Perou CM, Whyte RI, Altman RB, Brown PO, Botstein D, Petersen I: Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci USA 2001, 98: 13784–13789. 10.1073/pnas.241500798PubMed CentralView ArticlePubMedGoogle Scholar
- Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J., Jr., Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Staudt LM: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403: 503–511. 10.1038/35000501View ArticlePubMedGoogle Scholar
- Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 2001, 98: 13790–13795. 10.1073/pnas.191502998PubMed CentralView ArticlePubMedGoogle Scholar
- Risch N, Merikangas K: The future of genetic studies of complex human diseases. Science 1996, 273: 1516–1517.View ArticlePubMedGoogle Scholar
- Reiner A, Yekutieli D, Benjamini Y: Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 2003, 19: 368–375. 10.1093/bioinformatics/btf877View ArticlePubMedGoogle Scholar
- Kalbfleisch JD, Prentice RL: The Statistical Analysis of Failure Time Data New York, John Wiley and Sons 1980.Google Scholar
- Zhao JH, Curtis D, Sham PC: Model-free analysis and permutation tests for allelic associations. Hum Hered 2000, 50: 133–139. 10.1159/000022901View ArticlePubMedGoogle Scholar
- Zhao JH, Sham P: Faster haplotype frequency estimation using unrelated subjects. Hum Hered 2002, 53: 36–41. 10.1159/000048602View ArticlePubMedGoogle Scholar
- Hoh J, Wille A, Ott J: Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Research 2001, 11: 2115–2119. 10.1101/gr.204001PubMed CentralView ArticlePubMedGoogle Scholar
- Westfall PH, Young SS: Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment New York, John Wiley & Sons 1993.Google Scholar
- Lung Adenocarcinomas > Home[http://genome-www.stanford.edu/lung_cancer/adeno/index.shtml]
- Lymphoma/Leukemia Molecular Profiling Project (LLMPP)[http://llmpp.nih.gov/lymphoma/]
- Horimoto K, Toh H: Statistical estimation of cluster boundaries in gene expression profile data. Bioinformatics 2001, 17: 1143–1151. 10.1093/bioinformatics/17.12.1143View ArticlePubMedGoogle Scholar
- Dudoit S, Fridlyand J: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 2002, 3: RESEARCH0036. 10.1186/gb-2002-3-7-research0036PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.