CLEAN: CLustering Enrichment ANalysis
BMC Bioinformatics volume 10, Article number: 234 (2009)
Integration of biological knowledge encoded in various lists of functionally related genes has become one of the most important aspects of analyzing genome-wide functional genomics data. In the context of cluster analysis, functional coherence of clusters established through such analyses have been used to identify biologically meaningful clusters, compare clustering algorithms and identify biological pathways associated with the biological process under investigation.
We developed a computational framework for analytically and visually integrating knowledge-based functional categories with the cluster analysis of genomics data. The framework is based on the simple, conceptually appealing, and biologically interpretable gene-specific functional coherence score (CLEAN score). The score is derived by correlating the clustering structure as a whole with functional categories of interest. We directly demonstrate that integrating biological knowledge in this way improves the reproducibility of conclusions derived from cluster analysis. The CLEAN score differentiates between the levels of functional coherence for genes within the same cluster based on their membership in enriched functional categories. We show that this aspect results in higher reproducibility across independent datasets and produces more informative genes for distinguishing different sample types than the scores based on the traditional cluster-wide analysis. We also demonstrate the utility of the CLEAN framework in comparing clusterings produced by different algorithms. CLEAN was implemented as an add-on R package and can be downloaded at http://Clusteranalysis.org. The package integrates routines for calculating gene specific functional coherence scores and the open source interactive Java-based viewer Functional TreeView (FTreeView).
Our results indicate that using the gene-specific functional coherence score improves the reproducibility of the conclusions made about clusters of co-expressed genes over using the traditional cluster-wide scores. Using gene-specific coherence scores also simplifies the comparisons of clusterings produced by different clustering algorithms and provides a simple tool for selecting genes with a "functionally coherent" expression profile.
Identifying groups of co-expressed genes through cluster analysis has been successfully used to elucidate affected biological pathways and postulate transcriptional regulatory mechanisms [1, 2]. The integration of biological knowledge in such analyses has been most commonly facilitated by assessing the enrichment of clusters with genes from pre-defined functionally coherent gene lists ("functional categories"). The concept of "functionally related genes clustering together" has been established by ad-hoc visual examination of hierarchical clustering results and their enrichment by genes from the same functional category . The first assessment of statistical significance of such enrichments was performed by analyzing results of k-means clustering  using the hypergeometric distribution . Similar strategies have also been used in the analysis of lists of differentially expressed genes , gene lists constructed based on genome-wide Chromatin Immunoprecipitation (ChIP) [7, 8] and epigenomics experiments , as well as the general approach to integrate lists of genes derived by various experimental and knowledge-based procedures . Introducing biological knowledge through such post-hoc analysis has been important for interpreting results and separating reproducible, biologically meaningful gene clusters from clusters that may have resulted from random fluctuations in the data. For both of these objectives, reproducibility of conclusions made is of utmost importance.
The first two concept defining papers [3, 5] also highlight the dichotomy that exist in using hierarchical vs. partitioning clustering procedures to this days. Hierarchical procedures do not necessitate specifying the "right" number of clusters, a parameter generally unknown in advance whose estimation from the data leads to instability in clustering results . On the other hand, selecting "meaningful" clusters in a hierarchical clustering that can be then correlated with functional categories using the hypergeometric distribution is still mostly performed by ad-hoc visual inspection of related heatmaps. Algorithms for systematic testing of all possible clusters have also been developed [12–14], but results of such analyses are difficult to summarize. Postulating the "right" number of clusters or choosing "good" clusters in an ad-hoc fashion before correlating them with functional categories can result in poor reproducibility since a slightly different number of clusters or slightly different "good" clusters can result in a different interpretation of the results. This problem is akin to choosing the "optimal" cut-off criteria for selecting differentially expressed genes before performing similar functional analyses. It has been shown that results of such analyses are highly sensitive to changes in the cut-off used with different cut-offs yielding different conclusions . In the analysis of differentially expressed genes computational alternatives have been developed that do not require setting such thresholds [16–18], but they are generally not applicable in the knowledge-based assessment of clustering results.
A frequently encountered problem in analyzing genome-wide experimental data is to choose among results produced by different clustering algorithms. Criteria such as homogeneity and separation are relatively straightforward to compute but are mostly of theoretical interest. A more relevant criterion from a biological perspective is the overall functional coherence of resulting gene clusters. Most of the methods developed to date for this purpose require specification of the number of clusters [19, 20]. Comparing different methods at a fixed number of clusters is problematic as some methods might create a better clustering structure when more clusters are allowed and others could create better clusterings when few clusters are allowed. To circumvent this problem ROC curves have been used to assess false and true positive rates of co-clustered gene pairs using the functional categories as a gold standard[21, 22]. However, this same strategy lacked discriminative power when a large number of large functional categories, such as Gene Ontology (GO) terms, served as a gold standard and required again fixing the number of clusters [21, 23].
We developed an analytical framework and flexible computational infrastructure for integrating knowledge-based functional categories into the cluster analysis of gene expression data. The framework is based on the simple, conceptually appealing and biologically interpretable gene-specific functional coherence CLustering Enrichment ANalysis (CLEAN) score derived by correlating the clustering structure as a whole with functional categories of interest. The CLEAN score is gene-specific and it differentiates between the levels of functional coherence for genes within the same cluster. The statistical significance of coherence scores is established by comparing them to the empirical null-distribution obtained by randomly permuting gene identifiers. The corresponding computational infrastructure is based on an open-source R package for the data analysis and open-source Java viewer for visually integrating and analyzing expression data and associated knowledge-based functional categories.
We investigate the reproducibility of the findings based on the CLEAN scores, and demonstrate its utility in comparing the functional coherence of clusterings produced by different algorithms and in selecting genes with informative expression patterns. Being gene-specific, the CLEAN score facilitates easy comparisons of functional coherence of different hierarchical structures (e.g. generated by different clustering algorithms) and selection of genes based on functional coherence of their expression pattern without the need to fix the number of clusters. On the other hand, we demonstrate that differentiating between the levels of functional coherence for genes within the same cluster leads to significant improvements in reproducibility of findings across independent microarray datasets when compared to traditional cluster-wide analyses. Furthermore, genes selected based on the CLEAN score produced more precise sample groupings than genes selected using the cluster-wide score.
Given a hierarchical clustering of genes based on their expression profiles and a set of functional categories (e.g. Gene Ontologies), the CLustering Enrichment ANalysis (CLEAN) score for a gene is calculated as follows (Figure 1):
Fisher's exact test for enrichment is calculated for all functional categories containing the gene and for all possible clusters containing this gene. (Figure 1).
The CLEAN score is then computed as the maximum -log10(q-value) of enrichment tests across all pairs of clusters containing the gene and functional categories containing the gene (see methods for details).
The clustering-specific null-distribution of the CLEAN score is established by randomly permuting gene identifiers. Statistically significant scores are then used to facilitate selection of genes or gene clusters, as well as the assessment of functional coherence and the comparison of clustering results produced by different algorithms. The integrated clustering viewer/browser, Functional TreeView (FTreeView), is used for integrative browsing and visual display of gene clusters and associated functional categories.
When multiple category types are used, the joint CLEAN score is calculated as the maximum of CLEAN scores for each category type. Here we focus on three sets of functional categories: Gene Ontology (GO) categories , Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [25, 26], and a custom set of Co-regulation Groups (CG) based on the computational analysis of gene promoters and regulatory motif definitions in the Transfac database, version 12.1  (see methods).
All currently used algorithms assign statistical significance of functional enrichment to whole clusters instead of individual genes within the cluster. To compare the properties of the CLEAN score to currently used methods we define a cluster-wide CLEAN (cwCLEAN) score to serve as a surrogate for this traditional type of analysis. The cwCLEAN score is defined as the maximum of -log10(q-value) for all clusters containing the gene regardless of whether the enriched functional categories contain the gene or not (Figure 1). We analyzed several public microarray datasets to demonstrate the statistical properties and utility of the CLEAN framework, and to compare its performance to traditionally used approaches.
Comparing clustering results using the CLEAN score
The CLEAN score provides a tool to compare the functional coherence of clustering results produced by different clustering algorithms on a gene-by-gene basis without requiring a pre-defined number of gene clusters. We used four independent large-scale breast cancer gene expression datasets [28–31] to demonstrate utility of the CLEAN score in choosing the clustering structure with the highest functional coherence. For each dataset we compared the performance of three typical clustering algorithms: Context specific infinite mixture model (CSIMM) , Euclidian distance based and Pearson's correlation based hierarchical clustering. For all three algorithms, the hierarchical clustering was constructed using the average linkage principle and algorithms were applied to expression data with and without prior variance-rescaling (see methods). Clustering algorithms were used to cluster data from four independent breast cancer gene expression datasets with GEO accession numbers GSE1456 , GSE3494 , GSE7390 , and GSE11121 .
For each clustering algorithm the total number of genes (y-axis) with the CLEAN score higher than the given threshold was plotted against all possible threshold levels (x-axis). There are two conclusions that can be immediately made based on results in Figure 2. First, variance-based rescaling of the data significantly improved the functional coherence of resulting clustering. While the CSIMM model is capable of compensating this effect to some extent, the performance of both CSIMM and Euclidean distance based algorithms improved after data was re-scaled. Since the Pearson's correlation coefficient implicitly performs such re-scaling, there is little difference between its performance with and without re-scaling. After the data was re-scaled, all three algorithms perform almost identically indicating that the re-scaling is the key step in improving the functional coherence of the data. The second conclusion is that these non-trivial results are perfectly reproducible across four independent breast cancer datasets, which is an important indication about their applicability to other datasets of this type.
Reproducibility and the comparison with cluster-wide scores
We used the same four independent breast cancer gene expression datasets (GSE1456 , GSE3494 , GSE7390 , and GSE11121 ) and a study comparing tissue-specific gene expression patterns in mouse and human  to investigate reproducibility of the CLEAN scores. We first assessed the contribution of the functional data to the reproducibility of the clustering results by comparing the correlation between the CLEAN scores (Figure 3A) to correlation of pair-wise distances used to construct the hierarchical clustering of genes (Figure 3B) in the two datasets (GSE3494 and GSE7390). In this analysis pairwise distances are based on the Bayesian posterior pairwise probabilities (PPPs) produced by the CSIMM algorithm . Significantly increased correlation for the CLEAN sore (0.82 vs. 0.52 PPPs) indicated a significant increase in reproducibility of results in terms of functional coherence of the gene expression patterns over the simple clustering that does not incorporate an assessment of functional coherence. The heatmap of expression profiles for the genes with the highest CLEAN scores in both datasets (circled in the Figure 3A) showed a coherent pattern of expression within both datasets (Figure 3C) and all these genes are related to immune system, which is a commonly implicated functional group in the etiology of cancer in general.
Next, we performed a comprehensive study of the reproducibility of the CLEAN scores in four breast cancer datasets and five clustering algorithms described in the previous section (since Pearson's correlation implicitly re-scales data, only the Pearson's correlation clustering with re-scaled data was used in this case). The heatmap in Figure 4 represents the similarities of clusterings for different algorithms and different datasets in terms of the CLEAN and the cwCLEAN scores. Three groupings of clusterings-by-score type combinations clearly emerge: clusterings formed using Euclidean distances and un-scaled data, cwCLEAN scores for clusterings based on the re-scaled data and CSIMM algorithm using un-scaled data, and CLEAN scores based on the re-scaled data and CSIMM algorithm using un-scaled data.
The improvement in reproducibility was further assessed by analyzing differences in correlations between CLEAN and cwCLEAN scores of all 6 pairs of breast cancer datasets for three different clustering algorithms (Figure 5A). Since all differences are positive, this indicates that the correlation coefficient was higher for CLEAN scores in each of the 6 pairs for all three algorithms. The increased reproducibility was also evident in the analysis utilizing the statistical significance cut-offs established by randomizing the gene labels for each clustering separately. For each pair of datasets we constructed a 2-by-2 contingency table based on the statistical significance scores (as in Table 1), and calculated differences in the odds ratios and the statistical significance of overlaps between lists of statistically significant genes in different datasets for a given algorithm and functional coherence score (CLEAN or cwCLEAN) (Figure 5B). All differences in odds ratios were positive implicating again higher reproducibility of CLEAN scores. Similarly, differences in the statistical significances (-log10(p-values)) of the Fisher's Exact tests for the same contingency tables were also all positive implicating the higher reproducibility of CLEAN scores (Figure 5C).
We repeated a similar type of analysis for a mouse and human datasets profiling gene expression in different tissues (79 human and 61 mouse tissue types). . After matching human and mouse probes using HomoloGene identifiers  we obtained 10,287 common genes that were represent on both microarray platforms. We constructed CSIMM-based gene clusterings for both species and applied CLEAN separately for the human and mouse datasets using both GO and KEGG based functional categories. Statistically significant correlation between the genes with statistically significant scores, using the Fisher's exact test for two-by-two tables, was firmly established for both CLEAN and cwCLEAN scores (Table 1 and 2 respectively). However, the statistical significance and the strength of association was considerably higher for the CLEAN score (odds ratio = 3.82 and p-value = 4.4 × 10-207) than for the cwCLEAN score (odds ratio = 1.49 and p-value = 1.8 × 10-17).
Unsupervised selection of informative genes
Reproducibly identifying genes whose expression patterns can delineate biologically meaningful groups of samples has been an important problem in computational biomedicine. We focus on the situation when the identity of samples belonging to different groups or even the number of the groups is not known in advance. In this case, the informative genes have to be selected in an unsupervised fashion. By studying the problem of classifying samples from different tissue types in the integrated mouse-human dataset, we demonstrate the utility of using the CLEAN score to select informative genes. We first identify genes with statistically significant CLEAN scores in mouse and human tissue expression profiling datasets. Then we show that expression profiles of these genes facilitate better separation of samples from different tissue types than expression profiles of genes not having statistically significant CLEAN scores. Furthermore, we demonstrate that the improvements in precision are significantly larger when using the CLEAN score than when using the cluster-wide cwCLEAN scores.
We created a total of 6 different gene lists and assessed their abilities to distinguish different tissue types in the combined human-mouse expression dataset. Gene lists were as follows:
All 10,287 genes present in both microarray platforms.
Significant CLEAN score
Genes with significant CLEAN scores in both human and mouse datasets.
Non-significant CLEAN score
Genes with not significant CLEAN scores in both human and mouse datasets.
Significant cluster score
Genes with significant cwCLEAN scores in both human and mouse datasets.
Non-significant cluster score
Genes with not significant cwCLEAN scores in both human and mouse datasets.
Genes identified by applying the Cancer Outlier Profiler Analysis (COPA)  analysis. This COPA list of 2,668 genes was generated by performing COPA separately for human and mouse datasets, selecting the top 5,000 genes, and using the overlapping genes in the two datasets. This procedure was tuned to produce a number of genes that is similar to the number of genes with significant CLEAN scores.
Tissue samples were then clustered based on each of these gene lists using Euclidean distance, average linkage hierarchical clustering. Co-clustered pairs of samples derived from the same tissue type (regardless of whether they are human or mouse derived) were considered true positives, and co-clustered pairs of samples derived from different tissues were considered false positives. By cutting the hierarchical tree structure at all possible levels and each time recording the number of true and false positives we determined the receiver operating characteristic (ROC) for each of the gene lists. Since the number of positive pairs (232) is small compared to the number of negative pairs (38,828), we used the ratio of number of false positive pairs divided by the total number of positive pairs, as recently described , instead of the traditional false positive rates on the x-axis. Genes with significant CLEAN scores in both human and mouse tissue expression sets were significantly better in separating different tissue types than the genes with non-significant CLEAN scores (Figure 6A). Genes with significant cwCLEAN scores were marginally better in separating different tissue types than genes with non-significant cwCLEAN scores (Figure 6B), but the difference was considerably smaller than for the CLEAN score. Using COPA for selecting informative genes was completely ineffective as it did not show any improvement over using all genes (Figure 6C).
We developed an open-source R package  that performs Clustering Enrichment Analysis (CLEAN). Typically, the user will provide a gene expression data set and a clustering of the genes. The package is intended for hierarchical clusterings but can also accommodate non-hierarchical clusterings such as k-means . The package is compatible with a number of common input formats. GO and KEGG functional categories are derived from respective Bioconductor packages , and users can provide their own functional categories. The CLEAN package provides functions to compute the CLEAN score and generate output files to interactively display expression data together with gene and sample clusterings, and functional cluster annotation.
In addition, we extended the Java-based expression data viewing software TreeView  to interactively display functional cluster annotations and the cwCLEAN scores produced by the CLEAN R package. Figure 7 shows a screenshot of the new viewer we named Functional TreeView (FTreeview) displaying CLEAN results for the breast cancer dataset GSE3494 . Panel 1 displays the per-gene functional coherence scores for individual category types. The broader the red bars are the higher is the score. Green indicates statistically non-significant functional coherence scores. Guided by the display of the CLEAN scores, the user can choose a subset of genes by selecting a branch of the hierarchical gene clustering tree (panel 2). Functional cluster annotations generated by CLEAN for the selected group of genes displayed in panel 3. The interactive display of functional annotations is the major new feature of FTreeView, and it allows for seamless integration and browsing of functional categories associated with each cluster of genes. Such an integrated view of clustering results, expression patterns and the enriched functional categories, facilitates a straightforward interactive identification of functionally coherent patterns of expression. For example, the selected cluster of genes (panel 3) which we identified based on the overall high CLEAN scores (panel 1) is highly enriched for genes associate with immunity related Gene Ontology terms (FDR < 10-60) as well as two KEGG pathways, and putative targets of the Interferon Consensus Sequence-binding protein (ICSBP) transcription factor. FTreeView is available as a stand-alone or as a Web Start application from our server http://Clusteranalysis.org.
Integrating biological knowledge encoded in lists of functionally related genes into the analysis of genome-wide functional genomics data is an increasingly important aspect of analyzing genomics data. In the context of cluster analysis, such integration is necessary for selecting meaningful clusters of genes, and for the adequate biological interpretation of patterns defined by such clusters. We developed a computational framework for analytically and visually integrating knowledge-based functional categories with the cluster analysis of genomics data. The framework is based on the gene-specific functional coherence score derived by correlating the clustering structure as a whole with functional categories of interest. The statistical significance of coherence scores is established by comparing them to the empirical null-distribution obtained by randomly permuting gene identifiers.
We established the reproducibility of the CLEAN score across related gene expression datasets, and its utility in comparing the functional coherence of different clusterings and in unsupervised selection of genes that discriminate between biologically meaningful groups of samples. When compared to the commonly used cluster-wide assessment of functional coherence, the CLEAN score exhibits higher reproducibility across different microarray datasets. Genes selected based on the CLEAN score produced more precise sample groupings than genes selected using either cluster-wide score or by using COPA algorithm.
It is important notice that by using the CLEAN score instead of the traditional cluster-wide approach one cannot use the guilt-by-association principle  to hypothesize the function of non-annotated genes. Our analysis of the four breast cancer datasets yielded one obvious example of a relevant gene (FOXM1) with a high cwCLEAN score and the CLEAN score of zero in all four breast cancer datasets. FOXM1 is a proliferation-associated transcription factor  which has recently been clearly implicated to be an important regulator in the cell cycle progression . However, functional annotations for this gene (Gene Ontologies and KEGG) do not reflect these recent findings and consequently FOXM1 was not associated with the "cell cycle" cluster based on the CLEAN score (Figure 8).
One way to think about the differences between CLEAN and cwCLEAN score in terms of differences between assuming functional coherence based on co-clustering (guilt-by-association, cwCLEAN) vs. having some additional pre-existing evidence of functional relationship (CLEAN). Our results in this context implicate that in the case of breast cancer and tissue datasets the previous evidence of functional relationship is overall more reliable than guilt-by-associations relationships arising only from the cluster analysis on their own. It is possible that in some other situations, new functional relationships would dominate the existing one and the opposite would be the case. Calculating the difference between the two scores can quickly implicate novel functional relationships arising from the data analysis alone.
A systematically different approach to integrating the experimental data and prior knowledge is to incorporate the functional information into the clustering algorithm itself [42–46]. While conceptually appealing, such methods have a more limited applicability than the framework presented here and have not been widely used. Our framework follows the commonly utilized post-hoc integration approach in which cluster analysis is performed first using the experimental data and integration is achieved in the post-hoc analysis. The ability to validate the clusters produced by analyzing experimental data, and the transparency about how exactly the different types of information is utilized in constructing clusters is most likely the reason for the popularity and the wide usage of post-hoc approaches. When the functional knowledge is used in the process of constructing clusters, it can no longer be employed to provide the guidance in selecting biologically meaningful clusters.
We directly demonstrate that integrating prior biological knowledge encoding in the lists of functionally coherent genes improves the reproducibility of clustering results. We also demonstrate that our gene-specific functional coherence score, which differentiates between the levels of functional coherence for genes within the same cluster, shows higher reproducibility than the cluster-wide score. The CLEAN score also produced more informative genes for distinguishing different sample types than the cluster-wide score. This implicates that the gene-specificity of the CLEAN score is a fundamentally different and, at least in some circumstances, better approach for integrating biological knowledge with results of the cluster analysis than previously used cluster-based scores.
Data Preprocessing, Gene Selection and Clustering
Raw data files (Affymetrix HG-U133A CEL files) of four independent human breast cancer datasets (GEO expression series GSE1456 , GSE3494 , GSE7390 , and GSE11121 ) were downloaded from the public repository GEO . Each dataset was RMA-preprocessed  separately using the Entrez Gene-based custom CDF (version 10) from the Psychiatry/MBNI Microarray Lab at the University of Michigan ('Brainarray') . Preprocessed data files of a large-scale tissue expression data set  were also downloaded from the same repository. The tissues included both human (GEO dataset record GDS596) and mouse (GDS592). For genes with multiple probes per Entrez gene ID, in each case, the probeset with the highest median expression value per probeset was selected as the representative probeset for that gene. To match corresponding genes across species, the HomoloGene database  was used.
We applied a mild variation filter using Cancer Outlier Profiler Analysis (COPA, 95th percentile)  to select the top 10,000 genes to be clustered in each of the human breast cancer datasets (GSE1456, GSE3494, GSE7390, GSE11121). In each dataset expression values were centered by setting the median value of each gene to zero (subtracting the gene-specific medians) and clustering analyses were performed for each dataset and species independently using hierarchical clustering with three different similarity metrics or distance metrics, respectively:
Context-Specific Infinite Mixture Models (CSIMM) . For any given pair or genes, this Bayesian method estimates the posterior pairwise probability (PPP) of the genes being co-clustered. The resulting PPP matrix is used as the similarity measure for the hierarchical clustering algorithm.
Pearson Correlation of gene expression values as the similarity measure.
Euclidian Distance based on per-gene normalized expression values as the distance measure.
All three hierarchical clustering algorithms used Average Linkage.
Each clustering analysis was then repeated after further variance based re-scaling each dataset by dividing expression levels by their standard deviation for each gene and each datasets separately. When computing the Pearson correlation, expression values are implicitly divided by the standard deviation. Thus, this additional normalization step did not significantly affect Pearson's correlations.
Clustering Enrichment Analysis
Clustering Enrichment Analysis (CLEAN) is based on testing every possible cluster within a gene clustering for statistically significant enrichment of biological categories. A background gene list (e.g. all genes represented on the microarray) is given as well as a hierarchical clustering of some or all genes in the background list. The method was implemented as described in Algorithm 1.
Algorithm 1. Clustering Enrichment Analysis (CLEAN).
Define one or more sets of biological categories with sufficient representation in the background gene list
Determine all possible gene clusters within a given size range
For each gene cluster
3.1 For each functional category, determine the 2 × 2 contingency table and perform Fisher's Exact test
3.2 Compute q-values, that is the adjusted Fisher p-values, to account for multiple testing
3.3 Record categories the cluster is significantly enriched with and corresponding q-values
Compute the cluster-wide cwCLEAN score
4.1 Determine the minimum q-value for each gene cluster
4.2 Sort gene clusters by minimum q-value starting with the lowest (i.e. most significant) q-value.
4.3 Prune cluster supersets with less significant q-value to avoid 'spill-over' effect, i.e. remove gene clusters whose significant functional enrichment score is likely due to a single subtree.
4.4 For each gene, find the minimum q-value over all remaining clusters the gene is member of
4.5 For each gene, find the minimum q-value over all category sets (e.g. GO and KEGG)
4.6 The cwCLEAN is the resulting -log10-transformed minimum per-gene q-value.
Compute the gene-specific CLEAN score
5.1 For each gene, find the minimum q-value over all clusters and all categories the gene is member of.
5.2 For each gene, find the minimum q-value over all category sets (e.g. GO and KEGG)
5.3 The CLEAN is the resulting -log10-transformed minimum per-gene q-value.
Defining Functional Categories
A functional category is defined as a non-empty set of genes representing a biological concept such as "cell cycle", "immunological synapse" or "cytokine-cytokine receptor activation". The method is designed to accommodate any set of functional categories such that each category is comprised of a list of genes that has user-specified minimum overlap with the background gene list. Here, sets of categories were either downloaded from publically accessible databases such as Gene Ontology (GO) , Kyoto Encyclopedia of Genes and Genomes (KEGG) [25, 26], or were defined based on the Transfac database .
More specifically, functional categories based on GO and KEGG were downloaded as R packages [36, 37] while co-regulation based categories (CG) were derived computationally. Transcription factor and corresponding gene promoter data  and DNA sequence data  was downloaded. For each of the 304 human transcription factors with at least one position-weight matrix (PWM) in the Transfac version 12.1 database, a score was computed for each of the 24,190 genes, as to how likely they were to have a corresponding binding motif within 1 kbp of their transcriptional start site. The respective 750 top-scoring genes (or fewer if the score was not significant for at least 750 genes) were assigned to each transcription factor to define the respective functional categories.
For compatibility, all gene identifiers were converted to Entrez gene IDs, and matched across species where necessary using the HomoloGene database . Subsequent analyses were restricted to categories that had at least ten genes in common with the respective background gene list (e.g. the genes represented on the microarray platform).
Obtaining All Possible Gene Clusters
Given a hierarchical gene tree, a list of all possible gene clusters is obtained by recursively traversing the tree structure and at each node recording the list of corresponding genes. The size of clusters is limited to a user-specified range. Here, clusters smaller than 10 genes and larger than 1,000 were disregarded.
Determining significant functional enrichment
To determine whether a functional category is over-represented in a given gene cluster, i.e. more genes of the functional category are present in the cluster than expected by chance, a two-by-two contingency table (number of genes in the cluster and category, in the cluster and not in the category, etc.) is constructed and Fisher's Exact Test is performed. The procedure is repeated for each category within a category set (e.g. "GO", "KEGG") and q-values (i.e. adjusted p-values) are computed to control the false discovery rate (FDR) . Categories with a q-value not greater than a user-defined cutoff are considered significant. The default q-value cutoff is 0.1.
Procedure for Non-hierarchical Methods
The procedure can be extended to non-hierarchical methods such as k-means . For a fixed number k of clusters, a mutually exclusive set of gene clusters is already given and step 2 in Algorithm 1 is skipped. If the user specifies a range K of cluster numbers, step 2 in Algorithm 1 is preceded by Algorithm 2 which generates a hierarchical gene clustering as a means to average over multiple runs of the non-hierarchical clustering algorithm.
Algorithm 2. Averaging gene clusterings over n runs of a non-hierarchical clustering algorithm
For each k in K, run the non-hierarchical clustering algorithm
For each gene i
2.1 For each gene j ≠ i, count the number n ij of clusterings where i and j are in the same cluster
2.2 For each gene j ≠ i, compute p ij = n ij /n, where n is the size of K
Use the p ij as a similarity metric and average linkage as a summary method to generate a hierarchical clustering.
R package and FTreeView tool
An R package to perform CLEAN and an open-source Java tool to interactively display gene expression data, gene and sample clustering, gene annotation, and functional cluster annotation can be freely downloaded from the authors' web-site http://ClusterAnalysis.org.
Clustering Enrichment Analysis
Cancer Outlier Profiler Analysis
Context-Specific Infinite Mixture Models
cluster-wide CLEAN score
false discovery rate
Kyoto Encyclopedia of Genes and Genomes
posterior pairwise probability
Robust Multichip Average
Receiver Operating Characteristic.
Slonim DK: From patterns to pathways: gene expression data analysis comes of age. Nat Genet 2002, 32(Suppl):502–508. 10.1038/ng1033
Do JH, Choi DK: Clustering approaches to identifying gene expression patterns from DNA microarray data. Mol Cells 2008, 25: 279–288.
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863–14868. 10.1073/pnas.95.25.14863
MacQueen J: Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability 1965, 281–297.
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet 1999, 22: 281–285. 10.1038/10343
Khatri P, Draghici S, Ostermeier GC, Krawetz SA: Profiling Gene Expression Using Onto-Express. Genomics 2002, 79: 266–270. 10.1006/geno.2002.6698
Wei CL, Wu Q, Vega VB, Chiu KP, Ng P, Zhang T, Shahab A, Yong HC, Fu Y, Weng Z, et al.: A global map of p53 transcription-factor binding sites in the human genome2. Cell 2006, 124: 207–219. 10.1016/j.cell.2005.10.043
Sartor MA, Schnekenburger M, Marlow JL, Reichard JF, Wang Y, Fan Y, Ma C, Karyala S, Halbleib D, Liu X, et al.: Genomewide Analysis of Aryl Hydrocarbon Receptor Binding Targets Reveals an Extensive Array of Gene Clusters that Control Morphogenic and Developmental Programs. Environ Health Perspect. 2009, 117(7):1139–1146.
Rakyan VK, Down TA, Thorne NP, Flicek P, Kulesha E, Graf S, Tomazou EM, Backdahl L, Johnson N, Herberth M, et al.: An integrated resource for genome-wide identification and analysis of human tissue-specific differentially methylated regions (tDMRs)9. Genome Res 2008, 18: 1518–1529. 10.1101/gr.077479.108
Tomlins SA, Mehra R, Rhodes DR, Cao X, Wang L, Dhanasekaran SM, Kalyana-Sundaram S, Wei JT, Rubin MA, Pienta KJ, et al.: Integrative molecular concept modeling of prostate cancer progression. Nat Genet 2007, 39: 41–51. 10.1038/ng1935
Medvedovic M, Sivaganesan S: Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 2002, 18: 1194–1206. 10.1093/bioinformatics/18.9.1194
Toronen P: Selection of informative clusters from hierarchical cluster tree with gene classes. BMC Bioinformatics 2004, 5: 32. 10.1186/1471-2105-5-32
Buehler EC, Sachs JR, Shao K, Bagchi A, Ungar LH: The CRASSS plug-in for integrating annotation data with hierarchical clustering results. Bioinformatics 2004, 20: 3266–3269. 10.1093/bioinformatics/bth362
Varshavsky R, Horn D, Linial M: Global considerations in hierarchical clustering reveal meaningful patterns in data. PLoS ONE 2008, 3: e2247. 10.1371/journal.pone.0002247
Kuang-Hung P, Chih-Jian L, Stanley NC: Effects of threshold choice on biological conclusions reached during analysis of gene expression by DNA microarrays. Proceedings of the National Academy of Sciences of the United States of America 2005, 102: 8961–8965. 10.1073/pnas.0502674102
Sartor MA, Leikauf GD, Medvedovic M: LRpath: a logistic regression approach for identifying enriched biological groups in gene expression data. Bioinformatics 2009, 25: 211–217. 10.1093/bioinformatics/btn592
Newton MA, Quinatan FA, den Boon JA, Sengupta S, Ahlquist P: Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. The Annals of Applided Statistics 2007, 1: 85–106. 10.1214/07-AOAS104
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005, 102: 15545–15550. 10.1073/pnas.0506580102
Yeung KY, Medvedovic M, Bumgarner RE: From co-expression to co-regulation: how many microarray experiments do we need? Genome Biol 2004, 5: R48. 10.1186/gb-2004-5-7-r48
Datta S, Datta S: Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics 2006, 7: 397. 10.1186/1471-2105-7-397
Liu X, Jessen WJ, Sivaganesan S, Aronow BJ, Medvedovic M: Bayesian hierarchical model for transcriptional module discovery by jointly modeling gene expression and ChIP-chip data2. BMC Bioinformatics 2007, 8: 283. 10.1186/1471-2105-8-283
Liu X, Sivaganesan S, Yeung KY, Guo J, Bumgarner RE, Medvedovic M: Context-specific infinite mixtures for clustering gene expression profiles across diverse microarray dataset. Bioinformatics 2006, 22: 1737–1744. 10.1093/bioinformatics/btl184
Guo X, Liu R, Shriver CD, Hu H, Liebman MN: Assessing semantic similarity measures for the characterization of human regulatory pathways 2. Bioinformatics 2006, 22: 967–973. 10.1093/bioinformatics/btl042
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25: 25–29. 10.1038/75556
Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000, 28: 27–30. 10.1093/nar/28.1.27
Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, et al.: KEGG for linking genomes to life and the environment. Nucleic Acids Res 2008, 36: D480-D484. 10.1093/nar/gkm882
Wingender E, Chen X, Fricke E, Geffers R, Hehl R, Liebich I, Krull M, Matys V, Michael H, Ohnhauser R, et al.: The TRANSFAC system on gene expression regulation. Nucleic Acids Res 2001, 29: 281–283. 10.1093/nar/29.1.281
Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, et al.: From The Cover: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. PNAS 2005, 102: 13550–13555. 10.1073/pnas.0506230102
Pawitan Y, Bjohle J, Amler L, Borg AL, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, et al.: Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res 2005, 7: R953-R964. 10.1186/bcr1325
Schmidt M, Bohm D, von TC, Steiner E, Puhl A, Pilch H, Lehr HA, Hengstler JG, Kolbl H, Gehrmann M: The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Res 2008, 68: 5405–5413. 10.1158/0008-5472.CAN-07-5206
Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, Haibe-Kains B, Viale G, Delorenzi M, Zhang Y, d'Assignies MS, et al.: Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res 2007, 13: 3207–3214. 10.1158/1078-0432.CCR-06-2765
Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, et al.: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA 2004, 101: 6062–6067. 10.1073/pnas.0400782101
Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2008, 36: D13-D21. 10.1093/nar/gkm1000
Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R, et al.: Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 2005, 310: 644–648. 10.1126/science.1117679
Johnson DS, Li W, Gordon DB, Bhattacharjee A, Curry B, Ghosh J, Brizuela L, Carroll JS, Brown M, Flicek P, et al.: Systematic evaluation of variability in ChIP-chip experiments using predefined DNA targets. Genome Res 2008, 18: 393–403. 10.1101/gr.7080508
R Development Core Team R: A Language and Environment for Statistical Computing, Vienna, Austria; 2008.
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5: R80. 10.1186/gb-2004-5-10-r80
Saldanha AJ: Java Treeview – extensible visualization of microarray data. Bioinformatics 2004, 20: 3246–3248. 10.1093/bioinformatics/bth349
Wolfe C, Kohane I, Butte A: Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks. BMC Bioinformatics 2005, 6: 227. 10.1186/1471-2105-6-227
Wierstra I, Alves J: FOXM1, a typical proliferation-associated transcription factor. Biol Chem 2007, 388: 1257–1274. 10.1515/BC.2007.159
Fu Z, Malureanu L, Huang J, Wang W, Li H, van Deursen JM, Tindal DJ, Chen J: Plk1-dependent phosphorylation of FoxM1 regulates a transcriptional programme required for mitotic progression. Nat Cell Biol 2008, 10: 1076–1082. 10.1038/ncb1767
Dotan-Cohen D, Melkman AA, Kasif S: Hierarchical tree snipping: clustering guided by prior knowledge. Bioinformatics 2007, 23: 3335–3342. 10.1093/bioinformatics/btm526
Huang D, Wei P, Pan W: Combining gene annotations and gene expression data in model-based clustering: weighted method19. OMICS 2006, 10: 28–39. 10.1089/omi.2006.10.28
Huang D, Pan W: Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data20. Bioinformatics 2006, 22: 1259–1268. 10.1093/bioinformatics/btl065
Lee SI, Batzoglou S: Application of independent component analysis to microarrays. Genome Biol 2003, 4: R76. 10.1186/gb-2003-4-11-r76
Tan MP, Smith EN, Broach JR, Floudas CA: Microarray data mining: a novel optimization-based approach to uncover biologically coherent structures. BMC Bioinformatics 2008, 9: 268. 10.1186/1471-2105-9-268
Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19: 185–193. 10.1093/bioinformatics/19.2.185
Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, et al.: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucl Acids Res 2005, 33: e175. 10.1093/nar/gni179
Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, 35: D61-D65. 10.1093/nar/gkl842
Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society B 1995, 57: 289–300.
This research was supported by grants from the National Human Genome Research Institute (R01 HG003749), National Library of Medicine (R21 LM009662) and NIEHS Center for Environmental Genetics grant (P30 ES06096).
JF developed the methods and the R package, performed all analyses, interpreted results and drafted the manuscript. MM conceived the methodology and provided guidance in the development, design, analysis, interpretation of results, and drafting of the manuscript. VJ developed FTreeView and ZH constructed some of the functional categories used in the analysis.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Freudenberg, J.M., Joshi, V.K., Hu, Z. et al. CLEAN: CLustering Enrichment ANalysis. BMC Bioinformatics 10, 234 (2009). https://doi.org/10.1186/1471-2105-10-234