CLEAN: CLustering Enrichment ANalysis
© Freudenberg et al; licensee BioMed Central Ltd. 2009
Received: 19 January 2009
Accepted: 29 July 2009
Published: 29 July 2009
Integration of biological knowledge encoded in various lists of functionally related genes has become one of the most important aspects of analyzing genome-wide functional genomics data. In the context of cluster analysis, functional coherence of clusters established through such analyses have been used to identify biologically meaningful clusters, compare clustering algorithms and identify biological pathways associated with the biological process under investigation.
We developed a computational framework for analytically and visually integrating knowledge-based functional categories with the cluster analysis of genomics data. The framework is based on the simple, conceptually appealing, and biologically interpretable gene-specific functional coherence score (CLEAN score). The score is derived by correlating the clustering structure as a whole with functional categories of interest. We directly demonstrate that integrating biological knowledge in this way improves the reproducibility of conclusions derived from cluster analysis. The CLEAN score differentiates between the levels of functional coherence for genes within the same cluster based on their membership in enriched functional categories. We show that this aspect results in higher reproducibility across independent datasets and produces more informative genes for distinguishing different sample types than the scores based on the traditional cluster-wide analysis. We also demonstrate the utility of the CLEAN framework in comparing clusterings produced by different algorithms. CLEAN was implemented as an add-on R package and can be downloaded at http://Clusteranalysis.org. The package integrates routines for calculating gene specific functional coherence scores and the open source interactive Java-based viewer Functional TreeView (FTreeView).
Our results indicate that using the gene-specific functional coherence score improves the reproducibility of the conclusions made about clusters of co-expressed genes over using the traditional cluster-wide scores. Using gene-specific coherence scores also simplifies the comparisons of clusterings produced by different clustering algorithms and provides a simple tool for selecting genes with a "functionally coherent" expression profile.
Identifying groups of co-expressed genes through cluster analysis has been successfully used to elucidate affected biological pathways and postulate transcriptional regulatory mechanisms [1, 2]. The integration of biological knowledge in such analyses has been most commonly facilitated by assessing the enrichment of clusters with genes from pre-defined functionally coherent gene lists ("functional categories"). The concept of "functionally related genes clustering together" has been established by ad-hoc visual examination of hierarchical clustering results and their enrichment by genes from the same functional category . The first assessment of statistical significance of such enrichments was performed by analyzing results of k-means clustering  using the hypergeometric distribution . Similar strategies have also been used in the analysis of lists of differentially expressed genes , gene lists constructed based on genome-wide Chromatin Immunoprecipitation (ChIP) [7, 8] and epigenomics experiments , as well as the general approach to integrate lists of genes derived by various experimental and knowledge-based procedures . Introducing biological knowledge through such post-hoc analysis has been important for interpreting results and separating reproducible, biologically meaningful gene clusters from clusters that may have resulted from random fluctuations in the data. For both of these objectives, reproducibility of conclusions made is of utmost importance.
The first two concept defining papers [3, 5] also highlight the dichotomy that exist in using hierarchical vs. partitioning clustering procedures to this days. Hierarchical procedures do not necessitate specifying the "right" number of clusters, a parameter generally unknown in advance whose estimation from the data leads to instability in clustering results . On the other hand, selecting "meaningful" clusters in a hierarchical clustering that can be then correlated with functional categories using the hypergeometric distribution is still mostly performed by ad-hoc visual inspection of related heatmaps. Algorithms for systematic testing of all possible clusters have also been developed [12–14], but results of such analyses are difficult to summarize. Postulating the "right" number of clusters or choosing "good" clusters in an ad-hoc fashion before correlating them with functional categories can result in poor reproducibility since a slightly different number of clusters or slightly different "good" clusters can result in a different interpretation of the results. This problem is akin to choosing the "optimal" cut-off criteria for selecting differentially expressed genes before performing similar functional analyses. It has been shown that results of such analyses are highly sensitive to changes in the cut-off used with different cut-offs yielding different conclusions . In the analysis of differentially expressed genes computational alternatives have been developed that do not require setting such thresholds [16–18], but they are generally not applicable in the knowledge-based assessment of clustering results.
A frequently encountered problem in analyzing genome-wide experimental data is to choose among results produced by different clustering algorithms. Criteria such as homogeneity and separation are relatively straightforward to compute but are mostly of theoretical interest. A more relevant criterion from a biological perspective is the overall functional coherence of resulting gene clusters. Most of the methods developed to date for this purpose require specification of the number of clusters [19, 20]. Comparing different methods at a fixed number of clusters is problematic as some methods might create a better clustering structure when more clusters are allowed and others could create better clusterings when few clusters are allowed. To circumvent this problem ROC curves have been used to assess false and true positive rates of co-clustered gene pairs using the functional categories as a gold standard[21, 22]. However, this same strategy lacked discriminative power when a large number of large functional categories, such as Gene Ontology (GO) terms, served as a gold standard and required again fixing the number of clusters [21, 23].
We developed an analytical framework and flexible computational infrastructure for integrating knowledge-based functional categories into the cluster analysis of gene expression data. The framework is based on the simple, conceptually appealing and biologically interpretable gene-specific functional coherence CLustering Enrichment ANalysis (CLEAN) score derived by correlating the clustering structure as a whole with functional categories of interest. The CLEAN score is gene-specific and it differentiates between the levels of functional coherence for genes within the same cluster. The statistical significance of coherence scores is established by comparing them to the empirical null-distribution obtained by randomly permuting gene identifiers. The corresponding computational infrastructure is based on an open-source R package for the data analysis and open-source Java viewer for visually integrating and analyzing expression data and associated knowledge-based functional categories.
We investigate the reproducibility of the findings based on the CLEAN scores, and demonstrate its utility in comparing the functional coherence of clusterings produced by different algorithms and in selecting genes with informative expression patterns. Being gene-specific, the CLEAN score facilitates easy comparisons of functional coherence of different hierarchical structures (e.g. generated by different clustering algorithms) and selection of genes based on functional coherence of their expression pattern without the need to fix the number of clusters. On the other hand, we demonstrate that differentiating between the levels of functional coherence for genes within the same cluster leads to significant improvements in reproducibility of findings across independent microarray datasets when compared to traditional cluster-wide analyses. Furthermore, genes selected based on the CLEAN score produced more precise sample groupings than genes selected using the cluster-wide score.
Fisher's exact test for enrichment is calculated for all functional categories containing the gene and for all possible clusters containing this gene. (Figure 1).
The CLEAN score is then computed as the maximum -log10(q-value) of enrichment tests across all pairs of clusters containing the gene and functional categories containing the gene (see methods for details).
The clustering-specific null-distribution of the CLEAN score is established by randomly permuting gene identifiers. Statistically significant scores are then used to facilitate selection of genes or gene clusters, as well as the assessment of functional coherence and the comparison of clustering results produced by different algorithms. The integrated clustering viewer/browser, Functional TreeView (FTreeView), is used for integrative browsing and visual display of gene clusters and associated functional categories.
When multiple category types are used, the joint CLEAN score is calculated as the maximum of CLEAN scores for each category type. Here we focus on three sets of functional categories: Gene Ontology (GO) categories , Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [25, 26], and a custom set of Co-regulation Groups (CG) based on the computational analysis of gene promoters and regulatory motif definitions in the Transfac database, version 12.1  (see methods).
All currently used algorithms assign statistical significance of functional enrichment to whole clusters instead of individual genes within the cluster. To compare the properties of the CLEAN score to currently used methods we define a cluster-wide CLEAN (cwCLEAN) score to serve as a surrogate for this traditional type of analysis. The cwCLEAN score is defined as the maximum of -log10(q-value) for all clusters containing the gene regardless of whether the enriched functional categories contain the gene or not (Figure 1). We analyzed several public microarray datasets to demonstrate the statistical properties and utility of the CLEAN framework, and to compare its performance to traditionally used approaches.
Comparing clustering results using the CLEAN score
The CLEAN score provides a tool to compare the functional coherence of clustering results produced by different clustering algorithms on a gene-by-gene basis without requiring a pre-defined number of gene clusters. We used four independent large-scale breast cancer gene expression datasets [28–31] to demonstrate utility of the CLEAN score in choosing the clustering structure with the highest functional coherence. For each dataset we compared the performance of three typical clustering algorithms: Context specific infinite mixture model (CSIMM) , Euclidian distance based and Pearson's correlation based hierarchical clustering. For all three algorithms, the hierarchical clustering was constructed using the average linkage principle and algorithms were applied to expression data with and without prior variance-rescaling (see methods). Clustering algorithms were used to cluster data from four independent breast cancer gene expression datasets with GEO accession numbers GSE1456 , GSE3494 , GSE7390 , and GSE11121 .
Reproducibility and the comparison with cluster-wide scores
Contingency table of genes with significant and non-significant CLEAN score in human and mouse tissues.
Human CLEAN score
Significant (> 2.7)
Non-significant (< 2.7)
Mouse CLEAN score
Significant (> 3.2)
Non-significant (< 3.2)
Contingency table of genes with significant and non-significant cwCLEAN score in human and mouse tissues.
Human cwCLEAN score
Non-significant (< 2.7)
Mouse cwCLEAN score
Significant (> 3.2)
Non-significant (< 3.2)
Unsupervised selection of informative genes
Reproducibly identifying genes whose expression patterns can delineate biologically meaningful groups of samples has been an important problem in computational biomedicine. We focus on the situation when the identity of samples belonging to different groups or even the number of the groups is not known in advance. In this case, the informative genes have to be selected in an unsupervised fashion. By studying the problem of classifying samples from different tissue types in the integrated mouse-human dataset, we demonstrate the utility of using the CLEAN score to select informative genes. We first identify genes with statistically significant CLEAN scores in mouse and human tissue expression profiling datasets. Then we show that expression profiles of these genes facilitate better separation of samples from different tissue types than expression profiles of genes not having statistically significant CLEAN scores. Furthermore, we demonstrate that the improvements in precision are significantly larger when using the CLEAN score than when using the cluster-wide cwCLEAN scores.
We created a total of 6 different gene lists and assessed their abilities to distinguish different tissue types in the combined human-mouse expression dataset. Gene lists were as follows:
All 10,287 genes present in both microarray platforms.
Significant CLEAN score
Genes with significant CLEAN scores in both human and mouse datasets.
Non-significant CLEAN score
Genes with not significant CLEAN scores in both human and mouse datasets.
Significant cluster score
Genes with significant cwCLEAN scores in both human and mouse datasets.
Non-significant cluster score
Genes with not significant cwCLEAN scores in both human and mouse datasets.
Genes identified by applying the Cancer Outlier Profiler Analysis (COPA)  analysis. This COPA list of 2,668 genes was generated by performing COPA separately for human and mouse datasets, selecting the top 5,000 genes, and using the overlapping genes in the two datasets. This procedure was tuned to produce a number of genes that is similar to the number of genes with significant CLEAN scores.
We developed an open-source R package  that performs Clustering Enrichment Analysis (CLEAN). Typically, the user will provide a gene expression data set and a clustering of the genes. The package is intended for hierarchical clusterings but can also accommodate non-hierarchical clusterings such as k-means . The package is compatible with a number of common input formats. GO and KEGG functional categories are derived from respective Bioconductor packages , and users can provide their own functional categories. The CLEAN package provides functions to compute the CLEAN score and generate output files to interactively display expression data together with gene and sample clusterings, and functional cluster annotation.
Integrating biological knowledge encoded in lists of functionally related genes into the analysis of genome-wide functional genomics data is an increasingly important aspect of analyzing genomics data. In the context of cluster analysis, such integration is necessary for selecting meaningful clusters of genes, and for the adequate biological interpretation of patterns defined by such clusters. We developed a computational framework for analytically and visually integrating knowledge-based functional categories with the cluster analysis of genomics data. The framework is based on the gene-specific functional coherence score derived by correlating the clustering structure as a whole with functional categories of interest. The statistical significance of coherence scores is established by comparing them to the empirical null-distribution obtained by randomly permuting gene identifiers.
We established the reproducibility of the CLEAN score across related gene expression datasets, and its utility in comparing the functional coherence of different clusterings and in unsupervised selection of genes that discriminate between biologically meaningful groups of samples. When compared to the commonly used cluster-wide assessment of functional coherence, the CLEAN score exhibits higher reproducibility across different microarray datasets. Genes selected based on the CLEAN score produced more precise sample groupings than genes selected using either cluster-wide score or by using COPA algorithm.
One way to think about the differences between CLEAN and cwCLEAN score in terms of differences between assuming functional coherence based on co-clustering (guilt-by-association, cwCLEAN) vs. having some additional pre-existing evidence of functional relationship (CLEAN). Our results in this context implicate that in the case of breast cancer and tissue datasets the previous evidence of functional relationship is overall more reliable than guilt-by-associations relationships arising only from the cluster analysis on their own. It is possible that in some other situations, new functional relationships would dominate the existing one and the opposite would be the case. Calculating the difference between the two scores can quickly implicate novel functional relationships arising from the data analysis alone.
A systematically different approach to integrating the experimental data and prior knowledge is to incorporate the functional information into the clustering algorithm itself [42–46]. While conceptually appealing, such methods have a more limited applicability than the framework presented here and have not been widely used. Our framework follows the commonly utilized post-hoc integration approach in which cluster analysis is performed first using the experimental data and integration is achieved in the post-hoc analysis. The ability to validate the clusters produced by analyzing experimental data, and the transparency about how exactly the different types of information is utilized in constructing clusters is most likely the reason for the popularity and the wide usage of post-hoc approaches. When the functional knowledge is used in the process of constructing clusters, it can no longer be employed to provide the guidance in selecting biologically meaningful clusters.
We directly demonstrate that integrating prior biological knowledge encoding in the lists of functionally coherent genes improves the reproducibility of clustering results. We also demonstrate that our gene-specific functional coherence score, which differentiates between the levels of functional coherence for genes within the same cluster, shows higher reproducibility than the cluster-wide score. The CLEAN score also produced more informative genes for distinguishing different sample types than the cluster-wide score. This implicates that the gene-specificity of the CLEAN score is a fundamentally different and, at least in some circumstances, better approach for integrating biological knowledge with results of the cluster analysis than previously used cluster-based scores.
Data Preprocessing, Gene Selection and Clustering
Raw data files (Affymetrix HG-U133A CEL files) of four independent human breast cancer datasets (GEO expression series GSE1456 , GSE3494 , GSE7390 , and GSE11121 ) were downloaded from the public repository GEO . Each dataset was RMA-preprocessed  separately using the Entrez Gene-based custom CDF (version 10) from the Psychiatry/MBNI Microarray Lab at the University of Michigan ('Brainarray') . Preprocessed data files of a large-scale tissue expression data set  were also downloaded from the same repository. The tissues included both human (GEO dataset record GDS596) and mouse (GDS592). For genes with multiple probes per Entrez gene ID, in each case, the probeset with the highest median expression value per probeset was selected as the representative probeset for that gene. To match corresponding genes across species, the HomoloGene database  was used.
We applied a mild variation filter using Cancer Outlier Profiler Analysis (COPA, 95th percentile)  to select the top 10,000 genes to be clustered in each of the human breast cancer datasets (GSE1456, GSE3494, GSE7390, GSE11121). In each dataset expression values were centered by setting the median value of each gene to zero (subtracting the gene-specific medians) and clustering analyses were performed for each dataset and species independently using hierarchical clustering with three different similarity metrics or distance metrics, respectively:
Context-Specific Infinite Mixture Models (CSIMM) . For any given pair or genes, this Bayesian method estimates the posterior pairwise probability (PPP) of the genes being co-clustered. The resulting PPP matrix is used as the similarity measure for the hierarchical clustering algorithm.
Pearson Correlation of gene expression values as the similarity measure.
Euclidian Distance based on per-gene normalized expression values as the distance measure.
All three hierarchical clustering algorithms used Average Linkage.
Each clustering analysis was then repeated after further variance based re-scaling each dataset by dividing expression levels by their standard deviation for each gene and each datasets separately. When computing the Pearson correlation, expression values are implicitly divided by the standard deviation. Thus, this additional normalization step did not significantly affect Pearson's correlations.
Clustering Enrichment Analysis
Clustering Enrichment Analysis (CLEAN) is based on testing every possible cluster within a gene clustering for statistically significant enrichment of biological categories. A background gene list (e.g. all genes represented on the microarray) is given as well as a hierarchical clustering of some or all genes in the background list. The method was implemented as described in Algorithm 1.
Define one or more sets of biological categories with sufficient representation in the background gene list
Determine all possible gene clusters within a given size range
For each gene cluster
3.1 For each functional category, determine the 2 × 2 contingency table and perform Fisher's Exact test
3.2 Compute q-values, that is the adjusted Fisher p-values, to account for multiple testing
Compute the cluster-wide cwCLEAN score
4.1 Determine the minimum q-value for each gene cluster
4.2 Sort gene clusters by minimum q-value starting with the lowest (i.e. most significant) q-value.
4.3 Prune cluster supersets with less significant q-value to avoid 'spill-over' effect, i.e. remove gene clusters whose significant functional enrichment score is likely due to a single subtree.
4.4 For each gene, find the minimum q-value over all remaining clusters the gene is member of
4.5 For each gene, find the minimum q-value over all category sets (e.g. GO and KEGG)
Compute the gene-specific CLEAN score
5.1 For each gene, find the minimum q-value over all clusters and all categories the gene is member of.
5.2 For each gene, find the minimum q-value over all category sets (e.g. GO and KEGG)
5.3 The CLEAN is the resulting -log10-transformed minimum per-gene q-value.
Defining Functional Categories
A functional category is defined as a non-empty set of genes representing a biological concept such as "cell cycle", "immunological synapse" or "cytokine-cytokine receptor activation". The method is designed to accommodate any set of functional categories such that each category is comprised of a list of genes that has user-specified minimum overlap with the background gene list. Here, sets of categories were either downloaded from publically accessible databases such as Gene Ontology (GO) , Kyoto Encyclopedia of Genes and Genomes (KEGG) [25, 26], or were defined based on the Transfac database .
More specifically, functional categories based on GO and KEGG were downloaded as R packages [36, 37] while co-regulation based categories (CG) were derived computationally. Transcription factor and corresponding gene promoter data  and DNA sequence data  was downloaded. For each of the 304 human transcription factors with at least one position-weight matrix (PWM) in the Transfac version 12.1 database, a score was computed for each of the 24,190 genes, as to how likely they were to have a corresponding binding motif within 1 kbp of their transcriptional start site. The respective 750 top-scoring genes (or fewer if the score was not significant for at least 750 genes) were assigned to each transcription factor to define the respective functional categories.
For compatibility, all gene identifiers were converted to Entrez gene IDs, and matched across species where necessary using the HomoloGene database . Subsequent analyses were restricted to categories that had at least ten genes in common with the respective background gene list (e.g. the genes represented on the microarray platform).
Obtaining All Possible Gene Clusters
Given a hierarchical gene tree, a list of all possible gene clusters is obtained by recursively traversing the tree structure and at each node recording the list of corresponding genes. The size of clusters is limited to a user-specified range. Here, clusters smaller than 10 genes and larger than 1,000 were disregarded.
Determining significant functional enrichment
To determine whether a functional category is over-represented in a given gene cluster, i.e. more genes of the functional category are present in the cluster than expected by chance, a two-by-two contingency table (number of genes in the cluster and category, in the cluster and not in the category, etc.) is constructed and Fisher's Exact Test is performed. The procedure is repeated for each category within a category set (e.g. "GO", "KEGG") and q-values (i.e. adjusted p-values) are computed to control the false discovery rate (FDR) . Categories with a q-value not greater than a user-defined cutoff are considered significant. The default q-value cutoff is 0.1.
Procedure for Non-hierarchical Methods
The procedure can be extended to non-hierarchical methods such as k-means . For a fixed number k of clusters, a mutually exclusive set of gene clusters is already given and step 2 in Algorithm 1 is skipped. If the user specifies a range K of cluster numbers, step 2 in Algorithm 1 is preceded by Algorithm 2 which generates a hierarchical gene clustering as a means to average over multiple runs of the non-hierarchical clustering algorithm.
For each k in K, run the non-hierarchical clustering algorithm
For each gene i
2.1 For each gene j ≠ i, count the number n ij of clusterings where i and j are in the same cluster
Use the p ij as a similarity metric and average linkage as a summary method to generate a hierarchical clustering.
R package and FTreeView tool
An R package to perform CLEAN and an open-source Java tool to interactively display gene expression data, gene and sample clustering, gene annotation, and functional cluster annotation can be freely downloaded from the authors' web-site http://ClusterAnalysis.org.
Clustering Enrichment Analysis
Cancer Outlier Profiler Analysis
Context-Specific Infinite Mixture Models
cluster-wide CLEAN score
false discovery rate
Kyoto Encyclopedia of Genes and Genomes
posterior pairwise probability
Robust Multichip Average
Receiver Operating Characteristic.
This research was supported by grants from the National Human Genome Research Institute (R01 HG003749), National Library of Medicine (R21 LM009662) and NIEHS Center for Environmental Genetics grant (P30 ES06096).
- Slonim DK: From patterns to pathways: gene expression data analysis comes of age. Nat Genet 2002, 32(Suppl):502–508. 10.1038/ng1033View ArticlePubMedGoogle Scholar
- Do JH, Choi DK: Clustering approaches to identifying gene expression patterns from DNA microarray data. Mol Cells 2008, 25: 279–288.PubMedGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
- MacQueen J: Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability 1965, 281–297.Google Scholar
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet 1999, 22: 281–285. 10.1038/10343View ArticlePubMedGoogle Scholar
- Khatri P, Draghici S, Ostermeier GC, Krawetz SA: Profiling Gene Expression Using Onto-Express. Genomics 2002, 79: 266–270. 10.1006/geno.2002.6698View ArticlePubMedGoogle Scholar
- Wei CL, Wu Q, Vega VB, Chiu KP, Ng P, Zhang T, Shahab A, Yong HC, Fu Y, Weng Z, et al.: A global map of p53 transcription-factor binding sites in the human genome2. Cell 2006, 124: 207–219. 10.1016/j.cell.2005.10.043View ArticlePubMedGoogle Scholar
- Sartor MA, Schnekenburger M, Marlow JL, Reichard JF, Wang Y, Fan Y, Ma C, Karyala S, Halbleib D, Liu X, et al.: Genomewide Analysis of Aryl Hydrocarbon Receptor Binding Targets Reveals an Extensive Array of Gene Clusters that Control Morphogenic and Developmental Programs. Environ Health Perspect. 2009, 117(7):1139–1146.PubMed CentralView ArticlePubMedGoogle Scholar
- Rakyan VK, Down TA, Thorne NP, Flicek P, Kulesha E, Graf S, Tomazou EM, Backdahl L, Johnson N, Herberth M, et al.: An integrated resource for genome-wide identification and analysis of human tissue-specific differentially methylated regions (tDMRs)9. Genome Res 2008, 18: 1518–1529. 10.1101/gr.077479.108PubMed CentralView ArticlePubMedGoogle Scholar
- Tomlins SA, Mehra R, Rhodes DR, Cao X, Wang L, Dhanasekaran SM, Kalyana-Sundaram S, Wei JT, Rubin MA, Pienta KJ, et al.: Integrative molecular concept modeling of prostate cancer progression. Nat Genet 2007, 39: 41–51. 10.1038/ng1935View ArticlePubMedGoogle Scholar
- Medvedovic M, Sivaganesan S: Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 2002, 18: 1194–1206. 10.1093/bioinformatics/18.9.1194View ArticlePubMedGoogle Scholar
- Toronen P: Selection of informative clusters from hierarchical cluster tree with gene classes. BMC Bioinformatics 2004, 5: 32. 10.1186/1471-2105-5-32PubMed CentralView ArticlePubMedGoogle Scholar
- Buehler EC, Sachs JR, Shao K, Bagchi A, Ungar LH: The CRASSS plug-in for integrating annotation data with hierarchical clustering results. Bioinformatics 2004, 20: 3266–3269. 10.1093/bioinformatics/bth362View ArticlePubMedGoogle Scholar
- Varshavsky R, Horn D, Linial M: Global considerations in hierarchical clustering reveal meaningful patterns in data. PLoS ONE 2008, 3: e2247. 10.1371/journal.pone.0002247PubMed CentralView ArticlePubMedGoogle Scholar
- Kuang-Hung P, Chih-Jian L, Stanley NC: Effects of threshold choice on biological conclusions reached during analysis of gene expression by DNA microarrays. Proceedings of the National Academy of Sciences of the United States of America 2005, 102: 8961–8965. 10.1073/pnas.0502674102View ArticleGoogle Scholar
- Sartor MA, Leikauf GD, Medvedovic M: LRpath: a logistic regression approach for identifying enriched biological groups in gene expression data. Bioinformatics 2009, 25: 211–217. 10.1093/bioinformatics/btn592PubMed CentralView ArticlePubMedGoogle Scholar
- Newton MA, Quinatan FA, den Boon JA, Sengupta S, Ahlquist P: Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. The Annals of Applided Statistics 2007, 1: 85–106. 10.1214/07-AOAS104View ArticleGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005, 102: 15545–15550. 10.1073/pnas.0506580102PubMed CentralView ArticlePubMedGoogle Scholar
- Yeung KY, Medvedovic M, Bumgarner RE: From co-expression to co-regulation: how many microarray experiments do we need? Genome Biol 2004, 5: R48. 10.1186/gb-2004-5-7-r48PubMed CentralView ArticlePubMedGoogle Scholar
- Datta S, Datta S: Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics 2006, 7: 397. 10.1186/1471-2105-7-397PubMed CentralView ArticlePubMedGoogle Scholar
- Liu X, Jessen WJ, Sivaganesan S, Aronow BJ, Medvedovic M: Bayesian hierarchical model for transcriptional module discovery by jointly modeling gene expression and ChIP-chip data2. BMC Bioinformatics 2007, 8: 283. 10.1186/1471-2105-8-283PubMed CentralView ArticlePubMedGoogle Scholar
- Liu X, Sivaganesan S, Yeung KY, Guo J, Bumgarner RE, Medvedovic M: Context-specific infinite mixtures for clustering gene expression profiles across diverse microarray dataset. Bioinformatics 2006, 22: 1737–1744. 10.1093/bioinformatics/btl184PubMed CentralView ArticlePubMedGoogle Scholar
- Guo X, Liu R, Shriver CD, Hu H, Liebman MN: Assessing semantic similarity measures for the characterization of human regulatory pathways 2. Bioinformatics 2006, 22: 967–973. 10.1093/bioinformatics/btl042View ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000, 28: 27–30. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, et al.: KEGG for linking genomes to life and the environment. Nucleic Acids Res 2008, 36: D480-D484. 10.1093/nar/gkm882PubMed CentralView ArticlePubMedGoogle Scholar
- Wingender E, Chen X, Fricke E, Geffers R, Hehl R, Liebich I, Krull M, Matys V, Michael H, Ohnhauser R, et al.: The TRANSFAC system on gene expression regulation. Nucleic Acids Res 2001, 29: 281–283. 10.1093/nar/29.1.281PubMed CentralView ArticlePubMedGoogle Scholar
- Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, et al.: From The Cover: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. PNAS 2005, 102: 13550–13555. 10.1073/pnas.0506230102PubMed CentralView ArticlePubMedGoogle Scholar
- Pawitan Y, Bjohle J, Amler L, Borg AL, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, et al.: Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res 2005, 7: R953-R964. 10.1186/bcr1325PubMed CentralView ArticlePubMedGoogle Scholar
- Schmidt M, Bohm D, von TC, Steiner E, Puhl A, Pilch H, Lehr HA, Hengstler JG, Kolbl H, Gehrmann M: The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Res 2008, 68: 5405–5413. 10.1158/0008-5472.CAN-07-5206View ArticlePubMedGoogle Scholar
- Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, Haibe-Kains B, Viale G, Delorenzi M, Zhang Y, d'Assignies MS, et al.: Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res 2007, 13: 3207–3214. 10.1158/1078-0432.CCR-06-2765View ArticlePubMedGoogle Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, et al.: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA 2004, 101: 6062–6067. 10.1073/pnas.0400782101PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2008, 36: D13-D21. 10.1093/nar/gkm1000PubMed CentralView ArticlePubMedGoogle Scholar
- Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R, et al.: Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 2005, 310: 644–648. 10.1126/science.1117679View ArticlePubMedGoogle Scholar
- Johnson DS, Li W, Gordon DB, Bhattacharjee A, Curry B, Ghosh J, Brizuela L, Carroll JS, Brown M, Flicek P, et al.: Systematic evaluation of variability in ChIP-chip experiments using predefined DNA targets. Genome Res 2008, 18: 393–403. 10.1101/gr.7080508PubMed CentralView ArticlePubMedGoogle Scholar
- R Development Core Team R: A Language and Environment for Statistical Computing, Vienna, Austria; 2008.
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5: R80. 10.1186/gb-2004-5-10-r80PubMed CentralView ArticlePubMedGoogle Scholar
- Saldanha AJ: Java Treeview – extensible visualization of microarray data. Bioinformatics 2004, 20: 3246–3248. 10.1093/bioinformatics/bth349View ArticlePubMedGoogle Scholar
- Wolfe C, Kohane I, Butte A: Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks. BMC Bioinformatics 2005, 6: 227. 10.1186/1471-2105-6-227PubMed CentralView ArticlePubMedGoogle Scholar
- Wierstra I, Alves J: FOXM1, a typical proliferation-associated transcription factor. Biol Chem 2007, 388: 1257–1274. 10.1515/BC.2007.159PubMedGoogle Scholar
- Fu Z, Malureanu L, Huang J, Wang W, Li H, van Deursen JM, Tindal DJ, Chen J: Plk1-dependent phosphorylation of FoxM1 regulates a transcriptional programme required for mitotic progression. Nat Cell Biol 2008, 10: 1076–1082. 10.1038/ncb1767PubMed CentralView ArticlePubMedGoogle Scholar
- Dotan-Cohen D, Melkman AA, Kasif S: Hierarchical tree snipping: clustering guided by prior knowledge. Bioinformatics 2007, 23: 3335–3342. 10.1093/bioinformatics/btm526View ArticlePubMedGoogle Scholar
- Huang D, Wei P, Pan W: Combining gene annotations and gene expression data in model-based clustering: weighted method19. OMICS 2006, 10: 28–39. 10.1089/omi.2006.10.28View ArticlePubMedGoogle Scholar
- Huang D, Pan W: Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data20. Bioinformatics 2006, 22: 1259–1268. 10.1093/bioinformatics/btl065View ArticlePubMedGoogle Scholar
- Lee SI, Batzoglou S: Application of independent component analysis to microarrays. Genome Biol 2003, 4: R76. 10.1186/gb-2003-4-11-r76PubMed CentralView ArticlePubMedGoogle Scholar
- Tan MP, Smith EN, Broach JR, Floudas CA: Microarray data mining: a novel optimization-based approach to uncover biologically coherent structures. BMC Bioinformatics 2008, 9: 268. 10.1186/1471-2105-9-268PubMed CentralView ArticlePubMedGoogle Scholar
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19: 185–193. 10.1093/bioinformatics/19.2.185View ArticlePubMedGoogle Scholar
- Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, et al.: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucl Acids Res 2005, 33: e175. 10.1093/nar/gni179PubMed CentralView ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, 35: D61-D65. 10.1093/nar/gkl842PubMed CentralView ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society B 1995, 57: 289–300.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.