DISCLOSE : DISsection of CLusters Obtained by SEries of transcriptome data using functional annotations and putative transcription factor binding sites
© Blom et al; licensee BioMed Central Ltd. 2008
Received: 25 April 2008
Accepted: 16 December 2008
Published: 16 December 2008
A typical step in the analysis of gene expression data is the determination of clusters of genes that exhibit similar expression patterns. Researchers are confronted with the seemingly arbitrary choice between numerous algorithms to perform cluster analysis.
We developed an exploratory application that benchmarks the results of clustering methods using functional annotations. In addition, a de novo DNA motif discovery algorithm is integrated in our program which identifies overrepresented DNA binding sites in the upstream DNA sequences of genes from the clusters that are indicative of sites of transcriptional control. The performance of our program was evaluated by comparing the original results of a time course experiment with the findings of our application.
DISCLOSE assists researchers in the prokaryotic research community in systematically evaluating results of the application of a range of clustering algorithms to transcriptome data. Different performance measures allow to quickly and comprehensively determine the best suited clustering approach for a given dataset.
DNA microarray technology is commonly used to study mRNA expression levels of genes under different experimental conditions. Clustering approaches are widely used in the analysis of gene expression data. The ability to identify groups of genes exhibiting similar expression patterns by clustering allows for detailed biological insights into global regulation of gene expression and cellular processes. Clustering methodology is considered a potent means to infer putative gene function [1, 2].
In the process of the analysis of transcriptome data, researchers are often faced with the choice between a wide variety of clustering methods and associated parameters. The results of the application of different clustering algorithms to the same dataset will place genes in different clusters and therefore result in different biological interpretations of the same dataset. Moreover, selecting the most appropriate clustering method and parameters heavily depends on the experience of the researcher and on the nature of the dataset analyzed.
Several studies have shown the relevance of applying external measures (i.e., using prior biological knowledge) to more objectively evaluate the results of clustering algorithms ([3–6]). Central in this approach is the assumption that genes involved in similar biological processes are more likely to be co-transcribed. Therefore, selecting a clustering method the clusters of which are most enriched with biological processes is considered as a relevant starting point for the biological interpretation of a DNA microarray dataset [6–9].
Co-clustered genes may also represent a candidate set of coregulated genes, i.e., genes of which the expression is regulated by the same transcription factor. The discovery of putative regulatory motifs in cis-regulatory regions of genes that are part of the same cluster could therefore allow identification of new TF targets . Existing implementations that employ motif discovery on clusters obtained by DNA microarray [7, 8, 11] leave the downstream analysis of the motifs to be performed by the researcher. More importantly, no feedback concerning the results of the analysis is presented for the used clustering algorithm and associated parameters, making it difficult to compare the effect on the results of different clustering parameters or methods to the same dataset. Ideally, quantitative information concerning the functional and motif enrichments of the tested clusters should be provided after each clustering analysis. This information would then allow for a more objective selection of optimal clustering parameters based on biological criteria. Lastly, all available software packages are not specifically suited for prokaryotic data analysis since they do not support prokaryote-specific data sources (e.g., operons, specific genome annotations).
We have developed the application DISCLOSE for prokaryotes that benchmarks clustering methods using biological annotations and the SCOPE DNA binding site detection algorithm . This algorithm allows the prediction of cis-regulatory motifs of genes which are part of the same cluster. In addition, additional occurrences of identified motifs are determined. Moreover, putative motifs are compared with known DNA binding sites as well as a functional analysis of the genes bearing the motif in their upstream region.
2 Program overview
The DISCLOSE application allows for an automated scoring based on different criteria of the different clusters in each clustering analysis. This scoring is followed by a decision by the researcher on the most suitable clustering method for the dataset analyzed based on one metric. Various metrics (see below) are available to assess the results of the clustering analysis. Each metric provides for a unique measure to filter the results of a clustering analysis and can therefore be used to address different research questions; e.g., selection of a clustering analysis which yields a large number of overrepresented motifs or a clustering analysis which produces a large number of significant overrepresented metabolic pathways. Based on the chosen clustering analysis, DISCLOSE provides an in-depth analysis of clustering results together with an intuitive visualization.
The DNA binding site detection algorithm uses operon information and the genomic sequence in FASTA format (see Fig. 1G). However, DISCLOSE also supports a single gene-based analysis if no operon information is available. Moreover, known binding site information can be used to evaluate the results of the DNA binding site detection algorithm.
2.2.1 Clustering of gene expression data
Two widely used clustering algorithms (K-means and Self Organizing Maps) from the TIGR Multiexperiment Viewer (MeV) package http://www.tm4.org/mev.html are implemented in DISCLOSE (Fig. 1B). The application of different parameter settings to a clustering analysis is facilitated by allowing for a parameter range and/or different correlation measures for each clustering approach, e.g., a K-means clustering with five to twenty clusters using Euclidean and Pearson correlation measures (Fig. 1C).
2.2.2 Evaluation of external clustering applications
The two most used clustering algorithms have been implemented in DISCLOSE. Since many other clustering methods exist, we have chosen to allow the evaluation of results obtained from external clustering applications in DISCLOSE (Fig. 1D).
2.2.3 Evaluating clusters using biological knowledge
The identification of significantly enriched categories in a cluster of co-expressed genes enables users to focus on relevant biological phenomena. Assuming a normal distribution of the number of genes from the clusters for each functional category, one expects a difference in the proportion of genes for a category present in each cluster compared to the genes from a reference set (e.g., the remaining genes from the studied organism). To identify clusters that contain a significantly enriched number of genes from a certain functional class, the distribution of genes from a gene set (e.g., a cluster of genes) is compared to the genes in the reference set (e.g., the remainder of the genes in all other clusters).
A hypergeometric distribution test is used to calculate p-values for each functional category from each cluster. This p-value describes the probability of observing an enrichment of genes from a functional category in a cluster by chance (Fig. 1E). The number of false-positives for the initial cluster evaluation (Fig. 1E) is controlled by a strict Bonferroni multiple testing correction (taking into account the clustering runs), while additional corrections () are used upon detailed analysis of selected results.
2.2.4 De novo identification of DNA binding sites
Clustering algorithms allow for the identification of groups of genes that exhibit similar expression patterns. This co-expression could be explained by transcriptional co-regulation. Identification of overrepresented DNA binding sites in genes of the same cluster is performed by the SCOPE method . This method utilizes three specialized algorithms; BEAM for non-degenerate motifs, PRISM for degenerate motifs and SPACER for bipartite motifs. Key aspects of SCOPE are high sensitivity and specificity for a broad range of motifs (i.e., perfect, degenerated and gapped motifs), requirement of a minimum of parameters for motif detection, and speed .
2.2.5 Characterization of putative motifs
For various model organisms reference databases on transcriptional regulation have been created that summarize experimentally characterized transcription factors, their binding sites and the genes they regulate (e.g., from DBTBS  for Bacillus subtilis or regulonDB  for Escherichia coli). The binding sites derived from these databases can be incorporated in DISCLOSE, allowing for a comparison of known binding sites to putative DNA binding sites found by the SCOPE algorithm. This feature allows researchers to distinguish between known and unknown binding sites. In addition, aligned putative motif instances identified by the SCOPE algorithm are used to create position specific scoring matrices. These matrices are subsequently used to score the upstream and coding DNA sequences from all genes of the studied organism. The prioritized results of this analysis allow researchers to identify additional genes that do contain the putative motif but were not part of the original cluster.
Lastly, DISCLOSE attempts to functionally characterize motifs by identifying significantly enriched categories using the genes that contain the motif. This analysis is different as compared to a standard functional enrichment analysis. Since the motif analysis only uses operon information, the enrichment of categories is calculated by taking into account the operons instead of genes. This analysis yields a p-value which describes the probability of observing an enrichment of operons belonging to a specific functional category in the operon members bearing the motif by chance. The results of this examination yield insights concerning the biological processes that could be controlled by the putative motif.
Number of significant overrepresented functional categories from each annotation source (e.g., the total number of overrepresented metabolic pathways).
Total number of significant overrepresented functional categories from all annotation sources (e.g., all overrepresented metabolic pathways, GO categories etc).
The number of clusters which are enriched for one or more functional categories.
The score of the most overrepresented DNA binding site.
The number of overrepresented DNA binding sites that exceed a predefined threshold.
Number of functional categories from one annotation source that were found overrepresented in gene members of a cluster that contain a certain motif in their upstream region.
Number of functional categories for all annotation sources that were found overrepresented in gene members of a cluster that contain a certain motif in their upstream region.
Several filtering options are available for each of the described metrics. Each metric provides for a unique measure for users to select the highest scoring clustering results based on different criteria, e.g., a clustering that yields the highest number of overrepresented metabolic pathways or the most significantly overrepresented DNA binding sites. This allows researchers to select the most optimal clustering run for their research question. Graphical representations of the results are available for different stages in the analysis which will be discussed in the upcoming sections.
2.3.1 Complete results analysis
In addition to saving the complete contents of the tabulated view to a HTML file, a robustness analysis is performed on all of the clustering runs. This analysis determines the frequency of occurrence for every functional category (e.g., a robustness frequency for a functional category of 50% indicates that it is significantly overrepresented in individual clusters from 50% of the clustering runs). Robust functional categories are a good starting point for an analysis since they occur in relatively large fractions of the clustering results. Less robust functional categories allow for the analysis of functional categories that would have been missed using more general clustering parameters.
2.3.2 Functional analysis for individual clustering runs
2.3.3 Visualization of putative DNA binding sites for individual clustering runs
Results and discussion
Biological phenomena discussed in the original article
General stress response
tricarboxylic acid cycle
Sigma D regulon (motility)
DNA replication and DNA repair functions
Sulfur amino acid metabolism
fatty acid biosynthesis
drug transporter activity
Results of robustness analysis of DISCLOSE
In original study
GO-0006164 : purine nucleotide biosynthetic process
COG-F : Nucleotide transport and metabolism
GO-0003735 : structural constituent of ribosome
INT-SigB : general stress sigma factor
COG-J : Translation, ribosomal structure and biogenesis
PW-path-bsu00020 : Citrate cycle
GO-0003723 : RNA binding
COG-N : Cell motility and secretion
INT-SigG : late forespore-specific gene expression
PW-path-bsu00193 : ATP synthesis
UP-67 : Ligase
GO-0006935 : chemotaxis
UP-56 : Glycolysis
UP-29 : Cell division
PW-path-bsu00720 : Reductive carboxylate cycle
PW-path-bsu00240 : Pyrimidine metabolism
PW-path-bsu00970 : Aminoacyl-tRNA biosynthesis
COG-L : DNA replication, recombination and repair
UP-15 : Threonine biosynthesis
COG-G : Carbohydrate transport and metabolism
GO-0006520 : amino acid metabolic process
UP-124 : Sporulation
INT-SigK : late mother cell-specific gene expression
PW-path-bsu03070 : Type III secretion system
INT-PurR : negative regulation of the purine operons
GO-0015293 : symporter activity
UP-25 : Porphyrin biosynthesis
UP-179 : Folate biosynthesis
PW-path-bsu00190 : Oxidative phosphorylation
PW-path-bsu02060 : Phosphotransferase system (PTS)
COG-D : Cell division and chromosome partitioning
GO-0008360 : regulation of cell shape
PW-path-bsu00740 : Ribo avin metabolism
PW-path-bsu00030 : Pentose phosphate pathway
GO-0009086 : methionine biosynthetic process
UP-17 : Hydrogen ion transport
INT-SigE : early mother cell-specific gene expression
PW-path-bsu00920 : Sulfur metabolism
INT-SigA : RNA polymerase major sigma-43 factor
COG-O : Posttranslational modification, protein turnover, chaperones
PW-path-bsu00400 : Phenylalanine, tyrosine and tryptophan biosynthesis
PW-path-bsu00252 : Alanine and aspartate metabolism
GO-0009252 : peptidoglycan biosynthetic process
PW-path-bsu00260 : Glycine, serine and threonine metabolism
UP-84 : Fatty acid biosynthesis
GO-0000103 : sulfate assimilation
GO-0000105 : histidine biosynthetic process
PW-path-bsu00670 : One carbon pool by folate
Furthermore, additional biological phenomena were identified by DISCLOSE that were not discussed by the authors (Table 2). However, some categories that were described in the original study did not meet the 10% threshold that was used in our analysis. These reported categories were not found in the original analysis using a clustering based analysis but using an analysis which has been performed on the highest expressed genes from every individual time point.
DNA binding site analysis of DISCLOSE
Discussion of the results
The results of our DISCLOSE analysis show that we were able to identify most of the functional overrepresented categories that are discussed in the original study. However, a number of functional categories were not recovered using our approach due to the nature of the analysis employed in the original study which was based on analysis of differentially expressed genes in single time points. The single-time point analysis is tedious and does not take into account the temporal properties of the dataset.
The interactive visualization module for overrepresented motifs by DISCLOSE facilitated the detection of putative motifs as well as the discovery of motifs that have been described in literature.
Choosing a clustering method and associated parameters for a given DNA microarray dataset is a challenging task. Moreover, commonly used clustering algorithms lack the ability to annotate the clusters using functional information. This information is crucial to comprehend the underlying biology of the experiment. Here, we present DISCLOSE, an exploratory application that benchmarks clustering methods using functional annotations and a de novo motif discovery algorithm. DISCLOSE allows to select the most appropriate clustering method and to visually inspect the clusters obtained for a given DNA microarray dataset. Our application quantitatively describes the most stable overrepresented functional categories in the clusters. This methodology allows for a more objective and complete interpretation of the dataset analyzed. Our application offers the following advantages to existing tools:
benchmarks clustering methods using enrichment analysis and motif discovery
supports K-means and SOM clustering algorithms
robustness analysis for functional categories
ready-to-use databases for over 600 prokaryotic organisms
functional enrichment analysis of putative motifs
identification of additional motif occurrences
matching of putative motifs with known motifs
interactive visualization of genomic context of known and putative motifs.
stand-alone application (supports all major operating systems)
4.1 Software package
DISCLOSE was programmed as a standalone application in Java using the Eclipse http://www.eclipse.org/ framework and it runs on all Java-supporting operating systems (Windows, Linux and Mac OSX). The graphical output can be viewed by all web browsers that are able to process Scalable Vector Graphics.
DISCLOSE features the following annotation modules: i. Gene Ontology, ii. Metabolic pathways (KEGG), iii. COG classes, iv. Regulatory interactions, v. UniProt keywords vi. and user-defined functional categories (Fig. 1F). The supplementary materials contain ready-to-use databases for over 600 prokaryotic organisms. The software package contains in addition a manual and an example analysis using a publicly available dataset.
Dataset used for validation
The dataset is part of a transcriptome analysis from a study on the growth transitions of Bacillus subtilis. Data from this experiment was obtained from the Gene Expression Omnibus database from NCBI  (accession number: GSE6865). The authors of the original study applied a K-means clustering to reveal patterns of temporal gene expression. The optimum number of clusters was revealed by principal-component analysis and ordered by the timing of expression. A detailed analysis based on individual time points was conducted by JProGO  to identify overrepresented groups of functionally related genes. From this analysis, the authors have selected several functional categories from a list of significantly overrepresented categories (see Table 1).
Our analysis was conducted using a K-means clustering using a range from 10 to 100 clusters for all correlation measures. For each cluster that was analyzed the DNA binding site reported the 10 most overrepresented motifs. Finally, the results of a clustering run that yielded the highest number of motifs with a score above 15 together with a robustness analysis with a 10% cut off for all clusters were analyzed.
A genome file for Bacillus subtilis was obtained from NCBI and supplemented with db_xrefield information from an EMBL genome review file from EBI . COG information from a local whog file  was loaded by using the organism abbreviation: bsu. Pathway information was obtained from KEGG . The latest Gene Ontology (GO) obo file was used from the Gene Ontology website . Functional categories based on Uniprot keywords  were imported as well as information from the DBTBS database  for the interaction annotation module.
4 Availability and requirements
Project name: DISCLOSE
Project homepage: http://bioinformatics.biol.rug.nl/standalone/disclose/
Supplementary information: http://bioinformatics.biol.rug.nl/standalone/disclose/
Operating systems: Microsoft Windows, Linux and Mac OSX
Programming language: Java
License: Freely available
Any restrictions on use by non-academics: No
This study was supported by a grant from the Netherlands Organisation for Scientific Research and industrial partners in the NWO-BMI project number 050.50.206 on Computational Genomics of Prokaryotes and by Center IOP Genomics. This work was in part supported by an EU program in FW6: Bacell Health, European Union Grant LSHG-CT-2004-503468.
- Quackenbush J: Computational analysis of microarray data. Nat Rev Genet 2001, 2(6):418–427. 10.1038/35076576View ArticlePubMedGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95(25):14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
- Toronen P: Selection of informative clusters from hierarchical cluster tree with gene classes. BMC Bioinformatics 2004, 5: 32. 10.1186/1471-2105-5-32PubMed CentralView ArticlePubMedGoogle Scholar
- Gat-Viks I, Sharan R, Shamir R: Scoring clustering solutions by their biological relevance. Bioinformatics 2003, 9(18):2381–2389. 10.1093/bioinformatics/btg330View ArticleGoogle Scholar
- Gibbons FD, Roth FP: Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res 2002, 12(10):1574–1581. 10.1101/gr.397002PubMed CentralView ArticlePubMedGoogle Scholar
- Gasch AP, Eisen MB: Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biol 2002, 3(11):RESEARCH0059. 10.1186/gb-2002-3-11-research0059PubMed CentralView ArticlePubMedGoogle Scholar
- Shamir R, Maron-Katz A, Tanay A, Linhart C, Steinfeld I, Sharan R, Shiloh Y, Elkon R: EXPANDER-an integrative program suite for microarray data analysis. BMC Bioinformatics 2005, 6: 232. 10.1186/1471-2105-6-232PubMed CentralView ArticlePubMedGoogle Scholar
- Kim TM, Chung YJ, Rhyu MG, Jung MH: Inferring biological functions and associated transcriptional regulators using gene set expression coherence analysis. BMC Bioinformatics 2007, 8: 453. 10.1186/1471-2105-8-453PubMed CentralView ArticlePubMedGoogle Scholar
- Datta S, Datta S: Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics 2006, 7: 397. 10.1186/1471-2105-7-397PubMed CentralView ArticlePubMedGoogle Scholar
- Jakt LM, Cao L, Cheah KS, Smith DK: Assessing clusters and motifs from gene expression data. Genome Res 2001, 11: 112–123. 10.1101/gr.148301PubMed CentralView ArticlePubMedGoogle Scholar
- Thijs G, Moreau Y, Smet FD, Mathys J, Lescot M, Rombauts S, Rouze P, Moor BD, Marchal K: INCLUSive: integrated clustering, upstream sequence retrieval and motif sampling. Bioinformatics 2002, 18(2):331–332. 10.1093/bioinformatics/18.2.331View ArticlePubMedGoogle Scholar
- Chakravarty A, Carlson J, Khetani R, Gross R: A novel ensemble learning method for de novo computational identification of DNA binding sites. BMC Bioinformatics 2007, 8: 249. 10.1186/1471-2105-8-249PubMed CentralView ArticlePubMedGoogle Scholar
- Blom EJ, Bosman DWJ, van Hijum SAFT, Breitling R, Tijsma L, Silvis R, Roerdink JBTM, Kuipers OP: FIVA: Functional Information Viewer and Analyzer extracting biological knowledge from transcriptome data of prokaryotes. Bioinformatics 2007, 23(9):1161–1163. 10.1093/bioinformatics/btl658View ArticlePubMedGoogle Scholar
- Makita Y, Nakao M, Ogasawara N, Nakai K: DBTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics. Nucleic Acids Res 2004, (32 Database):D75-D77. 10.1093/nar/gkh074
- Salgado H, Gama-Castro S, Peralta-Gil M, Díaz-Peredo E, Sánchez-Solano F, Santos-Zavaleta A, Martínez-Flores I, Jiménez-Jacinto V, Bonavides-Martínez C, Segura-Salazar J, Martínez-Antonio A, Collado-Vides J: RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res 2006, (34 Database):D394-D397. 10.1093/nar/gkj156
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res 2004, 14(6):1188–1190. 10.1101/gr.849004PubMed CentralView ArticlePubMedGoogle Scholar
- Keijser BJF, Beek AT, Rauwerda H, Schuren F, Montijn R, Spek H, Brul S: Analysis of temporal gene expression during Bacillus subtilis spore germination and outgrowth. J Bacteriol 2007, 189(9):3624–3634. 10.1128/JB.01736-06PubMed CentralView ArticlePubMedGoogle Scholar
- NCBI GEO[http://www.ncbi.nlm.nih.gov/geo/]
- Scheer M, Klawonn F, Münch R, Grote A, Hiller K, Choi C, Koch I, Schobert M, Härtig E, Klages U, Jahn D: JProGO: a novel tool for the functional interpretation of prokaryotic microarray data using Gene Ontology information. Nucleic Acids Res 2006, (34 Web Server):W510-W515. 10.1093/nar/gkl329
- EBI Genome Reviews[http://www.ebi.ac.uk/GenomeReviews/files/cellular/]
- COG WHOG[ftp://ftp.ncbi.nih.gov/pub/COG/COG/whog]
- KEGG Pathways[ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/uniprot_sprot_bacteria.dat.gz]
- Gene Ontology Obo File[http://www.geneontology.org/ontology/gene_ontology.obo]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.