EXPANDER – an integrative program suite for microarray data analysis
© Shamir et al; licensee BioMed Central Ltd. 2005
Received: 05 July 2005
Accepted: 21 September 2005
Published: 21 September 2005
Gene expression microarrays are a prominent experimental tool in functional genomics which has opened the opportunity for gaining global, systems-level understanding of transcriptional networks. Experiments that apply this technology typically generate overwhelming volumes of data, unprecedented in biological research. Therefore the task of mining meaningful biological knowledge out of the raw data is a major challenge in bioinformatics. Of special need are integrative packages that provide biologist users with advanced but yet easy to use, set of algorithms, together covering the whole range of steps in microarray data analysis.
Here we present the EXPANDER 2.0 (EXPression ANalyzer and DisplayER) software package. EXPANDER 2.0 is an integrative package for the analysis of gene expression data, designed as a 'one-stop shop' tool that implements various data analysis algorithms ranging from the initial steps of normalization and filtering, through clustering and biclustering, to high-level functional enrichment analysis that points to biological processes that are active in the examined conditions, and to promoter cis-regulatory elements analysis that elucidates transcription factors that control the observed transcriptional response. EXPANDER is available with pre-compiled functional Gene Ontology (GO) and promoter sequence-derived data files for yeast, worm, fly, rat, mouse and human, supporting high-level analysis applied to data obtained from these six organisms.
EXPANDER integrated capabilities and its built-in support of multiple organisms make it a very powerful tool for analysis of microarray data. The package is freely available for academic users at http://www.cs.tau.ac.il/~rshamir/expander
Gene expression microarrays are a prominent experimental tool in functional genomics. They have revolutionized biological research by providing genome-wide snapshots of transcriptional networks that are active in the cell. This opens the opportunity for gaining global, systems-level understanding of cellular processes. Microarray platforms for measuring the expression levels of most or all genes of an organism are available for a variety of organisms ranging from yeast to human. Experiments that use this technology typically generate overwhelming volumes of data, unprecedented in biological research, which makes the task of mining meaningful biological knowledge out of the raw data a major challenge. Hence, exploitation of gene expression data is fully dependent on the availability of advanced data analysis and statistical tools. Many algorithms and software tools for analysis of microarray data were developed in recent years, including sophisticated methods for signal extraction and array normalization [1, 2], clustering [3, 4], and statistical identification of over-represented functional categories  and promoter motifs [6, 7]. At present, of special need are integrative software packages that provide users with a set of algorithms collectively covering the whole range of steps in microarray data analysis, thereby significantly boosting the analysis flow and the researcher's ability to deduce meaningful biological conclusions from the overwhelming volume of recorded data. Here we present the EXPANDER program suite for gene expression data analysis.
EXPANDER (EXPression ANalyzer and DisplayER), initially developed as a clustering tool , has been redesigned as a 'one-stop shop' tool for analysis of the data. EXPANDER 2.0 integrates methods and algorithms that collectively cover different steps of the data analysis, ranging from the initial steps of normalization and filtering, through module detection by clustering and biclustering, to high-level analysis of functional enrichment and of promoter cis-regulatory elements. EXPANDER serves as the major platform in which we integrate various gene expression analysis algorithms that were developed in our lab, including CLICK for clustering , SAMBA for biclustering , PRIMA for promoter elements analysis , and TANGO for GO functional enrichment analysis (manuscript in preparation). In addition, EXPANDER implements various visualization utilities that accompany each of the analysis modules. Four basic design principles instructed us in the implementation of the package: First, the analysis flow should be highly streamlined. Second, although some of the modules are based on highly complicated algorithms, their use should be kept simple and results should be presented in an intuitive manner. Third, data analysis is expected to be done iteratively, allowing users to examine different parameter settings and clustering algorithms – therefore, special effort was put on efficient implementation of the algorithms. Forth, users should be freed from the burden of compiling annotation data required for the analysis. Therefore, EXPANDER not only implements the analysis algorithms, but also supplies users with all necessary annotation and sequence data.
EXPANDER is implemented in Java. Most of the algorithms it runs were implemented in C. EXPANDER versions for Windows and UNIX are freely available for academic users.
In this section we describe the main analysis modules implemented in EXPANDER, and present a case analysis that demonstrates the strength of this package in deriving biological conclusions out of massive gene expression datasets.
The goal of this pre-processing step is the removal of technical biases among the analyzed chips. Currently, the default normalization scheme applied by Affymetrix software is the global scaling, which multiplies all intensities measured in a chip by a constant factor to bring the average/median intensity level in each chip to a predefined fixed level. However, several studies pointed out that global scaling is too naïve in many cases, and that more sophisticated normalization procedures accounting, e.g., for intensity-dependent bias, are required [11, 12]. We implemented in EXPANDER two such methods: non-linear regression and quantiles equalization as described in . Normalization of cDNA arrays requires intensity levels measured in both red and green channels. EXPANDER expects log ratios (Red/Green) as input when analyzing dual channels data. Therefore, normalization schemes in EXPANDER are available at this stage to one-channel datasets. Several novel normalization schemes are not yet implemented in EXPANDER (e.g., Variance Stabilizing Normalization (VSN) , Li-Wong invariant set normalization ). Users can load EXPANDER with data that were normalized using external tools.
EXPANDER provides several commonly-used filtering options based on fold-change factors, minimal variation criteria, or choosing the n most variant genes, allowing the user to focus downstream analysis on the set of genes that show sufficient variation across the measured conditions.
Functional enrichment analysis
After identifying the main co-expressed gene groups in the data (either by clustering or biclustering), one of the major challenges is to ascribe them to some biological meaning. To assist the researcher in this task, EXPANDER contains a statistical analysis module that seeks specific functional categories that are significantly over-represented in the analyzed gene groups, with respect to a given background set of genes. In addition to pointing to possible biological roles for distinct gene sets, such analysis was demonstrated to be very helpful in assigning putative functional roles to uncharacterized genes [10, 18]. EXPANDER is provided with pre-compiled functional annotation files for six organisms: yeast (S. cerevisiae), worm (C. Elegans), fly (D. melanogaster), rat (R. norvegicus), mouse (M. musculus) and human, releasing the user from the burden of compiling such annotation information. These annotation files, compiled based on data provided by the Gene Ontology (GO) consortium  and the central databases for these organisms, associate genes with GO functional categories.
Cis-regulatory element analysis
Demonstration of EXPANDER's capabilities
To demonstrate the utility of the EXPANDER package, we applied it to a very large dataset published recently by Murray et al . This study recorded expression profiles in several human cell lines exposed to various stressful conditions. The authors integrated these data with a dataset in which expression profiles were measured throughout the progression of the cell cycle . The combined dataset contains expression data for 36,825 probes measured over 174 conditions. The analysis of such complex dataset poses a daunting bioinformatics challenge. Murray et al. used the Cluster/TreeView tool  to hierarchically cluster this dataset, and by visual inspection of the resulting tree defined the main clusters in the data. A second "adoption step" was then applied, in which each main cluster adopted genes whose expression pattern resembled the cluster's mean pattern. Overall, 23 clusters containing 1245 distinct genes were reported. Biological meaning was assigned to the clusters by inspection of their expression profiles and of the genes they contain. No promoter analysis was reported.
As we noted above, when analyzing large datasets, biclustering becomes more appropriate than clustering. Therefore, we subjected this dataset to bicluster analysis using SAMBA. We first replaced missing entries with 0 (which corresponds to 'no change' in log-transformed data) and then scanned the dataset for probes whose expression was changed by at least 2-fold in at least 7 conditions. Some 10% of the clones (3,392) passed this filtering. We applied SAMBA to the union of these genes and the 1245 genes analyzed by Murray et al. The union contains 3892 genes. SAMBA identified 155 biclusters on this filtered dataset. (These biclusters can overlap – genes can be assigned to several biclusters – but are not redundant: a pruning step removes highly overlapping biclusters.) The identified biclusters reveal the major expression patterns that underlie this intricate dataset. Next, we aimed to assign biclusters with putative functional meaning, and to identify major TFs that regulate the transcriptional responses captured by them. To this goal, we applied the TANGO and PRIMA modules (both were run with default parameters).
Major biclusters identified in the test case analysis of the human stress data set.
Num of Conditions
Num of Genes
Enriched GO (GOid, p-val)
Enriched TF binding site signatures (TRANSFAC id, p-val)
DNA Replication (GO:0006260, 5.3 × 10-9)
E2F (M00918, 1.3 × 10-7)
Down-regulation of DNA replication genes in fibroblasts exposed to DDT or Menadione.
Mitosis (GO:0007067, 9.3 × 10-19)
NF-Y (M00287, 6.7 × 10-22) IRF-7 (M00453, 9.5 × 10-5)
Down-regulation of mitotic genes in response to various stresses.
Mitosis (GO:0007067, 3.3 × 10-10)
NF-Y (M00287, 3.4 × 10-9)
Down-regulation of mitotic genes in response to various stresses.
Carboxylic acid metabolism (GO:0019752, 3.4 × 10-8)
Genes activated in Hela cells in response to Tunicamycin and Menadione
Response to unfolded protein (GO:0006986, 1.2 × 10-7)
Genes activated in Hela cells in response to heat shock
Response to unfolded protein (GO:0006986, 7.3 × 10-9)
AP-2alpha (M00469, 5.6 × 10-4)
Genes activated in K562 cells in response to heat shock
Response to unfolded protein (GO:0006986, 1.5 × 10-7)
Genes that are activated by heat shock but repressed by crowding in Hela cells
Transcription corepressor (GO:0003714, 1.5 × 10-6)
HIF-1 (M00797, 6.9 × 10-4)
Genes activated in fibroblasts in response to DDT
Genes activated in fibroblasts in response to oxidative stress (H2O2)
Genes that are repressed by crowding in fibroblasts.
N-Myc(M00055, 2.7 × 10-6)
Genes that are repressed in both Hela cells and fibroblasts.
AP-4 (M00005, 2.1 × 10-4)
Genes repressed in Hela cells in response to various stresses.
NFkB (M00051, 7.1 × 10-4)
Genes activated in Hela cells in response to DDT.
In several biclusters PRIMA identified significant enrichment for binding site signatures of TFs that are known to control the respective biological processes (e.g., over-representation of E2F binding site in bicluster #106, which is enriched for DNA replication genes; enrichment of NF-Y binding site in bicluster #40, which is enriched for mitotic genes). In other biclusters PRIMA suggests novel links between TFs and stress responses (e.g., over-representation of N-Myc binding site in bicluster #53, which contains genes that are repressed by different stresses).
Some of EXPANDER's salient advantages are evident from the above analysis: The biclustering module, which is unique to EXPANDER among packages for microarray data analysis, allows systematic detection of the major expression patterns in highly complex datasets. Biclusters provide higher resolution gene groups, some encompassing many conditions but most covering relatively small subsets and thus focusing on specific phenomena. Functional enrichment and promoter analyses are done in a streamlined and integrated fashion, and so most of the expert's effort can be devoted to biological interpretation. Last, analysis of microarray data requires experimentation with various filtering thresholds and algorithmic parameters settings; therefore it is of high importance that the analysis modules will require relatively short running time. EXPANDER was designed to meet this requirement. A full analysis iteration, which includes biclustering, functional enrichment and promoter analyses applied to the above massive dataset that we used as an example, takes some 15 mins on a standard PC.
Comparison with other tools
Several integrative packages for the analysis of gene expression data were are available, among them are INCLUSive , Expression-Profiler, GEPAS , TIGR's Multiple Experiment Viewer, and ArrayPipe . EXPANDER has several advantages over extant packages. While some of the integrative packages are designed as web portals that provide links to independent programs, where, in some cases, the outputs are sent to the user by e-mail and not always in a format directly compatible with subsequent analysis steps, in EXPANDER the analysis flow is inherently streamlined and straightforward. In addition, EXPANDERs' strength lies in the advanced algorithms it uniquely provides: CLICK for clustering, SAMBA for biclustering, TANGO for identification of GO enrichment, and PRIMA for the identification of enriched TF binding site signatures. The synergism that stems from the integration of these algorithms into one package grants EXPANDER with very powerful analytical capabilities. Another feature that distinguishes EXPANDER is its built-in support for genome-wide analysis of data obtained from six major research organisms.
Designed as a 'one-stop shop' for gene expression data analysis, EXPANDER provides algorithms covering main analysis steps including (1) the initial process of normalization and filtering for removing biases and focusing downstream analysis on responding genes in the dataset; (2) clustering and biclustering to discover the main expression patterns in the data; (3) high-level functional enrichment analysis; and (4) promoter cis-element analysis to gain insights on the biological meaning of the identified expression patterns and to point to transcriptional regulators that underlie them. These integrated capabilities provided by EXPANDER and its built-in support of multiple organisms make it a very powerful tool for analysis of microarray data. Although some of the analysis modules implemented in EXPANDER are based on sophisticated algorithms, their execution remains simple and intuitive.
We will routinely post on EXPANDER' s website updated GO annotation and promoter fingerprint files for all the supported organisms. EXPANDER's users will be notified of such updates. We will continue to maintain and expand EXPANDER to keep it as an integrative suite that provides state-of-the-art algorithms and visualization utilities for analysis of microarray data. We will also expand the group of organisms supported by the package according to the availability of appropriate information and data.
Availability and requirements
Project name: EXPANDER
Project home page: http://www.cs.tau.ac.il/~rshamir/expander
Operating system(s): Windows, UNIX
Programming language: Java for the envelope and C for most of the algorithms.
Other requirements: Java 1.4 or higher
License: free for non-commercial users.
Any restrictions to use by non-academics: License needed.
EXPression ANalyzer and DisplayER
Self Organizing Maps
CLuster Identification via Connectivity Kernels
Statistical-Algorithmic Method for Bicluster Analysis
Tool for ANalysis of GO enrichments
PRomoter Integration in Microarray Analysis
Position weight matrix
This study was supported in parts by research a grant from the Ministry of Science and Technology, Israel. R. Elkon is a Joseph Sassoon Fellow. A. Tanay is supported in part by a scholarship in Complexity Science from the Yeshaia Horvitz Association. R. Sharan is supported by an Alon fellowship.
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185–193. 10.1093/bioinformatics/19.2.185View ArticlePubMed
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4(2):249–264. 10.1093/biostatistics/4.2.249View ArticlePubMed
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 1998, 95(25):14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMed
- Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A 1999, 96(6):2907–2912. 10.1073/pnas.96.6.2907PubMed CentralView ArticlePubMed
- Al-Shahrour F, Diaz-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 2004, 20(4):578–580. 10.1093/bioinformatics/btg455View ArticlePubMed
- Aerts S, Thijs G, Coessens B, Staes M, Moreau Y, De Moor B: Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Res 2003, 31(6):1753–1764. 10.1093/nar/gkg268PubMed CentralView ArticlePubMed
- Elkon R, Linhart C, Sharan R, Shamir R, Shiloh Y: Genome-wide in silico identification of transcriptional regulators controlling the cell cycle in human cells. Genome Res 2003, 13(5):773–780. 10.1101/gr.947203PubMed CentralView ArticlePubMed
- Sharan R, Maron-Katz A, Shamir R: CLICK and EXPANDER: a system for clustering and visualizing gene expression data. Bioinformatics 2003, 19(14):1787–1799. 10.1093/bioinformatics/btg232View ArticlePubMed
- Sharan R, Elkon R, Shamir R: Cluster analysis and its applications to gene expression data. Ernst Schering Res Found Workshop 2002, 83–108.
- Tanay A, Sharan R, Kupiec M, Shamir R: Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci U S A 2004, 101(9):2981–2986. 10.1073/pnas.0308661100PubMed CentralView ArticlePubMed
- Schuchhardt J, Beule D, Malik A, Wolski E, Eickhoff H, Lehrach H, Herzel H: Normalization strategies for cDNA microarrays. Nucleic Acids Res 2000, 28(10):E47. 10.1093/nar/28.10.e47PubMed CentralView ArticlePubMed
- Yang YH, Dudoit S, Luu P, TP. S: Normalization for cDNA Microarray Data. Technical Report Department of Statistics, University of California at Berkeley 2000.
- Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 2002, 18 Suppl 1: S96–104.View ArticlePubMed
- Kel AE, Kel-Margoulis OV, Farnham PJ, Bartley SM, Wingender E, Zhang MQ: Computer-assisted identification of cell cycle-related genes: new targets for E2F transcription factors. J Mol Biol 2001, 309(1):99–120. 10.1006/jmbi.2001.4650View ArticlePubMed
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet 1999, 22(3):281–285. 10.1038/10343View ArticlePubMed
- Sharan R, Shamir R: CLICK: a clustering algorithm with applications to gene expression analysis. Proc Int Conf Intell Syst Mol Biol 2000, 8: 307–316.PubMed
- Tanay A, Sharan R, Shamir R: Discovering statistically significant biclusters in gene expression data. Bioinformatics 2002, 18 Suppl 1: S136–44.View ArticlePubMed
- Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R, Altschuler SJ: Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat Genet 2002, 31(3):255–265. 10.1038/ng906View ArticlePubMed
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25–29. 10.1038/75556PubMed CentralView ArticlePubMed
- Murray JI, Whitfield ML, Trinklein ND, Myers RM, Brown PO, Botstein D: Diverse and specific gene expression responses to stresses in cultured human cells. Mol Biol Cell 2004, 15(5):2361–2374. 10.1091/mbc.E03-11-0799PubMed CentralView ArticlePubMed
- Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, Botstein D: Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell 2002, 13(6):1977–2000. 10.1091/mbc.02-02-0030.PubMed CentralView ArticlePubMed
- Coessens B, Thijs G, Aerts S, Marchal K, De Smet F, Engelen K, Glenisson P, Moreau Y, Mathys J, De Moor B: INCLUSive: A web portal and service registry for microarray and regulatory sequence analysis. Nucleic Acids Res 2003, 31(13):3468–3470. 10.1093/nar/gkg615PubMed CentralView ArticlePubMed
- Herrero J, Vaquerizas JM, Al-Shahrour F, Conde L, Mateos A, Diaz-Uriarte JS, Dopazo J: New challenges in gene expression data analysis and the extended GEPAS. Nucleic Acids Res 2004, 32(Web Server issue):W485–91.PubMed CentralView ArticlePubMed
- Hokamp K, Roche FM, Acab M, Rousseau ME, Kuo B, Goode D, Aeschliman D, Bryan J, Babiuk LA, Hancock RE, Brinkman FS: ArrayPipe: a flexible processing pipeline for microarray data. Nucleic Acids Res 2004, 32(Web Server issue):W457–9.PubMed CentralView ArticlePubMed
- Christie KR, Weng S, Balakrishnan R, Costanzo MC, Dolinski K, Dwight SS, Engel SR, Feierbach B, Fisk DG, Hirschman JE, Hong EL, Issel-Tarver L, Nash R, Sethuraman A, Starr B, Theesfeld CL, Andrada R, Binkley G, Dong Q, Lane C, Schroeder M, Botstein D, Cherry JM: Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res 2004, 32(Database issue):D311–4. 10.1093/nar/gkh033PubMed CentralView ArticlePubMed
- Chen N, Harris TW, Antoshechkin I, Bastiani C, Bieri T, Blasiar D, Bradnam K, Canaran P, Chan J, Chen CK, Chen WJ, Cunningham F, Davis P, Kenny E, Kishore R, Lawson D, Lee R, Muller HM, Nakamura C, Pai S, Ozersky P, Petcherski A, Rogers A, Sabo A, Schwarz EM, Van Auken K, Wang Q, Durbin R, Spieth J, Sternberg PW, Stein LD: WormBase: a comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Res 2005, 33 (Database Issue): D383–9.
- Drysdale RA, Crosby MA, Gelbart W, Campbell K, Emmert D, Matthews B, Russo S, Schroeder A, Smutniak F, Zhang P, Zhou P, Zytkovicz M, Ashburner M, de Grey A, Foulger R, Millburn G, Sutherland D, Yamada C, Kaufman T, Matthews K, DeAngelo A, Cook RK, Gilbert D, Goodman J, Grumbling G, Sheth H, Strelets V, Rubin G, Gibson M, Harris N, Lewis S, Misra S, Shu SQ: FlyBase: genes and gene models. Nucleic Acids Res 2005, 33 (Database Issue): D390–5.
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 2005, 33 (Database Issue): D54–8.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.