Skip to main content
  • Research article
  • Open access
  • Published:

Gene set-based module discovery in the breast cancer transcriptome



Although microarray-based studies have revealed global view of gene expression in cancer cells, we still have little knowledge about regulatory mechanisms underlying the transcriptome. Several computational methods applied to yeast data have recently succeeded in identifying expression modules, which is defined as co-expressed gene sets under common regulatory mechanisms. However, such module discovery methods are not applied cancer transcriptome data.


In order to decode oncogenic regulatory programs in cancer cells, we developed a novel module discovery method termed EEM by extending a previously reported module discovery method, and applied it to breast cancer expression data. Starting from seed gene sets prepared based on cis-regulatory elements, ChIP-chip data, and gene locus information, EEM identified 10 principal expression modules in breast cancer based on their expression coherence. Moreover, EEM depicted their activity profiles, which predict regulatory programs in each subtypes of breast tumors. For example, our analysis revealed that the expression module regulated by the Polycomb repressive complex 2 (PRC2) is downregulated in triple negative breast cancers, suggesting similarity of transcriptional programs between stem cells and aggressive breast cancer cells. We also found that the activity of the PRC2 expression module is negatively correlated to the expression of EZH2, a component of PRC2 which belongs to the E2F expression module. E2F-driven EZH2 overexpression may be responsible for the repression of the PRC2 expression modules in triple negative tumors. Furthermore, our network analysis predicts regulatory circuits in breast cancer cells.


These results demonstrate that the gene set-based module discovery approach is a powerful tool to decode regulatory programs in cancer cells.


In the last decade, microarray technology has produced exploding amounts of cancer transcriptome data; especially, breast cancer transcriptome has been intensively profiled. Human breast tumors show diversity in their histology, prognosis, and responsiveness to treatments. The microarray technology has demonstrated that transcriptomic diversity underlies phenotypic diversity, and brought great progress in our molecular understanding of breast cancer [1]. However, compared with the increasing knowledge about the transcriptome, little is yet known about regulatory programs generating the transcriptomic diversity.

To decode gene regulatory programs controlling the breast cancer transcriptome, we searched for cis-regulatory motifs associated with tumor phenotypes in our previous study [2]. One of the limitations of this method is that it takes a supervised approach and requires sample information. In this study, we introduce an alternative method which focuses on expression modules and does not require sample information. An expression module is defined as a set of coexpressed genes controlled by a common regulatory mechanism. Although expression modules were originally visualized by clustering analysis of microarray data [3], methods based only on expression data are insufficient to reveal regulatory programs controlling such expression modules. Recently, approaches that combine expression data and cis-regulatory information have succeeded in identify gene regulatory programs of lower organisms like Saccharomyces cerevisiae [4, 5]. However, such module discovery approaches have rarely been applied to cancer transcriptome data, although a number of analyses based on prescribed sets of genes have also been performed in order to analyze oncogenic regulatory programs [6, 7].

Our new computational method termed EEM (Extraction of Expression Modules) is constructed for extracting expression modules in the cancer transcriptome. Our approach is based on an integrative method by Bar-Joseph et al. [5], which successfully identified yeast expression modules by integrating ChIP-chip and expression data. By combining with gene set-based approaches [6, 7], we extended their approach and made it applicable to cancer transcriptome data. Starting from seed gene sets predicted based on cis-regulatory elements, ChIP-chip data, and gene locus information, EEM statistically evaluates their functionality and refines them based on their expression coherence. We analyzed breast cancer microarray data by EEM, and find 10 expression modules in the breast cancer transcriptome. Our additional bioinformatics analysis validated the 10 expression modules and demonstrated their significance in the pathophysiology of breast cancer.


Methods Overview

The EEM algorithm discovers an expression module by combining two types of data: seed gene sets and expression profile data. A set of genes whose expressions are considered to be regulated by the same molecular mechanism could be predicted based on various types of data, and prepared as a seed gene set. EEM assesses functionality of the seed gene set based on expression coherence. If seed gene set functions as expression module, all genes in it are ideally expressed coherently. Although a functional seed gene set might include false positives, or non-functional module genes in the biological context of interest, at least a significant fraction of seed genes should behave coherently in the expression data. This assumption is verified by the observation that putatively functional gene sets often harbors a large cluster of genes which behave coherently (see Additional File 1). EEM extracts only such a coherently expressed gene subset, filtering out false positive or non-functional module genes. Taking a geometric approach, EEM searches for the largest subset with a minimum degree of coexpression (specified by radius parameter r). Concurrently, EEM statistically evaluates the size of the retrieved coherent subset using a Z score based on randomization tests. If the Z score is greater than the prespecified cutoff value, we conclude that the seed gene set includes a functional expression module and the coherent subset is extracted as an expression module. We observe that the expression modules extracted by EEM are more functionally enriched than seed gene sets. This observation would justify our refinement procedure (see Additional File 1).

Employing this EEM algorithm, we systematically searched for expression modules in the breast cancer transcriptome (Figure 1). In our search, a collection of seed gene sets are prepared based on cis-regulatory motifs, ChIP-chip data, and gene locus information. Since genes that possess a common cis-regulatory element in their promoters could be regulated by a common transcription factor (TF), we can predict an expression module based on the cis-regulatory motif. We searched human gene promoters and its mouse homolog promoters for 200 motif using PWMs obtained from the TRANSFAC and JASPAR databases [8, 9], and prepared 200 seed gene sets which include genes with common motifs in their promoters. We can also predict expression modules utilizing ChIP-chip data, which provide direct evidence of TF binding in the cis-regulatory regions. Published ChIP-chip results [1014] are collected to prepare seed gene sets. DNA copy number alteration is known to have a significant effect on the cancer transcriptome as well as transcriptional regulation [15]. Hence, we also regarded it as one of expression regulatory mechanisms in cancer cells. Genes residing in a chromosomal region which is subjected to copy number alteration could be expressed coherently, and viewed as an expression module; taking sliding window approach, we prepared seed gene sets which consist of genes residing on the same chromosomal region. For each of the prepared seed gene sets, we tested the presence of coherently expressed gene subsets in breast cancer microarray data [16]. If such coherently expressed genes exist, we then extracted them as an expression module. Furthermore, the average expression profile of the predicted expression module can be considered as its activity in each of tumor samples. The expression module activity profiles were then analyzed using ordinary methods applied to gene expression profile data like clustering and survival analysis.

Figure 1
figure 1

Schema of systematic search for expression modules. We prepared a collection of seed gene sets based on cis-regulatory motifs, ChIP-chip and gene locus information. We next statistically evaluated whether each seed gene set includes a significant large number of coherently expressed genes in expression profile data. If such a coherently expressed gene subset exists, we assumed it as an expression module, and obtained its averaged expression profile as an activity profile.

This approach is an extension of a module discovery method, GRAM, which is developed by Bar-Joseph et al. [5] for learning yeast expression modules from microarray and ChIP-chip data. In the first step, GRAM uses ChIP-chip data to find a small number of genes whose upstream regions are bound by common TFs with high confidence. In the second step, the microarray data is used to extract a coherently expressed subset of these genes. Finally, the resulting set is expanded by adding genes which are identified to be bound by the TFs with less strict criteria. Although GRAM and another similar method [17] require binding P values in ChIP-chip data, we relaxed this requirement by taking a gene set-based approach. Although gene set-based approaches are simpler than direct integration of ChIP-chip and expression data, they have shown substantial successes in cancer transcriptome analysis [6, 7]. From ChIP-chip data, we prepared gene sets as seed gene sets, by retrieving genes which have binding sites within specified cis-regulatory regions and with P values below a specified threshold. In addition to ChIP-chip data, our analysis utilized cis-regulatory motifs and locus information to generate gene sets, because available human ChIP-chip data are insufficient for comprehensive analysis and they also provide clear evidence of transcriptional regulation or genomic alteration. Although EEM also takes a module discovery approach similar to GRAM, there are some clearly different points. In contrast to GRAM, EEM starts from a sufficient number of genes that are predicted to be under common regulatory mechanism, and refines them to produce a final expression module utilizing expression profiles. In this process, EEM evaluates statistical significance of the identified expression module by measuring how many of module genes are coherently expressed in the expression data. This statistical evaluation based on the expression coherence is a novel feature which is not implemented by other module discovery methods.

EEM is also regard as one of gene set screening methods like Gene Set Enrichment Analysis (GSEA) [18]. GSEA screens for gene gets that have a significant bias in a ranked list according to their differential expressions between two sample groups, while our approach searches for significant gene sets based on their expression coherence. However, because GSEA takes a supervised approach which uses sample labels, it potentially fails to identify expression modules which do not correlate with sample labels. By contrast, EEM realizes an unsupervised analysis, which does not depend on sample information and can search for expression modules more globally.

As we mentioned above, our method finally produces activity profiles of expression modules. Because microarray data usually include expression profiles of thousands of genes, it is difficult to understand the raw data intuitively. On the other hand, since our activity profiles consist of those of a small number of expression modules, they provide concise description of the transcriptome that allows it to be understood more easily. This problem can also be solved using dimension reduction approach of gene expression data. Dimension reduction is originally addressed by a study utilizing singular value decomposition [19], and can be performed by many other methods [2023]. However, because most of them are based on purely mathematical framework, deduced components do not necessarily have biological meanings and are often difficult to understand biologically. By contrast, since expression modules deduced by EEM are derived from biologically meaningful seed gene sets, they can always be associated with molecular mechanisms. By extracting module modules as biologically meaningful components in expression data, EEM provides intuitively understandable views of transcriptomes.

EEM algorithm

Let E = {e1,...,e n } be a set of gene expression profiles such that each e i E is a vector e i = (ei 1,...,e im ) of values with e ij giving the expression of the i-th gene in the j-th condition (or sample). Each e i E then exists as a point in a continuous m-dimensional gene expression space S MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NeXpfaaa@376D@ . Although the expression values can be obtained by any means, we may assume they are from gene expression microarray experiments. EEM operates on a subset E M E called the seed gene set (we describe below how such seed gene sets are obtained). For a given radius r and point x S MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NeXpfaaa@376D@ , define

C x = {e i E M : d(e i , x) ≤ r}, (1)

where d is the Euclidean distance. We call C x the coherently expressed gene set (or simply coherent set), and the point x is called the center of C x . The objective of EEM is to find maximal sized coherent set C B (and corresponding center B) for the genes in E M . We remark that the center B may not necessarily correspond to any profiles in E M . We also call B the activity profile for genes in C B . As stated above, the distance measure we use to define a degree of co-expression between genes is Euclidean distance. In practice the expression profiles are normalized, so this is equivalent to measuring similarity using Pearson correlation. Our method is intended for large datasets (based on microarray expression profiles), and employs a heuristic modified from a previously proposed algorithm [5]. Similar geometric optimization problems arise in the context of clustering [24, 25]. EEM attempts to find an optimal center for E M in two stages. The first stage identifies a candidate center B1 from among the expression profiles in E M . For each e i E M , the set C e i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aaSbaaSqaaiabdwgaLnaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@2FF6@ is constructed (see Equation 1). The profile e i E M with maximal | C e i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aaSbaaSqaaiabdwgaLnaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@2FF6@ | is retained as B1. The second stage uses B1 to find an improved center. Let T E M denote the set containing the 9 profiles in E M closest to B1 along with B1 itself (i.e. |T| = 10). For each triple {t1, t2, t3} T, the mean profile t = (t1 + t2 + t3)/3 is constructed and C t is evaluated. The mean profile t that maximizes |C t | over all triples from T is retained and returned by EEM as the optimal center B along with the identity of genes in C B (see the Appendix section for a pseudocode for this optimization procedure).

EEM includes a critical step to estimate the statistical significance of the size of the coherent set, given the full set of expression profiles from the expression data set (recall that the procedure described above operates on a subset E M E defined by a seed gene set). This is accomplished by sampling subsets of size k uniformly at random from the full set E of expression profiles, where k = |E M |. The EEM optimization procedure (described above and summarized as in the Appendix section) is applied to each sampled subset to produce an empirical distribution for the sizes of coherent sets derived from E. The mean and standard deviation from this empirical distribution are used to obtain a Z score for |C B | corresponding to E M , and Z score threshold is used to determine whether a particular coherent set is significant. Our results are based on Z scores estimated using 500 randomly sampled subsets of expression profiles.

Preparation of Expression data

From GEO database, we downloaded Affymetrix GeneChip data of 252 breast tumor samples [16] (the accession number is GSE3494). Absolute expression values of a data set were converted to the logarithmic scale and normalized so that the mean is equal to 0 and the variance is equal to 1 in each sample. The Probe set IDs were converted to Ensembl gene IDs. In cases that one gene ID matches multiple probe set IDs, the probe set which shows the most variance among the samples was mapped to the gene. A variation filter was then applied to the data, and we obtained 5000 genes with the highest variance. The expression profiles of the 5000 genes were normalized across samples and subjected to the following analysis.

Preparation of seed gene sets

Preparation based on cis-regulatory motifs

We prepared promoter data of human genes and mouse genes from the Ensembl database (Release 44). Assuming TSSs (transcription start sites) as gene starts registered in Ensembl, a repeat-masked promoter sequence covering the 500 bp upstream and 100 bp downstream of the TSS for each gene was retrieved from the genome sequences.

As cis-regulatory motif data, we prepared PWMs (position weight matrices). The value f ib of a PWM represents the frequency of nucleotide base b at the i-th position in a motif. The frequencies of bases in each position are normalized so that ∑b{a, t, g, c}f ib = 1. If f ib = 0, we reassigned f ib = 0.001 to avoid errors in log calculations. We acquired a total of 601 PWMs, which consist of vertebrate 513 PWMs annotated as "good" in TRANSFAC 10.1 [8] and 88 PWMs from JASPAR core [9]. We then removed extremely simple or complex PWMs based on their information contents to produce a set of total 511 PWMs whose information contents range from 5 to 15. The information content R of a PWM is defined as follows:

R = 2 w i = 1 w H i , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOuaiLaeyypa0JaeGOmaiJaem4DaCNaeyOeI0YaaabCaeaacqWGibasdaWgaaWcbaGaemyAaKgabeaakiabcYcaSaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaem4DaChaniabggHiLdaaaa@3C40@

where w is the width of the motif, and H i is the information entropy at the i-th position defined by

H i = b = a , c , g , t f i b log 2 f i b . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemisaG0aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqGHsisldaaeqbqaaiabdAgaMnaaBaaaleaacqWGPbqAcqWGIbGyaeqaaOGagiiBaWMaei4Ba8Maei4zaC2aaSbaaSqaaiabikdaYaqabaGccqWGMbGzdaWgaaWcbaGaemyAaKMaemOyaigabeaakiabc6caUaWcbaGaemOyaiMaeyypa0JaemyyaeMaeiilaWIaem4yamMaeiilaWIaem4zaCMaeiilaWIaemiDaqhabeqdcqGHris5aaaa@4BC8@

Since this set includes highly redundant PWMs, they were subjected to clustering to reduce the redundancy. For clustering, the dissimilarity between two PWMs A and B was calculated based on the Kullback-Leibler divergence. At every alignment offset, the PWMs were extended using a column representing the uniform base frequency (f ib = 0.25 for all b) so that all position of two aligned motifs were matched. As for this alignment step, we followed a method used by Xie et al. [26] For every pair of the extended PWMs, A' and B', whose length are w', the dissimilarity DA'B'is calculated by:

D A B = i = 1 w b = a , t , g , c ( f i b A f i b B ) log f i b A f i b B . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiraq0aaSbaaSqaaiqbdgeabzaafaGafmOqaiKbauaaaeqaaOGaeyypa0ZaaabCaeaadaaeqbqaaiabcIcaOiabdAgaMnaaDaaaleaacqWGPbqAcqWGIbGyaeaacuWGbbqqgaqbaaaakiabgkHiTiabdAgaMnaaDaaaleaacqWGPbqAcqWGIbGyaeaacuWGcbGqgaqbaaaakiabcMcaPiGbcYgaSjabc+gaVjabcEgaNLqbaoaaleaabaGaemOzay2aa0baaeaacqWGPbqAcqWGIbGyaeaacuWGbbqqgaqbaaaaaeaacqWGMbGzdaqhaaqaaiabdMgaPjabdkgaIbqaaiqbdkeaczaafaaaaaaaaSqaaiabdkgaIjabg2da9iabdggaHjabcYcaSiabdsha0jabcYcaSiabdEgaNjabcYcaSiabdogaJbqab0GaeyyeIuoaaSqaaiabdMgaPjabg2da9iabigdaXaqaaiqbdEha3zaafaaaniabggHiLdGccqGGUaGlaaa@617E@

We assumed the lowest score of DA'B'as the dissimilarity between A and B, D AB . Note that D AB = D BA holds. Using the partition around medoids algorithm, the 511 PWMs are divided into 200 clusters. We used 200 medoids of the clusters in the following analyses.

To predict expression modules, we searched promoter sequences for TF binding motifs based on the log odds ratio L between a PWM and background base frequency f b b g MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOzay2aa0baaSqaaiabdkgaIbqaaiabdkgaIjabdEgaNbaaaaa@3196@ . Using the STORM program [27], we calculated log odds ratio L s for every subsequence of each promoter s (including the complementary strand), whose length is equal to the width of the motif of interest, w:

L s = i = 1 w log f i b i f b i b g . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemitaW0aaSbaaSqaaiabdohaZbqabaGccqGH9aqpdaaeWbqaaiGbcYgaSjabc+gaVjabcEgaNbWcbaGaemyAaKMaeyypa0JaeGymaedabaGaem4DaChaniabggHiLdqcfa4aaSqaaeaacqWGMbGzdaWgaaqaaiabdMgaPjabdkgaInaaBaaabaGaemyAaKgabeaaaeqaaaqaaiabdAgaMnaaDaaabaGaemOyai2aaSbaaeaacqWGPbqAaeqaaaqaaiabdkgaIjabdEgaNbaaaaGccqGGUaGlaaa@4922@

In our analyses, f b b g MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOzay2aa0baaSqaaiabdkgaIbqaaiabdkgaIjabdEgaNbaaaaa@3148@ is the base composition of each promoter, and the maximum of L s in a human promoter sequence was taken as the motif score Lhuman for the sequence. For human genes whose mouse homolog is registered in Ensembl, Lmouse was also calculated. Lhuman and Lmouse were then averaged to produce the final motif score L. For human genes that do not have any homologs, we used Lhuman as L. Among all genes analyzed, genes which score the 5% highest L were assumed as a seed gene set regulated by the motif. For each of the 200 PWMs, we performed this procedure to produce 200 seed gene sets.

Preparation based on ChIP-chip data

We obtained TF-bound gene sets identified in ChIP-chip experiments; CREB1, FOXA2, HINF1, HNF4, HNF6 and USF1 in hepatocyte [13], and NFκ B in U931 cells [11]. We assumed genes which are bound by EED and SUZ12, and trimetylated at histone H3 lysin-27 in ES cells [12] as a PRC2-bound gene set. A p53-bound gene set was obtained from [14]. For ER-bound genes, we analyzed ChIP-chip results [10], and retrieved genes which are bound by ER in their promoters (from the TSSs to 3 kbp upstream), 5'UTRs or first introns with P values of < 10-50.

Preparation based on locus information

A sliding window was employed to prepare seed gene sets consisting of genes that reside on the same chromosomal region. A 10 Mb window was slid along each chromosome at interval of 1 Mb. At each position, if the window includes more than or equal to 10 genes, we pooled them in the gene set library and obtained 2766 seed gene sets.

Setting of parameters

The radius parameter

To specify the radius parameter r in EEM, we converted the absolute distance to a relative distance for each expression data set. Expression spaces specified by different data sets have different dimensions and different densities of points. Therefore, instead of the absolute distance, we used a relative distance which practically acts as an equal measure for different data sets. To convert the absolute distance, dabsolute to such a relative distance, drelative, we define coherent set C x all MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aa0baaSqaaiabdIha4bqaaiabbggaHjabbYgaSjabbYgaSbaaaaa@3291@ for all e i E, similarly to Equation 1:

C x all = { e i E : d ( e i , x ) r } , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aa0baaSqaaiabdIha4bqaaiabbggaHjabbYgaSjabbYgaSbaakiabg2da9iabcUha7jabdwgaLnaaBaaaleaacqWGPbqAaeqaaOGaeyicI4SaemyrauKaeiOoaOJaemizaqMaeiikaGIaemyzau2aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWG4baEcqGGPaqkcqGHKjYOcqWGYbGCcqGG9bqFcqGGSaalaaa@49A8@

where r is a given radius parameter and point x S MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NeXpfaaa@376D@ . The maximal sized coherent set C B all MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aa0baaSqaaiabdkeacbqaaiabbggaHjabbYgaSjabbYgaSbaaaaa@3225@ can also be found based on the above-described algorithm for radius r = dabsolute. drelative is then defined as follows:

d relative = | C B all | | E | . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemizaq2aaWbaaSqabeaacqqGYbGCcqqGLbqzcqqGSbaBcqqGHbqycqqG0baDcqqGPbqAcqqG2bGDcqqGLbqzaaGccqGH9aqpjuaGdaWcaaqaaiabcYha8jabdoeadnaaDaaabaGaemOqaieabaGaeeyyaeMaeeiBaWMaeeiBaWgaaiabcYha8bqaaiabcYha8jabdweafjabcYha8baakiabc6caUaaa@4885@

In our analysis, we assumed r = 0.05 in the relative distance. It should be should noted that we also tried to use r = 0.03 and 0.10 and observed that the identified expression modules are essentially the same; although the number of module genes increases as r increases, statistical significance, activity profiles and enriched GO terms were essentially unchanged. Exceptionally, we used a larger radius for the PRC2 expression module (see Additional File 1).

The threshold of Z scores

We assumed that a seed gene set includes a functional expression module, if its Z score is greater than a threshold. In this paper, we set 4.0 as the threshold. The reason why we set 4.0 as the threshold is that when we permuted gene labels in microarray data, no expression modules showed greater Z scores than 4.0. Therefore, we concluded that this threshold is sufficiently conservative and the resulting expression modules are expected to have high accuracy.

Evaluation of obtained expression modules

Gene ontology analysis

We evaluated the enrichment of GO categories in each identified expression module by using GO::TermFinder [28]. The GOA file we used was obtained from EBI. To predict biological function for each expression module, we also reported the GO category scoring the lowest P values as the most enriched GO.

Survival analysis

Kaplan-Meier survival curves were obtained for two patient groups with high or low activity of each identified expression module. The cutoff of the high and low groups was optimized to achieve the most significant P value in the Kaplan-Meier analysis with at least 20% patients at each group. Since the optimized P values in the Kaplan-Meier analysis overestimate the significance, we reported P values based on Cox regression analysis.

Network analysis

We evaluated enrichment of physically interacting gene pairs in the expression modules based on PPI data obtained from the Human Protein Reference Database (HPRD) [29]. To calculate a Z score for the number of interacting pairs in an expression module, we randomly sampled 500 gene sets with the same number of genes of the considered expression module. After finding expression modules having the significant number of PPIs, we constructed PPI subnetworks for elucidating molecular circuits. The PPI subnetworks were constructed by the interacting protein pairs and their first neighbor in PPI data. The network visualization was performed using CytoScape [30].

Results and discussion

Our systematic search identified 10 expression modules in the breast cancer transcriptome (Table 1). Based on cis-regulatory motifs, we identified expression modules regulated by E2F, NFY, RUNX, IRF, and ETS family TFs. Hereafter, we use the TF name to refer to the family, e.g. the E2F module will refer to the module associated with the E2F family of TFs. ChIP-chip data led us to identify expression modules regulated by the estrogen receptor (ER), Polycomb repressive complex 2 (PRC2), and NFκ B. In addition to these transcriptional modules, incorporation of locus information yielded two expression modules, which are located on the 17q12 and 8q24 locus. The reproducibility of these results was confirmed by analysis using other independent microarray data [31] (see Additional File 1).

Table 1 Motif associated with histological grades or prognosis identified based on independent datasets

Each expression module consists of dozens of genes. We performed Gene Ontology (GO) analysis to examine whether the obtained expression modules are enriched in genes involved in specific cellular activities (Table 1). The GO analysis showed that most of the expression modules deduced from cis-regulatory motifs and ChIP-chip data contain a significant number of genes sharing common GO terms, such as immune response and cell cycle. Thus, these transcriptional modules have the potential to function for specific cellular activities.

The EEM analysis predicted the activity profiles of the 10 expression modules; we performed hierarchical clustering analysis of them as performed for ordinary gene expression profiles (Figure 2). We found that clustering of tumor samples based only on these 10 expression modules succeeded in dividing samples into several subtypes that are consistent with clinical information and gene expression profiles. This observation suggests that a significant degree of diversity of the breast cancer phenotypes can be explained by only these 10 expression modules. In other words, this result demonstrates that the EEM analysis successfully reduced gene expression data of extremely high dimension to the 10 components. We also performed survival time analysis and found expression modules associated with prognosis (Figure 3, see below).

Figure 2
figure 2

Clustering analysis of expression module activity profiles in breast tumors. Activity profiles of 10 expression modules extracted from breast cancer expression data were analyzed by hierarchical clustering. Red indicates increased activity and blue indicates decreased activity. The upper color bars indicate clinical and gene expression information of each tumor sample; histological grades (G1: red, G2: yellow G3: blue), p53 status (wildtype: red, mutant: blue), expression subtypes, ER, PgR and nodal status (positive: red, negative: blue), ERBB2 and EZH2 expression (increased expression: red, decreased expression: blue). The expression subtypes are based on the five major branches in the clustering dendrogram of the gene expression profiles. In the upper dendrogram, red branches represent a sample cluster which is enriched for triple negative breast tumors.

Figure 3
figure 3

Survival time analysis of expression module activity profiles in breast tumors. Associations between survival time of patients and expression module activities were evaluated. Kaplan-Meier curves for two patient groups with different activities of each expression module were displayed using a color code (high survival rate: red, low survival rate: blue). P values are calculated for coefficients in Cox regression analysis.

The E2F and NFY expression modules show similar activity profiles, which are activated in high grade breast tumors and strongly correlated with poor prognosis. They also share common modules genes and appear to cooperatively regulate the cell cycle. Similar expression profiles shared by the RUNX ETS, IRF and NFκ B expression modules also suggest that these TFs regulate immune pathways cooperatively. These results are consistent with those of previous studies [3236]. As expected, the ER expression module was found to be the most critical determinant of tumor subtypes. Their activity profiles are strongly correlated with ER status, demonstrating the validity of our approach. The 17q12 and 8q24 expression modules are derived from known amplified regions [15]. The 17q12 expression module contains the ERBB2 gene, while the 8q24 expression module contains genes residing near the Myc locus. The 17q12 expression is an important determinant of tumor subtypes and survival time. Although the 8q24 expression modules are not clearly associated with any subtypes, its upregulation is related to poor prognosis.

Triple-negative breast cancers characterized by a lack of the ER, progesterone receptor (PgR), and ERBB2 expression have attracted special attention in breast cancer research. In addition to their aggressive phenotype, they lack the benefit of specific therapy that targets these genes and, therefore, are associated with short survival. The sample cluster enriched for triple negative cancers has characteristic expression module activity profiles; the E2F and NFY expression modules are upregulated, while the ER, PRC2 and 17q12 expression modules are downregulated. Among them, the PRC2 expression module is especially intriguing. PRC2 is an epigenetic gene silencer, which plays a critical role in the maintenance of stem cells. They have also been reported to be implicated in neoplastic development. The PRC2 expression module was derived from gene sets bound by EED and SUZ12 and trimetylated at histone H3 lysin-27 in ES cells [12]. Therefore, our observation suggests similarity of transcriptional programs in both stem cells and the aggressive breast cancer. Triple-negative breast cancers are known to have poorly differentiated phenotype histology, which might be maintained by a PRC2-directed regulatory program. Recently a drug targeting PRC2 is developed [37]; PRC2 could be a therapeutic molecular target in triple negative breast cancers. Furthermore, we found that EZH2, a component of PRC2, belongs to the E2F expression module [38], while its expression is inversely correlated with profiles of the PRC2 expression module. This finding suggests that E2F-driven EZH2 overexpression is important for repression of the PRC2 expression modules in triple negative tumors. It should be noted that another independent study employing bioinformatics has also recently shown that PRC2 target genes are downregulated in malignant breast tumors, supporting our finding [39].

Inspection of individual genes in each expression module provided insights into regulatory networks in breast tumors. For example, our result suggests that auto-regulatory designs are prevailing in mammalian transcriptional networks. The E2F expression module contains three E2F family genes: E2F1, E2F7, and E2F8. The ER expression module also harbors ER itself and its interacting co-factor, FOXA1 [40]. Furthermore, RUNX3, one of the RUNX family genes, belongs to the RUNX expression module (see Additional File 1).

To obtain further insights into regulatory networks, we performed network analysis using protein-protein interaction (PPI) data in the HPRD [29]. Previous studies have demonstrated a significant correlation between yeast PPI and transcriptional networks [41, 42]. This observation prompted us to examine whether human expression modules identified by EEM also tend to contain genes involved in the same protein complex. We calculated the Z score for the count of module gene pairs which interact directly (nearest neighbor pairs), and those which interact directly or via the next node of each (nearest or next-to-nearest neighbor pairs) (Table 1). This analysis showed that the transcriptional modules that were identified based on cis-regulatory motifs or ChIP-chip data harbor a statistically significant number of physically interacting pairs, revealing a tight coupling of transcriptional and PPI networks in mammalian cells.

Finally, by extracting expression module genes and interacting partners from the PPI data, we depicted molecular circuits in breast tumors based on multiple lines of evidence: PPI, expression coherence, cis-regulatory motifs and ChIP-chip data (Figure 4). These network views revealed that E2F regulates cell cycle hub genes, such as CDC2 and Cyclins, in cooperation with NFY. The transcriptional sub-network involving RUNX, ETS, IRF, and NFκ B regulates immune circuits involving various chemokines and chemokine receptors, some of which have been reported to be involved in tumor growth, invasion, and metastasis [43]. For example, a recent study showed that CCL5 can induce metastasis of breast tumors [44]. Consistent with a previous report [45], EEM predicted that ETS, IRF and NFκ B transcriptionally control CCL5, suggesting that these TFs are responsible for CCL5-matiated metastasis. Human breast tumors are histologically complex and contain a variety of cell types in addition to the carcinoma cells. Hence, the transcriptional programs controlling these immune circuits could operate not in the carcinoma cells themselves, but in the tumor microenvironment. Indeed, CCL5 was reported to be secreted from mesenchymal stem cells. Also, many other module genes are known to be specifically expressed in immune cells such as lymphocytes and macrophages. On the other hand, we could also identify the IRF expression module in the transcriptome of breast cancer cell lines (see Additional File 1). Thus, the IRF expression module may function in the carcinoma cells. Recently, global expression profiling of distinct cell population in breast tumors has been attempted [46]. We expect that application of EEM to such data will clearly show cell type specific regulatory programs.

Figure 4
figure 4

PPI and transcriptional sub-networks in breast tumors. (A) The E2F and NFY expression modules and an overlapping PPI sub-network. (B) The ETS, IRF, NFκ B, and Runx expression modules and an overlapping PPI sub-network. For clarity, we displayed nodes which have more than one links. Red, yellow and blue nodes denote transcriptional regulators, regulated genes, and their interacting partners, respectively. Red links denote transcriptional regulations predicted by our analysis, and blue links denote PPIs registered in the HPRD.

Previously, Segal et al. [47] also reported expression modules in cancer transcriptome. However, their method identifies functional modules based on relative up or down-regulation of module genes in each sample, which contrasts with our method that takes into account expression coherence across all samples. In this study, our method succeeded in reducing the expression profiles of thousands of genes to the activity profiles of 10 expression modules which explains a significant degree of diversity of the breast cancer phenotypes. It should be noted that the expression module activity profiles show some similarity with oncogenic pathway activity profiles depicted by Bild et al. [48]. Their Bayesian regression-based method learns signatures that can predict pathway activities of clinical tumor samples from microarray experiments using cell line models. By contrast, our method searches for coexpressed gene sets under common regulatory mechanisms using prescribed gene sets. Hence, these two methods are considered to be complementary to each other. In addition to clustering analysis, we applied survival analysis to the expression module activity profiles and succeeded in identifying expression modules associated with prognosis. While a number of studies have identified signature genes associated with prognosis in breast cancer [6], the result suggested that our approach is also useful to search for such signature genes. Based on EEM-deduced expression modules, we also predicted transcriptional regulatory networks in breast tumors. Although some previous studies have addressed reverse engineering problems of regulatory networks in cancer cells [49], they are only based on correlation or conditional independence of expression profiles. On the other hand, our method incorporates evidence of direct TF regulation. Collectively, we can say that EEM is a powerful module discovery method that provides various types of information essential for a deeper understanding of cancer transcriptomes.

As well as these notable advantages, our approach has several limitations. EEM assumes that module genes behave coherently across all samples. However, because gene regulatory programs are usually functional in specific contexts, it might be more appropriate to assume that module genes are assumed to behave coherently in only a subset of samples. Under the current assumption, we might fail to find tumor subtype-specific expression modules. It is also probable that different genes in a common module are controlled by different modes of a regulatory program (e.g., it is known that some TFs act as both activators and repressors, depending on target genes). Although current version of EEM cannot detect expression modules which show such complex patterns of expression profiles, future studies will improve the algorithm to overcome these limitation. Also, it should be noted that EEM uses the size of a coherent subset as an index of expression coherence. We use the arbitrary parameter r to specify the minimum degree of coexpression of coherent gene subsets. It is possible that EEM misses tightly coexpressed small modules or loosely coexpressed large modules, depending the values of r. In such a case, optimization of r based on Z scores will improve results. Recently, another gene set screening method based on a different index of expression coherence [50] was also reported. Comparison of different coherence indexes should be addressed in future studies.

To search for expression modules utilizing EEM, we prepared a collection of seed gene sets based on cis-regulatory motifs and ChIP-chip data. Therefore, comprehensiveness of our method depends on coverage of these data. Although a large number of motifs are already registered in databases, the quality and coverage seem to be incomplete. However, because several promising methods that are suitable for high-throughput determination of TF binding specificity have been devised [51, 52], more accurate and comprehensive data of regulatory motifs are expected to be available soon. Furthermore, instead of ChIP-chip, a more high-throughput and cost-effective alternative, the ChIP-Seq technique has recently emerged [53, 54]. It is expected that a great deal of TF binding site data will be produced by ChIP-Seq in the next decade. These increasing amounts of data will enable more global analysis in the near future. In this study, we assumed that each expression module is regulated by a single TF. However, combinatorial regulations by multiple TFs are known to be essential in mammalian regulatory networks. Combinatorial analysis will be enabled by constructing expression modules based on cis-regulatory information about multiple TFs. We will focus on this problem in future studies.


We apply a new gene-set based module discovery method, EEM, to breast cancer microarray data, and revealed 10 principal expression modules in the breast cancer transcriptome. The subsequent analyses of expression module activity profiles and predicted regulatory networks demonstrated their importance in the pathophysiology of breast cancer. We believe that our method will be a powerful tool to decode gene regulatory programs in cancer transcriptomes.


A pseudocode for the optimization procedure in EEM is as follows:

1: comment: E M is the set of expression profiles for a seed gene set (i.e. some subset of available genes).

2: comment: r is the given radius value.

3: C B 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aaSbaaSqaaiabdkeacnaaBaaameaacqaIXaqmaeqaaaWcbeaaaaa@2F45@

4: for all e i E M do

5:     C e i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aaSbaaSqaaiabdwgaLnaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@2FF6@ {e j E M : d(e j , e i ) <r}

6:    if | C e i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aaSbaaSqaaiabdwgaLnaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@2FF6@ | > | C B 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aaSbaaSqaaiabdkeacnaaBaaameaacqaIXaqmaeqaaaWcbeaaaaa@2F45@ | then

7:       B1 e i

8:        C B 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aaSbaaSqaaiabdkeacnaaBaaameaacqaIXaqmaeqaaaWcbeaaaaa@2F45@ C e i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aaSbaaSqaaiabdwgaLnaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@2FF6@

9:    end if

10: endfor

11: C B

12: T {B1 and the 9 profiles in C B 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aaSbaaSqaaiabdkeacnaaBaaameaacqaIXaqmaeqaaaWcbeaaaaa@2F45@ closest to B1}

13: for all triples {t1, t2, t3} T do

14:    t (t1, t2, t3)/3

15:    C t {e i E M : d(e i , t) <r}

16:    if |C t | > |C B | then

17:       B t

18:       G B C t

19:    end if

20: end for

21: return (B, C B )


  1. Sotiriou C, Piccart MJ: Taking gene-expression profiling to the clinic: when will molecular signatures become relevant to patient care?. Nat Rev Cancer. 2007, 7: 545-553. 10.1038/nrc2173.

    Article  CAS  PubMed  Google Scholar 

  2. Niida A, Smith AD, Imoto S, Tsutsumi S, Aburatani H, Zhang MQ, Akiyama T: Integrative bioinformatics analysis of transcriptional regulatory programs in breast cancer cells. BMC Bioinformatics. 9: 404-10.1186/1471-2105-9-404.

  3. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95: 14863-14868. 10.1073/pnas.95.25.14863.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  4. Beer MA, Tavazoie S: Predicting gene expression from sequence. Cell. 2004, 117: 185-198. 10.1016/S0092-8674(04)00304-6.

    Article  CAS  PubMed  Google Scholar 

  5. Bar-Joseph Z, Gerber GK, Lee TI, Rinaldi NJ, Yoo JY, Robert F, Gordon DB, Fraenkel E, Jaakkola TS, Young RA, Gifford DK: Computational discovery of gene modules and regulatory networks. Nat Biotechnol. 2003, 21: 1337-1342. 10.1038/nbt890.

    Article  CAS  PubMed  Google Scholar 

  6. Nevins JR, Potti A: Mining gene expression profiles: expression signatures as cancer phenotypes. Nat Rev Genet. 2007, 8: 601-609. 10.1038/nrg2137.

    Article  CAS  PubMed  Google Scholar 

  7. Segal E, Friedman N, Kaminski N, Regev A, Koller D: From signatures to models: understanding cancer using microarrays. Nat Genet 37. 2005, S38-S45. 10.1038/ng1561. Suppl

  8. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel AE, Wingender E: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006, 34: D108-D110. 10.1093/nar/gkj143.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  9. Vlieghe D, Sandelin A, De Bleser PJ, Vleminckx K, Wasserman WW, van Roy F, Lenhard B: A new generation of JASPAR, the open-access repository for transcription factor binding site profiles. Nucleic Acids Res. 2006, 34: D95-D97. 10.1093/nar/gkj115.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  10. Carroll JS, Meyer CA, Song J, Li W, Geistlinger TR, Eeckhoute J, Brodsky AS, Keeton EK, Fertuck KC, Hall GF, Wang Q, Bekiranov S, Sementchenko V, Fox EA, Silver PA, Gingeras TR, Liu XS, Brown M: Genome-wide analysis of estrogen receptor binding sites. Nat Genet. 2006, 38: 1289-1297. 10.1038/ng1901.

    Article  CAS  PubMed  Google Scholar 

  11. Schreiber J, Jenner RG, Murray HL, Gerber GK, Gifford DK, Young RA: Coordinated binding of NF-kappaB family members in the response of human cells to lipopolysaccharide. Proc Natl Acad Sci USA. 2006, 103: 5899-5904. 10.1073/pnas.0510996103.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  12. Lee TI, Jenner RG, Boyer LA, Guenther MG, Levine SS, Kumar RM, Chevalier B, Johnstone SE, Cole MF, Isono K, Koseki H, Fuchikami T, Abe K, Murray HL, Zucker JP, Yuan B, Bell GW, Herbolsheimer E, Hannett NM, Sun K, Odom DT, Otte AP, Volkert TL, Bartel DP, Melton DA, Gifford DK, Jaenisch R, Young RA: Control of developmental regulators by Polycomb in human embryonic stem cells. Cell. 2006, 125: 301-313. 10.1016/j.cell.2006.02.043.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  13. Odom DT, Dowell RD, Jacobsen ES, Nekludova L, Rolfe PA, Danford TW, Gifford DK, Fraenkel E, Bell GI, Young RA: Core transcriptional regulatory circuitry in human hepatocytes. Mol Syst Biol. 2006, 2: 0017-10.1038/msb4100059.

    Article  PubMed  Google Scholar 

  14. Wei CL, Wu Q, Vega VB, Chiu KP, Ng P, Zhang T, Shahab A, Yong HC, Fu Y, Weng Z, Liu J, Zhao XD, Chew JL, Lee YL, Kuznetsov VA, Sung WK, Miller LD, Lim B, Liu ET, Yu Q, Ng HH, Ruan Y: A global map of p53 transcription-factor binding sites in the human genome. Cell. 2006, 124: 207-219. 10.1016/j.cell.2005.10.043.

    Article  CAS  PubMed  Google Scholar 

  15. Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Borresen-Dale AL, Brown PO: Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci USA. 2002, 99: 12963-12968. 10.1073/pnas.162471999.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  16. Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci USA. 2005, 102: 13550-13555. 10.1073/pnas.0506230102.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  17. Lemmens K, Dhollander T, De Bie T, Monsieurs P, Engelen K, Smets B, Winderickx J, De Moor B, Marchal K: Inferring transcriptional modules from ChIP-chip, motif and microarray data. Genome Biol. 2006, 7: R37-10.1186/gb-2006-7-5-r37.

    Article  PubMed Central  PubMed  Google Scholar 

  18. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005, 102: 15545-15550. 10.1073/pnas.0506580102.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  19. Alter O, Brown PO, Botstein D: Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA. 2000, 97: 10101-10106. 10.1073/pnas.97.18.10101.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  20. Tamayo P, Scanfeld D, Ebert BL, Gillette MA, Roberts CW, Mesirov JP: Metagene projection for cross-platform, cross-species characterization of global transcriptional states. Proc Natl Acad Sci USA. 2007, 104: 5959-5964. 10.1073/pnas.0701068104.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  21. Moloshok TD, Klevecz RR, Grant JD, Manion FJ, Speier WF, Ochs MF: Application of Bayesian Decomposition for analysing microarray data. Bioinformatics. 2002, 18: 566-575. 10.1093/bioinformatics/18.4.566.

    Article  CAS  PubMed  Google Scholar 

  22. Dueck D, Morris QD, Frey BJ: Multi-way clustering of microarray data using probabilistic sparse matrix factorization. Bioinformatics Suppl. 2005, 1: i144-i151. 10.1093/bioinformatics/bti1041.

    Article  Google Scholar 

  23. Kim PM, Tidor B: Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res. 2003, 13: 1706-1718. 10.1101/gr.903503.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  24. Heyer LJ, Kruglyak S, Yooseph S: Exploring expression data: identification and analysis of coexpressed genes. Genome Res. 1999, 9: 1106-1115. 10.1101/gr.9.11.1106.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  25. De Smet F, Mathys J, Marchal K, Thijs G, De Moor B, Moreau Y: Adaptive quality-based clustering of gene expression profiles. Bioinformatics. 2002, 18: 735-746. 10.1093/bioinformatics/18.5.735.

    Article  CAS  PubMed  Google Scholar 

  26. Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature. 2005, 434: 338-345. 10.1038/nature03441.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  27. Schones DE, Smith AD, Zhang MQ: Statistical significance of cis-regulatory modules. BMC Bioinformatics. 2007, 8: 19-10.1186/1471-2105-8-19.

    Article  PubMed Central  PubMed  Google Scholar 

  28. Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G: GO::TermFinder?open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics. 2004, 20: 3710-3715. 10.1093/bioinformatics/bth456.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  29. Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, Ibarrola N, Deshpande N, Shanker K, Shivashankar HN, Rashmi BP, Ramya MA, Zhao Z, Chandrika KN, Padma N, Harsha HC, Yatish AJ, Kavitha MP, Menezes M, Choudhury DR, Suresh S, Ghosh N, Saravana R, Chandran S, Krishna S, Joy M, Anand SK, Madavan V, Joseph A, Wong GW, Schiemann WP, Constantinescu SN, Huang L, Khosravi-Far R, Steen H, Tewari M, Ghaffari S, Blobe GC, Dang CV, Garcia JG, Pevsner J, Jensen ON, Roepstorff P, Deshpande KS, Chinnaiyan AM, Hamosh A, Chakravarti A, Pandey A: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003, 13: 2363-2371. 10.1101/gr.1680803.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  30. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13: 2498-504. 10.1101/gr.1239303.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  31. Chin K, DeVries S, Fridlyand J, Spellman PT, Roydasgupta R, Kuo WL, Lapuk A, Neve RM, Qian Z, Ryder T, Chen F, Feiler H, Tokuyasu T, Kingsley C, Dairkee S, Meng Z, Chew K, Pinkel D, Jain A, Ljung BM, Esserman L, Albertson DG, Waldman FM, Gray JW: Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell. 2006, 10: 529-541. 10.1016/j.ccr.2006.10.009.

    Article  CAS  PubMed  Google Scholar 

  32. Zhu W, Giangrande PH, Nevins JR: E2Fs link the control of G1/S and G2/M transcription. EMBO J. 2004, 23: 4615-4626. 10.1038/sj.emboj.7600459.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  33. Sharrocks AD: The ETS-domain transcription factor family. Nat Rev Mol Cell Biol. 2001, 2: 827-837. 10.1038/35099076.

    Article  CAS  PubMed  Google Scholar 

  34. Blyth K, Cameron ER, Neil JC: The RUNX genes: gain or loss of function in cancer. Nat Rev Cancer. 2005, 5: 376-387. 10.1038/nrc1607.

    Article  CAS  PubMed  Google Scholar 

  35. Perkins ND: Integrating cell-signalling pathways with NF-kappaB and IKK function. Nat Rev Mol Cell Biol. 2007, 8: 49-62. 10.1038/nrm2083.

    Article  CAS  PubMed  Google Scholar 

  36. Honda K, Taniguchi T: IRFs: master regulators of signalling by Toll-like receptors and cytosolic pattern-recognition receptors. Nat Rev Immunol. 2006, 6: 644-658. 10.1038/nri1900.

    Article  CAS  PubMed  Google Scholar 

  37. Tan J, Yang X, Zhuang L, Jiang X, Chen W, Lee PL, Karuturi RK, Tan PB, Liu ET, Yu Q: Pharmacologic disruption of Polycomb-repressive complex 2-mediated gene repression selectively induces apoptosis in cancer cells. Genes Dev. 2007, 21: 1050-1063. 10.1101/gad.1524107.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  38. Bracken AP, Pasini D, Capra M, Prosperini E, Colli E, Helin K: EZH2 is downstream of the pRB-E2F pathway, essential for proliferation and amplified in cancer. EMBO J. 2003, 22: 5323-5335. 10.1093/emboj/cdg542.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  39. Ben-Porath I, Thomson MW, Carey VJ, Ge R, Bell GW, Regev A, Weinberg RA: An embryonic stem cell-like gene expression signature in poorly differentiated aggressive human tumors. Nature Genet. 2008, 40: 499-507. 10.1038/ng.127.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  40. Carroll JS, Liu XS, Brodsky AS, Li W, Meyer CA, Szary AJ, Eeckhoute J, Shao W, Hestermann EV, Geistlinger TR, Fox EA, Silver PA, Brown M: Chromosome-wide mapping of estrogen receptor binding reveals long-range regulation requiring the forkhead protein FoxA1. Cell. 2005, 122: 33-43. 10.1016/j.cell.2005.05.008.

    Article  CAS  PubMed  Google Scholar 

  41. Simonis N, Gonze D, Orsi C, van Helden J, Wodak SJ: Modularity of the transcriptional response of protein complexes in yeast. J Mol Biol. 2006, 363: 589-610. 10.1016/j.jmb.2006.06.024.

    Article  CAS  PubMed  Google Scholar 

  42. Manke T, Bringas R, Vingron M: Correlating protein-DNA and protein-protein interaction networks. J Mol Biol. 2003, 333: 75-85. 10.1016/j.jmb.2003.08.004.

    Article  CAS  PubMed  Google Scholar 

  43. Balkwill F: Cancer and the chemokine network. Nat Rev Cancer. 2004, 4: 540-550. 10.1038/nrc1388.

    Article  CAS  PubMed  Google Scholar 

  44. Karnoub AE, Dash AB, Vo AP, Sullivan A, Brooks MW, Bell GW, Richardson AL, Polyak K, Tubo R, Weinberg RA: Mesenchymal stem cells within tumour stroma promote breast cancer metastasis. Nature. 2007, 449: 557-563. 10.1038/nature06188.

    Article  CAS  PubMed  Google Scholar 

  45. Liu J, Ma X: Interferon regulatory factor 8 regulates RANTES gene transcription in cooperation with interferon regulatory factor-1, NF-kappaB, and PU.1. J Biol Chem. 2006, 281: 19188-19195. 10.1074/jbc.M602059200.

    Article  CAS  PubMed  Google Scholar 

  46. Allinen M, Beroukhim R, Cai L, Brennan C, Lahti-Domenici J, Huang H, Porter D, Hu M, Chin L, Richardson A, Schnitt S, Sellers WR, Polyak K: Molecular characterization of the tumor microenvironment in breast cancer. Cancer Cell. 2004, 6: 17-32. 10.1016/j.ccr.2004.06.010.

    Article  CAS  PubMed  Google Scholar 

  47. Segal E, Friedman N, Koller D, Regev A: A module map showing conditional activity of expression modules in cancer. Nat Genet. 2004, 36: 1090-1098. 10.1038/ng1434.

    Article  CAS  PubMed  Google Scholar 

  48. Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D, Joshi MB, Harpole D, Lancaster JM, Berchuck A, Olson JA, Marks JR, Dressman HK, West M, Nevins JR: Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature. 2006, 439: 353-357. 10.1038/nature04296.

    Article  CAS  PubMed  Google Scholar 

  49. Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, Califano A: Reverse engineering of regulatory networks in human B cells. Nat Genet. 2005, 37: 382-390. 10.1038/ng1532.

    Article  CAS  PubMed  Google Scholar 

  50. Kim TM, Chung YJ, Rhyu MG, Jung MH: Inferring biological functions and associated transcriptional regulators using gene set expression coherence analysis. BMC Bioinformatics. 2007, 8: 453-10.1186/1471-2105-8-453.

    Article  PubMed Central  PubMed  Google Scholar 

  51. Mukherjee S, Berger MF, Jona G, Wang XS, Muzzey D, Snyder M, Young RA, Bulyk ML: Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nat Genet. 2004, 36: 1331-1339. 10.1038/ng1473.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  52. Berger MF, Philippakis AA, Qureshi AM, He FS, Estep PW, Bulyk ML: Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat Biotechnol. 2006, 24: 1429-1435. 10.1038/nbt1246.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  53. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell. 2007, 129: 823-837. 10.1016/j.cell.2007.05.009.

    Article  CAS  PubMed  Google Scholar 

  54. Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007, 316: 1497-1502. 10.1126/science.1141319.

    Article  CAS  PubMed  Google Scholar 

Download references


We thank S. Mori, Y. Hoshida, and Y. Suzuki for their valuable comments and helpful suggestions on earlier version of this manuscript. We appreciate comments from the two anonymous referees that improve the quality of our article considerably. Computation time was provided by the Super Computer System, Human Genome Center, Institute of Medical Science, University of Tokyo. This work was supported by Grants-in-Aid for Scientific Research on Priority Areas.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Atsushi Niida.

Additional information

Authors' contributions

AN, TA, and HA designed research; AN performed research; AN, ADS and MQZ contributed new analytic tools; AN, TA, and SI wrote the paper; all authors read and approved the final manuscript.

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Niida, A., Smith, A.D., Imoto, S. et al. Gene set-based module discovery in the breast cancer transcriptome. BMC Bioinformatics 10, 71 (2009).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: