Research article | Open | Published:
Meta-analysis of breast cancer microarray studies in conjunction with conserved cis-elements suggest patterns for coordinate regulation
BMC Bioinformaticsvolume 9, Article number: 63 (2008)
Gene expression measurements from breast cancer (BrCa) tumors are established clinical predictive tools to identify tumor subtypes, identify patients showing poor/good prognosis, and identify patients likely to have disease recurrence. However, diverse breast cancer datasets in conjunction with diagnostic clinical arrays show little overlap in the sets of genes identified. One approach to identify a set of consistently dysregulated candidate genes in these tumors is to employ meta-analysis of multiple independent microarray datasets. This allows one to compare expression data from a diverse collection of breast tumor array datasets generated on either cDNA or oligonucleotide arrays.
We gathered expression data from 9 published microarray studies examining estrogen receptor positive (ER+) and estrogen receptor negative (ER-) BrCa tumor cases from the Oncomine database. We performed a meta-analysis and identified genes that were universally up or down regulated with respect to ER+ versus ER- tumor status. We surveyed both the proximal promoter and 3' untranslated regions (3'UTR) of our top-ranking genes in each expression group to test whether common sequence elements may contribute to the observed expression patterns. Utilizing a combination of known transcription factor binding sites (TFBS), evolutionarily conserved mammalian promoter and 3'UTR motifs, and microRNA (miRNA) seed sequences, we identified numerous motifs that were disproportionately represented between the two gene classes suggesting a common regulatory network for the observed gene expression patterns.
Some of the genes we identified distinguish key transcripts previously seen in array studies, while others are newly defined. Many of the genes identified as overexpressed in ER- tumors were previously identified as expression markers for neoplastic transformation in multiple human cancers. Moreover, our motif analysis identified a collection of specific cis-acting target sites which may collectively play a role in the differential gene expression patterns observed in ER+ versus ER- breast cancer tumors. Importantly, the gene sets and associated DNA motifs provide a starting point with which to explore the mechanistic basis for the observed expression patterns in breast tumors.
Variation in gene expression provides a quantifiable trait that has been employed to classify breast tumors [1–3]. However it has long been known that the gene sets identified from independent laboratories fail to provide a unified set of genes thereby casting doubt on the biological implications of these profiles . Despite these differences, two prognostic tests have recently been approved in the United States for clinical management of disease [5, 6]. From a diagnostic perspective, developing a unified gene profile that predicts both risk of recurrence and therapeutic response in diverse disease subtypes would be clinically useful. These gene sets could also provide an understanding of the mechanistic basis of malignancy.
Meta-analysis has been used as a formal summarization method in the clinical cancer literature for many years [7–10]. Recently, some groups have applied meta-analysis to gene expression microarrays [11–13]. Meta-analysis refers to a broad class of models used for summarizing and synthesizing studies to estimate their overall effect. Rhoades, et al was among the first to demonstrate the usefulness of meta-analytic procedures on microarray data in prostate cancer . Since then, there have been many contributions to the oncology literature by applying meta-analysis to microarrays, including breast cancer [13, 16, 17].
One of the central goals in gene expression experiments is to identify the common regulatory themes and cis-elements responsible for the observed patterns of gene expression. This has been most successfully performed for the yeast Saccharomyces cerevisiae where new regulatory genes have been suggested . However, metazoan expression patterns tend to be more complicated. One approach has been to combine expression data of orthologous genes from diverse organisms to build co-expression networks . In Drosophilia gene networks have been proposed based upon the co-localization of TFBS with cis-regulatory modules (CRM) . The availability of both mammalian and lower metazoan complete genome assemblies affords one the opportunity to identify phylogenetically conserved motifs in the array candidates. In addition to known TFBS, these phylogenetic motifs may identify important new cis-acting signals that modulate transcription (promoters) or transcript stability (3'UTRs) and may be key elements in the observed expression patterns. A systematic comparison of both known and phylogenetic cis-elements between two sets of differentially expressed genes can serve to implicate these elements as common modulators in the observed gene expression patterns.
Our method incorporates a meta-analysis model to rank genes into groups of over- and under-expressed gene sets, based upon their relative importance between independent array studies. Our analyses of gene expression patterns in ER+ and ER- breast tumors were performed across different array platforms on a diverse spectrum of patients. The two sets of genes showing the most disparate expression patterns between ER+ and ER- tumors provided an entry point with which to explore the possibility that specific sequence elements may be disproportionately represented in these two groups. We utilized known motifs in conjunction with comparative genomic resources to search for enriched DNA elements in both the proximal promoter and 3'UTR regions of these genes. Our findings suggest that the differential gene expression in ER+ vs. ER- tumors may, in some cases, be mediated by specific sequence elements in either the promoter or 3'UTR intervals. The motif distribution profiles between our gene sets identified both known and phylogenetically conserved elements that may play a role in these genes' co-expression.
Forty-six percent of unique probes among the studies mapped many-to-one to unique UniGene IDs. The mean and median numbers of probes per UniGene IDs were 12.7 and 1, respectively. When we merged the 9 studies in Table 1 for the meta-analysis data set, we retained the expression values for all probe combinations in all studies and this resulted in a multiplicative set of records in the database. Approximately 12% of the unique ESTs in the Oncomine database (Oncomine DB) did not correspond to a unique UniGene ID. These were dropped from the analysis data sets.
We focused our subsequent analyses on a select set of genes by taking medians across each UniGene ID's S/SD statistics. A scatter plot of the S/N (x-axis) versus abs(S)/SD statistics (y axis) appears in Figure 1. The distribution of the S/N values was bell-shaped with heavy tails. Our criteria for selecting genes were to take the most extreme 1% and 5% values in both tails. We found it instructive to consider the ratio S/SD on the y-axis of Figure 1, where SD is the standard deviation of the (C j ln p j ) addends of S. Large values of this ratio indicate those genes with consistently significant p-values across all of the studies that we considered. The number of UniGene IDs with S/N scores in the top 1% and 5% (S+ and S- combined) were 300 and 1804, respectively. The mean numbers of studies for genes present in the top 1% and top 5% classes were 2.94 and 3.18 respectively. Our choice of reporting both top 1% and the top 5% for further screening was for crude management of false positives from bias correlated with each gene's relative ranking. Many of the genes present in our top 1% upregulated list identified in our meta-analyses have previously been identified as overexpressed in ER+ breast tumors, most notably the two transcription factors ESR1 and GATA3. Our gene lists appear in Additional File 1.
We next compared our top 1% and 5% upregulated gene lists in ER+ and ER- tumors to those prognostic genes utilized in the 70-gene signature associated with Mammaprint®  along with the 16-gene signature with the RT-PCR based OncoType Dx®  tests. Although the array data defining the 70-gene profile was one of 9 input datasets for our meta-analysis and the 16-gene signature datasets utilized two expression datasets were also employed for the analysis, we did not observe complete overlap in the genes identified. For the 70-gene signature our top 1% dataset identified an overlap of one and four genes respectively that were upregulated in ER+ tumors versus ER- tumors. Only 14 and 5 genes overlapped in the top 5% dataset, respectively. Alternatively, for the 16-gene signature, one and two genes, respectfully, from the top 1% gene sets were overexpressed in ER+ versus ER- tumors from our meta-analysis, while 4 and 5 genes, respectfully, overlapped in the top 5% list. Differences in probes, arrays, and studies used in the meta-analyses may explain some of the differences between our gene lists and the gene lists from the two diagnostic tools. Additionally, we compared our gene lists to a previously identified universal profile that uses 69 genes overexpressed in a diverse spectrum undifferentiated cancers to predict neoplastic transformation . Strikingly we observed only genes overexpressed from ER- tumors to overlap with this 69 gene signature. Four genes (CNAP1, CDC20, YBX1, and CENPA) overlapped in our top 1% list while 23 genes overlapped from our top 5% list. These findings are in accord with the observation that ER- tumors are more highly undifferentiated than ER+ tumors and demonstrate more metastatic potential clinically [24, 25]. Collectively these 23 genes may identify a set of candidate genes predictive of metastatic potential in ER- breast tumors.
Ingenuity Pathway Analyses
We considered the relationship of our top 1% genes in the ER+ and ER- groups using Ingenuity Pathway Analyses . Our objective in using Ingenuity was to characterize the functional role of our selected genes. IPA isolated genes for which it had documented associations, and created a series of networks based on the published literature. We were able to map 290 of the 300 genes comprising the sum of the 1% upregulated and 1% downregulated gene sets. From these networks, IPA queried its database of biological functions and scored each gene cluster with a p-value calculation. Table 2 shows the most common functions found among our most differentially-expressed genes. Notably our top 1% genes upregulated in ER- tumors contained 26 genes showing association to cancer whereas only 7 of the genes upregulated in ER+ tumors were cancer-associated.
Promoter Motif Comparisons in Dysregulated Genes
We tested the hypothesis that there was a significant difference in the occurrence of each motif between our two classes of genes (ER+ overexpressed vs. ER- overexpressed) using a Fisher's Exact test. We adjusted for multiple testing by applying the Benjamini-Hochberg p-value correction . We counted the number of genes in each class which were overexpressed in ER+ tumors and contained a copy of each phylogenetic motif, and compared those to the number of genes overexpressed in ER- tumors. For genes harboring multiple copies of a motif we counted these elements as a single motif event. We independently performed tests for both the top 1% and 5% of our genes. Our initial query sets consisted of 123 condensed TRANSFAC motifs and a second analysis comprised 174 phylogenetically conserved mammalian promoter motifs as previously defined . Sixty-nine of the phylogenetic motifs map to known TFBS defined in the TRANSFAC DB v7.4 while 105 represented novel phylogenetically conserved elements.
We first examined whether any of 123 known TFBS were disproportionately represented in our ER+ and ER- gene sets. Abbreviated results appear in Table 3. While numerous motifs showed significance by Fisher's Exact testing (p < 0.05) only 2 survived multiple testing correction. The first motif KTWGTTT, a binding site for the SRY1 transcription factor, was over-represented in the top 5% of ER+ upregulated genes in the noncoding strand. For ER+ overexpressed genes 473 of 735 genes contained the site while 423 of 766 ER- overexpressed genes contained the site (Benjamini-Hochberg corrected p = 0.042). The second site, ABWCAGGTRNR, a binding site for AREB6 (also called Transcription Factor 8, TCF8, or ZEB1), was over-represented in the top 1% of ER+ upregulated genes when both coding and noncoding strands were surveyed (adjusted p = 0.024) and contains an embedded E-box motif. Twenty-five genes bore TCF8 sites in either strand amongst 138 ER+ upregulated genes while only 6 genes contained the site amongst 147 ER- upregulated genes. The presence of TCF8 sites in nearly four times as many ER+ upregulated genes versus ER- upregulated genes may be an indirect mechanism for gene activation in ER+ breast tumors. TCF8 has been shown to be induced by estrogen which in turn activates a cascade of downstream genes . Additionally, the transcriptional repression of e-cadherin by TCF8 has been shown to lead to loss of the epithelial phenotype suggesting a role for this TF in late-stage carcinogenesis . We note that although e-cadherin was not identified in our meta-analysis, 2 related genes, CDH3 and PCDH8, both of which lie in the top 5% of ER- overexpressed genes, may be responsive to repression by TCF8. The over-representation of TCF8 binding sites in both strands of our top 1% genes ER+ overexpressed tumors suggests that TCF8 may act as a transcriptional activator for these genes yet act as a transcriptional repressor in ER- overexpressed genes.
In addition to known sites, we sought to identify potential new regulatory motifs by examining the coding and noncoding strands with 174 previously identified phylogenetic motifs in the top 1% and 5% of our S+ vs. S- genes . Eleven of these motifs represented palindromic sequences and were scanned in only the coding strand when both strands were analyzed. Again, while numerous motifs showed significance by Fisher's Exact testing (p < 0.05) only 1 survived multiple testing correction. Abbreviated results appear in Table 4. A single motif (CAGNYGKNAAA) showed a significant difference between the ER+ upregulated genes versus the ER- upregulated genes when the non-coding strand was examined in our top 1% gene list. Nineteen of 138 ER+ overexpressed genes contained at least 1 copy of the motif while only 3 of 147 genes contained the motif in the ER- overexpressed genes (adjusted p < 0.0373). This phylogenetic motif does not map to any known TFBS and represents a new target for exploration.
Analysis of the 3'UTR
We next screened for regulatory elements in the 3'UTR of our genes sets. Less is known about functional motifs in 3'UTRs than about functional motifs in promoter regions, but evolutionary conserved motifs in 3'UTRs may, as in promoter regions, indicate regulatory sites. We therefore used a previously identified set of evolutionary conserved 3'UTR motifs . Although the function of half of these motifs is unknown, the remaining half has A/T rich elements believed to be involved in mRNA stability or represent likely microRNAs binding sites.
We used the same RefSeq ID's to harvest the annotated 3'UTRs of our gene sets as described in the Methods. Surprisingly, we observed a significant difference in the median 3'UTR lengths between our gene sets (Figure 2). The top 1% genes overexpressed in ER+ tumors contained a median 3'UTR length of 0.9 kb, while genes overexpressed in ER- tumors contained a median 3'UTR length of 0.61 kb. A similar trend was observed when we examined the top 5% of genes sets. In this set, ER+ upregulated genes had a median UTR length of 0.87 kb while the ER- genes had a length of 0.63 kb. MicroRNA target genes have longer 3'UTRs, whereas anti-targets have shorter 3'UTRs . Thus, the difference in 3'UTR length suggests a difference in miRNA targeting prevalence between the ER+ and ER- genes.
The most significant evolutionary conserved motif in the top 1% and top 5% genes (Table 5) correspond to a potential miRNA target site; YACTGCCR and WGCCTTA have seed complementarity to miR-34/miR-449 and miR-124. The miRNA seed region – nucleotides 2–8 from the 5' end – is the most important factor for miRNA target site recognition [32–34]. Fisher's Exact tests on the miRNA seed site occurrence counts, corrected for multiple testing, seemingly confirm that the ER+ genes are preferentially regulated by miRNAs, as all the significant seeds are overrepresented in the ER+ upregulated genes (Table 6). There is, however, a potential problem with using the Fisher's Exact test for the 3'UTR sets. If motif occurrences were random, we would expect the ER+ genes to have more motif occurrences than the ER- genes have, as the ER+ genes have longer 3'UTRs. Thus, to determine whether there is a significant difference in miRNA regulation between the ER+ and ER- genes, we had to address whether the occurrences of miRNA seed sites in the two sets were significantly different from what we would expect by chance. We therefore ran a set of randomization experiments where we compared the observed number of seed site occurrences in the ER+ and ER- genes' 3'UTRs with those in random gene sets that had 3'UTR lengths similar to the ER+ and ER- 3'UTRs (Table 6). We found that all of the seeds identified by significant Fisher's Exact tests do occur significantly more frequently in the ER+ 3'UTRs than in 3'UTRs from random gene sets. Moreover, these seeds also occur significantly less frequently in the ER- 3'UTRs than in random gene sets. Thus, it seems that whereas several miRNAs may coordinately regulate some of the ER+ genes, some of the ER- genes may collectively avoid being regulated by the same miRNAs.
Previous studies have identified several miRNAs that are aberrantly expressed in breast cancers [35, 36]. Together the aberrantly expressed miRNAs in these studies mapped to 35 unique 6 mer seed sequences of which three were among our ten most significant 6 mer motifs. The three corresponding miRNAs (miR-205, miR-21, and miR-203) are all overexpressed in breast cancers. None of the ten most significant 6 mer motifs are from miRNAs reported to be differentially expressed in ER+ and ER- tumors ; the most significant 6 mer is ranked 25th (Hochberg-adjusted Fisher's Exact p-value of 0.17), is significantly more abundant in ER+ genes than expected by random, and is from miR-206, which is downregulated in ER+ tumors.
MicroRNAs are small (21–23 nucleotides) noncoding RNAs that recognize complementary target sequences in mRNAs and prompt either translational repression or RNA degradation. MicroRNAs play important roles in cancer. Iorio et al., for example, recently revealed that deregulation of multiple miRNAs can be correlated to pathogenic features such as estrogen or progesterone receptor status and tumor stage for breast cancers . In addition, shorter postoperative survival times for patients with lung tumors can be predicted by measuring miRNA let-7 . Thus miRNAs can be used both as classifiers of breast tumor type and as predictors of survival of lung cancer patients. MicroRNAs preferentially target 3'UTRs that have short sequences with perfect complementarity to nucleotides 2–7 (6 mer) or 2–8 (7 mer) in the miRNA's 5' region – the seed region [32–34]. As miRNA regulation may explain gene co-expression, we therefore included the 6 mer and 7 mer seed sequences for all human miRNA sequences known at the time of the study. We note that not all known human miRNAs are highly evolutionary conserved and these seed sequences therefore supplement the miRNA-related evolutionary conserved motifs.
Since we identified sets of genes that demonstrated differential expression between ER+ and ER- tumors, we reasoned that some of these genes may contain common cis-regulatory motifs contributing to their co-regulation. We would predict that these sites may, in some cases, be disproportionately represented between genes upregulated in ER+ tumors versus genes upregulated in ER- tumors perhaps allowing one to identify genes sharing common regulatory pathway. Computational tools exist to identify TFBS based upon over-representation of conserved motifs in datasets . Other approaches aim to identify transcription factors (TF) which bind to TFBS based on the relatedness of expression profiles between the TF and the target genes they are postulated to regulate . A combined approach utilizing expression measurements of tissue-specific gene sets in conjunction with orthologous TFs from humans and mouse provides for enhanced accuracy in predicting bone fide cis-regulatory elements . For the most part these searches are guided by biologically confirmed TFBS interactions identified in the TRANSFAC database ; however, this approach may fail to identify motifs that may be evolutionarily conserved amongst mammals.
In addition to known sites that remained significant after multiple testing correction, many additional sites, and their associated transcription factors, warrant comment. A second important TFBS, CTTTGA, the binding site for lymphoid enhancer-binding factor 1 (LEF1), in the Top 1% Coding Strand ER+ overexpressed genes, failed rigorous multiple testing where 83 of 138 genes contained ≥ 1 site versus 64 of 147 genes in ER- gene set in Table 3. Nonetheless there is strong biological evidence supporting the role of LEF1 in tumorogenesis. The LEF1 binding site CTTTGA is one of the primary binding sites in the Wnt signaling pathway which regulates cell-cell adhesion and many morphogenetic events during mammary development and possibly cancer [42, 43] Binding of Wnt proteins with frizzled protein prevents degradation of β-catenin, which subsequently translocates to the nucleus and binds transcription factors of the TCF/LEF family (this includes TCF8 discussed above and LEF1). Several tumors are known to have an altered β-catenin signaling pathway including colorectal and lymphoblastic tumors . Mutations in the Wnt pathway genes can result in β-catenin stabilization and activation of LEF/TCF-induced transcription. Recent studies have demonstrated sebaceous tumors harboring LEF1 mutations interfere with β-catenin-binding domain of LEF1 and transcriptional activation . Common human carcinomas also carry mutations in the β-catenin-binding domain of LEF . Our data suggest that mutations (somatic or germline) in LEF1 or TCF8 binding sites in genes that inactivate Wnt signaling could contribute to breast tumorogenesis.
We did not find the estrogen receptor binding site (TGACCTTG) over-enriched in any our analyses. This is not surprising as our survey was confined to the immediate 2 kb promoter region. We point out that estrogen may be playing an indirect role on genes in ER+ overexpressing tumors via the activation of TF such as TCF8 which in turn activate downstream targets. Additionally, it is possible that differences in ER binding sites do exist between our gene sets but these sites may reside at distances much further upstream. Recent reports indicate that only two-thirds of ER TFBS can be localized to the proximal promoter region of RNA polymerase II genes . We also note that the E2F binding site (GCGCSAAA) consistently ranked amongst the top 5 motifs (Table 3, 4th highest scoring motif for top 1% and 2nd highest scoring for top 5%) identified when screening the non-coding strand. In the non-coding strand of the top 1% gene sets, more E2F sites were observed in genes overexpressed in ER- tumors (8 of 147) versus 0 of 138 in genes overexpressed in ER+ tumors. Though the E2F site did not pass our multiple comparisons correction, published data support a role for these E2F sites in carcinogenesis. Prior efforts to identify a conditional regulatory program responsible for the coordinate regulation of sets of genes in multiple cancer types identified E2F as the lone TF universally overexpressed in multiple tumor types . The presence of E2F sites exclusively in genes overexpressed in ER- BrCa tumors suggests that E2F plays a major role in this tumor type and may activate some target genes involved in cell cycle control .
A caveat to our analyses is the realization that in some cases the motif count alone may not be considered to be a good predictor due to positional bias of a given motif relative to the transcriptional start site (TSS). For some TFs, positional bias is likely to play a role in function. For example, the motif TATAAATW (TATA binding protein recognition sequence), well known for interactions with the basal transcription apparatus, shows a strong bias 23 bp upstream of the TSS. This spatial restriction is likely due to necessary interactions with the basal transcriptional apparatus (RNA Polymerase 2) . Thus, motif copies present around -23 are likely to be functional while motifs distributed at other positions throughout the 2 kb upstream region would be predicted to be non-functional. Of our 174 phylogenetic motifs, only 32% (56 of 174) show positional bias, the majority of which are located within 100 bp of the TSS. The absence of any position bias for the vast majority of motifs in genes demonstrating disparate motif frequencies suggests a possible position-independent role in contributing to the observed expression patterns. The lone phylogenetic motif showing significance, CAGNYGKNAAA does not demonstrate positional bias.
A difficulty with any meta-analysis is that of study heterogeneity when one combines studies [51–53]. Meta-analyses on gene expression data are not immune from this criticism. There are many factors that influence a designation of ER+ and ER- status in breast tumors, including assay sensitivity and the scoring system used. The specific methods and assays for determining ER+ and ER- status are not available from Oncomine and we were unable to account for this factor in our results. Many have proposed statistical methods for quantifying the heterogeneity in a meta-analysis data set [54–56]. Since heterogeneity manifests in an inflation of inter-study variance, a meta-analysis with any degree of heterogeneity tends to bias the effect size toward the null hypothesis  and hence be conservative.
Our meta-analysis was designed to identify genes showing consistent differences in gene expression patterns between ER+ versus ER- breast tumors. The target genes identified provide a unified set of genes obtained across multiple analyses and their expression patterns may reflect the true biological complexity of breast tumors. A small 10-gene meta-analysis signature to predict ER status has recently been described . Three genes identified in their study (ESR1, GATA3, and SLC39A6) overlap with our top 1% ER+ upregulated genes. From our results, a more highly refined set of gene targets can potentially be explored that would prove useful in the development of an improved biomarker assays for determining not only ER status but also prognosis. Importantly, the overlap of 23 genes from our top 5% ER- upregulated tumors with a set of 69 genes demonstrating overexpression in more than 12 types of undifferentiated cancers via meta-profiling identifies genes universally activated in cancer. This list includes genes shown to be involved in the undifferentiated phenotype. They include the MELK kinase involved in mammalian embryogenesis, the apoptosis inhibitor BIRC5, and multiple genes implicated in cell cycle control (CCNA2, MCM6 and FOXM1).
By screening the proximal promoter and 3'UTR domains of our gene sets we wanted to identify both known TFBS, phylogenetically conserved motifs, and miRNA seed sequences that differ in prevalence between ER+ upregulated versus ER- upregulated genes. For any given site the disproportionate distribution between these gene sets may identify elements responsible for the co-regulation of groups of genes, and our analyses identified several significant elements in both the promoter and 3'UTR regions. Moreover, ER- genes had significantly shorter 3'UTRs than ER+ genes. Short 3'UTRs are common for miRNA anti-targets, which suggest that different mechanisms regulate groups of ER+ and ER- genes; that is, ER+ genes may be miRNA targets whereas ER- genes may be anti-targets. Consistent with this hypothesis, ER+ genes have significantly more putative miRNA target sites in common than expected by 3'UTR length alone, whereas ER- genes have significantly less putative miRNA target sites in common than expected by 3'UTR length alone. Anti-target genes are commonly involved in basic cellular processes  and in agreement with this, genes involved in the cell-cycle are significantly overrepresented in the ER- genes (data not shown).
Clearly, our analysis is a starting point. An examination of larger sequence domains upstream or these target genes may suggest additional elements showing differences in target abundance between these gene sets. While our phylogenetic motifs were for the most part small (<20 nucleotides), larger sequence elements such as enhancers that function at extended distances from these genes are likely to also play a role in the observed expression patterns. The potential importance of promoter motifs in gene expression and of common polymorphisms that reside within these sites was highlighted by a recent survey of the promoter regions of nearly 200 genes in which 75% of the SNPs identified modify (either by gain or loss) putative TFBS . A survey of known polymorphisms (SNPs) from existing databases (dbSNP or HapMap) that reside within these motifs would also suggest the importance of these elements. It would be of keen interest to explore if regulatory modules exist within these gene sets consisting of combinations of both known and phylogenetically conserved motifs. Approaches such as this have been described computationally for yeast, fly, mouse and humans . The recent use of comparative genomics tools from mammalian as well as evolutionarily distant species such as pufferfish (Tetraodon sp.) to identify phylogenetically conserved enhancers may also enable the identification of additional sequence elements responsible for the coordinate expression patterns seen for some of our genes . Efforts such as these in conjunction with genomewide chromatin immunoprecipitation (ChIP) studies of promoter regions will provide a more comprehensive view of the key elements modulating the observed gene expression patterns.
Likewise, in the 3'UTR, genomewide efforts to map SNPs to miRNA target sites have revealed that many polymorphisms can either create new miRNA target sites or can lead to their loss . Genome-wide searches in humans have identified cis polymorphisms in putative miRNA target sites that are likely contributors to phenotypic variation in humans and may to play a role in disease pathogenesis . Future analyses will reveal whether SNPs in phylogenetically conserved promoter and 3'UTR elements can influence breast cancer risk at the level of RNA transcription or stability.
We queried the Oncomine database  for gene expression studies in breast cancer as of September 2005. Within Oncomine, a dataset is considered to be "Analyzed" when the data from the original study is digitized and normalized into Oncomine's data mining system. At that time, there were 14 "Analyzed" studies with complete expression data in breast cancer. These "Analyzed" studies provided by Oncomine included normalized expression data per probe. Each probe's record included the probe's identification number (dependent on the array platform), the number of subjects, the mean expression values, and the p-value and q-value . Although each study measured a variety of clinical aspects of patients with disease (e.g., progesterone receptor status, distant lymph node metastases, disease-free survival, etc.) 9 studies considered expression patterns between ER+ and ER- tumors. As estrogen receptor status is a key factor in treatment decision-making, we elected to compare the expression of genes overexpressed in ER+ tumors with those of ER- tumors. These 9 studies are listed in Table 1 and represent 954 independent cases of breast cancer.
For our meta-analysis we collected all of the expression data and imported all data sets into JMP tables for merging . We considered Fisher's method for combining p-values as the basis of our meta-analysis statistic . Rhodes et al  also considered this approach in their meta-analysis of gene expression in prostate cancer. Equation 1 shows our modification to the Fisher's statistic for our meta-analysis. Since we were not interested in the distributional properties of the Fisher's statistic, we modified the statistic by incorporating the signum of the direction of the differential gene expression. We mapped probe identifiers to unique UniGene IDs and these were the addends for S in Equation 1. However, if a study did not have a probe corresponding to a given UniGene ID, that study did not contribute to the meta-analysis statistic. Given that m studies have expression p-values for a given UniGene ID p1,..., p m , the meta-analysis statistic S is defined as
where C j = +1 if a given genes expression is higher in estrogen receptor negative (ER-) versus estrogen receptor positive (ER+) tumors while C j = -1 if a given gene's expression is higher in ER+ versus ER- tumors in any given study j.
Our convention for expression resulted in large negative values of S implying overexpression of genes associated with ER+ breast cancers while conversely large positive values of S indicated genes overexpressed in ER- breast tumors. Values of S close to zero imply neither over- nor underexpression of the gene. Herein, we will refer to "upregulated" and "downregulated" genes as those genes overexpressed in ER+ tumors, versus genes overexpressed in ER- tumors, respectively.
To compensate for the possibility that high values of S (either + or -) may be due to the contribution of high p-values from just a few studies rather than high p-values from multiple consistently significant studies, we normalized the S statistic by N, the number of studies in which a UniGene ID was present. The additional descriptive statistics that we considered for our meta-analysis included the number of studies that contained a probe for each UniGene ID, and the standard deviation (SD) of the (C j ln p j ) addends of S. These statistics were used for summarization and discovery and not for consideration of any inferential or asymptotic statistical properties of S. We focused our subsequent analyses on a select set of genes by taking medians across each UniGene ID's S/N statistics. We selected sets of ER+ and ER- genes for further study by arbitrarily defining cutoffs at the upper 1% and 5% tails of the S/N distributions and including all genes with those S/N values or greater. We will refer to these as the "top 1% gene lists" and "top 5% gene lists" below. The complete list of Top 1% and 5% gene sets are in Additional File 1.
The difficulty of different gene annotation and naming conventions is well-known [68–71] and mandated that we select a common gene identifier. Since probes were dependent on both the array platform in the original studies, it was necessary to collapse the probes into one common identifier prior to our meta-analyses. We chose the UniGene nomenclature as a common identifier across all microarray probe sets. UniGene identifiers were chosen because each UniGene ID may capture multiple expressed sequence tags (ESTs)  on any given array. The lack of common probes or genes often occurs in array studies and is one possible explanation for the disparate gene sets identified between array studies 
We used the GEPAS' ID Converter batch formatting at the Bioinformatics Department at CIPF [74, 75]. Owing to the diversity of probe nomenclature present on these arrays, our imported IDs included GenBank Accession numbers, clone IDs/IMAGE tags, and Affymetrix IDs. If a study's probe ID did not map to a UniGene ID, no information was contributed to the meta-analyzed expression value. In studies containing multiple probes for a given UniGene ID, each expression value was retained; we did not collapse nor statistically summarize expression values when multiple probes measured the same UniGene ID.
To screen for known motifs in the promoters of our ER+ and ER- gene classes we used a previously defined collapsed set of motifs from the TRANSFAC database v7.4 whereby highly redundant motifs were eliminated using weight matrix similarity as described in Xie et al. . Xie et al. also identified conserved mammalian phylogenetic motifs in the promoter and 3'UTR domains ; these served as our reference motifs. MicroRNA 6-mer and 7-mer seed sequences corresponding to nucleotides 2–7 and 2–8 from the miRNA 5' end  were from miRBase release 9.1 . We obtained RefSeq accession numbers mapping to each UniGene ID cluster by passing UniGene IDs from the top 1% and 5% gene lists through the conversion tool D.A.V.I.D . RefSeq transcripts that were redundant as either duplicates or subsequences of other entries were removed. This removed redundancies that may unduly bias our motif comparisons, yet retained the sequences of as many transcripts as possible. RefSeq genomic intervals containing promoters and 3'UTRs were harvested from genomic resources (UCSC Genome Browser, NCBI Build 36.1). For each sequence list and each motif, a custom Python script counted the number of sequences with one or more motif occurrences within the set, and a Fisher exact test evaluated the significance of over or under representation in the ER+ versus the ER- sets.
For our promoter intervals some phylogenetic motifs represented sequences or subsequences of known TFBS while others were novel motifs having no known binding factors. For example, the phylogenetically conserved mammalian motif CAGGTG is a core subsequence for the E-box motif of helix-loop-helix TFs as well as the known binding site for the transcription factor MYC (SCACGTG). Alternatively, the phylogenetically conserved motif AGCYRWTTC does not represent any known TFBS. We limited our search to the proximal promoter space ranging from 2 kb 5' of the transcription start site (TSS) to 2 kb 3' downstream. If the translation start site was within 2 kb of the TSS site the shorter region was chosen so as to not overlap with the first coding exon. Collectively these promoter motifs ranged in length from 6–17 nucleotides. We separately screened the top one and five percent categories overexpressed in ER+ tumors (S-) and compared this to the same motif in genes overexpressed in the top one and five percent of ER- tumors (S+) respectively.
Table 7 shows the results for our UniGene ID conversion, number of RefSeq mRNAs identified, and the final number of RefSeq mRNAs after subsequence filtering. Though our initial analysis returned more RefSeq mRNAs than input UniGene IDs, after subsequence filtering the yield of RefSeq mRNAs ranged from 77–98%. Collectively, we feel it represents a balanced collection of unique RefSeq IDs minimizing transcript redundancy yet faithfully representing the transcript diversity observed in our meta-analysis. For promoter analyses we surveyed both coding and non-coding strands as this provided a comprehensive survey of the motif distribution since earlier work suggests functional TFBS may be independent of strand orientation . Additionally, we elected to survey the entirety of the sequence space without filtering repeat elements as previous studies demonstrate that TFBS sites may reside in these elements . For palindromic motifs we only screened the coding strand in our promoter survey.
- ER+ and ER-:
estrogen receptor positive and negative breast cancer, respectively
3' untranslated region
transcription factor binding site
expressed sequence tags
transcription start site
- Oncomine DB:
- SD :
- S+ and S-:
list of genes overexpressed in ER- tumors and ER+ tumors.
Perou CM, Jeffrey SS, van de Rijn M, Rees CA, Eisen MB, Ross DT, Pergamenschikov A, Williams CF, Zhu SX, Lee JC, et al.: Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc Natl Acad Sci USA 1999, 96: 9212–9217. 10.1073/pnas.96.16.9212
Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, et al.: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA 2001, 98: 10869–10874. 10.1073/pnas.191367098
van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, et al.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415: 530–536. 10.1038/415530a
Massague J: Sorting out breast-cancer gene signatures. N Engl J Med 2007, 356: 294–297. 10.1056/NEJMe068292
Fan C, Oh DS, Wessels L, Weigelt B, Nuyten DS, Nobel AB, van't Veer LJ, Perou CM: Concordance among gene-expression-based predictors for breast cancer. N Engl J Med 2006, 355: 560–569. 10.1056/NEJMoa052933
Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker MG, Watson D, Park T, et al.: A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med 2004, 351: 2817–2826. 10.1056/NEJMoa041588
Halvorsen KT, Burdick E, Colditz GA, Frazier HS, Mosteller F: Combining results from independent investigations meta-analysis in clinical research. In Medical uses of statistics. 2nd edition. Edited by: Bailar JC, Mosteller F. Boston, MA: NEJM Books; 1992:413–426.
Robey RR, Dalebout SD: A tutorial on conducting meta-analyses of clinical outcome research. J Speech Lang Hear Res 1998, 41: 1227–1241.
Smith DD, Givens GH, Tweedie RL: Adjusting for publication and quality bias in Bayesian meta-analysis. In Meta-analysis in Medicine and Health Policy. Edited by: Stangl DK, Berry DA. New York: Marcel Dekker; 2000:277–304.
Egger M, Smith GD: Meta-Analysis. Potentials and promise. Bmj 1997, 315: 1371–1374.
Rhodes D, Barrette T, Rubin M, Ghosh D, Chinnaiyan A: Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Research 2002, 62: 4427–4433.
Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proceedings of the National Academy of Sciences of the United States of America 2004, 101: 9309–9314. 10.1073/pnas.0401994101
Schneider J, Ruschhaupt M, Buness A, Asslaber M, Regitnig P, Zatloukal K, Schippinger W, Ploner F, Poustka A, Sultmann H: Identification and meta-analysis of a small gene expression signature for the diagnosis of estrogen receptor status in invasive ductal breast cancer. Int J Cancer 2006.
Rhodes DR, Barrette TR, Rubin MA, Ghosh D, Chinnaiyan AM: Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Res 2002, 62: 4427–4433.
Lyman GH, Kuderer NM: Gene expression profile assays as predictors of recurrence-free survival in early-stage breast cancer: a metaanalysis. Clin Breast Cancer 2006, 7: 372–379.
Mehra R, Varambally S, Ding L, Shen R, Sabel MS, Ghosh D, Chinnaiyan AM, Kleer CG: Identification of GATA3 as a breast cancer prognostic marker by global gene expression meta-analysis. Cancer Res 2005, 65: 11259–11264. 10.1158/0008-5472.CAN-05-2495
Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet 2003, 34: 166–176.
Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global discovery of conserved genetic modules. Science 2003, 302: 249–255. 10.1126/science.1087447
Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB: Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci USA 2002, 99: 757–762. 10.1073/pnas.231608898
Chang HY, Nuyten DS, Sneddon JB, Hastie T, Tibshirani R, Sorlie T, Dai H, He YD, van't Veer LJ, Bartelink H, et al.: Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proc Natl Acad Sci USA 2005, 102: 3738–3743. 10.1073/pnas.0409462102
Paik S, Tang G, Shak S, Kim C, Baker J, Kim W, Cronin M, Baehner FL, Watson D, Bryant J, et al.: Gene expression and benefit of chemotherapy in women with node-negative, estrogen receptor-positive breast cancer. J Clin Oncol 2006, 24: 3726–3734. 10.1200/JCO.2005.04.7985
Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci USA 2004, 101: 9309–9314. 10.1073/pnas.0401994101
Thompson EW, Paik S, Brunner N, Sommers CL, Zugmaier G, Clarke R, Shima TB, Torri J, Donahue S, Lippman ME, et al.: Association of increased basement membrane invasiveness with absence of estrogen receptor and expression of vimentin in human breast cancer cell lines. J Cell Physiol 1992, 150: 534–544. 10.1002/jcp.1041500314
Pichon MF, Broet P, Magdelenat H, Delarue JC, Spyratos F, Basuyau JP, Saez S, Rallet A, Courriere P, Millon R, Asselain B: Prognostic value of steroid receptors after long-term follow-up of 2257 operable breast cancers. Br J Cancer 1996, 73: 1545–1551.
Ingenuity Pathway Analysis[http://www.ingenuity.com]
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRoyal Stat Soc B 1995, 57: 289–300.
Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature 2005, 434: 338–345. 10.1038/nature03441
Dillner NB, Sanders MM: Transcriptional activation by the zinc-finger homeodomain protein delta EF1 in estrogen signaling cascades. DNA Cell Biol 2004, 23: 25–34. 10.1089/104454904322745907
Eger A, Aigner K, Sonderegger S, Dampier B, Oehler S, Schreiber M, Berx G, Cano A, Beug H, Foisner R: DeltaEF1 is a transcriptional repressor of E-cadherin and regulates epithelial plasticity in breast cancer cells. Oncogene 2005, 24: 2375–2385. 10.1038/sj.onc.1208429
Farh KK, Grimson A, Jan C, Lewis BP, Johnston WK, Lim LP, Burge CB, Bartel DP: The widespread impact of mammalian MicroRNAs on mRNA repression and evolution. Science 2005, 310: 1817–1821. 10.1126/science.1121158
Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB: Prediction of mammalian microRNA targets. Cell 2003, 115: 787–798. 10.1016/S0092-8674(03)01018-3
Brennecke J, Stark A, Russell RB, Cohen SM: Principles of microRNA-target recognition. PLoS Biol 2005, 3: e85. 10.1371/journal.pbio.0030085
Lewis BP, Burge CB, Bartel DP: Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 2005, 120: 15–20. 10.1016/j.cell.2004.12.035
Iorio MV, Ferracin M, Liu CG, Veronese A, Spizzo R, Sabbioni S, Magri E, Pedriali M, Fabbri M, Campiglio M, et al.: MicroRNA gene expression deregulation in human breast cancer. Cancer Res 2005, 65: 7065–7070. 10.1158/0008-5472.CAN-05-1783
Volinia S, Calin GA, Liu CG, Ambs S, Cimmino A, Petrocca F, Visone R, Iorio M, Roldo C, Ferracin M, et al.: A microRNA expression signature of human solid tumors defines cancer gene targets. Proc Natl Acad Sci USA 2006, 103: 2257–2261. 10.1073/pnas.0510565103
Takamizawa J, Konishi H, Yanagisawa K, Tomida S, Osada H, Endoh H, Harano T, Yatabe Y, Nagino M, Nimura Y, et al.: Reduced expression of the let-7 microRNAs in human lung cancers in association with shortened postoperative survival. Cancer Res 2004, 64: 3753–3756. 10.1158/0008-5472.CAN-04-0637
Ho Sui SJ, Mortimer JR, Arenillas DJ, Brumm J, Walsh CJ, Kennedy BP, Wasserman WW: oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res 2005, 33: 3154–3164. 10.1093/nar/gki624
Zhu Z, Pilpel Y, Church GM: Computational identification of transcription factor binding sites via a transcription-factor-centric clustering (TFCC) algorithm. J Mol Biol 2002, 318: 71–81. 10.1016/S0022-2836(02)00026-8
Huber BR, Bulyk ML: Meta-analysis discovery of tissue-specific DNA sequence motifs from mammalian gene expression data. BMC Bioinformatics 2006, 7: 229. 10.1186/1471-2105-7-229
Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, et al.: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 2006, 34: D108–110. 10.1093/nar/gkj143
Hatsell S, Rowlands T, Hiremath M, Cowin P: Beta-catenin and Tcfs in mammary development and cancer. J Mammary Gland Biol Neoplasia 2003, 8: 145–158. 10.1023/A:1025944723047
Morin PJ: beta-catenin signaling and cancer. Bioessays 1999, 21: 1021–1030. 10.1002/(SICI)1521-1878(199912)22:1<1021::AID-BIES6>3.0.CO;2-P
Waterman ML: Lymphoid enhancer factor/T cell factor expression in colorectal cancer. Cancer Metastasis Rev 2004, 23: 41–52. 10.1023/A:1025858928620
Takeda H, Lyle S, Lazar AJ, Zouboulis CC, Smyth I, Watt FM: Human sebaceous tumors harbor inactivating mutations in LEF1. Nat Med 2006, 12: 395–397. 10.1038/nm1386
Jeong EG, Lee SH, Yoo NJ, Lee SH: Mutational analysis of Wnt pathway gene LEF1 in common human carcinomas. Dig Liver Dis 2007, 39: 287–288. 10.1016/j.dld.2006.11.005
Carroll JS, Meyer CA, Song J, Li W, Geistlinger TR, Eeckhoute J, Brodsky AS, Keeton EK, Fertuck KC, Hall GF, et al.: Genome-wide analysis of estrogen receptor binding sites. Nat Genet 2006.
Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Barrette TR, Ghosh D, Chinnaiyan AM: Mining for regulatory programs in the cancer transcriptome. Nat Genet 2005, 37: 579–583. 10.1038/ng1578
Zhu W, Giangrande PH, Nevins JR: E2Fs link the control of G1/S and G2/M transcription. Embo J 2004, 23: 4615–4626. 10.1038/sj.emboj.7600459
Berendzen KW, Stuber K, Harter K, Wanke D: Cis-motifs upstream of the transcription and translation initiation sites are effectively revealed by their positional disequilibrium in eukaryote genomes using frequency distribution curves. BMC Bioinformatics 2006, 7: 522. 10.1186/1471-2105-7-522
Thompson SG: Why sources of heterogeneity in meta-analysis should be investigated. Bmj 1994, 309: 1351–1355.
Thompson SG, Sharp SJ: Explaining heterogeneity in meta-analysis: a comparison of methods. Stat Med 1999, 18: 2693–2708. 10.1002/(SICI)1097-0258(19991030)18:20<2693::AID-SIM235>3.0.CO;2-V
Deville WL, Buntinx F, Bouter LM, Montori VM, de Vet HC, van der Windt DA, Bezemer PD: Conducting systematic reviews of diagnostic studies: didactic guidelines. BMC Med Res Methodol 2002, 2: 9. 10.1186/1471-2288-2-9
Song F, Sheldon TA, Sutton AJ, Abrams KR, Jones DR: Methods for exploring heterogeneity in meta-analysis. Eval Health Prof 2001, 24: 126–151.
Petitti DB: Approaches to heterogeneity in meta-analysis. Stat Med 2001, 20: 3625–3633. 10.1002/sim.1091
Givens GH, Smith DD, Tweedie RL: Bayesian data-augmented meta-analysis that accounts for publication bias issues exemplified in the passive smoking debate. Statistical Science 1997, 12: 221–250. 10.1214/ss/1030037958
Walter SD: Variation in baseline risk as an explanation of heterogeneity in meta-analysis. Stat Med 1997, 16: 2883–2900. 10.1002/(SICI)1097-0258(19971230)16:24<2883::AID-SIM825>3.0.CO;2-B
Stark A, Brennecke J, Bushati N, Russell RB, Cohen SM: Animal MicroRNAs confer robustness to gene expression and have a significant impact on 3'UTR evolution. Cell 2005, 123: 1133–1146. 10.1016/j.cell.2005.11.023
Sinnett D, Beaulieu P, Belanger H, Lefebvre JF, Langlois S, Theberge MC, Drouin S, Zotti C, Hudson TJ, Labuda D: Detection and characterization of DNA variants in the promoter regions of hundreds of human disease candidate genes. Genomics 2006, 87: 704–710. 10.1016/j.ygeno.2006.01.001
Kreiman G: Identification of sparsely distributed clusters of cis-regulatory elements in sets of co-expressed genes. Nucleic Acids Res 2004, 32: 2889–2900. 10.1093/nar/gkh614
Baroukh N, Ahituv N, Chang J, Shoukry M, Afzal V, Rubin EM, Pennacchio LA: Comparative genomic analysis reveals a distant liver enhancer upstream of the COUP-TFII gene. Mamm Genome 2005, 16: 91–95. 10.1007/s00335-004-2442-9
Chen K, Rajewsky N: Natural selection on human microRNA binding sites inferred from SNP data. Nat Genet 2006, 38: 1452–1456. 10.1038/ng1910
Saunders MA, Liang H, Li WH: Human polymorphism at microRNAs and microRNA target sites. Proc Natl Acad Sci USA 2007, 104: 3300–3305. 10.1073/pnas.0611347104
Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: ONCOMINE: a cancer microarray database and integrated data-mining platform. Neoplasia 2004, 6: 1–6.
Storey JD: The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics 2003, 31: 2013–2035. 10.1214/aos/1074290335
SAS Institute JMP. 5.1.2 edn. Cary, NC; 2004.
Fisher RA: Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd; 1932.
Adams MD, Soares MB, Kerlavage AR, Fields C, Venter JC: Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nat Genet 1993, 4: 373–380. 10.1038/ng0893-373
Soares MB, Bonaldo MF, Jelene P, Su L, Lawton L, Efstratiadis A: Construction and characterization of a normalized cDNA library. Proc Natl Acad Sci USA 1994, 91: 9228–9232. 10.1073/pnas.91.20.9228
Lennon G, Auffray C, Polymeropoulos M, Soares MB: The I.M.A.G.E. Consortium: an integrated molecular analysis of genomes and their expression. Genomics 1996, 33: 151–152. 10.1006/geno.1996.0177
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL: GenBank. Nucleic Acids Res 2002, 30: 17–20. 10.1093/nar/30.1.17
Pontius JU, Wagner L, Schuler GD: UniGene: a unified view of the transcriptome. In The NCBI Handbook. Bethesda, MD: National Center for Biotechnology Information; 2003.
Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, et al.: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005, 365: 671–679.
Herrero J, Al-Shahrour F, Diaz-Uriarte R, Mateos A, Vaquerizas JM, Santoyo J, Dopazo J: GEPAS: A web-based resource for microarray gene expression data analysis. Nucleic Acids Res 2003, 31: 3461–3467. 10.1093/nar/gkg591
Vaquerizas JM, Conde L, Yankilevich P, Cabezon A, Minguez P, Diaz-Uriarte R, Al-Shahrour F, Herrero J, Dopazo J: GEPAS, an experiment-oriented pipeline for the analysis of microarray gene expression data. Nucleic Acids Res 2005, 33: W616–620. 10.1093/nar/gki500
Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ: miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res 2006, 34: D140–144. 10.1093/nar/gkj112
Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003, 4: P3. 10.1186/gb-2003-4-5-p3
Mantovani R: A survey of 178 NF-Y binding CCAAT boxes. Nucleic Acids Res 1998, 26: 1135–1143. 10.1093/nar/26.5.1135
Shankar R, Grover D, Brahmachari SK, Mukerji M: Evolution and distribution of RNA polymerase II regulatory sites from RNA polymerase III dependant mobile Alu elements. BMC Evol Biol 2004, 4: 37. 10.1186/1471-2148-4-37
Zhao H, Langerod A, Ji Y, Nowels KW, Nesland JM, Tibshirani R, Bukholm IK, Karesen R, Botstein D, Borresen-Dale AL, Jeffrey SS: Different gene expression patterns in invasive lobular and ductal carcinomas of the breast. Mol Biol Cell 2004, 15: 2523–2536. 10.1091/mbc.E03-11-0786
Sotiriou C, Neo SY, McShane LM, Korn EL, Long PM, Jazaeri A, Martiat P, Fox SB, Harris AL, Liu ET: Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proc Natl Acad Sci USA 2003, 100: 10393–10398. 10.1073/pnas.1732912100
Ma XJ, Salunga R, Tuggle JT, Gaudet J, Enright E, McQuary P, Payette T, Pistone M, Stecker K, Zhang BM, et al.: Gene expression profiles of human breast cancer progression. Proc Natl Acad Sci USA 2003, 100: 5974–5979. 10.1073/pnas.0931261100
van de Vijver MJ, He YD, van't Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, et al.: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 2002, 347: 1999–2009. 10.1056/NEJMoa021967
Gruvberger S, Ringner M, Chen Y, Panavally S, Saal LH, Borg A, Ferno M, Peterson C, Meltzer PS: Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Res 2001, 61: 5979–5984.
West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA Jr, Marks JR, Nevins JR: Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci USA 2001, 98: 11462–11467. 10.1073/pnas.201162998
This study was supported by Susan G. Komen for the Cure Basic, Clinical and Translational Research Grant BCTR0504486 (GL). PS and OS were supported by the Norwegian Functional Genomics Program (FUGE) and the Leiv Eriksson program of the Norwegian Research Council.
DDS extracted data from Oncomine, performed the statistical analyses and drafted the manuscript. PS programmed the 3'UTR and promoter parsers and performed the randomization tests. OS conceived of the 3'UTR design and subsequent analysis. CL and GER extracted data from Oncomine and performed the Ingenuity analysis. CG edited the manuscript and contributed to the motif discovery portions. GL designed the project, provided overall guidance, edited and drafted the manuscript. All authors read and approved the final manuscript.