Volume 9 Supplement 9
From microarray to biology: an integrated experimental, statistical and in silico analysis of how the extracellular matrix modulates the phenotype of cancer cells
© Dozmorov et al; licensee BioMed Central Ltd. 2008
Published: 12 August 2008
A statistically robust and biologically-based approach for analysis of microarray data is described that integrates independent biological knowledge and data with a global F-test for finding genes of interest that minimizes the need for replicates when used for hypothesis generation. First, each microarray is normalized to its noise level around zero. The microarray dataset is then globally adjusted by robust linear regression. Second, genes of interest that capture significant responses to experimental conditions are selected by finding those that express significantly higher variance than those expressing only technical variability. Clustering expression data and identifying expression-independent properties of genes of interest including upstream transcriptional regulatory elements (TREs), ontologies and networks or pathways organizes the data into a biologically meaningful system. We demonstrate that when the number of genes of interest is inconveniently large, identifying a subset of "beacon genes" representing the largest changes will identify pathways or networks altered by biological manipulation. The entire dataset is then used to complete the picture outlined by the "beacon genes." This allow construction of a structured model of a system that can generate biologically testable hypotheses. We illustrate this approach by comparing cells cultured on plastic or an extracellular matrix which organizes a dataset of over 2,000 genes of interest from a genome wide scan of transcription. The resulting model was confirmed by comparing the predicted pattern of TREs with experimental determination of active transcription factors.
Microarrays are widely used to overview gene expression landscapes under different experimental conditions. Since their initial appearance microarrays developed into very dependable tools with good inter- and intra-platform reproducibility [1, 2]. Although numerous attempts to unify microarray analysis workflow were made, each manufacturer has its own methods for processing large quantities of data, and there is no general consensus as to the best means to analyze microarray data, and probably never will be. Each experimental situation is different, and different designs may be necessary for hypothesis testing as compared to hypothesis generation. With the former, one biological system is compared with another, and the significance of differences is statistically tested using a t-test, generally with the assumption that each biological sample is homogeneous. In such experiments statistical power becomes the driving consideration. In hypotheses generating experiments, a number of biological situations are compared, for example a series of different cell lines, a time course study or a dose-response study. The biological samples may not be homogeneous. Cost becomes a major consideration because the number of replicates needed to test hypotheses may make experiments prohibitive. Thus, there is a need for analytical approaches to use under hypothesis-generating conditions that are based on sound statistical principles but which nonetheless reduce the number of replicates needed to assemble at least a preliminary global picture of the effect of a particular biological situation on gene expression .
We present here a statistically robust approach for analyzing the changes in the transcriptome that is driven by the underlying biology. Previous work by I. Dozmorov showed that approaches based on separating variability in expression of genes into biological and technical sources provide an alternative means of identifying "genes of interest" for further analysis [4–7]. Under the assumption that in any experiment most genes do not change expression, the F-test is used to identify genes that express more variability than the overall technical variability of the system. This set of genes is referred to as "hypervariable genes," and has been assumed to reflect the relevant biological variables in the system. In this communication we have added a number of in silico tests based upon properties of these genes that do not depend upon expression. These additional analyses confirmed that at the level of transcriptional regulatory networks this approach does identify important genes that can then be assembled into networks of functions, transcriptional regulation and with previous knowledge. This represents a further extension of work published in our laboratory that included only in silico analyses [8, 9].
We applied this method to examining the effect of cancer remodeled extracellular matrix (crECM) on bladder papilloma-derived cell line (RT4) as they grow over the course of several days on a crECM after having been transferred from culture on plastic . Papillomas represent a very early premalignant change, and determining how a crECM can drive them toward malignancy could identify novel targets for therapy. Genes exhibiting major changes in expression introduced by crECM were selected and their functions examined. Two overlapping canonical pathways were identified as the main targets. Finally, transcription factors regulating the genes of interest were found and their validity proven by additional experimental method. Such integrative approach may reveal new roles of unknown genes , new drug targets , and lead to clinical tests .
Gene expression dynamics
Visualization of gene expression changes
This represents a very large set of genes of interest, and other than determining ontologies, such a large number is inconvenient to interpret. We therefore focused on the genes showing the largest changes, considering they could serve as "beacons" to draw our attention to the changes in underlying processes. Because the dynamics of gene expression exhibit mainly a change in state between cells growing on plastic vs. crECM, the set of hypervariable genes was filtered to identify the genes that are expressed below noise level on plastic and highly expressed on crECM ("off-on" genes) and vice versa ("on-off" genes) (Figure 4A). The two arrays of gene expression on plastic and nine arrays of timecourse on crECM were each averaged and genes with average expression level on plastic < 0 and on crECM > 1 (Log10 scale, average identifies geometric mean) were selected as "off-on" genes. The opposite criteria were applied to identify "on-off" genes. Using these stringent criteria, a total of 877 unique "off-on" genes were turned on by the crECM, whereas a total of 74 unique "on-off" genes were shut down by the crECM. The validity of this approach was tested in the next section.
Beside the main dynamic of state change genes a smaller number of genes showed change in level. Figure 3B shows the distribution of the ratio of expression of genes from cells grown on plastic to those grown on crECM. The standard deviation of this ratio is 0.2, and 3 standard deviations (0.6) corresponds to a 3-fold difference. A total of 241 genes were identified that were expressed at least 5 SD above background in cells grown on plastic and showed at least a 3-fold increase in expression. Only 67 genes showed the opposite pattern being highly expressed on plastic and decreasing 3-fold but still expressed above noise.
Gene ontology analysis and visualization
The ontologies of the genes of interest were examined using the Database for Annotation, Visualization and Integrated Discovery (DAVID, http://david.abcc.ncifcrf.gov/)  tool, which examines all the functions represented by each gene in a gene list and identifies groups that share ontologies. The over-represented ontologies form the basis for identifying functional processes represented in the change of state induced by a crECM. Several parameters can be adjusted to achieve a reasonable and comprehensive set of ontologies and associated genes. For the 877 "off-on" genes the following parameters were set: Similarity term overlap: 5; Similarity threshold: 0.5; Initial group membership: 5; Final group membership: 5; Multiple linkage threshold: 0.5, which is equal to the "Highest" stringency setting in DAVID. After examining the results provided by different stringencies, the above set was selected because the picture presented overall affinities without too many groups but provided sufficient detail to build a conceptual model of the effect of crECM on progressing urothelial cells.
Major functional groups overrepresented among state-change genes and corresponding overrepresented TREs. The groups are presented according to the order of significance identified by DAVID. Overrepresented TREs marked in bold are either "off-on" TFs or increased their level; regular – not present in Panomics set; italics – not present under either condition.
Number of genes
Major Gene Ontologies
Gene expression, RNA processing and protein synthesis
Translation initiation factors
v-Maf, SOX-9, FOXJ2, CP2, HFH-3, Elk-1, NRF-2, AREB6
RNA processing – ribosome biogenesis
NGFI-C, GR, HNF-4, YY1, Elk-1, NRF-2, v-Myb, NF-κB, TATA, c-Myc
Cell signaling proteins
GABA receptor/ion channels
Ion channels, K
Post-translational modification and regulatory control
CDP, CR3+HD, CRE-BP1, CCAAT
HNF-3β, CDP CR3+HD, E2, NF-κB, USF
Immune function associated with suppression of effector T-cells
Transmembrane immunoglobulin-like proteins
NF-κB, v-Maf, RSRFC4, FOXJ2, AP-1, Pax-4, USF, CDP, Brn-2
Transmembrane proteins of unknown significance
Examination of 241 level-change genes increased on crECM by medium stringency ontological analysis of these genes found five clusters of functions, three of which were similar to those for state change.
The 74 genes that were shut off and the 64 genes that were 3-fold down-regulated on crECM were less informative than were those that were turned on or up-regulated. At same stringency as was used for "off-on" genes, one over-represented cluster was identified in the genes that were shut off and consisted of 6 transcription factors sharing homeobox ontology. Decreasing the stringency to "medium" (the default for DAVID) increased the number of genes in the cluster to 10 but did not add clusters. Genes that were down-regulated at least 3-fold yielded two clusters under medium stringency.
The validity of the selection of "beacon genes" was tested by comparing the ontological clusters observed with the entire set of 2743 HV genes. A total of 17 clusters was seen, all of which were identified using the "beacon" genes. This demonstrates that the smaller data set of state- and level-change genes will identify all the processes seen in the larger set of HV genes.
Having preliminary understanding of functions represented by state change genes, they were probed for membership in canonical pathways by Ingenuity© Pathway Analysis (IPA, http://www.ingenuity.com/). IPA maps each gene identifier to its corresponding gene object in the Ingenuity© Pathways Knowledge Base, and generates multiple biological networks with associated ontologies from a list of focus genes, as well as general gene ontologies overrepresented. IPA canonical pathways analysis identified the most significant known biological pathways for a given set of genes. For identification of a significant canonical pathway or pre-defined network of genes it is only necessary to identify a single member as significant, hence the term "beacon" genes. The participation of other members of the network is checked manually against the entire data set to ensure they are expressed, or show a smaller change than the "beacon" genes threshold . Pathways or connections involving genes that are not expressed are deleted. This process is particularly helpful in cases where a large number of genes of interest has been discovered. The smaller, more tractable set of "beacon" genes are used to draw attention to processes, and all the details are filled in with the entire data set as is shown below.
Identification of potential transcriptional networks
The predictions of PAINT were tested using an independent experimental assay that measured the DNA binding activity of transcription factors using TranSignal Protein/DNA Combo Arrays. This array allows estimation of binding activity of 345 TFs. We found a number of up-regulated transcription factors on crECM. The bar diagram in Figure 6B compares the activity of selected transcription factors in cells grown on plastic vs. crECM. The majority of transcription factors that showed large changes in activity were also identified by PAINT as driving up- or down-regulated clusters of genes, the results are shown in Table 1 with TREs confirmed by Panomics array highlighted in bold. Depending on stringency, between 5 and 13 transcription factors were shut off and between 25 and 40 were activated by the crECM. Supplemental table S1 in Additional file 1 shows DNA binding activity of all transcription factors in cells grown on plastic or crECM. Most of the changes were "off-on," as was observed for the mRNAs of the downstream genes. Some transcription factors were active in cells growing under both conditions.
In this study we present a self-guided approach for analyzing a complex biological change by microarrays and illustrate its use to describe the complex change in gene expression that occur when papilloma cells are placed on a crECM. The flowchart of each step of this approach is shown in Figure 1. We also confirm the validity of the integrated approach by independent verification of the predictions of transcriptional regulatory networks. With a very complex biological system mobility more than 2000 genes identified as significantly varying, we show that the essential elements of the change in the large scale picture of the biology can be captured in a smaller subset of "beacon" genes. Analysis of this more concise set of "beacons" facilitates mapping the gene expression dynamics onto known processes . We wish to emphasize that the resulting biological picture does not derive solely from indentifying only a few key genes. This approach also requires that all members of a pathway be expressed, which is determined by comparing putative networks or canonical pathways against the entire dataset of expressed genes. Genes showing smaller changes than shown by the "beacons" are identified against the set of HV genes.
The approach also is statistically robust. Expression is judged against the uncertainty of the zero point, and the threshold can be selected either to minimize false negatives or false positives. The need for replicates, and therefore the cost of experiments, is minimized using a global F-test against the variance of the system as a whole with a p-value standard of 1/N that minimizes false positive identification of significant genes. The HV gene approach is best suited to providing an overall description for hypothesis generation with multiple biological variables as opposed to hypothesis-testing in a two-state system (e.g. treated and control).
In summary, this article demonstrates an approach to microarray analysis that organizes the findings into a biologically based model that should in turn, facilitate generation and testing of hypotheses because the analysis itself is structured around the properties of biological system. In this case, the findings suggest that G-protein signaling plays a major role in the modulation of phenotype by crECM, that the cells are differentiated and acquire specialized functions (e.g. immune function and transmembrane proteins) and that several transcription factors regulate the process.
The RT4 bladder transitional cell papilloma cell line (American Type Culture Collection, Manassas Virginia) was cultured on plastic and on cancer-remodeled ECM, Matrigel™ (crECM), and RNA was isolated as described previously using the RNAeasy kit (QIAGEN Inc., Valencia, CA) [8, 19].
Microarray data were obtained using a spotted array from cells cultured on plastic (two arrays) and across 9 days time course of growth on crECM, as previously described . Cy3 labeled cDNA was synthesized and hybridized onto glass arrays spotted with 22,464 long oligos (~70 mers) from the UniGene database of functionally known genes.
Expression data were normalized to the variability around the zero point, as described previously [5, 7]. Genes were considered to be expressed if their expression normalized to the S.D. of zero point exceeded 3.0 (p < 0.001). The arrays were then globally adjusted to each other by robust linear regression. Genes expressing higher variability than the technical variability of the system (hypervariable, or HV genes) were identified as described previously  and as shown in Figure 3B. Two thresholds were used. HV genes showed relative standard deviations > 3.8 SD above that of the population mode (p < 6 .7 × 10-5). This value is 1/N, where N is the number of expressed genes (~15,000). A more stringent criterion of > 5 SD (p < 2.87 × 10-7) was also used to identify a subset of very HV, or "beacon" genes. Their expression profiles were clustered by the Cluster 3.0 program http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm. Hierarchical clustering identified major patterns of gene expression changes; further clustering of selected "beacon" genes was done using K-means clustering, k = 3, which was selected as described below. Significantly over-represented gene ontologies were identified using the Database for Annotation, Visualization and Integrated Discovery (DAVID, http://david.abcc.ncifcrf.gov/) . Biologically relevant networks were assembled from identified clusters and groups of common genes by using Ingenuity Pathways Analysis (IPA, http://www.ingenuity.com/). Each gene identifier was mapped to its corresponding gene object in the Ingenuity Pathways Knowledge Base. Genes were not weighted by expression levels, and biological networks were built on this assumption. Analysis of common TREs shared by genes in each ontological cluster was performed by using the web-based program PAINT http://www.dbi.tju.edu/dbi/tools/paint/ against whole list of genes in microarray.
Additional assessment of the DNA binding activity of transcription factors using TranSignal Protein/DNA Combo Arrays (spin column version, # MA1215, Panomics, Redwood City, CA) was conducted according to the manufacturer's protocol. Briefly, cell nuclear extracts were incubated with biotin-labeled oligonucleotides that possess consensus DNA binding sites for 345 transcription factors. The protein bound probes were isolated by using a spin column and then hybridized to the DNA/protein array. After the DNA/protein array was washed, the array was incubated with detection solution and images of the chemiluminescent signal were captured using an Alpha Imager (Alpha Innotech, San Leandro, CA) and quantitated by using AlphaEase software and standardized against biotinylated DNA spots on the membrane. The results are linear between 0 and 100 units. In order to detect low expression transcription factor activity, two exposure times were used, 8 and 25 min. Values were adjusted for exposure so that all values were measured within the linear range of the assay. A histogram of expression data was plotted and was found to be bimodal, with one mode centered about zero. Background subtraction was performed by calculating the standard deviation of this distribution and subtracting 3 standard deviations above the mode from all expression values, approximately 6 units.
List of abbreviations used
complementary deoxyribonucleic acid
cancer-remodeled extracellular matrix
Database for Annotation, Visualization and Integrated Discovery
false discovery rate
messenger ribonucleic acid
Promoter Analysis and Interaction Network Toolset
transcription regulatory element
The authors gratefully acknowledge the technical assistance of Jean Coffman. This work was supported in part by a grant from the NIH to REH, R01 DK069808
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 9, 2008: Proceedings of the Fifth Annual MCBIOS Conference. Systems Biology: Bridging the Omics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S9
- Klebanov L, Yakovlev A: How high is the level of technical noise in microarray data? Biol Direct 2007, 2: 9.PubMed CentralView ArticlePubMedGoogle Scholar
- Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de LF, Kawasaki ES, Lee KY, et al.: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 2006, 24: 1151–1161.View ArticlePubMedGoogle Scholar
- Huang S: Gene expression profiling, genetic networks, and cellular states: an integrating concept for tumorigenesis and drug discovery. J Mol Med 1999, 77: 469–480.View ArticlePubMedGoogle Scholar
- Dozmorov I, Centola M: An associative analysis of gene expression array data. Bioinformatics 2003, 19: 204–211.View ArticlePubMedGoogle Scholar
- Dozmorov I, Knowlton N, Tang Y, Shields A, Pathipvanich P, Jarvis JN, Centola M: Hypervariable genes–experimental error or hidden dynamics. Nucleic Acids Res 2004, 32: e147.PubMed CentralView ArticlePubMedGoogle Scholar
- Dozmorov I, Knowlton N, Tang Y, Centola M: Statistical monitoring of weak spots for improvement of normalization and ratio estimates in microarrays. BMC Bioinformatics 2004, 5: 53.PubMed CentralView ArticlePubMedGoogle Scholar
- Knowlton N, Dozmorov IM, Centola M: Microarray Data Analysis Toolbox (MDAT): for normalization, adjustment and analysis of gene expression data. Bioinformatics 2004, 20: 3687–3690.View ArticlePubMedGoogle Scholar
- Dozmorov MG, Kyker KD, Saban R, Knowlton N, Dozmorov IM, Centola M, Hurst RE: Analysis of the Interaction of Extracellular Matrix and Phenotype of Bladder Cancer Cells. BMC Cancer 2006, 6: 12.PubMed CentralView ArticlePubMedGoogle Scholar
- Dozmorov MG, Kyker KD, Saban R, Shankar N, Baghdayan AS, Centola M, Hurst RE: Systems biology approach for mapping the response of human urothelial cells to infection by Enterococcus faecalis. BMC Bioinformatics 2007, 8: 52.View ArticleGoogle Scholar
- Hurst RE, Kyker KD, Bonner RB, Bowditch RG, Hemstreet GP: Matrix-Dependent Plasticity of the Malignant Phenotype of Bladder Cancer Cells. Anticancer Res 2003, 23: 3119–3128.PubMed CentralPubMedGoogle Scholar
- Smith BA, Kennedy WJ, Harnden P, Selby PJ, Trejdosiewicz LK, Southgate J: Identification of genes involved in human urothelial cell-matrix interactions: implications for the progression pathways of malignant urothelium. Cancer Res 2001, 61: 1678–1685.PubMedGoogle Scholar
- Fournier MV, Martin KJ, Kenny PA, Xhaja K, Bosch I, Yaswen P, Bissell MJ: Gene expression signature in organized and growth-arrested mammary acini predicts good outcome in breast cancer. Cancer Res 2006, 66: 7095–7102.PubMed CentralView ArticlePubMedGoogle Scholar
- Simon R: Development and evaluation of therapeutically relevant predictive classifiers using gene expression profiling. J Natl Cancer Inst 2006, 98: 1169–1171.View ArticlePubMedGoogle Scholar
- Saldanha AJ: Java Treeview–extensible visualization of microarray data. Bioinformatics 2004, 20: 3246–3248.View ArticlePubMedGoogle Scholar
- Juan HF, Huang HC: Bioinformatics: microarray data clustering and functional classification. Methods Mol Biol 2007, 382: 405–416.View ArticlePubMedGoogle Scholar
- Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003,4(5):P3.View ArticlePubMedGoogle Scholar
- Vadigepalli R, Chakravarthula p, Zak DE, Schwaber JS, Gonye GE: PAINT: A Promoter Analysis and Interaction Network Generation Tool for Gene Regulatory Network Identification. OMICS: A Journal of Integrative Biology 2003, 7: 235–252.View ArticlePubMedGoogle Scholar
- Boutros PC, Okey AB: Unsupervised pattern recognition: an introduction to the whys and wherefores of clustering microarray data. Brief Bioinform 2005, 6: 331–343.View ArticlePubMedGoogle Scholar
- Kyker KD, Culkin DJ, Hurst RE: A model for 3-dimensional growth of bladder cancers to study mechanisms of phenotypic expression. Urologic Oncology 2003, 21: 255–261.View ArticlePubMedGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863–14868.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.