Computational expression deconvolution in a complex mammalian organ

Background Microarray expression profiling has been widely used to identify differentially expressed genes in complex cellular systems. However, while such methods can be used to directly infer intracellular regulation within homogeneous cell populations, interpretation of in vivo gene expression data derived from complex organs composed of multiple cell types is more problematic. Specifically, observed changes in gene expression may be due either to changes in gene regulation within a given cell type or to changes in the relative abundance of expressing cell types. Consequently, bona fide changes in intrinsic gene regulation may be either mimicked or masked by changes in the relative proportion of different cell types. To date, few analytical approaches have addressed this problem. Results We have chosen to apply a computational method for deconvoluting gene expression profiles derived from intact tissues by using reference expression data for purified populations of the constituent cell types of the mammary gland. These data were used to estimate changes in the relative proportions of different cell types during murine mammary gland development and Ras-induced mammary tumorigenesis. These computational estimates of changing compartment sizes were then used to enrich lists of differentially expressed genes for transcripts that change as a function of intrinsic intracellular regulation rather than shifts in the relative abundance of expressing cell types. Using this approach, we have demonstrated that adjusting mammary gene expression profiles for changes in three principal compartments – epithelium, white adipose tissue, and brown adipose tissue – is sufficient both to reduce false-positive changes in gene expression due solely to changes in compartment sizes and to reduce false-negative changes by unmasking genuine alterations in gene expression that were otherwise obscured by changes in compartment sizes. Conclusion By adjusting gene expression values for changes in the sizes of cell type-specific compartments, this computational deconvolution method has the potential to increase both the sensitivity and specificity of differential gene expression experiments performed on complex tissues. Given the necessity for understanding complex biological processes such as development and carcinogenesis within the context of intact tissues, this approach offers substantial utility and should be broadly applicable to identifying gene expression changes in tissues composed of multiple cell types.


Background
High-throughput transcriptional profiling using DNA microarrays has enabled routine measurements of genome-wide regulatory changes in a variety of contexts. This technique has been applied to the analysis of gene expression within relatively homogeneous cellular populations as well tissues or tumors consisting of disparate cell types. Within metazoans, cells depend upon environmental signals for growth, differentiation, and survival. However, the interpretation of gene expression profiles obtained in metazoan organisms is complicated by their characteristically complex cellular environments. This results from the fact that microarray expression measurements from heterogeneous tissues with distinct cellular compartments reflect weighted averages of expression levels within different cellular populations. Therefore, observed changes in gene expression may result from bona fide changes in regulation within a given cellular compartment, or from changes in the abundance of an expressing compartment within the tissue as a whole. As a consequence, changes in compartment size may be mistaken for the intracellular regulation of gene expression; conversely, genuine regulation within a given cell type may not be detected due to changes in the abundance of cellular compartments that mask its contribution to the tissue as a whole.
To date, several approaches have been used to identify changes in gene expression that occur in different cellular compartments within tissues or tumors comprised of multiple cell types. Laser capture micro-dissection (LCM) [1][2][3] has been used to physically separate defined cell populations prior to gene expression analysis. A drawback of this approach, however, has been the difficulty in obtaining sufficient quantities of purified material to perform robust, reproducible genome-wide profiling. Other techniques for physical separation may also be used [4,5], however, it is often difficult to ensure that the separation process itself does not introduce substantial alterations in gene expression.
Recently, Lu et al. described a computational approach for estimating proportions of cells at specified points in the cell cycle within asynchronous cultures of yeast [6]. Application of this method to complex tissues in higher organisms, however, requires the identification of cell typespecific genes whose expression levels are not substantially affected by biological state or experimental perturbation.
The mammary gland contains two major cellular compartments -epithelial and stromal -that are themselves composed of multiple cell types. These include luminal, myoepithelial, and alveolar epithelial cells, endothelial cells, fibroblasts, white adipocytes, brown adipocytes, and other stromal cell types including multiple hematopoietic cell lineages. During mammary gland development as well as tumorigenesis, the proportions of these compartments change dramatically relative to each other. For example, the epithelial compartments proliferate rapidly and expand during both puberty (ductal elongation) and pregnancy (lobuloalveolar development). Conversely, large scale apoptosis of alveolar epithelial cells and remodeling of the extracellular matrix occurs following pup weaning, thereby returning the gland to a state superficially resembling that of the adult nulliparous animal [7]. In an analogous manner, during the process of tumorigenesis, oncogene activation expands the size of the epithelial compartment while inducing marked changes in other cellular compartments, such as the adipose and fibroblastic stroma and cells of the innate and adaptive immune system.
In this article, we describe and apply a novel extension of a computational deconvolution strategy observed patterns of differentially regulated gene expression. We further demonstrate the utility of this approach by using it to deconvolute expression changes that occur over the course of mammary gland development as well as in response to oncogenic Ras activation within the mammary gland. Predicted gene expression changes computed in this manner were confirmed experimentally and revealed statistically significant regulation of underlying functional pathways that were not detected by conventional gene expression analysis methods.

Identification of discriminant gene lists
We hypothesized that changes in the proportions of different cell types within a complex organ could be quantitatively assessed using panels of transcripts that were specifically expressed within each of the composite cellular compartments. In order to identify such transcripts, we selected highly enriched reference samples containing largely homogeneous cell populations and compared gene expression levels in these samples to those in samples representing other cell types. Mammary epithelial cells (MEC), brown adipose tissue (BAT), white adipose tissue (WAT), T cells (CD4+ and CD8+), B cells, plasma cells, macrophages, and fibroblasts were selected for thismodeling approach either because they represent abundant cell populations within the mammary gland or because they are known to play a role in mammary gland development and tumorigenesis.
Microarray expression data were used to generate cell type-specific gene lists through pairwise comparisons of expression between all samples as described in Methods. Only genes that showed significant enrichment within a given cell type compared to all other cell types were retained for further analysis. These lists were further refined using stepwise discriminant analysis to identify the optimal cell type-specific transcripts for classification purposes. The resulting gene lists for each cell or tissue type are shown in Table 1. Examination of the genes identified by this discriminant analysis revealed many that had previously been reported to be expressed in a cell typespecific manner. These included genes whose expression is specific for mammary epithelial cells (Krt2-8, Krt1-18, , fibroblasts (Fgf7), WAT (Retn), BAT (Ucp1, Dci), plasma cells (immunoglobulin genes), macrophages (Emr1), B cells (CD19, CD22, CD72, Blk), and CD8+ Tcells (CD8a, CD8b). These findings support the validity of this approach to identify cell type-specific genes.

Expression deconvolution during mammary gland development
Deconvoluting expression patterns in a complex tissue using cell type-specific gene lists is dependent on the extent to which the genes selected are expressed at a constant level on a per-cell basis over a range of biological conditions. Moreover, it is also necessary to retain a large enough gene list such that the overall estimate will be robust in the face of biological and technical sources of variation in gene expression measurements [8]. We reasoned that the expression profiles for highly regulated cell type-specific genes would deviate substantially from the average expression profile exhibited by genes within that particular cell type-specific gene list.
To exclude highly regulated, cell type-specific genes from consideration, mean and variance-normalized expression profiles across mammary gland development were compared for each cell type-specific gene list. Average or canonical gene expression profiles were calculated for each set of cell type-specific genes. Genes with a Pearson correlation coefficient less than 0.5 when compared to the average group expression profile were excluded from further analysis. For example, BAT-specific genes typically displayed high levels of expression in the mammary glands of male mice and female mice prior to puberty (2 wk G0P0). Expression levels progressively decreased thereafter, reaching a stable nadir during mid-late pregnancy, lactation, and early involution (Figure 1a). This pattern parallels the change in abundance of this cellular compartment during mammary gland development [7]. In contrast, expression of the BAT-specific gene, Cidea, was markedly upregulated in the mammary gland during late pregnancy and lactation compared to other BAT-specific genes and was therefore excluded as a marker for BAT in subsequent analysis (Figure 1a). An analogous approach was taken for each list of cell type-specific genes.
To determine whether this method permits the accurate calculation of the relative contribution of different cellu-lar compartments within the mammary gland, we next used these cell type-specific gene lists to estimate compartment sizes across thirteen stages of mammary gland development. We have previously shown that changes in the relative proportions of mammary epithelial cells, WAT, and BAT across mammary gland development substantially affect gene expression profiles observed in the mammary gland [7]. Because these represent the most abundant compartments within the mammary gland, marked changes in their relative sizes during mammary development would be predicted to result in numerous changes in gene expression when expression levels are measuredwithin the tissue as a whole.
Expression levels for cell type-specific genes representing MEC, BAT, and WAT were averaged across the three reference samples for each cell type. The resulting values for each gene were taken as its basal expression within its cognate tissue compartment. Mean and variance-normalization was first performed across all genes in that sample, and proportions of each cell type were estimated by obtaining solutions to linear equations of the form Ax = y, where A is an m × n matrix of expression values (m genes × n reference groups), y is a vector of m values in the test sample, and x is the vector of n values reflecting the estimated relative proportions of each cell type within the mixture. Solutions were estimated using simulated annealing [6]. A related approach, albeit using different methods for identifying genes to be used in the estimate, has been described by Lu et al. who termed the process of estimating cellular proportions "expression deconvolution" [6].
Expression deconvolution was first applied to the initial expression data sets derived from the MEC, WAT, and BAT purified cell populations. As expected, this algorithm correctly estimated the composition of each of the nine samples as consisting entirely of its appropriate corresponding cell type (Figure 1b). Next, expression deconvolution was performed on data sets representing triplicate measurements of 13 time points spanning mammary gland development ( Figure 1b). This analysis yielded estimates of the proportions of MEC, WAT, and BAT present in the mammary gland during each stage of development. These estimates closely paralleled changes in compartment sizes that were observed by histological analysis, thereby confirming the validity of this approach (Figure 2a and 2b). For example, in male and 2-week-old female mice, only a rudimentary epithelial tree is present in the mammary fat pad (Figure 2a and 2b). Consistent with this, expression deconvolution analysis estimated that the mammary gland at these stages is composed primarily of brown and white adipose tissue, with only a small amount (<5%) of epithelium ( Figure 1b). Also consistent with previous descriptions, the estimated proportion of BAT declines   Expression profiles of BAT-specific genes and estimated changes in cell compartment sizes during mammary gland development during puberty to reach a nadir in adult nulliparous mice (Figure 1b and [7]). Conversely, during this same period, ductal elongation and branching occur with extension of mammary epithelial ducts into the surrounding fat pad (Figure 2a and 2b). This process results in a substantial increase in the proportion of epithelial cells in the mammary gland between 2 wk and 10 wk of age, at which time ductal morphogenesis is largely complete. This increase in the epithelial content of the gland is accurately reflected in cell compartment estimates using expression deconvolution ( Figure 1b).
Similar to the expansion of the epithelial compartment during puberty, further increases in epithelial content occur during pregnancy due to the expansion of the alveolar epithelial compartment ( Thus, in aggregate, the calculated proportional composition of the mammary gland with respect to the contribution of each of these three cellular compartments across mammary development is consistent with previously described changes as well as direct visualization of these compartments in staged samples.

Effects of compartment size adjustment on the identification of regulated genes
A principal benefit of the ability to accurately estimate changes in compartment sizes is the possibility of distinguishing bona fide changes in gene expression within a compartment from apparent gene expression changes due solely to changes in compartment size. That is, since the overall expression level in the mammary gland for a gene whose expression is not regulated is equivalent to the sum of its expression levels within each cellular compartment, it should be possible to quantitatively predict the changes in that gene's apparent expression level that would result solely from specified changes in the abundance of cellular compartments within the gland.
To test the validity of this approach, we calculated expression levels for αand γ-tubulin across mammary development based on their expression within each compartment within the gland. We then compared these predicted expression levels to those observed across development. Briefly, the average expression levels observed for the tubulin genes within the reference MEC, WAT, and BAT samples were used in conjunction with the estimated proportions of the three compartments at each of the development stages to generate a predicted expression values for the tubulin genes ( Figure 3a). The resulting calculated expression levels were largely congruent with the observed values with correlation coefficients greater than 0.6.
To ascertain whether this expression deconvolution method could identify intrinsic gene expression changes, mammary glands harvested from late pregnant (day 18) mice were compared to those from 10-week-old nulliparous females. These two developmental stages have marked differences in physiology as well as in the abundance of epithelial and adipocyte compartments [10] and thereby provide an appropriate test for the adequacy of this approach. First, the predicted expression level for each gene was calculated based on its expression level in the MEC, WAT, and BAT compartments, coupled with changes in the sizes of the three compartments between these two stages of mammary gland development. This predicted gene expression level was then subtracted from the observed expression level, and the resulting values for these two developmental points were compared using the Statistical Analysis for Microarrays (SAM) algorithm [11] with a false discovery rate threshold (FDR) of <3%. Additional File 2 lists the results of SAM comparisons both before and after taking into account changes in compartment sizes and Table 2 summarizes the number of probesets whose regulation calls were altered. Table 3 lists selected genes that were considered to be differentially expressed either before or after adjusting for changes in cell compartment sizes.
The effects of taking compartment size changes into account was substantial as 20.2%, 35.2%, and 15.5% of up-, down-, and non-changing calls, respectively, were altered as a consequence of adjusting for changes in cell compartment sizes ( Table 2) that occur during mammary development. As shown in Table 2, 16% of genes called upregulated prior to signal adjustment were called nonchanging and 4.2% were called downregulated, after taking into account changes in cell compartment sizes. Thus, adjustment for cell compartment size alters change calls for 20.2% of genes initially called upregulated. Similarly, 35.2% of genes initially called downregulated were predicted to be either up-regulated (3%) or non-changing (32.2%) after taking into account changes in cell compartment sizes. These findings strongly suggest that apparent changes in expression for a substantial fraction of genes identified as differentially regulated using standard analytical approaches may actually reflect changes in cell Tubulin and cytokeratin 8 expression during mammary gland development compartment sizes that occur during mammary gland development rather than intrinsic gene regulation.
For example, examination of genes in Table 3 revealed that a number of adipocyte-specific genes, such as Pparg and Fabp4 (aP2), appeared to be significantly downregulated in the pregnant (d18) compared to the 10-week-old nulliparous gland prior to adjustment for estimated changes in cell compartment sizes. After adjusting for estimated changes in cell compartment sizes, however, these adipocyte-specific genes were no longer considered to change significantly between these two developmental time points. This suggests that the apparent down-regulation of multiple WAT-specific genes that occur during pregnancy is most likely a consequence of a decrease in the size of the adipose compartment that occurs at this stage (see Figure 1b, 2a, and 2b).
Figure 1b implies that the contribution of the adipocyte compartment to total mammary gland mRNA at d18 of pregnancy is less than half that of its contribution to the 10 wk nulliparous gland. As such, the apparent expression of genes that are expressed predominantly within the WAT compartment would be predicted to decrease from 10 wk nulliparous development to d18 of pregnancy. Consistent with this, the pre-adjustment level of leptin (Lep) expression appeared significantly downregulated from 10 wk nulliparous gland to d18 of pregnancy, whereas postadjustment signals indicated that its expression level did not differ significantly between these two developmental stages. In fact, we have previously shown by in situ hybrid-ization that Lep mRNA expression does not change substantially on a per-cell basis between puberty and early pregnancy [7], providing experimental confirmation of this computational result.
Conversely, a substantial fraction of genes (15.5%) that were not considered to be differentially regulated prior to adjustment for changes in cell compartment sizes were found to be either up-(5.2%) or downregulated (10.3%) following deconvolution (Table 2 and 3). This suggests that changes in cell compartment sizes that occur during mammary gland development mask bona fide changes in expression for a substantial number of genes. For example, lipoprotein lipase (Lpl), insulin-like growth factor 2 (Igf2), and prolactin-like protein E (Prlpe) were all identified as upregulated in late pregnancy compared with the 10 wk nulliparous gland only after taking into account the changes in cell compartment sizes. This finding is consistent with previous reports that the content and activity of Lpl, which is expressed in mammary adipocytes, increases within the fat pad during pregnancy [12] and early lactation [13]. Similarly, the adipocyte-secreted factors Igf2 and Prlpe were found to be upregulated only after adjusting for the decreasing sizes of the adipocyte compartment that occurs during pregnancy. This finding is consistent with their previously described role in promoting mammary gland proliferation and differentiation [14,15].
Several genes encoding transporter proteins (e.g., Abca1, Abcd2, Abcc9) appeared to be down-regulated in pregnancy prior to adjusting for changes in compartment Table 2: Alterations in change calls following adjustment for cell compartment size. Total probesets called upregulated (Up), downregulated (Down), or non-changing (NC) before ("Pre-adj") and after ("Post-adj") adjusting for estimated changes in cell compartment sizes. The percentages of probesets whose post-adjustment change calls were the same as or different than their preadjustment change calls are shown. sizes, but were predicted to be up-regulated following signal adjustment; these results are consistent with the preparation of the mammary gland for large-scale transport and secretion during lactation. Conversely, pro-apoptotic genes such as Bad, Bax, Bid, and mdm2 were found to be downregulated at d18 of pregnancy, consistent with the low levels of apoptosis observed during late pregnancy and lactation, only after adjusting for changes in compartment sizes ( Table 3). As such, the congruence of these results with our current understanding of mammary gland Notably, the expression of several epithelial genes, including Krt2-8, Krt1-18, and Krt1-19, did not appear to change between 10 wk nulliparous and d18 pregnant mice prior to adjusting for changes in compartment sizes, but were predicted to be significantly downregulated after taking the compartment size changes account. These results were surprising not only because developmental regulation of these epithelial markers within the mammary gland has not previously been reported, but also because two of these cytokeratin genes (Krt2-8 and Krt1-18) were themselves members of the epithelial-specific set of transcripts used to adjust gene expression for compartment size. Closer examination of these data revealed that a modest increase in apparent Krt2-8 and Krt1-18 expression levels in late pregnancy is offset by more substantial upregulation of other epithelium-specific genes, such as Ndrg1. To determine whether this prediction was accurate, Krt2-8 protein expression was assessed by immunofluorescence ( Figure 3b). This analysis suggested that Krt2-8 expression does indeed decrease on a per-cell basis within the murine mammary epithelium during late pregnancy. This, in turn, indicates that adjusted expression values can accurately identify differentially regulated genes whose unadjusted expression values within the mammary gland as a whole do not appear to change.

Effect of compartment size adjustment on GO analysis
To identify coherent changes in functional gene expression patterns between late-pregnant (d18) and 10 wk nulliparous mice, statistically significant associations between Gene Ontology (GO) categories and lists of upand down-regulated genes were identified using EASE before and after adjusting for estimated changes in cell compartment sizes [16]. Table 4 lists a subset of the significant associated terms in the Biological Process category and Additional File 3 lists the significant GO terms. This analysis revealed that GO terms relating to protein localization and protein transportation were significantly associated with up-regulated gene lists irrespective of adjustment for changes in cell compartment sizes. In contrast, while pre-adjustment gene lists suggested the downregulation of multiple metabolic pathways (fatty acid metabolism, lipid metabolism, and carboxylic acid metabolism), these associations were absent in the postadjustment gene list. This indicates that the initial associations observed were most likely a result of the diminished size of the WAT compartment during late pregnancy rather than bona fide regulation within an expressing compartment.

Deconvolution of oncogenic Ras activation
Having demonstrated the ability of expression deconvolution toidentify gene expression changes in the setting of marked changes inthe proportions of different cellular compartments, we wished to use this approach to examine gene expression changes that occur in the mammary gland as a consequence of oncogene Ras activation. Our laboratory has previously described the mammary-specific, doxycycline-inducible expression of oncogenes using bitransgenic mouse models [17][18][19][20]. For the present experiments, an activated Ras transgene was placed under the control of a tetracycline-dependent minimal promoter, and Ras was induced in the mammary gland for specified periods of time by administering animals 2 mg/ ml doxycycline in their drinking water. Mammary tissues were harvested from nine mice each at day 0 (d0), day 1 (d1), day 2 (d2), day 4 (d4), day 8 (d8), and day 14 (d14) of doxycycline treatment and oncogene induction. RNA prepared from these samples was used to generate three independent pooled samples, each consisting of RNA from three animals, and microarray transcriptional profiling was performed on Affymetrix MG-U74Av2 arrays.
As was performed for mammary development, tissue-specific gene lists from epithelial cells, WAT, and BAT were first used to estimate changes in the relative proportions of these three compartments following Ras induction (Figure 4a). Expression deconvolution predicted that Ras induction resulted in the dramatic expansion of the epithelial compartment with a corresponding loss of BAT and WAT (Figure 4a). Aanalysis of mammary whole mounts and hematoxylin and eosin (H&E)-stained sections confirmed that after 4 days of Ras induction, the mammary epithelial compartment occupied more than 70% of the mammary gland ( Figure 4b). As was the case for normal mammary gland development, calculated estimates of changes in cell compartment sizes derived from gene expression analysis (Figure 4a) closely paralleled morphological changes observed in the mammary gland induced following Rasactivation (Figure 4b).
We next adjusted gene expression levels to account for the calculated baseline expression that would be attributable to the estimated proportions of the three major cellular compartments (MEC, WAT, and BAT). Gene expression data were then re-analyzed to identify genes that were differentially regulated following 4 days of Ras activation. Additional File 4 lists gene expression change calls before and after the adjusting for estimated changes in cell compartment sizes. Table 2 summarizes the number of probesets with altered change calls, and Table 5 lists selected genes that were considered to be differentially expressed either before or after adjusting for changes in cell compartment sizes.
As observed for analyses of mammary development, adjustment for changes in cell compartment sizes identified genes whose apparent expression changes actually reflected changes in the sizes of their expressing cell compartment, as well as genes whose changes in expression between these two points had been masked by changes in cell compartment sizes. These effects were substantial as 32.4%, 58.3%, and 18.9% of up-, down-, and non-changing calls, respectively, were altered as a consequence of adjusting for changes in cell compartment sizes ( Table 2).
As was observed for the analysis of pregnancy-induced changes in gene expression in the mammary gland, the expression of adipocyte-specific genes such as Dci, Ucp3, Cepba, Lep, Pparg, Fsp27, and Fabp4 that appeared to be down-regulated prior to adjustment for cellular compartment sizes, were no longer considered to be changed after adjustment. This suggests that the apparent down-regulation of these genes was due not to Ras-induced repression of gene expression, but to the reduction in size of the adipocyte compartment that occurs as a consequence of the epithelial expansion triggered by Ras activation. Similarly, although Krt2-8 and Krt1-18 appeared upregulated in response to Ras activation prior to adjustment, these genes were not considered to be differentially expressed after deconvolution. This suggests that the increase in cytokeratin expression observed following 4 days of Ras activation principally reflects expansion of the epithelial compartment rather than intrinsic regulation of these genes within the epithelial compartment. Consistent with the deconvolution analysis, significant changes in per-cell expression of cytokeratin 8 protein levels were not observed following immunofluorescence analysis performed at these two time points (Figure 5a). Similar to cytokeratins, the pro-apoptotic genes Bax, Bid, and Usp3 no longer showed significant up-regulation after adjusting for expansion of the epithelial compartment. Consistent with this result, TUNEL staining revealed low levels of apoptosis that did not increase following Ras activation ( Figure 5b).
As was the case for our analysis of mammary gland development, multiple genes were identified that were predicted to be differentially expressed only after adjusting for changes in cell compartment sizes. Genes that were predicted to be up-regulated following expression deconvolution included Rras and Ctsb, whereas Mr1 was predicted to be down-regulated following adjustment. These predictions are consistent with previously published reports on the effect of Ras activation [21,22], providing  additional evidence that this approach can reliably adjust expression profiles for changes in compartment sizes.
Having adjusted Ras-induced changes in gene expression levels for changes in cell compartment sizes, we next analyzed lists of differentially regulated genes for statistical association with Gene Ontology annotation. Analysis of pre-adjusted gene lists revealed significant associations between down-regulated genes and multiple energyrelated pathways, including glycolysis, lipid metabolism, carbohydrate metabolism, fatty acid metabolism, and electron transport (Additional File 5 and Table 6). This result was surprising given the known ability of Ras to stimulate glycolysis [23] as well as the presumably large energy requirement for macromolecular synthesis necessary to sustain oncogene-induced epithelial proliferation. However, after adjusting for the decreased proportion of adipocytes in the mammary gland, these metabolic pathways were no longer significantly associated with downregulated genes ( Table 6). This suggests that the observed pre-adjustment association was principally a consequence of the reduction in size of the WAT compartment induced by Ras activation.
Finally, adjusting for changes in cell compartment sizes revealed several significant associations between up-regulated genes and GO categories that were not evident prior to adjustment. These included genes associated with the inflammatory response and integrin-mediated pathways ( Table 6). Ras-mediated activation of these pathways has previously been described [24].

Comprehensive compartment dynamics in the mammary gland
The above findings demonstrate th'at adjusting mammary gene expression profiles for changes in the size of three principal cellular compartments (MEC, WAT, and BAT) is sufficient both to eliminate false-positive changes in gene expression due to changes in cell compartment sizes and to reveal intrinsic changes in gene expression that are masked by changes of compartment sizes. Consequently, we wished to extend this method for estimating compartment size to include additional cell types for which expression profiling data were available from purified cell populations. These included macrophages, fibroblasts, Bcells, T-cells, and plasma cells. Estimates of relative changes in the size of each cell type compartment were derived as before for mammary gland development and inducible Ras activation ( Figure 6).
As shown in Figure 6a, the calculated proportions of hematopoietic cells, including B cells, plasma cells, CD4+ and CD8+ T cells, and macrophages, were found to increase during lactation or early postlactational involution. These findings are consistent with previous reports and with the putative roles of these cell types in antibody secretion into milk and the detection and clearance of apoptotic alveolar debris during postlactational involu- tion [24,25]. Additional increases in macrophage and CD8+ T cell abundance were predicted at the onset of puberty, and fibroblast abundance was estimated to decrease during pregnancy and subsequently increase during involution.

Cytokeratin 8 expression and apoptosis in the mammary gland following Ras activation
Estimated changes in macrophage and fibroblast cell populations in the mammary gland following Ras activation are shown in Figure 6b. The calculated increase in macrophages following Ras activation suggested macrophage infiltration into the mammary gland. The accuracy of this prediction was supported both by enzyme-linked immunosorbent assay (ELISA) of IL-1β, a known potent mediator of immune and inflammatory responses [26], and macrophage infiltration assays at d4 following Ras induction (manuscript in preparation).
Notably, while the mammary epithelial compartment constitutes the majority of the mammary gland after 2 days of Ras activation (Figure 4a and 4b), a progressive increase in the estimated proportion of fibroblasts was noted at d4, d8, and d14 (Figure 6b). We considered the possibility that the apparent increase in fibroblast-associated genes might reflect an epithelial-mesenchymal transition (EMT) resulting from Ras activation. To explore this possibility, we examined this data set for the expression of Snail, a potent activator of EMT [27,28]. Consistent with our prediction, both microarray gene expression data as well as Northern hybridization analysis showed dramatic up-regulation of Snail in response to Ras activation (Figure 7).

Discussion
While microarray expression profiling can be used to directly infer intracellular gene regulation within homogeneous cell populations, the complex mixture of cell types within tissues of higher organisms substantially hampers the interpretation of results from such experi- Table 6: Biological Processes regulated by Ras activation. Biological Process terms significantly associated with lists of genes that were detected as differentially expressed, both before ("Pre-adj") and after ("Post-adj") adjusting for estimated changes in cell compartment sizes in the mammary glands at d4 vs. d0 following Ras activation.
Estimated changes in the relative abundance of stromal cell types Figure 6 Estimated changes in the relative abundance of stromal cell types. Calculated relative abundance (mean ± SEM) of stromal cell types (a) during mammary gland development or (b) following oncogenic Ras activation. Expression levels shown are the average of normalized expression levels of the tissue-specific genes for each tissue/cell type listed in Table 1.
ments. Specifically, changes in gene expression observed within complex tissues may be due either to changes in gene regulation within a given cell type or to changes in the relative abundance of expressing cell types. Given the necessity of understanding complex biological processes such as development and carcinogenesis within the context of intact tissues, this problem is of substantial importance.
Previous attempts to address this problem have focused on purification or enrichment of defined cell types by physical methods, such as laser-capture microdissection or magnetic bead separation [1,2]. While these approaches have been successful, they are labor-intensive and become prohibitive when attempting to analyze large sample collections. As an alternative, we have chosen to apply a computational method for deconvoluting expression profiles derived from intact tissues by using reference data derived from purified populations of the constituent cell types. An analogous approach has previously been used by Lu et al. for analyzing cell populations during the yeast cell cycle [6]. However, the more challenging goal of deconvoluting gene expression within a complex metazoan tissue has not previously been reported.
As Lu et al. noted in their studies in yeast, a critical element of the deconvolution approach is the choice of purified cell populations that will adequately represent the repertoire of cell types present within the mixed population [6].
In the case of the mammary gland, identification and accurate representation of the multiple cellular compartments present is nontrivial. To attempt to model the major cell types in the murine mammary gland, we chose populations representing mammary epithelial cells, white adipocytes, brown adipocytes, fibroblasts, plasma cells, B cells, T cells (CD4+ and CD8+), and macrophages. Despite the relatively large number of cell types modeled, it is worth noting that even this represents an over-simplification since these cell types may be further subdivided based on lineage (luminal vs. alveolar vs. myoepithelial cells) or differentiation status (preadipocytes vs. mature adipocytes). Moreover, additional cellular compartments, such as those responsible for the mammary vasculature, were not included in the model. It is notable, therefore, that we were able to achieve robust results simply by restricting the deconvolution model to three major cellular components (MEC, WAT, and BAT). This suggests that fluctuations in the sizes of these three compartments account for a significant amount of the variation in gene expression observed during mammary gland development and carcinogenesis.
After identifying the relevant constituent cell types within a tissue, it is then necessary to identify robust sets of reference genes whose expression is specific to a given cell type and relatively unaffected by different physiological and experimental conditions. Cell type-specific genes have previously been identified using a variety of methods, including signal-to-noise calculations [29], linear discriminant analysis [30], and feature relevance approaches [31]. Additionally, a number of epithelial cell type-specific genes within the mammary gland have previously been identified using affinity-purified populations from normal and neoplastic mammary tissues [4]. Many of these mammary epithelial markers, including Krt2-8, Krt1-18, Krt1-19, and Stard10, were also identified in the current analysis. We now extend previous work by using these cell type-specific genes to provide quantitative estimates of compartment size and to adjust expression values by taking into account changes in compartment sizes.
In the analysis presented here, we have combined a signalto-noise approach (through T statistic rankings) with linear discriminant analysis in order to identify genes that are preferentially expressed within individual cell types.
Although cell type classification cannot be considered equivalent to the problem of estimating the proportional contributions of those cell types to a mixed population, we reasoned that the most discriminating genes for classification were also likely to provide a set of genes from which successful estimates of cell compartment size could be derived.

Upregulation of Snail by Ras activation
To attempt to identify genes that would provide useful markers for cell type abundance by identifying those with consistent expression under a variety of biological conditions, we analyzed the behavior of candidate genes in data sets spanning mammary gland development. Since mammary development encompasses a diverse group of biological processes including branching morphogenesis, alveolar differentiation, apoptosis, and extracellular matrix remodeling [10], it provides a useful test set for further gene selection. To identify outlier genes that were significantly regulated during development, we eliminated genes whose expression was not sufficiently correlated with the normalized mean profile of discriminating genes for specific cellular compartments. The fact that this overall approach selected a number of known marker genes specific for each tissue, including Krt2-8 and Krt1-18 for mammary epithelial cells, Ucp1 for BAT, Retn for BAT, and CD8a and CD8b for CD8+ cells, suggests that it was successful. Furthermore, the identification and elimination of genes (such as Cidea) with expression patterns that clearly differ from these known marker genes underscores the importance of this step. Optimal extension of this method to other organs and organisms will, in each case, require the collection of data from reference populations under a sufficiently diverse set of biological conditions.
The refined list of co-regulated marker genes and their reference expression values were then used to estimate the proportional contribution of each cell type to the mixture of mammary cell populations by using simulated annealing to estimate solutions to the corresponding linear equations. When this technique was applied to solve equations for the relative contributions of three major mammary cell types across mammary gland development, we achieved quantitative predictions of cellular compartment size changes that were strongly correlated with observed morphological changes. To further confirm the utility of this computational approach, we applied this method to a conditional transgenic mouse model in which oncogenic Ras was inducibly activated in the mammary epithelium. As before, estimated changes in compartment sizes derived from this model were consistent with those observed by morphological analysis of mammary whole mounts and stained sections.
Finally, we extended this expression deconvolution approach to include additional cell types, particularly those of the hematopoietic system. This approach predicted increases in several types of immune cells during lactation and postlactational involution, consistent with previous reports of their role in clearing apoptotic debris and with their persistent presence in the mammary gland following the cycle of pregnancy, lactation, and postlactational involution [24,25]. Interestingly, expression deconvolution also predicted an increase in macrophages and CD8+ cells at the onset of puberty. Although macrophages have previously been shown to be required for ductal elongation and to be recruited to puberty-specific structures termed terminal end buds [32], to our knowledge there has been no similar role described for T cells. Taken together, our results support the contention that the method described can reliably estimate dynamic changes in cellular compartments within a complex mammalian organ.

Deconvolution-adjusted expression analysis
While the ability to estimate compartment dynamics is useful in its own right, we were particularly interested in attempting to adjust gene expression values derived from whole-organ profiling in order to eliminate apparent changes in expression due solely to compartment dynamics. Furthermore, changes in expression due to changes in compartment sizes can offset genuine alterations in gene expression within particular cell types such that net expression may appear unchanged. Thus, adjusting expression values has the potential to reduce false-positive and false-negatives gene expression changes and to thereby increase both the sensitivity and specificity of differential gene expression experiments.
To adjust the expression signal associated with any given gene on the array, we utilized our estimates of compartment size in conjunction with mean expression values for that gene in each of the reference populations representing different cell types. This yielded a gene expression value that would be expected if its expression in other samples was solely determined by the composition of the sample with respect to each compartment in the absence of regulation within a compartment. The "predicted" value based on compartment sizes was then subtracted from the observed value and SAM was used to identify differentially expressed genes. Assuming that the expression value of a given gene in the reference samples provides a reasonable estimate of basal expression of that gene in the tissue, this comparison of "deconvolution-adjusted" expression values should reduce the number of genes identified as differentially expressed as a consequence of changes in compartment sizes.
When this approach was applied to the analysis of gene expression changes that occur during pregnancy and in response to Ras activation, the post-adjusted change calls of most genes were consistent with previous reports in the literature. For example, Abcg1, Abcg2, Csna, Csnb, and HIF1a have been previously shown to be upregulated during pregnancy [33], and Myc [34], Ccnd1, Cxcl1, Tgfb1, and Itgb2 are known to be induced by Ras activation [24]. The current approach retains the upregulated change calls of these genes.
Using this approach, we were also able to identify genes that appeared to be differentially expressed before-but not after-adjusting for changes in cell compartment sizes. These likely correspond to genes whose apparent expression levels change only as a consequence of compartment dynamics. Conversely, we were also able to identify genes whose expression appeared unchanged prior to adjustment, but which were found to be differentially expressed once changes in cell compartment sizes had been taken into account. These may represent genes whose bona fide regulation is masked by offsetting changes in compartment sizes. Finally, statistical association of post-adjustment gene expression lists with Gene Ontology (GO) revealed some biological processes that were masked by changes in compartment sizes. Several of the pathways significantly associated with post-adjustment gene expression changes have either been reported in the literature or confirmed experimentally by ourselves, such as the activation of inflammatory response and integrin-mediated pathways induced by Ras activation.
Despite the successful application of this approach to the adjustment of gene expression profiles, several caveats must be considered. First, any given adjusted gene expression profile is critically dependent upon the behavior of that particular gene in the relatively homogeneous and purified cell populations used as the reference samples. That is, adjustments are based on the assumption that "basal" gene expression can be estimated by the average gene expression value in the reference samples from the same organism that are of a given cell type. Thus, if the typical expression levels for a given gene in its relevant compartment in vivo is substantially higher than that measured in the purified population, a change due solely to changes in the size of that compartment will be undercompensated and expression of the gene may still appear to change. Thus, while we expect that this method will enrich for genes that are intrinsically regulated within cells, a certain number of "false positive" changes will undoubtedly remain.
Ultimately, the success of this enrichment will depend upon the purified populations used to provide the basis for subsequent estimates. In this regard, we anticipate that our results would be improved by deriving reference expression values from purified primary cell populations derived directly from the mammary gland. Although LCM-based analysis of large numbers of specimens may be difficult, a number of groups have successfully utilized this strategy to purify tumor populations [35,36]. LCM of a small number of reference samples may be ideal for isolating purified reference cell populations, and the resulting expression profiles can be used in conjunction with our computational approach for higher-throughput analysis of a large number of samples. While such an LCMbased method may provide improve estimates of compartment-based gene expression, however, the experimental validation of results obtained in the current study suggests that even non-ideal reference samples can substantially improve the detection of intrinsically regulated genes.
A second caveat that applies specifically to the initial estimates of compartment size is that differentiation of a given cell type may lead to what essentially constitutes a new cell type. Adequate modeling of such a cell type would be unlikely using the original reference populations. This problem may particularly complicate deconvolution efforts when a cellular process causes one cell type to express markers that are characteristic of another cell type from the reference set. This phenomenon may account for the substantial increase in the estimated abundance of fibroblasts observed in the mammary gland following Ras activation. Rather than an increase in bona fide fibroblasts, this may instead reflect the expression of mesenchymal markers in mammary epithelial cells due to an epithelial-mesenchymal transition, as suggested by the increase in Snail expression that we observed in response to Ras activation. The converse problem -multiple cell types (such as normal vs. neoplastic epithelial cells) appearing as a single type (epithelial) -should also be considered. However, our results suggest that this latter case is more likely to be interpretable using sensitive methods for detecting differential expression after correcting for other, more biologically distinct, cell populations.

Conclusion
Our extension of the expression deconvolution approach, in conjunction with deconvolution-adjusted expression analysis, demonstrates the ability to correct microarray data obtained from a complex, heterogeneous mammalian organ for changes in compartment sizes. Beyond identifying changes in tissue compartments, this approach permits the improved detection of differentially expressed genes. Finally, these adjusted differential expression estimates identify statistical associations with functional annotation that suggest novel aspects of mammary gland biology and carcinogenesis. Our findings indicate that this model of expression deconvolution provides a powerful tool for the study of complex cellular mixtures in higher organisms.
FVB/N (Taconic) mice were housed under barrier conditions with a 12 hr light/dark cycle and access to food and water ad libitum. Brown adipose tissue (BAT) was obtained from the interscapular fat pad of FVB mice. White adipose tissue (WAT) was obtained from sacral fat depots. Harvested tissues were snap frozen for RNA isolation.
Tissues used for microarray analyses of murine mammary gland development included independent, triplicate pools of mammary tissue from 10-week-old adult males as well as females at 12 time points of mammary gland development representing puberty, pregnancy, lactation, and postlactational involution [7]. All animal work described in this paper was carried out under humane conditions and has been approved by the University of Pennsylvania Laboratory Animal Resources committee.
Doxycycline-inducible systems that permit conditional gene expression in bitransgenic mice have been described [17]. The bitransgenic mouse line, MTB/TRAS, carries two transgenic constructs. The first, MMTV-rtTA (MTB), expresses the reverse tetracycline transactivator (rtTA) under the control of the murine mammary tumor virus (MMTV) promoter. The second, TetO-v-Ha-Ras (TRAS), expresses v-Ha-Ras under the control of the tetracyclinedependent minimal promoter. When MTB/TRAS mice are given 2 mg/ml doxcycline in drinking water, Ras is rapidly and specifically induced in the mammary gland [17]. The third, fourth, and fifth mammary glands were harvested from MTB/TRAS mice at days 0, 1, 2, 4, 8, and 14 of doxcycline treatment. At each time point, independent triplicate samples were prepared with each sample consisting of tissue pooled from 3 mice.

Mammary whole mounts and immunostaining
Mammary glands were mounted on glass slides, fixed in 4% neutral buffered paraformaldehyde overnight, and transferred to 70% ethanol. Whole mounts were stained with carmine alum as described [9].
Mammary glands embedded in OCT were sectioned at 8 μm and fixed for 10 min in 4% neutral buffered paraformaldehyde. After rinsing 3 times in PBS (10 minutes/ rinse), sections were treated in 0.5% Triton X-100 for 20 min. Sections were then rinsed 3 times in PBS and incubated in blocking buffer (PBS, 5% BSA, 0.3% Triton X-100, and 10% normal goat serum) for 1.5 hr at room temperature. Rat anti-mouse cytokeratin 8 (DSHB, the University of Iowa) was diluted in blocking buffer and incubated on sections overnight at 4°C. Sections were rinsed 3 times in PBS/0.3% Triton X-100 and stained with 1:1000 Alexa 567 conjugated goat anti-rat IgG (Molecular Probes) at room temperature for 2 hr. Stained sections were rinsed once in PBS/0.3% Triton X-100 and twice in PBS. Nuclei were counterstained with 1 μg/ml Hoechst 33258 dye, mounted in Fluoromount-G (SouthernBiotech) and visualized using a Leica DMRXE microscope. All images were captured using identical settings.

Microarray analysis
Total RNA from cell lines and tissues was isolated and purified as described [38]. cDNA was generated and biotinylated cRNA was hybridized to Affymetrix MG-U74Av2 oligonucleotide microarrays as described [7]. Data files and published CEL files representing gene expression profiles for other cell types were listed in Additional File 1.
For murine development, previously generated RNA pools [7] were re-analyzed on MG-U74Av2 arrays. All raw data were analyzed using Affymetrix GeneChip5.0 (MAS5) with default normalization to a target signal of 150.

Identification of tissue-specific genes
To identify cell type-specific genes, the two-tailed Student t-test was first used to compare all possible sample pairs representing different tissue and cell types. Transcripts were selected that were consistently expressed at levels within a tissue at least 5.0-fold higher compared to any other tissue or cell type analyzed at a significance of p < 0.0005. Transcripts were then excluded that were considered "Absent" by MAS5 or that had an MAS5 signal value <500 in samples from the tissue within which they were most highly expressed. Transcripts on the resulting list were ranked based on signal-to-noise ratio using the Zscore transformed data for each sample [29] where higher values reflect consistently higher expression in the given group compared to expression in all other cell types. For genes measured by multiple probesets, the probeset with the highest MAS5 signal was retained if probe sets were adjacent in rank; in all other cases, the highest-ranking probe set was used.
To eliminate poorly discriminating genes, the ranked gene list for each tissue and cell type was further filtered using the SAS procedure PROC STEPDISC with the backward method and default significance level (Version 9.1 of SAS System for Windows). The most discriminating genes were thereby selected that minimized the ratio of withingroup (consistently high or low expression with a group of tissue samples) sum of squares to total sum of squares for the model. Genes for each tissue type that passed the backward stepwise discriminant analysis were considered to represent the most specific genes for that tissue for subsequent analysis. A crude estimate of the efficiency of this step was obtained using the leave-one-out cross validation method in SAS PROC DISCRIM with the POOL and CROSSVALIDATE options and pooled covariance matrix when calculating the squared distances.

Computational estimation of proportional contributions from cell types
Expression values for cell-type-specific genes were obtained from each reference sample and averaged across samples obtained from the same cell or tissue type. These values were taken as estimates of basal expression for further work. All subsequent calculations were performed after first normalizing the expression in each sample (or averaged sample group) across all probesets such that the mean = 0 and the standard deviation = 1. Estimation of the relative proportions of each cell type was then performed by obtaining solutions to linear equations of the form Ax = y, where A is an m × n matrix of expression values (m genes × n reference groups), y is a vector of m values in the test sample, and x is the vector of n values reflecting the estimated relative proportions of each cell type within the mixture. Solutions for x were estimated using simulated annealing as described by Lu et al. [6] Deconvolution-adjusted expression analysis Given a sample for which an estimate of the relative proportions of composite cellular compartments has been obtained, gene expression was normalized by subtracting the expected contribution for each component. That is, the difference in expression that would have occurred in the absence of changes in compartment size was estimated using where g adj is an adjusted expression value, g orig is the observed expression value in the test sample, and x k and e k refer to estimated proportions and expression values in each of n reference cellular compartments, respectively. To identify genes for which observed expression changes were not due to changes in the abundance of its expressing cell types, Significance Analysis of Microarrays (SAM) [11] was used to compare sets of g adj from replicate measurements under various experimental conditions. Comparisons were performed with a False Discovery Rate (FDR) cutoff of 3%. A Java application that calculates adjusted gene expression based on estimated proportions of an arbitrary number of cellular components is available for public download [39].

Association with Gene Ontology (GO) annotation
Statistical associations between GO annotation and lists of differentially expressed genes were identified using EASE [16]. Multiple testing correction was performed using within-system bootstrapping, and a final cutoff of p < 0.05 was used to identify statistically significant associations.