Digital sorting of complex tissues for cell type-specific gene expression profiles
© Zhong et al.; licensee BioMed Central Ltd. 2013
Received: 13 September 2012
Accepted: 14 February 2013
Published: 7 March 2013
Skip to main content
© Zhong et al.; licensee BioMed Central Ltd. 2013
Received: 13 September 2012
Accepted: 14 February 2013
Published: 7 March 2013
Cellular heterogeneity is present in almost all gene expression profiles. However, transcriptome analysis of tissue specimens often ignores the cellular heterogeneity present in these samples. Standard deconvolution algorithms require prior knowledge of the cell type frequencies within a tissue or their in vitro expression profiles. Furthermore, these algorithms tend to report biased estimations.
Here, we describe a Digital Sorting Algorithm (DSA) for extracting cell-type specific gene expression profiles from mixed tissue samples that is unbiased and does not require prior knowledge of cell type frequencies.
The results suggest that DSA is a specific and sensitivity algorithm in gene expression profile deconvolution and will be useful in studying individual cell types of complex tissues.
Cellular heterogeneity is present in nearly all biological specimens. When the genome-wide transcriptional profile of heterogeneous samples is measured under different physiological states, any observed differences are strongly confounded by differences in cell type compositions between samples [1–3]. Recent studies suggest that the microenvironment of a tissue may change under different physiological states and can contribute to the etiology of diverse diseases [4–10]. Consequently, to fully understand gene expression differences associated with different physiological states, deconvolution of tissue expression into the component expression profiles of each cell type is critically needed.
Fluorescence Activated Cell Sorting (FACS), Laser Capture Micro-dissection (LCM) and Translating Ribosome Affinity Purification (TRAP) have been used to physically separate defined cell types before gene expression analysis [2, 11–13]. However, technical difficulties, such as limited availability of good surface markers, cell type specific promoters and transgenic models, have restricted the application of these techniques. Furthermore, the sorting process may introduce additional stress on cells and hence alter their gene expression profiles.
Unsupervised mixture models have been developed to solve the gene expression devolution problem. For example, a Latent Dirichlet Allocation (LDA) model trained with a variational expectation maximization framework was used to estimate the breast cancer cell gene expression profiles from heterogeneous tumor samples . An alternative approach  is to first reduce the observed mixture using standard dimension reduction algorithm, such as principal components analysis (PCA) or independent components analysis (ICA), find a minimum-volume polytope with k vertices that enclose the reduced data and then transform the reduced data back to the gene expression profiles. The success application of these unsupervised approaches will depend on the availability of large number of observation over a wide range of tissues [14, 16]. In addition, these algorithms do not use the biological knowledge on the cell type markers.
Several supervised and semi-supervised computational deconvolution algorithms have also been proposed to tackle this problem [17–21]. However, they require prior knowledge of either the cell type frequencies within a given tissue [19, 20], or the in vitro gene expression profiles of each component cell type [17, 18]. In reality, this information can be difficult to obtain and presents a major roadblock for these kinds of approaches. Our previous work  has proved that gene expression deconvolution should be done in linear space rather than log-transformed space, as is often used in microarray studies. Based on our previous findings, we propose a novel Digital Sorting Algorithm (DSA) that can deconvolve the expression of a tissue into the component profiles of each cell type using a set of marker genes that are highly expressed in each cell type.
Next, we examined whether DSA can accurately deconvolve the gene expression profile from mixed tissue samples into tissue specific expression profiles. Using the cell type frequencies we estimated using marker genes, we were able to accurately estimate the expression profiles of the liver, brain and lung cells that constitute the mixture (Figure 1b-d). The deconvolved expression was highly correlated with the true gene expression profile in each tissue type. The error measure was smaller for genes that are highly expressed, as would be predicted given that technical variations tend to have larger impact for genes that are expressed at low levels. The effect of the number of marker genes used in estimating the proportion of each cell type was also studied. We randomly sampled the marker genes from the TIGER list (Additional file 1: Table S1). In 100 repetitions, we plotted the correlation and mean absolute difference between the estimated and pure cell-specific expressions against the number of markers used. Our results demonstrate that DSA is robust to the number of marker genes and only requires several marker genes for accurate gene expression deconvolution (Additional file 2: Figure S1).
We next asked whether DSA can identify differentially expressed genes between different tissue types. To do this, we computed the gene expression fold change using the deconvolved gene expression profile, and then carried out a Receiver Operator Curve (ROC) analysis to assess DSA’s ability to detect changes more than two-fold between any tissue types. Our results demonstrate that DSA is highly specific and sensitive in identifying differentially expressed genes (Figure 1e-f).
In the benchmark data, liver, brain and lung were used to construct the mixtures. However, the expression differences between different cell types within a tissue sample are much smaller compared to the differences between liver, brain and lung. Hence, we tested whether our DSA algorithm works on real tissue samples composed of cell types with gene expression profiles that are more similar to each other.
Population specific expression analysis (PSEA)  is an algorithm that has the same input parameters as DSA. However, PSEA uses the marker gene information as normalization factors in the gene expression deconvolution analysis. Hence, the estimated gene expression profiles are not the absolute gene expression values, but are relative to the average of the marker genes for each cell type. In practice, the marker genes from different cells are not guaranteed to have the same expression level. This critical assumption of PSEA makes comparing results between different cell types biased towards the marker gene expression.
In conclusion, we have demonstrated the general feasibility of a Digital Sorting Algorithm (DSA) to obtain cell type specific gene expression profiles from a complex tissue. DSA represents a dramatic improvement over the conventional deconvolution approaches, which typically require prior knowledge of cell type frequencies or in vitro gene expression profiles for each cell type. By using cell type specific genes, DSA overcomes these limitations. We have also demonstrated that DSA is an unbiased estimation algorithm for signal reconstruction and deconvolution. Downstream analysis, such as differential gene analysis, will benefit from digital sorting and yield better results. Most important of all, we have demonstrated that DSA can be used to extract the expression profile of a specific cell type from a complex tissue. This will allow for investigation of the properties of specific cell types in mixed tissues in vivo. For example, we can obtain the gene expression profile of a particular cell type in the cancer microenvironment just from microarray data from the bulk tumor. In principle, the DSA framework could be applied to any unbiased high-throughput dataset, such as global DNA methylation array, next generation sequencing data, metabolic data, and proteomics. Partially, RNA-seq is a more accurate technology compared to microarray. The linearity assumption holds true in RNA-seq studies, hence we believe that our DSA framework can also be applied to RNA-seq data.
In many real-world applications, a small number of cell type specific markers are often available to molecular biologists since these markers are frequently used in biochemical assays. For example, the cancer stem cell markers are known for many types of tumors. The use of these markers in gene expression deconvolution greatly improved the performance and also enabled the application of this algorithm not only to cancer studies, but also to other biological studies involved with heterogeneous samples.
DSA is implemented in a single R package (https://github.com/zhandong/DSA). The package also includes sample data from liver, brain and lung benchmark data and the cell type specific markers.
Liver, brain and lung tissues derived from a single rat were homogenized, extracted, and mixed in 11 different proportions in triplicates. The gene expression profile of these mixed tissues were measured using Affymetrix array and can be obtained from GSE19830. Immune cell expression profiles were obtained from GSE11057 and GSE24759. Hodgkin’s lymphoma dataset was obtained from GSE17920. The tumor associated macrophage marker genes were obtained from GSE18404.
Six different immune cell lines that were used to construct references are available in the simulated blood samples. The weights were sampled randomly in decreasing order (Additional file 6: Table S3).
where S represents the source signal, W is the weight matrix for cell type frequencies, and O is the observation on tissue samples. In a typical gene expression profiling setting, O is often measured through microarray or RNA-seq. Both W and S are unknown and our goal is to estimate S. We approach this problem by first estimating W using cell type specific markers and then solve the linear model using estimated W.
When the number of observations on the mixed samples is greater the number of cell types involved that is p > k, we can solve the system of equations with k unknown parameters. Once is known, we can take into equation (3) and compute the cell type frequency matrix.
Input: Expression data on tissue samples and a set of gene symbols that is known to be highly expressed in a specific cell type.
Output: Expression profile for each of the cell types in a tissue.
Step I: If W is known, proceed to step II, else estimate W using X S and equation 3.
where O is the gene expression profile on tissue samples, S is the expression profile for pure cell types, W is the weight matrix estimated using the marker genes, and t 1 and t 2 is the maximum and minimum measurable gene expression level. R package ‘quadprog’ is used to solve the quadratic programming problem.
R package ‘ROCR’ from CRAN  was used to compute the ROC curve and the area under curve (AUC). Specifically, genes with more than 2 fold increase or decrease are included in the reference list as the positive set. Our goal is to assess the true positive rate and false positive rate in identifying these genes using our estimated gene expression profiles. A ratio between the estimated gene expression profiles of two different cell types is used to compute the ROC curve. A gene is classified into the positive set if the ratio of this gene is greater than a threshold t. A ROC curve is generated by varying t.
Five tumor associated macrophage (TAM) marker genes were selected by comparing the macrophage in mouse mammary tumor to the normal splenic macrophages. The percentage and expression profile of TAMs were estimated using DSA algorithm. A cox proportional hazard model was used and a significant association was identified with survival. Further, patients are dichotomized into two groups by comparing to the median percentage of TAMs. A log rank test was calculated on these data.
We thank Huda Y. Zoghbi, Juan Botas, Lewis A. Chodosh, Olivier Litcharge, and Tiemo Klisch for helpful insights and critical comments on this manuscript.
Financial support from the CIBR seed grant, Houston Bioinformatics Endowment, Sontag Foundation Distinguished Scientist Award and Childhood Brain Tumor Foundation are gratefully acknowledged. LMLC is a St. Baldrick’s Foundation Scholar.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.