Digital sorting of complex tissues for cell type-specific gene expression profiles
© Zhong et al.; licensee BioMed Central Ltd. 2013
Received: 13 September 2012
Accepted: 14 February 2013
Published: 7 March 2013
Cellular heterogeneity is present in almost all gene expression profiles. However, transcriptome analysis of tissue specimens often ignores the cellular heterogeneity present in these samples. Standard deconvolution algorithms require prior knowledge of the cell type frequencies within a tissue or their in vitro expression profiles. Furthermore, these algorithms tend to report biased estimations.
Here, we describe a Digital Sorting Algorithm (DSA) for extracting cell-type specific gene expression profiles from mixed tissue samples that is unbiased and does not require prior knowledge of cell type frequencies.
The results suggest that DSA is a specific and sensitivity algorithm in gene expression profile deconvolution and will be useful in studying individual cell types of complex tissues.
Cellular heterogeneity is present in nearly all biological specimens. When the genome-wide transcriptional profile of heterogeneous samples is measured under different physiological states, any observed differences are strongly confounded by differences in cell type compositions between samples[1-3]. Recent studies suggest that the microenvironment of a tissue may change under different physiological states and can contribute to the etiology of diverse diseases[4-10]. Consequently, to fully understand gene expression differences associated with different physiological states, deconvolution of tissue expression into the component expression profiles of each cell type is critically needed.
Fluorescence Activated Cell Sorting (FACS), Laser Capture Micro-dissection (LCM) and Translating Ribosome Affinity Purification (TRAP) have been used to physically separate defined cell types before gene expression analysis[2, 11-13]. However, technical difficulties, such as limited availability of good surface markers, cell type specific promoters and transgenic models, have restricted the application of these techniques. Furthermore, the sorting process may introduce additional stress on cells and hence alter their gene expression profiles.
Unsupervised mixture models have been developed to solve the gene expression devolution problem. For example, a Latent Dirichlet Allocation (LDA) model trained with a variational expectation maximization framework was used to estimate the breast cancer cell gene expression profiles from heterogeneous tumor samples. An alternative approach is to first reduce the observed mixture using standard dimension reduction algorithm, such as principal components analysis (PCA) or independent components analysis (ICA), find a minimum-volume polytope with k vertices that enclose the reduced data and then transform the reduced data back to the gene expression profiles. The success application of these unsupervised approaches will depend on the availability of large number of observation over a wide range of tissues[14, 16]. In addition, these algorithms do not use the biological knowledge on the cell type markers.
Several supervised and semi-supervised computational deconvolution algorithms have also been proposed to tackle this problem[17-21]. However, they require prior knowledge of either the cell type frequencies within a given tissue[19, 20], or the in vitro gene expression profiles of each component cell type[17, 18]. In reality, this information can be difficult to obtain and presents a major roadblock for these kinds of approaches. Our previous work has proved that gene expression deconvolution should be done in linear space rather than log-transformed space, as is often used in microarray studies. Based on our previous findings, we propose a novel Digital Sorting Algorithm (DSA) that can deconvolve the expression of a tissue into the component profiles of each cell type using a set of marker genes that are highly expressed in each cell type.
Results and discussion
Next, we examined whether DSA can accurately deconvolve the gene expression profile from mixed tissue samples into tissue specific expression profiles. Using the cell type frequencies we estimated using marker genes, we were able to accurately estimate the expression profiles of the liver, brain and lung cells that constitute the mixture (Figure 1b-d). The deconvolved expression was highly correlated with the true gene expression profile in each tissue type. The error measure was smaller for genes that are highly expressed, as would be predicted given that technical variations tend to have larger impact for genes that are expressed at low levels. The effect of the number of marker genes used in estimating the proportion of each cell type was also studied. We randomly sampled the marker genes from the TIGER list (Additional file1: Table S1). In 100 repetitions, we plotted the correlation and mean absolute difference between the estimated and pure cell-specific expressions against the number of markers used. Our results demonstrate that DSA is robust to the number of marker genes and only requires several marker genes for accurate gene expression deconvolution (Additional file2: Figure S1).
We next asked whether DSA can identify differentially expressed genes between different tissue types. To do this, we computed the gene expression fold change using the deconvolved gene expression profile, and then carried out a Receiver Operator Curve (ROC) analysis to assess DSA’s ability to detect changes more than two-fold between any tissue types. Our results demonstrate that DSA is highly specific and sensitive in identifying differentially expressed genes (Figure 1e-f).
In the benchmark data, liver, brain and lung were used to construct the mixtures. However, the expression differences between different cell types within a tissue sample are much smaller compared to the differences between liver, brain and lung. Hence, we tested whether our DSA algorithm works on real tissue samples composed of cell types with gene expression profiles that are more similar to each other.
Population specific expression analysis (PSEA) is an algorithm that has the same input parameters as DSA. However, PSEA uses the marker gene information as normalization factors in the gene expression deconvolution analysis. Hence, the estimated gene expression profiles are not the absolute gene expression values, but are relative to the average of the marker genes for each cell type. In practice, the marker genes from different cells are not guaranteed to have the same expression level. This critical assumption of PSEA makes comparing results between different cell types biased towards the marker gene expression.
In conclusion, we have demonstrated the general feasibility of a Digital Sorting Algorithm (DSA) to obtain cell type specific gene expression profiles from a complex tissue. DSA represents a dramatic improvement over the conventional deconvolution approaches, which typically require prior knowledge of cell type frequencies or in vitro gene expression profiles for each cell type. By using cell type specific genes, DSA overcomes these limitations. We have also demonstrated that DSA is an unbiased estimation algorithm for signal reconstruction and deconvolution. Downstream analysis, such as differential gene analysis, will benefit from digital sorting and yield better results. Most important of all, we have demonstrated that DSA can be used to extract the expression profile of a specific cell type from a complex tissue. This will allow for investigation of the properties of specific cell types in mixed tissues in vivo. For example, we can obtain the gene expression profile of a particular cell type in the cancer microenvironment just from microarray data from the bulk tumor. In principle, the DSA framework could be applied to any unbiased high-throughput dataset, such as global DNA methylation array, next generation sequencing data, metabolic data, and proteomics. Partially, RNA-seq is a more accurate technology compared to microarray. The linearity assumption holds true in RNA-seq studies, hence we believe that our DSA framework can also be applied to RNA-seq data.
In many real-world applications, a small number of cell type specific markers are often available to molecular biologists since these markers are frequently used in biochemical assays. For example, the cancer stem cell markers are known for many types of tumors. The use of these markers in gene expression deconvolution greatly improved the performance and also enabled the application of this algorithm not only to cancer studies, but also to other biological studies involved with heterogeneous samples.
DSA is implemented in a single R package (https://github.com/zhandong/DSA). The package also includes sample data from liver, brain and lung benchmark data and the cell type specific markers.
Liver, brain and lung tissues derived from a single rat were homogenized, extracted, and mixed in 11 different proportions in triplicates. The gene expression profile of these mixed tissues were measured using Affymetrix array and can be obtained from GSE19830. Immune cell expression profiles were obtained from GSE11057 and GSE24759. Hodgkin’s lymphoma dataset was obtained from GSE17920. The tumor associated macrophage marker genes were obtained from GSE18404.
Simulated immune cells
Six different immune cell lines that were used to construct references are available in the simulated blood samples. The weights were sampled randomly in decreasing order (Additional file6: Table S3).
Linear model on gene expression deconvolution
where S represents the source signal, W is the weight matrix for cell type frequencies, and O is the observation on tissue samples. In a typical gene expression profiling setting, O is often measured through microarray or RNA-seq. Both W and S are unknown and our goal is to estimate S. We approach this problem by first estimating W using cell type specific markers and then solve the linear model using estimated W.
Estimate cell type frequencies matrix W from marker genes
When the number of observations on the mixed samples is greater the number of cell types involved that is p > k, we can solve the system of equations with k unknown parameters. Once is known, we can take into equation (3) and compute the cell type frequency matrix.
Digital sorting on tissue samples
Input: Expression data on tissue samples and a set of gene symbols that is known to be highly expressed in a specific cell type.
Output: Expression profile for each of the cell types in a tissue.
Step I: If W is known, proceed to step II, else estimate W using X S and equation 3.
where O is the gene expression profile on tissue samples, S is the expression profile for pure cell types, W is the weight matrix estimated using the marker genes, and t 1 and t 2 is the maximum and minimum measurable gene expression level. R package ‘quadprog’ is used to solve the quadratic programming problem.
Receiver operator characteristic (ROC) analysis
R package ‘ROCR’ from CRAN was used to compute the ROC curve and the area under curve (AUC). Specifically, genes with more than 2 fold increase or decrease are included in the reference list as the positive set. Our goal is to assess the true positive rate and false positive rate in identifying these genes using our estimated gene expression profiles. A ratio between the estimated gene expression profiles of two different cell types is used to compute the ROC curve. A gene is classified into the positive set if the ratio of this gene is greater than a threshold t. A ROC curve is generated by varying t.
Five tumor associated macrophage (TAM) marker genes were selected by comparing the macrophage in mouse mammary tumor to the normal splenic macrophages. The percentage and expression profile of TAMs were estimated using DSA algorithm. A cox proportional hazard model was used and a significant association was identified with survival. Further, patients are dichotomized into two groups by comparing to the median percentage of TAMs. A log rank test was calculated on these data.
We thank Huda Y. Zoghbi, Juan Botas, Lewis A. Chodosh, Olivier Litcharge, and Tiemo Klisch for helpful insights and critical comments on this manuscript.
Financial support from the CIBR seed grant, Houston Bioinformatics Endowment, Sontag Foundation Distinguished Scientist Award and Childhood Brain Tumor Foundation are gratefully acknowledged. LMLC is a St. Baldrick’s Foundation Scholar.
- Cobb JP: Application of genome-wide expression analysis to human health and disease. Proc Natl Acad Sci 2005, 102: 4801-4806. 10.1073/pnas.0409768102PubMed CentralView ArticlePubMedGoogle Scholar
- Repsilber D, Kern S, Telaar A, Walzl G, Black GF, Selbig J, Parida SK, Kaufmann SH, Jacobsen M: Biomarker discovery in heterogeneous tissue samples -taking the in-silico deconfounding approach. BMC Bioinformatics 2010, 11: 27. 10.1186/1471-2105-11-27PubMed CentralView ArticlePubMedGoogle Scholar
- Palmer C, Diehn M, Alizadeh AA, Brown PO: BMC Genomics. 2006, 7: 115. 10.1186/1471-2164-7-115PubMed CentralView ArticlePubMedGoogle Scholar
- Hanahan D, Weinberg RA: Hallmarks of cancer: the next generation. Cell 2011, 144: 646-674. 10.1016/j.cell.2011.02.013View ArticlePubMedGoogle Scholar
- Yang F: Stromal expression of connective tissue growth factor promotes angiogenesis and prostate cancer tumorigenesis. Cancer Res 2005, 65: 8887-8895. 10.1158/0008-5472.CAN-05-1702View ArticlePubMedGoogle Scholar
- Tuxhorn JA, Ayala GE, Smith MJ, Smith VC, Dang TD, Rowley DR: Reactive stroma in human prostate cancer: induction of myofibroblast phenotype and extracellular matrix remodeling. Clin Cancer Res 2002, 8: 2912-2923.PubMedGoogle Scholar
- Joyce JA, Pollard JW: Microenvironmental regulation of metastasis. Nat Rev Cancer 2008, 9: 239-252. 10.1038/nrg2317PubMed CentralView ArticlePubMedGoogle Scholar
- Yu H, Kortylewski M, Pardoll D: Crosstalk between cancer and immune cells: role of STAT3 in the tumour microenvironment. Nat Rev Immunol 2007, 7: 41-51. 10.1038/nri1995View ArticlePubMedGoogle Scholar
- Salsman VS, Chow KKH, Shaffer DR, Kadikoy H, Li X-N, Gerken C, Perlaky L, Metelitsa LS, Gao X, Bhattacharjee M, Hirschi K, Heslop HE, Gottschalk S, Ahmed N: Crosstalk between medulloblastoma cells and endothelium triggers a strong chemotactic signal recruiting T lymphocytes to the tumor microenvironment. PLoS One 2011, 6: e20267. 10.1371/journal.pone.0020267PubMed CentralView ArticlePubMedGoogle Scholar
- Qian B-Z, Li J, Zhang H, Kitamura T, Zhang J, Campion LR, Kaiser EA, Snyder LA, Pollard JW: CCL2 recruits inflammatory monocytes to facilitate breast-tumour metastasis. Nature 2011, 475: 222-225. 10.1038/nature10138PubMed CentralView ArticlePubMedGoogle Scholar
- Pahler JC, Tazzyman S, Erez N, Chen Y-Y, Murdoch C, Nozawa H, Lewis CE, Hanahan D: Plasticity in tumor-promoting inflammation: impairment of macrophage recruitment evokes a compensatory neutrophil response. Neoplasia (New York, NY) 2008, 10: 329-340.View ArticleGoogle Scholar
- Okaty BW, Sugino K, Nelson SB: A quantitative comparison of cell-type-specific microarray gene expression profiling methods in the mouse brain. PLoS One 2011, 6: e16493. 10.1371/journal.pone.0016493PubMed CentralView ArticlePubMedGoogle Scholar
- Hoelzinger DB, Nakada M, Demuth T, Rosensteel T, Reavie LB, Berens ME: Autotaxin: a secreted autocrine/paracrine factor that promotes glioma invasion. J Neurooncol 2007, 86: 297-309.View ArticlePubMedGoogle Scholar
- Quon G, Morris Q: ISOLATE: a computational strategy for identifying the primary origin of cancers using high-throughput sequencing. Bioinformatics 2009, 25: 2882-2889. 10.1093/bioinformatics/btp378PubMed CentralView ArticlePubMedGoogle Scholar
- Schwartz R, Shackney SE: Applying unmixing to gene expression data for tumor phylogeny inference. BMC Bioinformatics 2010, 11: 42. 10.1186/1471-2105-11-42PubMed CentralView ArticlePubMedGoogle Scholar
- Qiao W, Quon G, Csaszar E, Yu M, Morris Q, Zandstra PW: PERT: a method for expression deconvolution of human blood samples from varied microenvironmental and developmental conditions. PLoS Comput Biol 2012, 8: e1002838. 10.1371/journal.pcbi.1002838PubMed CentralView ArticlePubMedGoogle Scholar
- Lu P: Expression deconvolution: a reinterpretation of DNA microarray data reveals dynamic changes in cell populations. Proc Natl Acad Sci 2003, 100: 10370-10375. 10.1073/pnas.1832361100PubMed CentralView ArticlePubMedGoogle Scholar
- Wang M, Master SR, Chodosh LA: BMC Bioinformatics. 2006, 7: 328. 10.1186/1471-2105-7-328PubMed CentralView ArticlePubMedGoogle Scholar
- Abbas AR, Wolslegel K, Seshasayee D, Modrusan Z, Clark HF: Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PLoS One 2009, 4: e6098. 10.1371/journal.pone.0006098PubMed CentralView ArticlePubMedGoogle Scholar
- Shen-Orr SS, Tibshirani R, Khatri P, Bodian DL, Staedtler F, Perry NM, Hastie T, Sarwal MM, Davis MM, Butte AJ: Cell type-specific gene expression differences in complex tissues. Nat Methods 2010, 7: 287-289. 10.1038/nmeth.1439PubMed CentralView ArticlePubMedGoogle Scholar
- Gaujoux R, Seoighe C: Semi-supervised nonnegative matrix factorization for gene expression deconvolution: a case study. Infect Genet Evol 2012, 12: 913-921. 10.1016/j.meegid.2011.08.014View ArticlePubMedGoogle Scholar
- Zhong Y, Liu Z: Gene expression deconvolution in linear space. Nat Methods 2012, 9: 8-9. author reply 9 author reply 9View ArticleGoogle Scholar
- Liu X, Yu X, Zack DJ, Zhu H, Qian J: TiGER: a database for tissue-specific gene expression and regulation. BMC Bioinformatics 2008, 9: 271. 10.1186/1471-2105-9-271PubMed CentralView ArticlePubMedGoogle Scholar
- Yanai I, Benjamin H, Shmoish M, Chalifa-Caspi V, Shklar M, Ophir R, Bar-Even A, Horn-Saban S, Safran M, Domany E, Lancet D, Shmueli O: Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics 2005, 21: 650-659. 10.1093/bioinformatics/bti042View ArticlePubMedGoogle Scholar
- Kelley J, de Bono B, Trowsdale J: IRIS: a database surveying known human immune system genes. Genomics 2005, 85: 503-511. 10.1016/j.ygeno.2005.01.009View ArticlePubMedGoogle Scholar
- Kuhn A, Thu D, Waldvogel HJ, Faull RLM, Luthi-Carter R: Population-specific expression analysis (PSEA) reveals molecular changes in diseased brain. Nat Methods 2011, 8: 945-947. 10.1038/nmeth.1710View ArticlePubMedGoogle Scholar
- Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics 2005, 21: 3940-3941. 10.1093/bioinformatics/bti623View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.