A mixture model for expression deconvolution from RNA-seq in heterogeneous tissues
© Li and Xie; licensee BioMed Central Ltd. 2013
Published: 10 April 2013
Skip to main content
© Li and Xie; licensee BioMed Central Ltd. 2013
Published: 10 April 2013
RNA-seq, a next-generation sequencing based method for transcriptome analysis, is rapidly emerging as the method of choice for comprehensive transcript abundance estimation. The accuracy of RNA-seq can be highly impacted by the purity of samples. A prominent, outstanding problem in RNA-seq is how to estimate transcript abundances in heterogeneous tissues, where a sample is composed of more than one cell type and the inhomogeneity can substantially confound the transcript abundance estimation of each individual cell type. Although experimental methods have been proposed to dissect multiple distinct cell types, computationally "deconvoluting" heterogeneous tissues provides an attractive alternative, since it keeps the tissue sample as well as the subsequent molecular content yield intact.
Here we propose a probabilistic model-based approach, Transcript Estimation from Mixed Tissue samples (TEMT), to estimate the transcript abundances of each cell type of interest from RNA-seq data of heterogeneous tissue samples. TEMT incorporates positional and sequence-specific biases, and its online EM algorithm only requires a runtime proportional to the data size and a small constant memory. We test the proposed method on both simulation data and recently released ENCODE data, and show that TEMT significantly outperforms current state-of-the-art methods that do not take tissue heterogeneity into account. Currently, TEMT only resolves the tissue heterogeneity resulting from two cell types, but it can be extended to handle tissue heterogeneity resulting from multi cell types. TEMT is written in python, and is freely available at https://github.com/uci-cbcl/TEMT.
The probabilistic model-based approach proposed here provides a new method for analyzing RNA-seq data from heterogeneous tissue samples. By applying the method to both simulation data and ENCODE data, we show that explicitly accounting for tissue heterogeneity can significantly improve the accuracy of transcript abundance estimation.
The rapidly advancing next-generation sequencing based transcriptome analysis tool, RNA-seq, provides a comprehensive and accurate method for analyzing the entire RNA components of the transcriptome . The efficiency and sensitivity of RNA-seq make it a primary method for detecting alternatively-spliced forms and estimating their abundances [2, 3]. However, estimating transcript abundances in heterogeneous tissues by RNA-seq remains an unsolved, outstanding problem because of the confounding effect from different cell types . Many tissue samples from native environments are heterogeneous. For example, tumor samples are usually composed of tumor cells and surrounding normal cells . Therefore, reads from an RNA-seq experiment of tumor samples will consist of contributions from both tumor and normal cells. Additionally, tumor tissues themselves are often heterogeneous, consisting of different subclones (e.g. breast cancer subtypes ), leading to even more complicated tissue environments.
Experimental methods have been proposed to address issues arising from contamination of different cell types, such as laser-capture microdissection , which allows dissection of morphologically distinguishable cell types. The mRNA content yield by this technology is consequently lowered, and needs to be compensated for, usually by molecular amplification. However, the nonlinearity induced by amplifying mRNA  has its own problems, and can make the expression profiles of distinct cell types less distinguishable, weakening the sensitivity of RNA-seq technology. Other experimental approaches, including cell purification and enrichment, are comparatively expensive and laborious . Therefore developing alternative in silico approaches to resolving the tissue heterogeneity problem, especially in cancer research, remains a major problem in RNA-seq analysis .
Research in computational approaches to resolving the tissue heterogeneity problem of different biotechnologies has a fairly long history [11–14]. The first attempt to computationally micro-dissect heterogeneous tissues for microarray expression data was based on a linear model , which estimated both cell-type proportion and gene expression level. Prior information regarding "marker genes", which are genes uniquely expressed in each cell-type, was incorporated into the linear model to identify distinct cell types. The linear model was extended with Bayesian prior densities of cell-type proportions , and a posterior sampling approach was then constructed for cell-type-specific expression profiling. A statistical testing method  was proposed for single nucleotide polymorphism (SNP) array based copy number alterations analysis from heterogeneous tissue samples. In this method, Bayesian differentiation between hemizygous deletion and homozygous deletion were used to infer the underlying normal cell proportion and copy number profiles of both normal cells and tumor cells. One common feature shared by these methods is that they all adopted probabilistic models, not only allowing prior information about different cell types to be smoothly incorporated into the models, but also taking advantages of the flexibility of probabilistic model to capture specific aspects of each data type.
To the best of our knowledge, no computational approaches have been proposed to resolve the tissue heterogeneity problem from RNA-seq data in a probabilistic fashion. Typically, researchers apply transcriptional profiling tools designed for homogeneous tissue samples directly to RNA-seq data from heterogeneous tissue samples. Subsequent estimation results are interpreted as transcriptional profiling of a particular single cell type of interest. Therefore, we ask whether it is possible to estimate trancriptive abundances of individual cell types from RNA-seq of heterogeneous tissues, by decoupling the contributions from multiple cell types. We propose a probabilistic model-based approach, Transcript Estimation from Mixed Tissue samples (TEMT) to address this question. Currently, TEMT requires two sets of single-end RNA-seq reads. One read set is from a heterogeneous tissue sample composed of two cell types, while the other is from a pure tissue sample composed of one of the two cell types. TEMT incorporates prior information of cell type proportion and can calculate probabilities of RNA-seq reads sampled from each cell type. Because TEMT implements an online EM algorithm , it has a time requirement proportional to the data size and a constant memory requirement. To further improve the estimation accuracy, TEMT also implements a bias module, which incorporates both positional bias [16–18] and sequence-specific bias [19, 20].
To assess the performance of TEMT, we analyzed a series of both simulation and real data from ENCODE , and compared the transcript relative abundances estimation from TEMT to those obtained from other methods that do not take the tissue heterogeneity into account. Our results show that explicitly accounting for tissue heterogeneity can significantly improve transcript abundance estimation accuracy.
In this section, we first introduce the generative mixture model of TEMT. Combined with cell type proportion as prior information, we propose a maximum a posteriori estimation approach for finding model parameters. Next, we explain how to incorporate a positional and sequence-specific bias module into the model. Finally, we introduce an online EM algorithm for parameter estimation, reducing the time complexity to be proportional to the data size and the space complexity to be constant.
We focus on transcript abundance estimation. Denote as a set of reference transcripts, which we assume is known and complete. Let denote the length of transcript t in the set with , where T is the total number of transcripts in the reference set. Suppose we are interested in transcriptome analysis in two cell types: a and b. Let and denote the relative transcript abundance of transcript t in cell type a and b, respectively, with . We assume are properly normalized such that and .
We assume RNA-seq reads are available in two samples: one consisting of cells of only type a, which we call the "pure sample", and the other consisting of cells of both type a and b with percentage from cell type a and from cell type b, which we call the "mixed sample." In the cancer transcriptome analysis, cell type a can represent normal cells as it is usually easy to obtain a pure tissue sample, while cell type b can represent tumor cells as most tumor tissue samples are contaminated by normal cells.
Denote the read set from the pure sample by and the read set from the mixed sample by . Our goal is to estimate the relative abundance of each transcript in the reference set T from the RNA-seq read data and in both cell type a and b
We first map reads to the reference transcript set and convert the raw read data into a corresponding alignment representation. Denote the alignment representation of the read set by = , where if read i from aligns to transcript t and 0 otherwise, and N p is the total number of reads in read set . The alignment representation is similarly defined for read set from the mixed sample. Note that one read might map to multiple transcripts due to alternative splicing, sequence similarity shared by homologous genes, or other reasons. As a result, the summation of over all transcripts may be bigger than 1 for some i. These "ambiguous reads" introduce a major source of uncertainty into transcript abundance estimation.
We assume the fragment length x has a normal distribution with mean µ and variance , and is the normal probability density function of. By renormalizing , we obtain the discrete distribution of all possible fragment lengths. The effective length is then the expectation of the number of positions a read can start within transcript t, based on the discrete distribution of fragment length.
for s = p or m.
As it brings computational convenience in the following learning step.
Several analysis have noticed the identifiability problem [12, 13] in estimating cell type specific expression in heterogeneous tissue samples. Ideally, if the proportion information for some cell types is missing, we can then pool these cell types as one type, making the expression of each individual cell type inside unidentifiable. Previously, prior constraints have been used to resolve the problem [12, 13]. In our model, the prior knowledge of cell type proportions is combined with the model likelihood, and we subsequently use maximum a posteriori (MAP) estimation to find the optimal parameters.
Both positional [16–18] and sequence-specific [19, 20] sequencing biases have been observed in next generation sequencing data. These biases mainly result from non-uniformly distributed cDNA fragments during the RNA-seq library preparation . Under positional bias, reads positioning is not uniformly distributed across the effective length of the target transcript, but preferentially distributed around either the 5' end or the 3' end of the target transcript. Under sequence-specific bias, the sequences near the two ends of the fragments affect their probability to be sequenced. To account for these non-uniformity effects during transcript abundance estimation, we incorporate the bias module of  into our model.
for s=p or m.
In which C is a normalizing constant and the equality holds only if the conditional probabilities , , are the true posterior distributions of latent variables .
The EM framework maximizes Equation (17) by iteratively applying the expectation step and the maximization step to update both the conditional probabilities , , and model parameters Θ until convergence. The expectation step of typical batch EM algorithm has to fetch all the data points into memory, and calculates the conditional probabilities based on the average of all the data points. While this batch method guarantee's the log-likelihood function to monotonically increase, it also induces inefficiency in both time and space complexity. Considering the high-throughput nature of next-generation sequencing technology as well as its huge data size, we implemented the EM algorithm in an online fashion  to both lower the memory requirement and boost the convergence rate.
The main difference between the batch EM and the online EM is in the E-step. The E-step of the online EM algorithm first calculates the conditional probabilities of only one new data point, and then updates the conditional probabilities of all the current data points by interpolating between the conditional probabilities of all the previous data points and the conditional probabilities of the new data point, with a forgetting factor σ controlling the convergence rate.
It is shown in  that with the constraint 0.5 < σ ≤ 1, the online EM algorithm is asymptotically equivalent to stochastic gradient ascent, and is guaranteed to converge to the maximum likelihood estimator, which is extended to the maximum a posteriori estimator in our model.
Specifically, the online EM updates in our model is given by
In Equation (18-20), we compute the conditional probabilities , , of just one new read i + 1 based on previous parameter estimation , , ; In Equation (21-23), we compute the new conditional probabilities average , , by interpolating between the previous conditional probabilities average , , and , , . n is the index of iteration step and i is the index of data points. σ is the forgetting factor which controls the convergence rate, with the constraint 0.5 < σ ≤ 1.
In the subsequent M-step, parameters , , , are updated according to new conditional probabilities average , , .
We aligned the raw read set from either simulation or the ENCODE data to a given transcript set using bowtie-0.12.7 . For each read, we allowed 2 mismatches and reported at most 10 candidate alignments.
The abundance of each transcript in terms of estimated counts was estimated via both TEMT and a control model. Estimated counts is defined as the estimated number of reads generated from the target transcript. In TEMT, the prior of each cell type proportion was set to the same as the proportion used in simulation and ENCODE data respectively, and β a , β b was set to 10 times the size of the read set . μ = 200; σ = 80 were used as the mean and standard deviation of the RNA-seq fragment length distribution. We chose eXpress-0.9.4  as the control model, as it is the state-of-the-art method for transcript abundance estimation and also utilizes an online-EM algorithm. Note that, to run TEMT, we need two read sets, in which one is for the pure sample and the other is for the mixed sample, as previously mentioned. In contrast, to run eXpress, we only need one read set from either the pure sample or the mixed sample. The forgetting factor for the on-line EM algorithms in both TEMT and eXpress was set to be σ = 0:85, and the error-model in eXpress was disabled for comparison.
To measure the model accuracy, we used the Error Fraction (EF) measure introduced by  to quantify the discrepancy between the model estimates and the ground truth estimates. The Error Fraction is defined as the fraction of transcripts for which the estimates are significantly different (percent error >10% in our case) from the ground truth.
To show the utility of TEMT, we first carried out a series of simulation studies. To obtain simulated read sets, we used FluxSimulator , a software for transcriptome and read generation by simulating the biochemical processes underlying the library preparation. FluxSimulator requires a reference transcript set to start the simulation process, so we manually downloaded 406 transcripts of 208 alternatively spliced genes in human from Alternative Splicing Structural Genomics Project (AS3D) , and used these 406 transcripts as the reference transcript set. We first simulated the transcript expression process twice producing two sets of relative transcript abundances, corresponding to cell type a and b respectively. Based on these two transcript abundance sets, we then simulated 6 pairs of 1 million 75-bp single-end read sets corresponding to six different cell type b proportions from 40% up to 90%. The relative transcript abundances of cell type a and b were kept the same throughout these simulations. For each paired read set, one read set is for the pure sample composed of only cell type a, whereas the other read set is for the mixed sample composed of both cell type a and b, mixed with the cell type b proportion. Within the mixed-sample read set, we also extracted the reads simulated purely from cell type b, which was used for control model eXpress.
We note that the estimates of cell type a from TEMT achieve roughly the same accuracy, compared with the estimates from eXpress based on the read set of the pure sample of cell type a. Also, this accuracy does not change significantly under the effect of different cell type b proportions. This is mainly due to the pure sample read set of cell type a within the input data for TEMT.
Next we analyzed the recently released ENCODE data. Due to the lack of RNA-seq data sampled from mixed tissue samples with known cell type proportions, we artificially generated the mixed-sample read sets by mixing reads obtained from two different cell types. Specifically, we chose two Tier 1 cell lines, GM12878 and K562, and treated them as cell type a and cell type b respectively. The corresponding single-end RNA-seq data of these two cell lines, GM78 1×75D A 1 (UCSC Accession: wgEncodeEH000125) and K562 1×75D A 1 (UCSC Accession: wgEncodeEH000126) from the Wold lab  at Caltech, were download from ENCODE (2012). The data downloaded from the same lab under similar protocols is intended to reduce the deviation resulting from experiments. We then randomly selected 10 million reads from GM12878 cells to form the read set of the pure sample, and 10 million reads from both GM12878 and K562 cells using different K562 cells proportions to form the read set of the mixed sample. Similar to the previous simulation study, we extracted the reads purely selected from K562 cells within the mixed sample, and used them for the eXpress control model. We studied 6 different K562 cells proportions from 40% to 90% in order to compare with the previous simulation study. 36908 human RefSeq  transcripts from UCSC known genes  were used as the transcript set for the ENCODE data.
We formulated our model under the assumption that the heterogeneous tissue is only composed of two cell types, but in reality, a heterogeneous tissue might be much more complicated, consisting of multiple cell types. To relax this constraint, our model needs to be further extended to analyze more complex cases in which each cell type may have its own subtypes, e.g. breast cancer subtypes, leading to a more sophisticated heterogeneous tissue environment. Further dissecting cell subtype heterogeneity is the next step in refining our model. Moving from two cell types to arbitrarily many cell types is of great interest, since it may substantially facilitate transcriptome study of heterogeneous tissues.
One critical component necessary to make our model work is the prior information of cell type b proportion, which is necessary to resolve the identifiability problem of mixed samples. In real experiments, precise prior information regarding cell type proportions may be unavailable. One solution in the context of our model is to down weight the effect of the prior by decreasing the parameter β a , β b , which adds more uncertainty to the cell mixture proportion. However, this approach may decrease the performance of the model as the uncertainty in cell mixture proportion cannot be distinguished from the uncertainty in transcript abundance estimation. This observation suggests another direction to further improving our model which is to solely estimate cell type b proportion without the prior information. To fulfill this requirement, the identifiability problem needs to be resolved as mentioned in section 2.3, which turns out to be comparatively hard for RNA-seq data. Unlike the heterozygous and homozygous deletions in , which can be utilized to differentiate between the SNP array data generated by normal cells and tumor cells, there are no such explicit differences between the reads generated by distinct cell types in RNA-seq data, thus making the generative mixture model unconstrained. The "marker genes" method proposed by , which tries to distinguish distinct cell types by utilizing genes uniquely expressed in each cell type, provides a future potential direction to extend the current model.
In this article, we propose a probabilistic model-based method TEMT to estimate transcript abundance of individual cell types based on RNA-seq data from heterogeneous tissue samples. TEMT utilizes prior information to distinguish reads generated by each cell type within the heterogeneous tissue sample. Positional and sequence-specific biases are also incorporated to improve estimation accuracy. TEMT is able to process large datasets as the online EM algorithm is adopted to guarantee a time complexity proportional to the data size and a constant space complexity. Our experiments on both simulated datasets and ENCODE data shows that explicitly accounting for tissue heterogeneity can significantly improve the accuracy of transcript abundance estimation.
We gratefully acknowledge helpful discussions with Jake Biesinger, Daniel Newkirk and Ali Mortazavi. The work is partly supported by National Institute of Health grant R01HG006870.
Publication of this article was supported by National Institute of Health grant R01HG006870.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 5, 2013: Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S5.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.