Volume 14 Supplement 5

## Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2013)

# A mixture model for expression deconvolution from RNA-seq in heterogeneous tissues

- Yi Li
^{1}and - Xiaohui Xie
^{1, 2, 3}Email author

**14(Suppl 5)**:S11

https://doi.org/10.1186/1471-2105-14-S5-S11

© Li and Xie; licensee BioMed Central Ltd. 2013

**Published: **10 April 2013

## Abstract

### Background

RNA-seq, a next-generation sequencing based method for transcriptome analysis, is rapidly emerging as the method of choice for comprehensive transcript abundance estimation. The accuracy of RNA-seq can be highly impacted by the purity of samples. A prominent, outstanding problem in RNA-seq is how to estimate transcript abundances in heterogeneous tissues, where a sample is composed of more than one cell type and the inhomogeneity can substantially confound the transcript abundance estimation of each individual cell type. Although experimental methods have been proposed to dissect multiple distinct cell types, computationally "deconvoluting" heterogeneous tissues provides an attractive alternative, since it keeps the tissue sample as well as the subsequent molecular content yield intact.

### Results

Here we propose a probabilistic model-based approach, Transcript Estimation from Mixed Tissue samples (TEMT), to estimate the transcript abundances of each cell type of interest from RNA-seq data of heterogeneous tissue samples. TEMT incorporates positional and sequence-specific biases, and its online EM algorithm only requires a runtime proportional to the data size and a small constant memory. We test the proposed method on both simulation data and recently released ENCODE data, and show that TEMT significantly outperforms current state-of-the-art methods that do not take tissue heterogeneity into account. Currently, TEMT only resolves the tissue heterogeneity resulting from two cell types, but it can be extended to handle tissue heterogeneity resulting from multi cell types. TEMT is written in python, and is freely available at https://github.com/uci-cbcl/TEMT.

### Conclusions

The probabilistic model-based approach proposed here provides a new method for analyzing RNA-seq data from heterogeneous tissue samples. By applying the method to both simulation data and ENCODE data, we show that explicitly accounting for tissue heterogeneity can significantly improve the accuracy of transcript abundance estimation.

## Background

The rapidly advancing next-generation sequencing based transcriptome analysis tool, RNA-seq, provides a comprehensive and accurate method for analyzing the entire RNA components of the transcriptome [1]. The efficiency and sensitivity of RNA-seq make it a primary method for detecting alternatively-spliced forms and estimating their abundances [2, 3]. However, estimating transcript abundances in heterogeneous tissues by RNA-seq remains an unsolved, outstanding problem because of the confounding effect from different cell types [4]. Many tissue samples from native environments are heterogeneous. For example, tumor samples are usually composed of tumor cells and surrounding normal cells [5]. Therefore, reads from an RNA-seq experiment of tumor samples will consist of contributions from both tumor and normal cells. Additionally, tumor tissues themselves are often heterogeneous, consisting of different subclones (e.g. breast cancer subtypes [6]), leading to even more complicated tissue environments.

Experimental methods have been proposed to address issues arising from contamination of different cell types, such as laser-capture microdissection [7], which allows dissection of morphologically distinguishable cell types. The mRNA content yield by this technology is consequently lowered, and needs to be compensated for, usually by molecular amplification. However, the nonlinearity induced by amplifying mRNA [8] has its own problems, and can make the expression profiles of distinct cell types less distinguishable, weakening the sensitivity of RNA-seq technology. Other experimental approaches, including cell purification and enrichment, are comparatively expensive and laborious [9]. Therefore developing alternative *in silico* approaches to resolving the tissue heterogeneity problem, especially in cancer research, remains a major problem in RNA-seq analysis [10].

Research in computational approaches to resolving the tissue heterogeneity problem of different biotechnologies has a fairly long history [11–14]. The first attempt to computationally micro-dissect heterogeneous tissues for microarray expression data was based on a linear model [11], which estimated both cell-type proportion and gene expression level. Prior information regarding "marker genes", which are genes uniquely expressed in each cell-type, was incorporated into the linear model to identify distinct cell types. The linear model was extended with Bayesian prior densities of cell-type proportions [13], and a posterior sampling approach was then constructed for cell-type-specific expression profiling. A statistical testing method [14] was proposed for single nucleotide polymorphism (SNP) array based copy number alterations analysis from heterogeneous tissue samples. In this method, Bayesian differentiation between hemizygous deletion and homozygous deletion were used to infer the underlying normal cell proportion and copy number profiles of both normal cells and tumor cells. One common feature shared by these methods is that they all adopted probabilistic models, not only allowing prior information about different cell types to be smoothly incorporated into the models, but also taking advantages of the flexibility of probabilistic model to capture specific aspects of each data type.

To the best of our knowledge, no computational approaches have been proposed to resolve the tissue heterogeneity problem from RNA-seq data in a probabilistic fashion. Typically, researchers apply transcriptional profiling tools designed for homogeneous tissue samples directly to RNA-seq data from heterogeneous tissue samples. Subsequent estimation results are interpreted as transcriptional profiling of a particular single cell type of interest. Therefore, we ask whether it is possible to estimate trancriptive abundances of individual cell types from RNA-seq of heterogeneous tissues, by decoupling the contributions from multiple cell types. We propose a probabilistic model-based approach, Transcript Estimation from Mixed Tissue samples (TEMT) to address this question. Currently, TEMT requires two sets of single-end RNA-seq reads. One read set is from a heterogeneous tissue sample composed of two cell types, while the other is from a pure tissue sample composed of one of the two cell types. TEMT incorporates prior information of cell type proportion and can calculate probabilities of RNA-seq reads sampled from each cell type. Because TEMT implements an online EM algorithm [15], it has a time requirement proportional to the data size and a constant memory requirement. To further improve the estimation accuracy, TEMT also implements a bias module, which incorporates both positional bias [16–18] and sequence-specific bias [19, 20].

To assess the performance of TEMT, we analyzed a series of both simulation and real data from ENCODE [21], and compared the transcript relative abundances estimation from TEMT to those obtained from other methods that do not take the tissue heterogeneity into account. Our results show that explicitly accounting for tissue heterogeneity can significantly improve transcript abundance estimation accuracy.

## Methods

In this section, we first introduce the generative mixture model of TEMT. Combined with cell type proportion as prior information, we propose a maximum a posteriori estimation approach for finding model parameters. Next, we explain how to incorporate a positional and sequence-specific bias module into the model. Finally, we introduce an online EM algorithm for parameter estimation, reducing the time complexity to be proportional to the data size and the space complexity to be constant.

### Model

#### Basic definition

We focus on transcript abundance estimation. Denote $\mathcal{T}$ as a set of reference transcripts, which we assume is known and complete. Let ${l}_{t}$ denote the length of transcript *t* in the set with $t=1,\cdots \phantom{\rule{0.3em}{0ex}},T$, where *T* is the total number of transcripts in the reference set. Suppose we are interested in transcriptome analysis in two cell types: *a* and *b*. Let ${\rho}_{t}^{a}$ and ${\rho}_{t}^{b}$ denote the relative transcript abundance of transcript *t* in cell type *a* and *b*, respectively, with $t=1,\cdots \phantom{\rule{0.3em}{0ex}},T$. We assume ${\left\{{\rho}_{t}^{a}\right\}}_{t=1}^{T}$ are ${\left\{{\rho}_{t}^{b}\right\}}_{t=1}^{T}$ properly normalized such that ${\sum}_{t=1}^{T}{\rho}_{t}^{a}=1$ and ${\sum}_{t=1}^{T}{\rho}_{t}^{b}=1$.

We assume RNA-seq reads are available in two samples: one consisting of cells of only type *a*, which we call the "pure sample", and the other consisting of cells of both type *a* and *b* with percentage ${\mathcal{T}}^{a}$ from cell type *a* and ${\mathcal{T}}^{b}$ from cell type *b*, which we call the "mixed sample." In the cancer transcriptome analysis, cell type *a* can represent normal cells as it is usually easy to obtain a pure tissue sample, while cell type *b* can represent tumor cells as most tumor tissue samples are contaminated by normal cells.

*a*, its relative transcript abundance ${\rho}_{t}^{p}$ is described by ${\rho}_{t}^{p}={\rho}_{t}^{a}$ for all

*t*. However, the relative abundance of transcript

*t*within the mixed sample is a weighted sum of the transcript abundance of both cell type

*a*and

*b*

Denote the read set from the pure sample by ${\mathcal{R}}^{p}$ and the read set from the mixed sample by${\mathcal{R}}^{m}$. Our goal is to estimate the relative abundance of each transcript in the reference set *T* from the RNA-seq read data ${\mathcal{R}}^{p}$ and ${\mathcal{R}}^{m}$ in both cell type *a* and *b*

### Alignment representation

We first map reads to the reference transcript set $\mathcal{T}$ and convert the raw read data into a corresponding alignment representation. Denote the alignment representation of the read set ${\mathcal{R}}^{p}$ by ${\mathcal{Y}}^{p}$ =$\left\{{y}_{i,t}^{p}|i=1,\phantom{\rule{2.77695pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.77695pt}{0ex}}{N}^{p},\phantom{\rule{2.77695pt}{0ex}}t=1,\phantom{\rule{2.77695pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.77695pt}{0ex}}T\right\}$, where ${y}_{i,t}^{p}=1$ if read i from ${\mathcal{R}}^{p}$ aligns to transcript *t* and 0 otherwise, and *N*^{
p
} is the total number of reads in read set ${\mathcal{R}}^{p}$. The alignment representation ${\mathcal{Y}}^{m}=\left\{{y}_{i,t}^{m}|i=1,\phantom{\rule{2.77695pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.77695pt}{0ex}}{N}^{m},\phantom{\rule{2.77695pt}{0ex}}t=1,\phantom{\rule{2.77695pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.77695pt}{0ex}}T\right\}$ is similarly defined for read set ${\mathcal{R}}^{m}$ from the mixed sample. Note that one read might map to multiple transcripts due to alternative splicing, sequence similarity shared by homologous genes, or other reasons. As a result, the summation of ${y}_{i,t}^{p}$ over all transcripts may be bigger than 1 for some *i*. These "ambiguous reads" introduce a major source of uncertainty into transcript abundance estimation.

### Generative model

*t*from the reference transcript set $\mathcal{T}$ according to its relative abundance and effective length, and then generates a read from a random location of the chosen transcript. Under this model, the probability of a read originating from transcript

*t*is

*s*being either

*p*for the pure sample or

*m*for the mixed sample. Here, ${\stackrel{\u0303}{l}}_{t}$ is the effective length of transcript

*t*, which quantifies the number of positions at which a read can start within transcript

*t*. Different methods have been proposed to model the effective length [19, 22]. In TEMT, the effective length is modeled with consideration to the length distribution of RNA-seq fragments [19]

We assume the fragment length *x* has a normal distribution with mean *µ* and variance ${\sigma}^{2}$, and $\varphi \left(x;\mu ,\phantom{\rule{2.77695pt}{0ex}}{\sigma}^{2}\right)$ is the normal probability density function of. By renormalizing $\varphi \left(x;\mu ,\phantom{\rule{2.77695pt}{0ex}}{\sigma}^{2}\right)$, we obtain the discrete distribution of all possible fragment lengths. The effective length ${\stackrel{\u0303}{l}}_{t}$ is then the expectation of the number of positions a read can start within transcript *t*, based on the discrete distribution of fragment length.

*i*as represented by its alignment map is

for *s* = *p* or *m*.

*t*as it is the parameter of pure sample, but unlike the linear form in Equation (1), ${\alpha}_{t}^{m}$ in terms of ${\alpha}_{t}^{a},{\alpha}_{t}^{b}$ is given as a nonlinear form

*±*0

*:*05. So we approximate ${\alpha}_{t}^{m}$ with the linear form

As it brings computational convenience in the following learning step.

### Maximum a posteriori estimation

Several analysis have noticed the identifiability problem [12, 13] in estimating cell type specific expression in heterogeneous tissue samples. Ideally, if the proportion information for some cell types is missing, we can then pool these cell types as one type, making the expression of each individual cell type inside unidentifiable. Previously, prior constraints have been used to resolve the problem [12, 13]. In our model, the prior knowledge of cell type proportions is combined with the model likelihood, and we subsequently use maximum a posteriori (MAP) estimation to find the optimal parameters.

*Beta*$\left({\beta}^{a},\phantom{\rule{2.77695pt}{0ex}}{\beta}^{b}\right)$ distribution as the prior for cell proportions of type

*a*and type

*b*. The parameter ${\beta}^{a},\phantom{\rule{0.3em}{0ex}}{\beta}^{b}$ quantify the location and sharpness of the prior. Practically, we found setting ${\beta}^{a},\phantom{\rule{0.3em}{0ex}}{\beta}^{b}$ 10 times as the data size gave a good convergence rate and accuracy. Combining the prior with the likelihood given in Equation (11), the posterior distribution of the model is proportional to

### Incorporating sequencing bias

Both positional [16–18] and sequence-specific [19, 20] sequencing biases have been observed in next generation sequencing data. These biases mainly result from non-uniformly distributed cDNA fragments during the RNA-seq library preparation [20]. Under positional bias, reads positioning is not uniformly distributed across the effective length of the target transcript, but preferentially distributed around either the 5' end or the 3' end of the target transcript. Under sequence-specific bias, the sequences near the two ends of the fragments affect their probability to be sequenced. To account for these non-uniformity effects during transcript abundance estimation, we incorporate the bias module of [19] into our model.

*i*from either read set ${\mathcal{R}}^{p}$ or ${\mathcal{R}}^{m}$, we denote ${b}_{i,\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}t}^{s}\in \left[0,\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{\stackrel{\u0303}{l}}_{t}\right]$ as the starting position of the alignment within transcript

*t*relative to the 5' end of the strand. We also denote ${\pi}_{i,\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}t}^{s}\in {\mathrm{\Sigma}}^{L}$, where $\mathrm{\Sigma}=\left\{A,\phantom{\rule{2.77695pt}{0ex}}C,\phantom{\rule{2.77695pt}{0ex}}G,\phantom{\rule{2.77695pt}{0ex}}T\right\}$, as the local sequence of transcript

*t*with length

*L*and centered at ${b}_{i,\phantom{\rule{0.3em}{0ex}}t}^{s}$.Then we define the bias weight ${w}_{i,\phantom{\rule{0.3em}{0ex}}t}^{s}$ as

for s=*p* or *m*.

*and*${\pi}_{i,\phantom{\rule{0.3em}{0ex}}t}^{s}$ under the bias model to the probability under the uniform model. If no bias exists, the weight ${w}_{i,\phantom{\rule{0.3em}{0ex}}t}^{s}$ reduces to 1. The bias re-weighted Equation (4) is then:

### Online EM algorithm for learning

*i*from read set ${\mathcal{R}}^{p}$ of pure sample, we denote the latent variable of the transcript alignment representation as ${\mathcal{Z}}_{i}^{p}=\left\{{z}_{i,t}^{p}|t=1,\phantom{\rule{2.77695pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.77695pt}{0ex}}T\right\}$, where ${z}_{i,t}^{p}=1$if read

*i*aligns to transcript

*t*and 0 otherwise. But now $\sum _{t=1}^{T}}{z}_{i,t}^{p}=1$, which means only one ${z}_{i,t}^{p}=1$, indicating read

*i*is actually originating from transcript

*t*. Similarly, for each read

*i*from read set ${\mathcal{R}}^{m}$ of mixed sample, we denote the latent variable of the transcript alignment representation as ${\mathcal{Z}}_{i}^{m}=\left\{{z}_{i,t}^{ma},\phantom{\rule{2.77695pt}{0ex}}{z}_{i,t}^{mb}|t=1,\phantom{\rule{2.77695pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.77695pt}{0ex}}T\right\}$, where ${z}_{i,t}^{ma}=1$ if read

*i*aligns to transcript

*t*and is originating from cell type

*a*within the mixed sample, and 0 otherwise. ${z}_{i,t}^{mb}=1$ or 0 is similar defined for cell type

*b*. Thus ${\sum}_{t=1}^{T}\left({z}_{i,\phantom{\rule{0.3em}{0ex}}t}^{ma}+{z}_{i,\phantom{\rule{0.3em}{0ex}}t}^{mb}\right)=1$ means read

*i*is actually originating from only one transcript, and either from cell type

*a*or

*b*within the mixed sample. We also define the auxiliary variable ${q}_{i,t}^{p}=\mathbb{P}\left({z}_{i,t}^{p}=1|\mathrm{\Theta},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{Y}}^{p},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{Y}}^{m}\right)$, ${q}_{i,\phantom{\rule{0.3em}{0ex}}t}^{ma}=\mathbb{P}\left({z}_{i,\phantom{\rule{0.3em}{0ex}}t}^{ma}=1|\mathrm{\Theta},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{Y}}^{p},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{Y}}^{m}\right)$ and ${q}_{i,\phantom{\rule{0.3em}{0ex}}t}^{mb}=\mathbb{P}\left({z}_{i,\phantom{\rule{0.3em}{0ex}}t}^{mb}=1|\mathrm{\Theta},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{Y}}^{p},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{Y}}^{m}\right)$ as the conditional probability weight of each latent variable ${z}_{i,\phantom{\rule{0.3em}{0ex}}t}^{p}=1$, ${z}_{i,\phantom{\rule{0.3em}{0ex}}t}^{ma}=1$ and ${z}_{i,\phantom{\rule{0.3em}{0ex}}t}^{mb}=1$ conditional on model parameters Θ and the observed read alignment representations ${\mathcal{Y}}^{p}{\mathcal{Y}}^{m}$. Then based on Jensen's inequality [24], the complete posterior distribution, which is also the lower bound of Equation (15) can be written as

In which *C* is a normalizing constant and the equality holds only if the conditional probabilities ${q}_{i,\phantom{\rule{0.3em}{0ex}}t}^{p}$, ${q}_{i,\phantom{\rule{0.3em}{0ex}}t}^{ma}$, ${q}_{i,\phantom{\rule{0.3em}{0ex}}t}^{mb}$ are the true posterior distributions of latent variables ${\left\{{\mathcal{Z}}_{i}^{p}\right\}}_{i=1}^{{N}^{p}},{\left\{{\mathcal{Z}}_{i}^{m}\right\}}_{i=1}^{{N}^{m}}$.

The EM framework maximizes Equation (17) by iteratively applying the expectation step and the maximization step to update both the conditional probabilities ${q}_{i,\phantom{\rule{0.3em}{0ex}}t}^{p}$, ${q}_{i,\phantom{\rule{0.3em}{0ex}}t}^{ma}$, ${q}_{i,\phantom{\rule{0.3em}{0ex}}t}^{mb}$ and model parameters Θ until convergence. The expectation step of typical batch EM algorithm has to fetch all the data points into memory, and calculates the conditional probabilities based on the average of all the data points. While this batch method guarantee's the log-likelihood function to monotonically increase, it also induces inefficiency in both time and space complexity. Considering the high-throughput nature of next-generation sequencing technology as well as its huge data size, we implemented the EM algorithm in an online fashion [15] to both lower the memory requirement and boost the convergence rate.

The main difference between the batch EM and the online EM is in the E-step. The E-step of the online EM algorithm first calculates the conditional probabilities of only one new data point, and then updates the conditional probabilities of all the current data points by interpolating between the conditional probabilities of all the previous data points and the conditional probabilities of the new data point, with a forgetting factor *σ* controlling the convergence rate.

It is shown in [15] that with the constraint 0.5 *< σ* ≤ 1, the online EM algorithm is asymptotically equivalent to stochastic gradient ascent, and is guaranteed to converge to the maximum likelihood estimator, which is extended to the maximum a posteriori estimator in our model.

Specifically, the online EM updates in our model is given by

### E-step

In Equation (18-20), we compute the conditional probabilities ${q}_{i+1,\phantom{\rule{0.3em}{0ex}}t}^{p}$, ${q}_{i+1,\phantom{\rule{0.3em}{0ex}}t}^{ma}$, ${q}_{i+1,\phantom{\rule{0.3em}{0ex}}t}^{mb}$ of just one new read *i* + 1 based on previous parameter estimation ${\left\{{\alpha}_{t}^{a\left(n\right)}\right\}}_{t\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}1}^{T}$, ${\left\{{\alpha}_{t}^{b\left(n\right)}\right\}}_{t\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}1}^{T}$, ${\tau}^{a\left(n\right)},{\tau}^{b\left(n\right)}$; In Equation (21-23), we compute the new conditional probabilities average ${q}_{*,\phantom{\rule{0.3em}{0ex}}t}^{p\left(n+1\right)}$, ${q}_{*,\phantom{\rule{0.3em}{0ex}}t}^{ma\left(n+1\right)}$, ${q}_{*,\phantom{\rule{0.3em}{0ex}}t}^{mb\left(n+1\right)}$ by interpolating between the previous conditional probabilities average ${q}_{*,\phantom{\rule{0.3em}{0ex}}t}^{p\left(n\right)}$, ${q}_{*,\phantom{\rule{0.3em}{0ex}}t}^{ma\left(n\right)}$, ${q}_{*,\phantom{\rule{0.3em}{0ex}}t}^{mb\left(n\right)}$ and ${q}_{i+1,\phantom{\rule{0.3em}{0ex}}t}^{p}$, ${q}_{i+1,\phantom{\rule{0.3em}{0ex}}t}^{ma}$, ${q}_{i+1,\phantom{\rule{0.3em}{0ex}}t}^{mb}$. *n* is the index of iteration step and *i* is the index of data points. *σ* is the forgetting factor which controls the convergence rate, with the constraint 0.5 *< σ* ≤ 1.

### M-step

In the subsequent M-step, parameters ${\left\{{\alpha}_{t}^{a\left(n+1\right)}\right\}}_{t\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}1}^{T}$, ${\left\{{\alpha}_{t}^{b\left(n+1\right)}\right\}}_{t\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}1}^{T}$, ${\tau}^{a\left(n+1\right)}$, ${\tau}^{b\left(n+1\right)}$ are updated according to new conditional probabilities average ${q}_{*,\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}t}^{p\left(n+1\right)}$, ${q}_{*,\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}t}^{ma\left(n+1\right)}$, ${q}_{*,\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}t}^{mb\left(n+1\right)}$.

## Results

Next we test the performance of the proposed method on both simulation data and the recently released ENCODE data [21]. For both datasets, we used the following three-step protocol and parameters to construct the analysis:

**1**. We aligned the raw read set from either simulation or the ENCODE data to a given transcript set using bowtie-0.12.7 [25]. For each read, we allowed 2 mismatches and reported at most 10 candidate alignments.

**2**. The abundance of each transcript in terms of estimated counts was estimated via both TEMT and a control model. Estimated counts is defined as the estimated number of reads generated from the target transcript. In TEMT, the prior of each cell type proportion was set to the same as the proportion used in simulation and ENCODE data respectively, and *β*^{
a
}, *β*^{
b
} was set to 10 times the size of the read set ${\mathcal{R}}^{m}$. *μ* = 200; *σ* = 80 were used as the mean and standard deviation of the RNA-seq fragment length distribution. We chose eXpress-0.9.4 [26] as the control model, as it is the state-of-the-art method for transcript abundance estimation and also utilizes an online-EM algorithm. Note that, to run TEMT, we need two read sets, in which one is for the pure sample and the other is for the mixed sample, as previously mentioned. In contrast, to run eXpress, we only need one read set from either the pure sample or the mixed sample. The forgetting factor for the on-line EM algorithms in both TEMT and eXpress was set to be *σ* = 0*:* 85, and the error-model in eXpress was disabled for comparison.

**3**. To measure the model accuracy, we used the Error Fraction (EF) measure introduced by [17] to quantify the discrepancy between the model estimates and the ground truth estimates. The Error Fraction is defined as the fraction of transcripts for which the estimates are significantly different (percent error *>* 10% in our case) from the ground truth.

### Simulation

#### Data preparation

To show the utility of TEMT, we first carried out a series of simulation studies. To obtain simulated read sets, we used FluxSimulator [27], a software for transcriptome and read generation by simulating the biochemical processes underlying the library preparation. FluxSimulator requires a reference transcript set to start the simulation process, so we manually downloaded 406 transcripts of 208 alternatively spliced genes in human from Alternative Splicing Structural Genomics Project (AS3D) [28], and used these 406 transcripts as the reference transcript set. We first simulated the transcript expression process twice producing two sets of relative transcript abundances, corresponding to cell type a and b respectively. Based on these two transcript abundance sets, we then simulated 6 pairs of 1 million 75-bp single-end read sets corresponding to six different cell type *b* proportions from 40% up to 90%. The relative transcript abundances of cell type *a* and *b* were kept the same throughout these simulations. For each paired read set, one read set is for the pure sample composed of only cell type *a*, whereas the other read set is for the mixed sample composed of both cell type *a* and *b*, mixed with the cell type *b* proportion. Within the mixed-sample read set, we also extracted the reads simulated purely from cell type *b*, which was used for control model eXpress.

#### Analysis

We note that the estimates of cell type *a* from TEMT achieve roughly the same accuracy, compared with the estimates from eXpress based on the read set of the pure sample of cell type *a*. Also, this accuracy does not change significantly under the effect of different cell type b proportions. This is mainly due to the pure sample read set of cell type *a* within the input data for TEMT.

*b*from TEMT is also shown in Figure 2, which shows that TEMT generally outperforms the direct estimation method. To the best of our knowledge, there are no computational tools similar to our model that can estimate the relative transcript abundances of cell type

*b*via RNA-seq data generated from mixed samples. Typically, computational methods are applied directly to the noisy data of mixed samples and results are interpreted as the estimates of cell type

*b*. To compare the estimates of cell type

*b*from TEMT with direct estimates using the current method, we applied the control model eXpress directly to the read set of the mixed sample. The estimated counts from eXpress were then compared with the true counts from another 1 million simulated read set purely of cell type

*b*, while keeping the same relative transcript abundance as the previous simulations. The corresponding Error Fractions are shown as the red line in Figure 2 regarding different cell type

*b*proportions. Although the accuracy of cell type

*b*estimates from TEMT is affected by different cell type

*b*proportions, it is generally better than the direct estimates. This can be further illustrated in Figure 3, which shows that the direct estimated counts of cell type

*b*from eXpress deviate more from the true counts as the cell type

*b*proportion decrease, while the estimates of TEMT have much reduced deviation. We notice that as the cell type

*b*proportion gradually decreases, the accuracy of the estimates of cell type

*b*from TEMT also decreases. This is the result of the contamination effect from the cell type

*a*within the mixed sample. A recent paper [4] also observed this similar phenomenon when studying copy number aberrations from heterogeneous tumor tissue.

### ENCODE data

#### Data preparation

Next we analyzed the recently released ENCODE data. Due to the lack of RNA-seq data sampled from mixed tissue samples with known cell type proportions, we artificially generated the mixed-sample read sets by mixing reads obtained from two different cell types. Specifically, we chose two Tier 1 cell lines, GM12878 and K562, and treated them as cell type *a* and cell type *b* respectively. The corresponding single-end RNA-seq data of these two cell lines, GM78 1×75D A 1 (UCSC Accession: wgEncodeEH000125) and K562 1×75D A 1 (UCSC Accession: wgEncodeEH000126) from the Wold lab [29] at Caltech, were download from ENCODE (2012). The data downloaded from the same lab under similar protocols is intended to reduce the deviation resulting from experiments. We then randomly selected 10 million reads from GM12878 cells to form the read set of the pure sample, and 10 million reads from both GM12878 and K562 cells using different K562 cells proportions to form the read set of the mixed sample. Similar to the previous simulation study, we extracted the reads purely selected from K562 cells within the mixed sample, and used them for the eXpress control model. We studied 6 different K562 cells proportions from 40% to 90% in order to compare with the previous simulation study. 36908 human RefSeq [30] transcripts from UCSC known genes [31] were used as the transcript set for the ENCODE data.

#### Analysis

*a*within the mixed sample observed in Figure 3 is also seen in the eXpress analysis of ENCODE data, while TEMT does not have this issue. Note that the measure of relative transcript abundances as shown in the red line of Figure 4 is no longer estimated counts, but reads per kilobase of transcript per million mapped reads (RPKM), as the total number of reads from K562 cells within the mixed sample is less than the total number of reads of the mixed sample, so that normalization is necessary for comparison. We notice TEMT underperforms direct estimates from eXpress when K562 cells proportion equals 90%. Possibly the contamination effect of GM12878 cells within the mixed sample is not severe enough at this point, as we can imagine the red line in Figure 4 will finally reach 0% Error Fraction when K562 cells proportion reaches 100%. On the other hand, since the estimates from eXpress based on the pure sample are considered the ground truth, the lower bound Error Fraction of K562 cells estimates from TEMT should be the same as the Error Fraction of GM12878 cells estimates, which is around 20% to 30% in Figure 4.

## Discussion

We formulated our model under the assumption that the heterogeneous tissue is only composed of two cell types, but in reality, a heterogeneous tissue might be much more complicated, consisting of multiple cell types. To relax this constraint, our model needs to be further extended to analyze more complex cases in which each cell type may have its own subtypes, e.g. breast cancer subtypes, leading to a more sophisticated heterogeneous tissue environment. Further dissecting cell subtype heterogeneity is the next step in refining our model. Moving from two cell types to arbitrarily many cell types is of great interest, since it may substantially facilitate transcriptome study of heterogeneous tissues.

One critical component necessary to make our model work is the prior information of cell type *b* proportion, which is necessary to resolve the identifiability problem of mixed samples. In real experiments, precise prior information regarding cell type proportions may be unavailable. One solution in the context of our model is to down weight the effect of the prior by decreasing the parameter *β*^{
a
}, *β*^{
b
}, which adds more uncertainty to the cell mixture proportion. However, this approach may decrease the performance of the model as the uncertainty in cell mixture proportion cannot be distinguished from the uncertainty in transcript abundance estimation. This observation suggests another direction to further improving our model which is to solely estimate cell type *b* proportion without the prior information. To fulfill this requirement, the identifiability problem needs to be resolved as mentioned in section 2.3, which turns out to be comparatively hard for RNA-seq data. Unlike the heterozygous and homozygous deletions in [14], which can be utilized to differentiate between the SNP array data generated by normal cells and tumor cells, there are no such explicit differences between the reads generated by distinct cell types in RNA-seq data, thus making the generative mixture model unconstrained. The "marker genes" method proposed by [11], which tries to distinguish distinct cell types by utilizing genes uniquely expressed in each cell type, provides a future potential direction to extend the current model.

## Conclusion

In this article, we propose a probabilistic model-based method TEMT to estimate transcript abundance of individual cell types based on RNA-seq data from heterogeneous tissue samples. TEMT utilizes prior information to distinguish reads generated by each cell type within the heterogeneous tissue sample. Positional and sequence-specific biases are also incorporated to improve estimation accuracy. TEMT is able to process large datasets as the online EM algorithm is adopted to guarantee a time complexity proportional to the data size and a constant space complexity. Our experiments on both simulated datasets and ENCODE data shows that explicitly accounting for tissue heterogeneity can significantly improve the accuracy of transcript abundance estimation.

## Declarations

### Acknowledgements

We gratefully acknowledge helpful discussions with Jake Biesinger, Daniel Newkirk and Ali Mortazavi. The work is partly supported by National Institute of Health grant R01HG006870.

**Declarations**

Publication of this article was supported by National Institute of Health grant R01HG006870.

This article has been published as part of *BMC Bioinformatics* Volume 14 Supplement 5, 2013: Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S5.

## Authors’ Affiliations

## References

- Marguerat S, Bähler J: RNA-seq: from technology to biology. Cellular and Molecular Life Sciences. 2010, 67 (4): 569-579. 10.1007/s00018-009-0180-6.PubMed CentralView ArticlePubMedGoogle Scholar
- Trapnell C, Williams B, Pertea G, Mortazavi A, Kwan G, Van Baren M, Salzberg S, Wold B, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology. 2010, 28 (5): 511-515. 10.1038/nbt.1621.PubMed CentralView ArticlePubMedGoogle Scholar
- Ren S, Peng Z, Mao J, Yu Y, Yin C, Gao X, Cui Z, Zhang J, Yi K, Xu W: RNA-seq analysis of prostate cancer in the Chinese population identifies recurrent gene fusions, cancer-associated long noncoding RNAs and aberrant alternative splicings. Cell Research. 2012, 22 (5): 806-821. 10.1038/cr.2012.30.PubMed CentralView ArticlePubMedGoogle Scholar
- Chan K, Jiang P, Zheng Y, Liao G, Sun H, Wong J, Siu S, Chan W, Chan S, Chan A: Cancer Genome Scanning in Plasma: Detection of Tumor-Associated Copy Number Aberrations, Single-Nucleotide Variants, and Tumoral Heterogeneity by Massively Parallel Sequencing. Clinical Chemistry. 2012Google Scholar
- Navin N, Kendall J, Troge J, Andrews P, Rodgers L, McIndoo J, Cook K, Stepansky A, Levy D, Esposito D: Tumour evolution inferred by single-cell sequencing. Nature. 2011, 472 (7341): 90-94. 10.1038/nature09807.PubMed CentralView ArticlePubMedGoogle Scholar
- MLL M, PIK3R1 P: Comprehensive molecular portraits of human breast tumours. 2012Google Scholar
- Emmert-Buck M, Bonner R, Smith P, Chuaqui R, Zhuang Z, Goldstein S, Weiss R, Liotta L: Laser capture microdissection. Science. 1996, 274 (5289): 998-1001. 10.1126/science.274.5289.998.View ArticlePubMedGoogle Scholar
- Otsuka Y, Ichikawa Y, Kunisaki C, Matsuda G, Akiyama H, Nomura M, Togo S, Hayashizaki Y, Shimada H: Correlating purity by microdissection with gene expression in gastric cancer tissue. Scandinavian Journal of Clinical & Laboratory Investigation. 2007, 67 (4): 367-379. 10.1080/00365510601046334.View ArticleGoogle Scholar
- Clarke R, Ressom H, Wang A, Xuan J, Liu M, Gehan E, Wang Y: The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nature Reviews Cancer. 2008, 8: 37-49. 10.1038/nrc2294.PubMed CentralView ArticlePubMedGoogle Scholar
- Meyerson M, Gabriel S, Getz G: Advances in understanding cancer genomes through second-generation sequencing. Nature Reviews Genetics. 2010, 11 (10): 685-696. 10.1038/nrg2841.View ArticlePubMedGoogle Scholar
- Venet D, Pecasse F, Maenhaut C, Bersini H: Separation of samples into their constituents using gene expression data. Bioinformatics. 2001, 17 (suppl 1): S279-S287. 10.1093/bioinformatics/17.suppl_1.S279.View ArticlePubMedGoogle Scholar
- Gusnanto A, Wood H, Pawitan Y, Rabbitts P, Berri S: Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data. Bioinformatics. 2012, 28: 40-47. 10.1093/bioinformatics/btr593.View ArticlePubMedGoogle Scholar
- Erkkilä T, Lehmusvaara S, Ruusuvuori P, Visakorpi T, Shmulevich I, Lähdesmäki H: Probabilistic analysis of gene expression measurements from heterogeneous tissues. Bioinformatics. 2010, 26 (20): 2571-2577. 10.1093/bioinformatics/btq406.PubMed CentralView ArticlePubMedGoogle Scholar
- Yu G, Zhang B, Bova G, Xu J, Wang Y: BACOM: in silico detection of genomic deletion types and correction of normal cell contamination in copy number data. Bioinformatics. 2011, 27 (11): 1473-1480. 10.1093/bioinformatics/btr183.PubMed CentralView ArticlePubMedGoogle Scholar
- Cappé O, Moulines E: Online EM algorithm for latent data models. Journal of the Royal Statistical Society. 2008Google Scholar
- Bohnert R, Rätsch G: rQuant. web: a tool for RNA-Seq-based transcript quantitation. Nucleic acids research. 2010, 38 (suppl 2): W348-W351.PubMed CentralView ArticlePubMedGoogle Scholar
- Li B, Ruotti V, Stewart R, Thomson J, Dewey C: RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010, 26 (4): 493-500. 10.1093/bioinformatics/btp692.PubMed CentralView ArticlePubMedGoogle Scholar
- Li J, Jiang H, Wong W: Method Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010, 11 (5): R25-View ArticleGoogle Scholar
- Roberts A, Trapnell C, Donaghey J, Rinn J, Pachter L: Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011, 12 (3): R22-10.1186/gb-2011-12-3-r22.PubMed CentralView ArticlePubMedGoogle Scholar
- Hansen K, Brenner S, Dudoit S: Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic acids research. 2010, 38 (12): e131-e131. 10.1093/nar/gkq224.PubMed CentralView ArticlePubMedGoogle Scholar
- Dunham L, Kunaje A, Aldred S, Collins P, Davies C, Doyle F, Epstein C, Frietze S, Harrow J, Khatun J, Kaul R, Lajoie B, Landt S, Lee B, Pauli F, Rosenbloom K, Sabo P, Safi A, Sanyal A, Shoresh N, Simon J, Song L, Trinklein N, Altshuler R, Birney E, Brown J, Cheng C, Djebali S, Dong X, Ernst J, Furey T: An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature. 2012Google Scholar
- Pachter L: Models for transcript quantification from RNA-Seq. arXiv preprint arXiv:1104.3889. 2011Google Scholar
- Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological). 1977, 1-38.Google Scholar
- Jensen J: Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta Mathematica. 1906, 30: 175-193. 10.1007/BF02418571.View ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.PubMed CentralView ArticlePubMedGoogle Scholar
- Roberts A, Pachter L: Streaming fragment assignment for real-time analysis of sequencing experiments. Nature Methods. 2012Google Scholar
- Sammeth M: The Flux Simulator. 2012, [http://sammeth.net/confluence/display/SIM/Home]Google Scholar
- AS3D: Alternative Splicing Structural Genomics Project. 2012, [http://www.as3d.org/]Google Scholar
- Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods. 2008, 5 (7): 621-628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
- Pruitt K, Tatusova T, Maglott D: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research. 2007, 35 (suppl 1): D61-D65.PubMed CentralView ArticlePubMedGoogle Scholar
- Hsu F, Kent W, Clawson H, Kuhn R, Diekhans M, Haussler D: The UCSC known genes. Bioinformatics. 2006, 22 (9): 1036-1046. 10.1093/bioinformatics/btl048.View ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.