Differential expression analysis for paired RNA-seq data
© Chung et al.; licensee BioMed Central Ltd. 2013
Received: 16 October 2012
Accepted: 1 March 2013
Published: 27 March 2013
Skip to main content
© Chung et al.; licensee BioMed Central Ltd. 2013
Received: 16 October 2012
Accepted: 1 March 2013
Published: 27 March 2013
RNA-Seq technology measures the transcript abundance by generating sequence reads and counting their frequencies across different biological conditions. To identify differentially expressed genes between two conditions, it is important to consider the experimental design as well as the distributional property of the data. In many RNA-Seq studies, the expression data are obtained as multiple pairs, e.g., pre- vs. post-treatment samples from the same individual. We seek to incorporate paired structure into analysis.
We present a Bayesian hierarchical mixture model for RNA-Seq data to separately account for the variability within and between individuals from a paired data structure. The method assumes a Poisson distribution for the data mixed with a gamma distribution to account variability between pairs. The effect of differential expression is modeled by two-component mixture model. The performance of this approach is examined by simulated and real data.
In this setting, our proposed model provides higher sensitivity than existing methods to detect differential expression. Application to real RNA-Seq data demonstrates the usefulness of this method for detecting expression alteration for genes with low average expression levels or shorter transcript length.
Gene expression profiles are routinely collected to identify differentially expressed genes and pathways across various individuals and cellular states. Sequencing-based technologies offer more accurate quantification of expression levels compared to other technologies. Early sequence-based expression measured transcript abundance by counting short segments, known as tags, generated from the 3’ end of a transcript. Tag-based methods include the Serial Analysis of Gene Expression (SAGE, ), Cap Analysis of Gene Expression (CAGE), LongSAGE, and massively parallel signature sequencing (MPSS). The development of deep sequencing technology enables simultaneous sequencing of millions of molecules and has led to advanced approaches for expression measurement [2, 3]. Digital gene expression - tag profiling  adapted the tag-based approach for use with the ‘next-generation’ sequencing platform. RNA-Seq is an alternative approach, that is an application of ‘whole genome shotgun sequencing’. Briefly, it entails generating a cDNA library by random priming off of fragmented RNA. The cDNA library is then subject to next-generation sequencing to generate short nucleotide sequences (reads) that correspond to the ends of the cDNA fragments. RNA-Seq aims to measure the entire transcriptome and is preferable to microarrays and tag-based approaches since it provides more information such as alternative splicing and isoform-specific gene expression with very low background signal and a wider dynamic range of quantification . Moreover, recent experiments revealed that the RNA-Seq measures expression level with high accuracy and reproducibility [6–9].
Sequence-based approaches quantify gene expression as a ‘digital’ count and require modeling suitable for a count random variable. The Poisson distribution has been central in modelling expression data [10–12] and commonly applied to RNA-Seq data [6, 13]. In particular, Li et al. (2012) proposed a permutation-based approach to generate the null distribution . However, Poisson-based approaches may not take all the variations between biological samples into account. The Beta-Binomial hierarchical model [15, 16], overdispersed logistic , and overdispersed log-linear models  were proposed to capture extra variance for each gene separately. Negative Bionomial models have been proposed to estimate the overdispersion parameter by shrinkage estimation [19–21], mean-dependent local regression , or empirically derived prior distribution . Alternatively, beta-binomial  and Poisson mixture  models were proposed under the Bayesian modeling framework. Nonparametric method with resampling was also considered . These approaches generally assume that samples under two groups are obtained independently. Recently, some of these approaches have been extended to deal with multi-factor design structures [14, 16, 21, 22].
Many practical RNA sequencing studies collect data with a paired structure, where the global expression profiles are measured before and after a treatment is applied to the same individual. Appropriate modeling of such data requires taking this design structure as well as the distributional property of the data into account. The Poisson model has been used to test the effect of drugs when the observation occurs as paired data, such as predrug and postdrug counts . Lee  considered a mixture model to account for extra variance among individuals over the level that would be expected under the Poisson model. These approaches assume independence of the paired observation conditional on the individual mean. Bivariate Poisson or negative binomial distribution are alternative choices to model correlations between observations [29, 30].
In this paper, we propose a Bayesian hierarchical approach to modeling paired count data that separately accounts for the within and between individual variability from a paired data structure. Our work adopts the Poisson-Gamma mixture model  and utilizes a Bayesian approach to evaluate the expression difference. We note that the Bayesian models are widely utilized in microarray studies and have improved sensitivity to detect differential gene expression by sharing information among genes . Mixture models are also commonly used to model differential expression, where non-differentially expressed and differently expressed genes correspond to different mixture components. Various mixture model specifications have been considered in the literature. The gamma and log-normal distribution were used to model the expression levels [32, 33]. Smyth  assumed a point mass at zero for log scaled fold change for null genes and a normal distribution centered at zero for non-null genes. Lonnstedt et al.  and Gottardo et al.  proposed a mixture of two (null and non-null) or three normal (null, over, and under expression) distributions. Non-parametric approaches have also been utilized [31, 37]. Lewin et al.  discussed various choices of mixture component priors and model checking.
The rest of this manuscript is organized as follows. Data Section introduces the biological problem and data that motivated this study. Methods Section presents our parametric model and the Bayesian method to identify genes with differential expression levels. The performance of the proposed model is examined by Simulations. Two sets of simulation studies are conducted: (1) those based on the model assumption to investigate the accuracy of the proposed method on parameter estimation, and (2) those based on mimicking the motivating data set to examine the robustness of the proposed method. Finally, the proposed method is applied to real data with detailed discussion of the results and comparisons with other methods.
Qian et al. (Qian F. et al.: Identification of genes critical for resistance to infection by West Nile virus using RNA-Seq analysis, submitted) designed an RNA-Seq experiment to study human West Nile virus (WNV) infection. One objective of this study was to identify altered genes/transcripts from viral infection of primary human macrophages in comparison to uninfected samples. This study naturally has a paired design structure. A total of 10 healthy donors were recruited according to the guidelines of the human research protection program of Yale University and cells were isolated from fresh heparinized blood samples for infection with WNV (strain CT 2741, MOI=1, for 24 hours) as described previously . PolyA+ RNA was prepared from uninfected and WNV-infected primary macrophages, fragmented, and subjected to sequencing using the Illumina Genome Analyzer 2. Approximately 50 million quality filtered reads were obtained from each sample, and about 85% were mapped to the human transcriptome (hg19) with ENSEMBL transcript annotations (Release 57) using TOPHAT v.1.1.4 . Genes and transcript isoforms were scored for expression by a maximum likelihood based method implemented in Cufflinks v.0.9.3 . To analyze differential expression, the data were first converted from the FPKM unit (fragments per kilobase of exon per million fragments mapped) to the number of reads originated from each transcript isoform. The trimmed-mean method  was applied to further normalize the count expression values. The processed data contains transcript-level expression counts from a total of 20 samples consisting of 10 pairs of uninfected and virus infected samples. For differential expression analysis, we removed transcripts with less than 10 total counts across 10 uninfected samples or no observed count from 6 or more individuals in the uninfected conditions. After these steps, 37,111 transcripts were considered for data analysis.
This model allows us to obtain a simpler form of the predictive density, i.e., the λ g i ’s can be integrated out (see Appendix).
(Π 0, Π 1)∼ Dirichlet(1,1), i.e., Π 0 ∼ Uniform(0, 1).
Each α g and β g has a non-informative prior.
μ 1 has an improper prior.
Joint independency among all the parameters.
In this section, we describe the Gibbs sampling algorithm  that we use to iteratively sample model parameters from their conditional distributions given the other parameters and the observed data. First, we evaluate the conditional distribution of parameters (α g , β g ) characterizing the baseline expression distribution (λ gi ). These parameters are separately updated using the Metropolis-Hastings algorithm. For the latent state z g and expression level change χ g , the state z g is first proposed and then χ g is sampled given the state. Lewin et al.  discussed this type of move with various choices of the mixture distribution. Details of our updates on the pair of (χ g , z g ) are described in the Appendix. Mixing proportions (Π 0, Π 1) and hyper-parameters for the mixture distribution ( , , μ 1) are sampled from their conditional posterior distributions which can be derived in closed forms.
The method was implemented in R and is available at http://bioinformatics.med.yale.edu.
The first part of the simulation was conducted to examine the performance of the proposed approach when the data are generated under the model assumptions. For 10,000 genes and 10 individuals, we simulate expression counts both before and after treatment according to Equation 1. Library sizes are sampled uniformly from 7 to 18 millions and relative expected baseline expression λ g i are drawn from a Gamma distribution with shape 0.1 and rate 1,000. For simplicity, we consider a two-component log-normal mixture model for effect size. For the null genes (90%), the log-scaled effect is sampled from a normal distribution with a mean 0 and a standard deviation (σ 0) 0.1, whereas the log-effects are sampled from a normal distribution with mean (μ 1) of 1.5 and standard deviation (σ 1) of 0.5 for the non-null genes. For the simulation studies, the true library sizes are used for the parameter estimation.
Posterior means of the parameters in the model
In the second part of the simulation, we assume that the expression abundance is measured for 5,000 genes simultaneously before and after a given treatment. The number of individuals is set to be 10 for the relatively larger sample case (cases 1 and 4), 5 for the medium (cases 2 and 5), and 3 for the relatively smaller sample case (cases 3 and 6). The size of each library is randomly sampled from 1.8 to 3 million to have simulated count distribution compatible with the real data distribution. The infected set of the RNA-seq data (Data Section, Qian F. et al. for details) was used as the expected baseline count data to mimic the observed mean-specific dispersion. First, we sample 5,000 gene indices with replacement to get the expected baseline expression. Expression counts from the selected indices are summarized by a matrix where rows from this data matrix correspond to the selected genes in the original data matrix and columns correspond to individuals. Then, the relative expression (λ gi, i = 1, …, N , Equation 1) is computed proportional to the total counts in each sample.
Among 5,000 genes, the first 4,000 are assumed to have no change (z g = 0) and their log-fold changes, log(χ g ), are sampled from a normal distribution with a mean of 0 and a variance of 4 × 10-4. For the rest of non-null genes, we considered the following two scenarios. An empirical set-up (cases 1, 2, 3) utilizes nominal fold change from the uninfected data set. Cases 4, 5, and 6 consider a theoretical setup, where the log-scaled fold change is drawn from a normal distribution with a mean of zero and a variance of 1. We further filter out non-null genes whose true fold changes are less than 1.4.
Each case was repeated 100 times. We compare the performance of our approach with DESeq (version 1.8.3)  and edgeR (version 2.6.10) , two widely used methods for RNA-seq data for the purpose of identifying differentially expressed genes. These two methods assume a negative binomial distribution to explain the variance due to the replicate. DESeq utilizes a smoothing curve to compute the overdispersion as a function of the average expression level. An option ‘pooled-CR’ is used to estimate the overdispersion parameter . In edgeR, a common dispersion setting is used which assumes a consistent overdispersion across all the features and estimates the parameter using a common likelihood function. A paired design can be incorporated by utilizing generalized linear model. For each application, the true library sizes are used as the library size inputs.
Estimated posterior means and results for empirical simulation
We further considered a simulation scenario similar with the real data. As shown in the data application, the log-scaled fold change estimated from the data has larger variance under null component. We set the null component variance to be 0.35 and repeated the simulation 50 times. For features in the non-null group, log-fold change was sampled from a normal distribution with a mean of -0.45 and a variance of 4. Simulation was performed with the sample size of 10 (case 7) and the size of 5 (case 8). Averages of the parameter estimates for cases 7 and 8 are (-0.42, 0.35, 3.92, 0.20) and (-0.42, 0.35, 3.85, 0.21), respectively. Similarly with the cases 1 through 6, the estimated false discovery rate is examined (Figure 3) and performance of the proposed approach is compared with two existing methods (Figure 4).
Selected pathways from the functional analysis
Response to wonding
Response to molecule of lipopolysaccharide
Response to cytokine stimulus
Response to bacterium
Regulation of cytokine production
Positive regulation of cytokine production
positive regulation of multicellular organismal process
Regulation of apoptosis
Regulation of programmed cell death
Regulation of cell death
T cell activation
In this paper, we have presented a hierarchical mixture model for the identification of differential gene expression from RNA-Seq data motivated by a West Nile Virus study, which collected samples as multiple pairs, i.e. pre- vs. post-treatment for each individual. While such design is common in biological investigations, few existing methods analyze such data appropriately. With a hierarchical Bayesian mixture model coupled with inference through MCMC, our approach incorporates variability across genes, individuals, and treatment effects in the context of a paired experiment. Application to both simulated and real data demonstrates that our model and implementation is suitable for paired design, having distinct advantages compared to the existing methods.
Simulation study suggests that our Bayesian setting can have better power to detect differential gene expression. In the real data application, our proposed is able to identify transcripts with large treatment effects but low expression levels, whereas these transcripts were not inferred to be differentially expressed by other approaches. This is likely due to the more flexible and adaptable modeling of variance across individuals in our approach. Further examination of the characteristics of these top-ranked transcripts shows that the proportion of top-ranked transcripts in the short transcript group is consistent with the proportion in the long transcript group. On the other hand, the gene sets detected by the existing approaches show a bias towards longer transcripts, as has been noted in the literature before [48, 49]. Our model reduces this bias and as a result facilitates detection of some short-length differentially expressed transcripts that the other approaches miss.
We have assumed that the log-fold change arises from a mixture of two normal distributions. Under DE, the model allows the mean of log-fold change distribution not to be restricted at zero. By doing so, our proposed model can be applied to the data showing asymmetry between over and under expression. A normal distributional assumption is shown to be robust from simulation study under empirical fold change scenarios. Other possible choices for the null genes include a point mass at 0 , uniform distribution around 0, and a log-Gamma distribution with a mean 0. Similar distributional assumptions can be made for the non-null genes under the two-component mixture set-up. Alternatively, one can consider a mixture of three components consisting of equal, over, and under expression states. Further extension can be considered by allowing variation in the magnitude of expression change across individuals.
We use the non-informative prior distributions for the unknown model parameters specified in the Methods Section.
We infer the posterior distributions using the Gibbs sampling , which iteratively samples model paramters from the conditional distribution of each patermter given the other parameters. In this section, we describe the procedure for the posterior inference.
If the proposal is accepted, we replace the old α g with the new one. Otherwise, α g stays at the current value.
Similarly, θ - β g is the vector of parameters except β g . For the evaluation of the acceptance probability, updated value of α g in the Step 1 will be used.
Update (χ g , z g ) by utilizing generalized Metropolis-Hastings. Lewin et al.  pointed out that χ g and z g have to be jointly estimated since the supporting space of χ g depends on the choice of z g . For example, χ g is a point mass at one if z g = 0. To estimate a pair of (χ g , z g ), they proposed the state z g first and then updated χ g |z g . By utilizing this approach, we adopt the following steps to sample (χ g , z g ).
(Step 3-1) Generate from the Bernoulli distribution, with .
(Step 3-2) Then, is proposed from LogNormal(0, V g ) if . Otherwise, it is sampled from LogNormal(M g , V g ). The mean and variance of the log-normal proposal distribution are computed from the observed counts. First, we collect individuals whose pre- and post-treatment counts are non-zero for each gene, separately. Then, M g is computed as a median of for such individuals. The variance of these values can be used as V g , however, this estimate often gives an extreme value. In data analysis, we trim the estimates at 25th and 75th percentiles when the sample size is 10. For small sample case, the median of V g ’s is used as the proposal variance.
where , LN 0 is a probability density function for log-normal distribution with mean zero and variance . Similarly, LN 1 is a log-normal density centered at and variance .
where and .
Update the mixing proportions (Π 0, Π 1). We assume a Dirichilet prior for the mixture probabilities. Using Gibbs sampling scheme, these weight parameters are updated from Dir(1 + ♯(z g = 0), 1 + ♯(z g = s1)).
This work was supported in part by Grant GM59507 from NIH, 5T15LM007056-25 from PHS/DHHS, UL1 RR024139 from Yale CTSA grant, and awards from the NIH (HHS N272201100019C, AI 070343, AI 089992).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.