 Software
 Open access
 Published:
NPEBseq: nonparametric empirical bayesianbased procedure for differential expression analysis of RNAseq data
BMC Bioinformatics volume 14, Article number: 262 (2013)
Abstract
Background
RNAseq, a massive parallelsequencingbased transcriptome profiling method, provides digital data in the form of aligned sequence read counts. The comparative analyses of the data require appropriate statistical methods to estimate the differential expression of transcript variants across different cell/tissue types and disease conditions.
Results
We developed a novel nonparametric empirical Bayesianbased approach (NPEBseq) to model the RNAseq data. The prior distribution of the Bayesian model is empirically estimated from the data without any parametric assumption, and hence the method is “nonparametric” in nature. Based on this model, we proposed a method for detecting differentially expressed genes across different conditions. We also extended this method to detect differential usage of exons from RNAseq data. The evaluation of NPEBseq on both simulated and publicly available RNAseq datasets and comparison with three popular methods showed improved results for experiments with or without biological replicates.
Conclusions
NPEBseq can successfully detect differential expression between different conditions not only at gene level but also at exon level from RNAseq datasets. In addition, NPEBSeq performs significantly better than current methods and can be applied to genomewide RNAseq datasets. Sample datasets and R package are available at http://bioinformatics.wistar.upenn.edu/NPEBseq.
Background
The advent of massive parallel sequencing, popularly known as NextGeneration Sequencing (NGS), is allowing whole genomes and transcriptomes to be sequenced with extraordinary speed and accuracy, providing insights into the bewildering complexity of gene expression at both gene and isoform levels [1]. With decreasing sequencing cost per base, RNASeq approach has become a desirable method to get a complete view of the transcriptome and to identify differentially expressed rare transcripts and isoforms [2]. The RNAseq assay provides sensitive and accurate digital counts for the exon regions of expressed transcripts in a given sample. The count of short sequence reads for each exon region is the sum of read counts belonging to the overlapping exon region of different transcript isoforms that are expressed in the sample. Therefore, estimating the transcriptlevel expression from the collection of counts of short read sequences that map to exons (or exon slices) and exon junctions is a computationally challenging problem, which has been recently attempted by us and others, in programs such as IsoformEx [3], rSeq [4], Cufflinks [5], RSEM [6], BASIS [7], and GPSeq [8]. However, none of these methods showed good agreement with qRTPCR measurements, a gold standard in measuring differential RNA abundance between samples [3]. The statistical challenges in analyzing RNASeq data arise from many perspectives. While some sources of error are due to inherent problems with the technology, some are contributed at laboratory or experimental levels, leading to nonbiological or technical variation across samples. Therefore, there is a critical need for investigation of other statistical methods for normalization and differential expression analysis of RNAseq data across different conditions.
RNAseq experiments are now frequently employed for identifying genes and alternatively spliced gene isoforms that are differentially expressed across distinct tissue/cell types and disease conditions [9]. This amounts to comparing one condition, A, with another condition, B, and producing a ranked list of differentially expressed genes according to the statistical significance of observed expression difference or foldchange between A and B [10, 11]. Thus, proper normalization between samples is crucial before differential expression (DE) analysis and, to a certain degree, the two aspects are linked with each other. Normalization can be divided into withinsample normalization and betweensamples normalization [12]. DE analysis is the study of the difference in absolute gene expression levels between two conditions. However, similar to microarray technology, RNAseq is a relative abundance measure technology and does not allow for the measurement of absolute transcript abundance. This is because molecules are sampled proportionately from a large pool of cells and the initial number of cells and other technical factors are usually difficult to estimate or unknown. The standard procedure for computing the proportion of sequence reads that map to a gene relative to the total number of reads obtained in that RNAseq experiment and for comparing those proportions across different samples can lead to high false positive rate. For example, a common method for normalization is to divide the genewise read counts by corresponding gene length and the total number of mapped reads to the genome. Recent reports show that the latter method, based on the total count of mapped reads, is not a robust method [13] and several alternative methods have been proposed. For example, an empirical strategy that equates the overall expression levels of genes between samples under the assumption that the majority of them are not DE was proposed recently [10, 14, 15]. Alternatively, the widely used quartile normalization method in the microarray field was also adapted for betweensample normalization of RNAseq [16]. A recent review evaluated seven proposed normalization methods for the differential analysis of RNAseq data by using a varied group of real and simulated datasets involving different species and experimental designs [13]. They concluded that the methods proposed in DESeq [17] and edgeR [18] have the most relative satisfactory behaviour compared to the others.
Similarly, several tools have been developed for DE analysis of RNAseq data. The Poisson model has been successfully used to account for technical variations in RNAseq data [4]. When biological replicates are available, the negative binomial distribution is commonly used to model the overdispersion in the count data, such as DESeq and edgeR. There are also pure nonparametric methods, which do not assume any particular distribution for the data, e.g. NOISeq [11]. Approaches within a Bayesian framework for differential expression in RNASeq data have also been developed by many researchers, such as baySeq [19], GFOLD [20], ShrinkSeq [21] and EBSeq [22]. It is acknowledged that Bayesian approach can be used to obtain accurate and robust estimates by sharing information across all genes when sample size is small [23]. In baySeq, the genes are ranked according to the posterior probabilities of differential expression between conditions, using an empirical Bayes framework. To infer the posterior probability, the gene expression prior factors are integrated out by an approximation method [24] with mean and dispersion parameters empirically and iteratively estimated from the entire set of genes through a quasilikelihood method. GFOLD assumes uniform prior distribution (vague prior) of gene expression level for technical replicate model. For data with biological replicates, a hierarchical model with lognormal prior for the gene expression is used to account for the biological variation. The posterior distribution of fold change is obtained through sampling. ShrinkSeq, which also takes a Bayesian perspective, is presented in a framework of generalized linear model setting, which infers the DE coefficient in the GLM directly instead of inferring the gene expression level first. ShrinksSeq explores both parametric mixture prior and nonparametric prior for the DE coefficient and extends the INLAs (integrated nested Laplace approximation) method to infer the marginal posterior distribution under nonparametric prior and shows the superior performance of nonparametric prior than parametric prior. EBSeq, similar to baySeq, ranks the genes/isoforms by posterior probability of DE, but assumes a parametric form of the prior distribution for the gene/isoform expression with parameters estimated from the data by method of moments and EM. All the aforementioned methods do not provide the closeform of posterior distribution of fold change. Because sequencing of cDNA reads is basically a sampling procedure, it is important to note that a large number of genes are unseen in a typical RNAseq sample due to low expression or the limited depth of the experiment. For example, only approximately 0.0013% of the total number of available molecules in a RNA library are sampled in one lane of a typical Solexa/Illumina GAIIx RNAseq experiment [25]. Further, the fact that a small number of highly expressed genes consume a significant fraction of the total sequence reads can also influence the statistical inference procedures. These limitations affect the estimation of DE or differences in relative transcript distribution between samples. For almost all currently developed RNAseq DE methods, genes with low read counts are usually omitted from the analysis because of unreliable estimation. Another issue is that a zero read count in one condition leads to unrealistic estimation of fold change.
Here, we developed a novel method to model the RNAseq data and detect differentially expressed genes and exons across different conditions. To mitigate the biases caused by the nature of sampling and reliably estimate the expression levels of those unseen and lowly expressed genes, we adopted a previously developed Poisson mixture model to empirically estimate the prior distributions of read counts completely from the data [26]. We propose a nonparametric, empirical Bayesianbased approach to model the RNAseq data. We prepared five datasets, three simulated and two publicly available RNAseq datasets, for systematically evaluating the performance of the new method. Also, the novel method is compared with the other popular methods for RNAseq DE analysis, both using simulated and real RNAseq datasets.
Implementation
A few of the earlier RNAseq assessment studies have reported highly reproducible results with little technical variation [27, 28], suggesting that the inclusion of technical replicates in the experimental plan is usually not essential. Numerous RNASeq studies have used the Poisson model to perform testing for differential gene expression. The Poisson model assumes equality of mean and variance of read counts per gene across replicates. Therefore, pooling technical replicates together to give read counts for each biological replicate does not lead to loss of information. Thus, we first discuss one replicate per condition and then consider biological replicates.
Model for single replicate
Let γ be the expression level of one gene under one condition and x be the number of observed reads mapped to this gene. It is well known that x follows a binomial distribution and can be approximated well by a Poisson distribution with mean λ = γdl, where 1 is the gene length and d is the normalization constant reflecting the sequencing depth. Given a prior mixing distribution G (with probability density function g(λ)) on λ, the posterior distribution of λ is \mathrm{g}\left(\lambda \right)\frac{{\lambda}^{\mathrm{x}}{e}^{\lambda \mathrm{d}}}{\mathrm{x}!}/{\mathrm{h}}_{\mathrm{G}}\left(\mathrm{x}\right), where h_{G}(x) = ∫ λ^{x}/x ! e^{‒ λ}dG(λ) is a Gmixture of Poisson.
A gene is expressed if, and only if, x ≥ 1. Conditioning on x ≥ 1, x follows a Qmixture of zerotruncated Poisson h_{G}(x)/(1 − h_{G}(0)) or a mixture f_{Q}( x ) of truncated Poisson, where
Let n_{x} denote the number of genes with exactly x reads in the sample. The conditional nonparametric maximum likelihood estimator \widehat{\mathrm{Q}} for Q is \widehat{\mathrm{Q}}=\mathrm{argmax}{\displaystyle \sum}_{\mathrm{x}\ge 1}{\mathrm{n}}_{\mathrm{x}}{\mathrm{logf}}_{\mathrm{Q}}\left(\mathrm{x}\right), whose calculation is discussed in [29, 30] and the calculation details under the context of RNAseq are provided in the Additional file 1. There is a onetoone mapping between \widehat{\mathrm{G}} and \widehat{\mathrm{Q}} from equation (1). The posterior distribution of λ is then given by \mathrm{\lambda}\mathrm{x}~\widehat{\mathrm{g}}\left(\mathrm{\lambda}\right)\frac{{\mathrm{\lambda}}^{\mathrm{x}}{\mathrm{e}}^{\mathrm{\lambda}}}{\mathrm{x}!}/{\mathrm{h}}_{\widehat{\mathrm{G}}}\left(\mathrm{x}\right). An empirical Bayes estimator for λ is
Let λ_{A} and λ_{B} denote the read counts that represent the true expression level of a gene and G_{ A } and G_{ B } denote the corresponding prior distributions, under conditions A and B, respectively. However, as mentioned previously, since NGS is like sequencing a set of sampled reads from a pool of expressed sequences of gene, the read counts that are obtained, say x_{A} and x_{B}, denote the corresponding reads counts obtained in conditions A and B. The posterior distribution of log \left(\mathrm{d}\frac{{\mathrm{\lambda}}_{\mathrm{A}}{\mathrm{x}}_{\mathrm{A}}}{{\mathrm{\lambda}}_{\mathrm{B}}{\mathrm{x}}_{\mathrm{B}}}\right), which is log fold change of expression level of a gene, has a closedform formula and is easy to derive, because {\widehat{G}}_{A} and {\widehat{G}}_{B} follow probability distribution of discrete form.
The normalization constant d can be inferred from some previous available methods, for example the methods proposed in DESeq or PoissonSeq [31]. This can also be calculated based on the assumption that the expected values of logfold change of the majority of genes are zeros,
Thus, we rank the genes by the values of \mathrm{E}\left[\mathrm{log}\left(\frac{{\mathrm{\lambda}}_{\mathrm{A}}{\mathrm{x}}_{\mathrm{A}}}{{\mathrm{\lambda}}_{\mathrm{B}}{\mathrm{x}}_{\mathrm{B}}}\right)\right] first and then estimate d by using the genes falling in the (ε, 1 − ε) quantile of all those values. In this paper, we used ε=0.25. That is, we used half of the genes to estimate d.
NPEBseq tests the hypothesis that the difference in the gene expression level between conditions A and B is above a userdefined cutoff Δ, i.e., the probability that
The default value for Δ is log(2). We consider this as our own predefined pvalue. The false discovery rate is controlled with BenjaminiHochberg adjustment.
Model dealing with biological replicates
RNAseq datasets with large numbers of biological replicates are increasingly generated by many laboratories and consortia, for example, HapMap [32], ENCODE [33], and TCGA projects [34]. TCGA data consists of hundreds of RNAseq biological replicates for each cancer condition. Dealing with the large number of biological replicate data is challenging. Recent studies have found that while the Poisson model is appropriate for technical replicates of the same RNA samples, it can be a poor fit for biological replicates. Here we propose a hierarchical Bayesian model to account for the overdispersion in the read counts.
Let c denote the number of biological replicates for one condition and we assume that
where x_{ij} is the number of reads for gene i and replicate j; e_{ij} is the expression index; λ_{i} is the expression level of gene i under this condition A; θ is the scale parameter of Gamma distribution; and d_{j} is normalization constant for replicate j. The prior distribution G is inferred as before by using the sample that has the largest data depth under each condition.
Here we are interested in inferring the posterior distribution of fold change for each gene, in which the 1 will be cancelled out, so simply letting l=1 does not change the calculation. Let {\overrightarrow{\mathrm{x}}}_{\mathrm{i}}=\left({\mathrm{x}}_{\mathrm{i}1},{\mathrm{x}}_{\mathrm{i}2},\dots ,{\mathrm{x}}_{\mathrm{ic}}\right) and {\overrightarrow{\mathrm{e}}}_{\mathrm{i}}=\left({\mathrm{e}}_{\mathrm{i}1},{\mathrm{e}}_{\mathrm{i}2},\dots ,{\mathrm{e}}_{\mathrm{ic}}\right). Based on the aforementioned model, the joint posterior distribution of e_{ij}, λ_{i} is given by
It is known that marginal conditional {\overrightarrow{\mathrm{e}}}_{\mathrm{i}}\left({\overrightarrow{\mathrm{x}}}_{\mathrm{i}},{\mathrm{\lambda}}_{\mathrm{i}}\right) follows Gamma distribution and can be easily integrated out (further details are given in the supplementary methods sections). Thus, the log transformed marginal posterior distribution of λ_{i} is given by
The pvalues and FDR can be computed by equation (4).
Empirical Bayes methods have been used to estimate the degrees of overdispersion in the data. Based on our hierarchical model, we also propose an empirical Bayes method to estimate the dispersion parameter θ. It is known that the conditional variable x_{ij}λ_{i} follows negative binomial distribution (mixture of Poisson with Gamma prior) and the expected value and variance of it are given by
Although the marginal distribution of x_{ij} is unclear, the expected value and variance can be computed in the following ways:
So we estimate θ by,
Similar to the estimation of G, θ is also estimated by using the sample that has the largest data depth under each condition.
Differential exon usage analysis from RNAseq data
RNAseq also provides information for the study of alternative splicing. DE analysis of individual transcripts is essential in many comparative studies because of isoformlevel changes in gene expression between conditions [9]. Recently, two tools, Cufflink [35] and BitSeq [36], have been proposed to identify differential expression of transcripts by first estimating the expression of the transcripts. The expression or abundance estimates may contain significant correlated uncertainties that reduce the power for inference of differential expression [37]. Another tool, DEXSeq [37], proposed an exoncentric analysis to test for differential exon usage in RNAseq data based on a generalized linear model. The input of DEXSeq is a table that contains read counts for each exon of every gene in each sample. Note that one exon may be cut into two or more parts if its boundary is not the same in all transcripts. The basic unit for counting the number of reads overlapped is called “counting bin” in this manner, similar to the definition of exon slice used in IsoformEx algorithm [3].
DEXSeq tests if each counting bin is differentially used between conditions. Inspired by this, we propose a method to detect different exon usage based on our Bayesian hierarchal model. Assuming that a gene is expressed under two different conditions, A and B, let t_{Ak} and t_{Bk} denote the expression level of counting bin k of this gene and t_{Ak} and t_{Bk} denote the observed read counts overlapping with it. The differential exon detection method involves the following steps:

1.
The posterior distribution of t_{Ak}y_{Ak} and t_{Bk}y_{Bk} is derived based on our model applied to counting bin read count data.

2.
Test the fitness of the distribution against the null hypothesis: the proportion of the number of reads overlapping with a counting bin to that of all the reads overlapping with the gene does not change between conditions (same as in DEXSeq).

3.
Finally, define the pvalue as the probability of \left\mathrm{log}\frac{{\mathrm{t}}_{\mathrm{Ak}}{\mathrm{y}}_{\mathrm{Ak}}}{\mathrm{E}\left[{\mathrm{\lambda}}_{\mathrm{A}}{\mathrm{x}}_{\mathrm{A}}\right]}\mathrm{log}\frac{{\mathrm{t}}_{\mathrm{Bk}}{\mathrm{y}}_{\mathrm{Bk}}}{\mathrm{E}\left[{\mathrm{\lambda}}_{\mathrm{B}}{\mathrm{x}}_{\mathrm{B}}\right]}\right>\mathrm{\Delta}, where Δ denotes a userdefined cutoff that represents the extent of differential expression one wishes to identify between conditions. For example, if Δ is set as log(1/3), it will test if the exon inclusion level of an exon is less than 1/3 or more than a threefold change between conditions. Such an extreme switch of exon usage between conditions is a strong indicator of functional alterative splicing events.
Withincondition quantile normalization
For cases without replicates, the normalization constant can be computed by equation (3). With biological replicates, d_{j} can be computed by the method proposed in DESeq or the trimmed mean of logfolder change method [10]. Here we propose withincondition quantile normalization based on the assumption that the distributions of read count data within conditions are common. The samples in the same condition are first quantile normalized relative to the one that has the largest data depth and then equation (3) is applied to do the normalization between conditions. Quantile normalization of read count data for samples coming from different conditions might not be proper due to the fact that longer genes have more reads and they might be differentially expressed between conditions, which can put the gene expression level in a different scale within one sample. We compared this withincondition quantile normalization to the method proposed in DESeq for the simulated data and no significant difference was present (results are not shown here). For rest of this paper, we adopt the withincondition quantile normalization procedure.
Datasets for simulation
We generated three simulation datasets to evaluate the proposed method for identifying differentially expressed genes. Dataset1 is generated with biological replicates by assuming different priors across conditions, which means G_{ A } and G_{B} follow different statistical distributions. Dataset2 consists of no replicates and is generated by following the simulation scheme adopted by baySeq and edgeR. Dataset3 is generated by the same scheme as dataset2 but with biological replicates.
Results
NPEBseq method
NPEBSeq is a nonparametric empirical Bayesianbased approach to model the RNAseq data. The expression level of genes with low read counts is estimated by borrowing information from the gene expression in the whole sample. The nonparametric form of the prior distribution avoids any unrealistic assumption. The parametric assumption for the prior distribution is usually not fulfilled for the RNAseq read count data. The fact that there are many genes expressed at low levels in one sample is illustrated in Figure 1, which is generated based on one sample from Marioni’s RNAseq dataset [27]. This plot clearly shows that a large proportion of genes in a sample are expressed at low levels. These genes could have a high impact on the performance of statistical methods to identify differentially expressed genes. The fact that there are large numbers of genes/transcripts with low read counts and a small number of genes with a significantly high number of reads make any parametric assumption for the prior distribution unrealistic.
Performance of NBESeq on simulated datasets
To evaluate the proposed method for identifying differentially expressed genes, we first conducted a simulation study.
Simulation 1  Simulation with different priors between conditions (dataset1)
To generate data similar to those produced by real RNAseq experiments, we first applied the empirical Bayes method on publicly available RNAseq datasets, which were generated to compare liver and kidney transcriptomes [27]. The prior distributions of kidney and liver samples were first estimated and then the data was normalized based on the expected values. The corresponding dispersion parameter θ for each condition was also estimated.
Dataset1 consist of 20 independent simulations with seven samples each for two conditions. The library size of each sample is uniformly sampled from 300,000 to 900,000. Each sample was generated by a mixture of negative binomial model with both the prior distributions and dispersion parameters estimated from Marioni’s data. Each sample consists of 10,000 genes for computational efficiency.
We performed a comparative analysis of our method with four popular methods, DESeq, edgeR, baySeq and NOISeq, which are available as part of Bioconductor packages at http://bioconductor.org[38]. The edgeR implements two ways to estimate the dispersion parameter in its model, common dispersion and tagwise dispersion. Both of them are studied here. baySeq provides two choices of model (Poisson and negative binomial). We adopted the negative binomial model for dataset1. Both DESeq and edgeR provide pvalues for ranking the genes. baySeq provides log posterior likelihood ratio for ranking the differential expression of genes. In the case of NPEBseq, we rank the genes by pvalues as defined in equation (4). The purpose of this simulation is to compare the ability of these methods to rank the genes in order of differential expression. The true ranking order of the genes is based on the fold change of differential expression values between the two conditions.
We used the following criteria to compare the performance of different methods. Given a cutoff point τ (e.g. the number of genes declared significantly expressed), the efficiency of a statistical method is measured by p_{τ}, the expected percentage of the true first τ DE genes being correctly declared as the first τ DE genes. The average of estimated p_{τ} is calculated from the 20 replicates. The simulation results for dataset1 are shown in Figure 2. The proposed NPEBseq method outperformed other methods.
Simulation 2  simulation with the same priors between conditions (dataset2 and dataset3)
A simulation scheme similar to the one suggested by Robinson and Smyth [39] is applied here to generate dataset2 and dataset3. The library size of each sample was uniformly sampled from 300,000 to 900,000. The prior distribution of λ was assumed to be common between the two conditions and estimated from the liver RNAseq data of Marioni.
Dataset2 was generated by Poisson distribution and dataset3 by negative binomial distribution, with the dispersion parameter estimated from the liver data. The simulated data consists of 10,000 genes, and onetenth of those genes were set to be differentially expressed (between condition A and condition B) with λ_{A} = bλ_{B}. In order to produce both over and underexpression in our simulated data, 500 randomly selected genes were set to have b=4 and the remaining 500 genes were set to have b=1/4. Both dataset2 and dataset3 consist of 20 independent simulations. Dataset2 was generated without replicates. Similar to dataset1 seven samples per condition per simulation were generated for dataset3. The full ROC curves for dataset2 and dataset3 are shown in Figures 3 and 4, respectively. Based upon examination of these curves, the proposed NPEBseq method appears to perform better than the other methods. The partial ROC curves with false positive rate less than 0.2 are shown in Additional file 2: Figure S1 and Figure S2, which indicate that NPEBSeq performs as well as the other methods. To clearly show that NPEBSeq can robustly estimate fold change of genes with low read counts, the estimated fold change of 10 genes from one sample of dataset2 by NPEBseq along with DESeq and edgeR are shown in Table 1. For the cases with zero read count under one condition, DESeq always gives infinite estimation of fold change.
Real RNAseq data analysis
To further evaluate our method, we tested it on two published RNAseq datasets.
Real RNAseq data 1Comparison based on one MAQC dataset
We first applied NPEBSeq on the MicroArray quality control (MAQC) dataset [40, 41] and compared with DESeq, baySeq, and edgeR. MAQC datasets contain gene expression data from multiple platforms and are extensively used in evaluating different data processing methods. We downloaded the MAQC2 Illumina RNAseq data from http://www.ncbi.nlm.nih.gov/sra, which contains seven technical replicates of brain reference RNA samples and seven technical replicates of UHR RNA samples. Tophat [35] was used for tag alignment and counts for each gene were computed by means of HTSeq Python package (http://wwwhuber.embl.de/users/anders/HTSeq/), using the annotation of the Ensembl genes and only reads that mapped to exons.
As part of the original MAQC project, around 1,000 genes were also chosen to be assayed by Taqman qRTPCR. Those qRTPCR data were obtained from GEO database, which contains four technical replicates for each of the two samples. The qRTPCR data were used as a gold standard to benchmark the gene expression values by RNAseq. We analysed the qRTPCR data using the comparative C_{t} methods [42]. Finally, 407 genes were defined as DE and 119 genes were defined as nonDE. Given the fact that not all the genes were assayed by qRTPCR, we followed the same procedure that was applied in [15] to define the true positive and false positive rates. Given a “DE” or “nonDE” call from qRTPCR, define a true positive (TP) as the event that the test of interest calls a gene DE that qRTPCR called DE. A false positive (FP) event occurs when the test calls a gene DE that qRTPCR called nonDE. The true positive rate (TPR) is defined as
and the false positive rate (FPR) is defined as
Note that these are not the standard definitions of TPR and FPR.
qRTPCR data were annotated by RefSeq. The BioMart R package [43] was used to convert the RefSeq genes IDs for qRTPCR to Ensembl genes ids.
The ROC curves from all the compared methods are shown in Figure 5. Clearly, our proposed method has the best performance in terms of sensitivity and specificity.
Real RNAseq data 2Detecting differential usage of exons from RNAseq data
We also analysed the data by Brook et al. [44], where the effect of the RNAi knockdown of “pasilla” was studied by RNAseq in the Drosophila melanogaster cell line. The data was downloaded as part of DEXSeq package. The data consists of four control samples and three knockdown samples. The analysis at gene level by NPEBseq reported 107 differentially expressed genes, with nominal FDR control at 0.1 for the comparison of control and knockdown. To access the specificity of the NPEBseq method we performed incondition comparison by making use of the fact that there are four biological replicates in the control group. We applied NPEBseq for the comparison of two control samples versus the other two. NPEBseq reported zero differentially expressed genes with FDR control at 0.1, which indicates that NPEBseq has a very high specificity.
We then analysed Brook’s data at exon level. NPEBseq found differential exon usage for 2,370 counting bins at FDR 0.01 for betweencondition comparison and 225 counting bins for withincondition comparison. We also applied the newest version of DEXSeq on the exon data, which reported 120 counting bins as DE at FDR 0.1. We checked whether NPEBseq and DEXSeq could achieve comparable results by computing the percentage of DE called exons that are common in the two ranked lists of exons generated by both programs. The results are shown in Figure 6. For example, we found that 74 counting bins (exons) were common among the top 120 DE counting bins called by each approach. And, further examination revealed that among the top 120 DE counting bins identified by NPEBseq, 12 were defined as “untestable” by the DEXSeq method due to low read counts in those counting bins. Since the pvalue defined in NPEBseq is different from the regular pvalue, we didn’t expect these two approaches to report similar number of DE exons at the same FDR level.
Discussion
In this paper we developed a novel empirical Bayesianbased approach to model the RNAseq data. This method has been widely used in ecology to estimate species diversity [26]. The nonparametric form of the prior distribution of the Bayesian model is empirically estimated from the data. The expression level of genes with low read counts are estimated by borrowing information from the gene expression in the whole sample. For data with biological replicates, we developed a hierarchal Bayesian model to account for the overdispersion and proposed an empirical Bayesian method to estimate the dispersion parameter. We also extended the model to detect differential usage of exons from RNAseq datasets. The closedform formula of the posterior distribution makes the computation of any statistics very efficient. At the final step, we evaluated the performance of this method in detecting the differentially expressed genes by conducting simulation and real RNAseq data analysis.
There are many challenges still present in the processing and analyses of RNAseq data. For example, it has been empirically observed that quantification of expression depends on the length of the biological features under study (genes, transcripts, or exons), as longer features tend to have more significant statistics than shorter ones [45]. Also, it was recently shown that there exists a samplespecific guaninecytosine content (GCcontent) effect and the studies proposed normalization methods by GCstrata to remove such effects [46, 47]. Incorporating those factors into our model could further improve the performance.
Delineating the gene expression at an alternative transcriptlevel from RNAseq data is still a very challenging problem. Our recently published IsoformEx method [3], based on nonnegative least square, is aimed to estimate transcript abundance. In future enhancements to the proposed method we will integrate NPEBseq with IsoformEx to identify DE at isoformlevel.
Conclusions
NPEBseq can be applied to not only detect differential gene expressions from the RNAseq dataset with technical and biological replicates for both studied conditions, but also to detect differential usage of exons. It is robust, since it requires no limited assumptions to be made about the prior distribution of the data. NPEBseq also provides the closed form of posterior distribution of the fold change, which is useful for further analysis.
Availability and requirements
Project name: NPEBseq
Project home page: http://bioinformatics.wistar.upenn.edu/NPEBseq
Operating system and R version: The R package is platform independent and is compatible with all the versions of R same as or higher than 2.15.1.
Other requirements: No.
License: GPL (≥ 2)
Any restrictions to use: It is available for free download.
References
Cancer Genome Atlas N: Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012, 487 (7407): 330337. 10.1038/nature11252.
Pal S, Gupta R, Kim H, Wickramasinghe P, Baubet V, Showe LC, Dahmane N, Davuluri RV: Alternative transcription exceeds alternative splicing in generating the transcriptome diversity of cerebellar development. Genome Res. 2011, 21 (8): 12601272. 10.1101/gr.120535.111.
Kim H, Bi Y, Pal S, Gupta R, Davuluri RV: IsoformEx: isoform level gene expression estimation using weighted nonnegative least squares from mRNASeq data. BMC Bioinforma. 2011, 12: 30510.1186/1471210512305.
Jiang H, Wong WH: Statistical inferences for isoform expression in RNASeq. Bioinformatics. 2009, 25 (8): 10261032. 10.1093/bioinformatics/btp113.
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, Van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNASeq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 28 (5): 511515.
Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN: RNASeq gene expression estimation with read mapping uncertainty. Bioinformatics. 26 (4): 493500.
Zheng S, Chen L: A hierarchical Bayesian model for comparing transcriptomes at the individual transcript isoform level. Nucleic Acids Res. 2009, 37 (10): e7510.1093/nar/gkp282.
Srivastava S, Chen L: A twoparameter generalized Poisson model to improve the analysis of RNAseq data. Nucleic Acids Res. 38 (17): e170
Pal S, Gupta R, Davuluri RV: Alternative transcription and alternative splicing in cancer. Pharmacol Ther. 2012, 136 (3): 283294. 10.1016/j.pharmthera.2012.08.005.
Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 2010, 11 (3): R2510.1186/gb2010113r25.
Tarazona S, GarciaAlcalde F, Dopazo J, Ferrer A, Conesa A: Differential expression in RNAseq: a matter of depth. Genome res. 2011, 21 (12): 22132223. 10.1101/gr.124321.111.
Oshlack A, Robinson MD, Young MD: From RNAseq reads to differential expression results. Genome biol. 2010, 11 (12): 22010.1186/gb20101112220.
Dillies MA, Rau A, Aubert J, HennequetAntier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, et al: A comprehensive evaluation of normalization methods for Illumina highthroughput RNA sequencing data analysis. Briefings Bioinf. 2012, Sep 17. [Epub ahead of print]
Kadota K, Nishiyama T, Shimizu K: A normalization strategy for comparing tag count data. Algorithms Mol Biol. 2012, 7 (1): 510.1186/1748718875.
Bullard JH, Purdom E, Hansen KD, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinf. 2010, 11: 9410.1186/147121051194.
Balwierz PJ, Carninci P, Daub CO, Kawai J, Hayashizaki Y, Van Belle W, Beisel C, Van Nimwegen E: Methods for analyzing deep sequencing expression data: constructing the human and mouse promoterome with deepCAGE data. Genome biol. 2009, 10 (7): R7910.1186/gb2009107r79.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome biol. 2010, 11 (10): R10610.1186/gb20101110r106.
Robinson MD, McCarthy DJ, Smyth GK: edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26 (1): 139140. 10.1093/bioinformatics/btp616.
Hardcastle TJ, Kelly KA: baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinf. 2010, 11: 42210.1186/1471210511422.
Feng J, Meyer CA, Wang Q, Liu JS, Liu XS, Zhang Y: GFOLD: a generalized fold change for ranking differentially expressed genes from RNAseq data. Bioinformatics. 2012, 28 (21): 27822788. 10.1093/bioinformatics/bts515.
Van De Wiel MA, Leday GG, Pardo L, Rue H, Van DerVaart AW, Van Wieringen WN: Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics. 2013, 14 (1): 113128. 10.1093/biostatistics/kxs031.
Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C: EBSeq: an empirical Bayes hierarchical model for inference in RNAseq experiments. Bioinformatics. 2013, 29 (8): 10351043. 10.1093/bioinformatics/btt087.
Ji HK, Liu XS: Analyzing ‘omics data using hierarchical models. Nat Biotechnol. 2010, 28 (4): 337340. 10.1038/nbt.1619.
Evans M, Swartz T: Methods for approximating integrals in statistics with special emphasis on Bayesian integration problems. Stat Sci. 1995, 10 (3): 254272. 10.1214/ss/1177009938.
McIntyre LM, Lopiano KK, Morse AM, Amin V, Oberg AL, Young LJ, Nuzhdin SV: RNAseq: technical variability and sampling. BMC genomics. 2011, 12: 29310.1186/1471216412293.
Mao CX, Lindsay BG: Tests and diagnostics for heterogeneity in the species problem. Comput Stat Data An. 2003, 41 (34): 389398. 10.1016/S01679473(02)001640.
Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNAseq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome res. 2008, 18 (9): 15091517. 10.1101/gr.079558.108.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq. Nature methods. 2008, 5 (7): 621628. 10.1038/nmeth.1226.
Mao CX: Inference on the number of species through geometric lower bounds. J Am Stat Assoc. 2006, 101 (476): 16631670. 10.1198/016214506000000528.
Wang JPZ, Lindsay BG: A penalized nonparametric maximum likelihood approach to species richness estimation. J Am Stat Assoc. 2005, 100 (471): 942959. 10.1198/016214504000002005.
Li J, Witten DM, Johnstone IM, Tibshirani R: Normalization, testing, and false discovery rate estimation for RNAsequencing data. Biostatistics. 2012, 13 (3): 523538. 10.1093/biostatistics/kxr031.
Montgomery SB, Sammeth M, GutierrezArcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET: Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010, 464 (7289): 773777. 10.1038/nature08903.
Consortium EP: A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS biol. 2011, 9 (4): e100104610.1371/journal.pbio.1001046.
Hammerman PS, Hayes DN, Wilkerson MD, Schultz N, Bose R, Chu A, Collisson EA, Cope L, Creighton CJ, Cancer Genome Atlas Research N, et al: Comprehensive genomic characterization of squamous cell lung cancers. Nature. 2012, 489 (7417): 519525. 10.1038/nature11404.
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L: Differential gene and transcript expression analysis of RNAseq experiments with TopHat and Cufflinks. Nat Protoc. 2012, 7 (3): 562578. 10.1038/nprot.2012.016.
Glaus P, Honkela A, Rattray M: Identifying differentially expressed transcripts from RNAseq data with biological variation. Bioinformatics. 2012, 28 (13): 17211728. 10.1093/bioinformatics/bts260.
Anders S, Reyes A, Huber W: Detecting differential usage of exons from RNAseq data. Genome res. 2012, 22 (10): 20082017. 10.1101/gr.133744.111.
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5 (10): R8010.1186/gb2004510r80.
Robinson MD, Smyth GK: Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007, 23 (21): 28812887. 10.1093/bioinformatics/btm453.
Canales RD, Luo Y, Willey JC, Austermiller B, Barbacioru CC, Boysen C, Hunkapiller K, Jensen RV, Knight CR, Lee KY, et al: Evaluation of DNA microarray results with quantitative gene expression platforms. Nat Biotechnol. 2006, 24 (9): 11151122. 10.1038/nbt1236.
Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, De Longueville F, Kawasaki ES, Lee KY, et al: The MicroArray Quality Control (MAQC) project shows inter and intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006, 24 (9): 11511161. 10.1038/nbt1239.
Schmittgen TD, Livak KJ: Analyzing realtime PCR data by the comparative C(T) method. Nat Protoc. 2008, 3 (6): 11011108. 10.1038/nprot.2008.73.
Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, Huber W: BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005, 21 (16): 34393440. 10.1093/bioinformatics/bti525.
Brooks AN, Yang L, Duff MO, Hansen KD, Park JW, Dudoit S, Brenner SE, Graveley BR: Conservation of an RNA regulatory map between Drosophila and mammals. Genome res. 2011, 21 (2): 193202. 10.1101/gr.108662.110.
Oshlack A, Wakefield MJ: Transcript length bias in RNAseq data confounds systems biology. Biol Direct. 2009, 4: 1410.1186/17456150414.
Hansen KD, Irizarry RA, Wu Z: Removing technical variability in RNAseq data using conditional quantile normalization. Biostatistics. 2012, 13 (2): 204216. 10.1093/biostatistics/kxr054.
Risso D, Schwartz K, Sherlock G, Dudoit S: GCcontent normalization for RNASeq data. BMC bioinf. 2011, 12: 48010.1186/1471210512480.
Acknowledgements
The work in the Davuluri laboratory is supported by Commonwealth Universal Research Enhancement (CURE) Research Program, Department of Health, Pennsylvania. RD holds a Philadelphia Healthcare Trust Endowed Chair Position; Research reported in this publication was partially supported by the National Library Of Medicine of the National Institutes of Health under Award Number R01LM011297. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The use of resources in the Bioinformatics Shared Facility of Wistar Cancer Centre (grant # P30 CA010815) is gratefully acknowledged. We thank Dr. Sharmistha Pal and Dr. Murali Bashyam for reading through the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
Both authors declare that they have no competing interests.
Authors’ contributions
YB and RVD conceived the initial approach. YB designed and implemented the methods and performed the analyses. YB and RVD wrote the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
12859_2013_6025_MOESM1_ESM.docx
Additional file 1: This file describes the procedure to derive the marginal posterior distribution of λ_{ i }and the procedure to infer the prior distribution of our proposed model.(DOCX 32 KB)
12859_2013_6025_MOESM2_ESM.docx
Additional file 2: Figure S1: Partial ROC curves based on simulated dataset2. Figure S2 Partial ROC curves based on simulated dataset3. (DOCX 69 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Bi, Y., Davuluri, R.V. NPEBseq: nonparametric empirical bayesianbased procedure for differential expression analysis of RNAseq data. BMC Bioinformatics 14, 262 (2013). https://doi.org/10.1186/1471210514262
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1471210514262