BADGE: A novel Bayesian model for accurate abundance quantification and differential analysis of RNA-Seq data
© Gu et al.; licensee BioMed Central Ltd. 2014
Published: 10 September 2014
Recent advances in RNA sequencing (RNA-Seq) technology have offered unprecedented scope and resolution for transcriptome analysis. However, precise quantification of mRNA abundance and identification of differentially expressed genes are complicated due to biological and technical variations in RNA-Seq data.
We systematically study the variation in count data and dissect the sources of variation into between-sample variation and within-sample variation. A novel Bayesian framework is developed for joint estimate of gene level mRNA abundance and differential state, which models the intrinsic variability in RNA-Seq to improve the estimation. Specifically, a Poisson-Lognormal model is incorporated into the Bayesian framework to model within-sample variation; a Gamma-Gamma model is then used to model between-sample variation, which accounts for over-dispersion of read counts among multiple samples. Simulation studies, where sequencing counts are synthesized based on parameters learned from real datasets, have demonstrated the advantage of the proposed method in both quantification of mRNA abundance and identification of differentially expressed genes. Moreover, performance comparison on data from the Sequencing Quality Control (SEQC) Project with ERCC spike-in controls has shown that the proposed method outperforms existing RNA-Seq methods in differential analysis. Application on breast cancer dataset has further illustrated that the proposed Bayesian model can 'blindly' estimate sources of variation caused by sequencing biases.
We have developed a novel Bayesian hierarchical approach to investigate within-sample and between-sample variations in RNA-Seq data. Simulation and real data applications have validated desirable performance of the proposed method. The software package is available at http://www.cbil.ece.vt.edu/software.htm.
Next Generation Sequencing (NGS) technology has opened a new era for transcriptome analysis, which grants the ability to investigate novel biological problems, such as alternative splicing, differential isoforms, gene fusion, etc. By piling up millions of reads along the reference genome, RNA-Seq technology can obtain signals in a much larger dynamic range with much higher accuracy compared to traditional microarray based technologies. For RNA-Seq analysis, the most popular routine is to determine the expression of genes (abundance quantification) and to identify differentially expressed genes (DEGs).
Methods that quantify gene/isoform level expression mainly fall into two categories: Poisson count mode (e.g., Cufflinks , etc) and linear regression model (e.g., Isolasso , SLIDE , BASIC , etc.). The major challenge in accurate quantification of gene expression is that large systematic bias in sequencing counts has been observed due to multiple factors. In contrast to uniform assumption of read distribution, it has been reported that sequence counts show a variety of physical and chemical biases, including transcript length bias, GC-content bias, random hexamer priming bias, etc [5–7].
Differential analysis of RNA-Seq data has been focused on modeling variance among biological replicates or samples in the same phenotype group. EdgeR  is the first method that models the 'between-sample' variability by replacing the Poisson model with Negative Binomial model. Various statistical methods have been proposed to model the variance among samples in biological groups, aimed to improve overall fitting of count data or robustness against outliers [9–12].
Despite initial success to model uncertainties associated with sequencing counts from different aspects, there lacks a systematic effort to address variability in RNA-Seq data. We dissect the variance in sequencing counts along two dimensions: over-dispersion of read counts within the same sample (i.e., within-sample variation) and over-dispersion of read counts among individuals from the same biological group (i.e., between-sample variation). Within-sample variation typically leads to large variance of read counts among genomic loci (e.g., nucleotides or exons), which have similar expression level in the same sample. It is typically caused by technical artifacts such as uncorrected systematic bias and gene specific random effects. On the other hand, between-sample variation is mostly due to biological differences among samples under the same condition. To increase the accuracy of abundance estimation so as to improve DEG identification, immediate attention is needed to developing a unified model that takes care of both forms of variation in RNA-Seq data. We propose a computational method, namely Bayesian Analysis of Dispersed Gene Expression ('BADGE'), to model variability in RNA-Seq data. A full Bayesian model is employed to simultaneously account for within-sample variation and between-sample variation to improve inference. The proposed method has several novel contributions compared to existing methodologies: 1) a unified Bayesian causality model is developed for joint abundance estimation and DEG identification. The improved accuracy in profiling mRNA abundance can facilitate the identification of DEGs, which may in turn refine the parameter learning in abundance quantification. 2) A Poisson-Lognormal regression model is incorporated to model within-sample variation . Instead of dealing with multiples sources of technical bias and variation separately, the proposed method can 'blindly' detect over-dispersion pattern within the individual sample. 3) Gamma-Gamma model  is used to model between-sample variation, which accounts for over-dispersion of read counts among multiple samples. BADGE is a unified computational method that extensively models variability in RNA-seq data to improve abundance quantification and DEG identification.
Bayesian Analysis of Dispersed Gene Expression (BADGE)
We have developed a computational method, namely Bayesian Analysis of Dispersed Gene Expression (BADGE), to model extensive variability in RNA-Seq data. BADGE explicitly models both between-sample variation and within-sample variation to improve abundance quantification and DEG identification. In this paper, we only focus on the gene level analysis, while the concept can be straight-forwardly generalized for genes with multiple transcripts (isoforms).
where β g,j is the true expression level of gene g for sample j. x g,i is the length of the ith exon weighted by the library size of sample j. U g,i,j is the unknown within-sample variation parameter, which follows normal distribution with mean 0 and precision τ. 'Flat' prior is assigned for τ by setting its shape a = 1 and rate b = 0. Equations (1-4) are also known as the Poisson-Lognormal regression model with identity link function.
Estimate model parameters using Gibbs sampling
We pool all the samples from U g,i,j and get its estimate where denotes sampled U g,i,j at sample t. t b is the last sample when burn-in stops. T is the total number of Gibbs samples. will be passed to the Gamma-Gamma model to estimate parameters associated with DEG identification.
We use Gibbs sampling method [4, 13, 15, 16] to estimate the posterior distributions of individual parameters iteratively from their complex joint distribution. For parameters β, ν, τ, and λ, we use conjugate priors to sample from their conditional distributions with standard probability distributions (Gamma distribution). For parameters (U, α0 and α) that do not have conjugate priors, we use Metropolis-Hastings sampling to sample their posterior distributions. Please see more details about the implementation of Gibbs sampler in additional file 1.
Results and discussion
Over-dispersion of RNA-Seq counts in two dimensions: between-sample variation and within-sample variation
Generate simulation data based on real RNA-Seq datasets
Performance comparison for abundance estimation on simulation data
Performance comparison for differentially expressed gene (DEG) identification on simulation data
Performance comparison using AUC for DEG identification (highly differentially expressed genes)
Performance comparison using AUC for DEG identification (moderately differentially expressed genes)
Performance comparison using AUC for DEG identification (weakly differentially expressed genes)
We simulated RNA-Seq gene expression with three different scenarios: genes that were highly differentially expressed, moderately differentially expressed and weakly differentially expressed (see supplementary materials in additional file 1 for more information). From Tables 1, 2, 3, we see that the BADGE method consistently outperformed existing methods in different parameter settings (highlighted in bold). For highly differentially expressed genes (Table 1), the performance of the other methods degraded as we decreased τ, while BADGE was able to maintain good performance by employing a Poisson-Lognormal model to account for within-sample variability. For genes that were weakly differentially expressed (Table 3), BADGE achieved a maximum improvement of AUC up to 1.3, compared to the second best method EBSeq (Table 3, τ = 0.44, ν = 0.001).
Performance comparison for differentially expressed gene (DEG) identification on Sequencing Quality Control (SEQC) data
On the SEQC dataset, BADGE had the best performance among all five methods by achieving an AUC very close to 0.9. The second best method was DSS with an AUC about 0.85. DESeq (AUC = 0.7624) and edgeR (AUC = 0.7675) had very close performance, which was consistent with the previous results reported by Rapaport et al. . EBSeq, on the other hand, had the least favourable performance (AUC = 0.71) on this specific dataset and it failed to detect the most strongly differentially expressed genes: its sensitivity was less than 0.1 when its specificity was about 0.9. By close examination of the 'left' ROC curves of the five methods, we can further infer that BADGE should have significantly better precision than the competing methods, whose sensitivity went all the way up to 0.7 before any sacrifice in specificity.
'Blind' estimation of hidden heterogeneity in RNA-Seq data
Large variation in RNA-Seq data has become the major obstacle against accurate estimation of gene expression and DEG identification. Much effort has been made to model variation across biological replicates, while limited attention is paid to tackle extensive over-dispersion observed in sequencing counts. For short-read sequencing technologies (e.g., Illumina), multiple sources of systematic bias have been identified, including transcript length bias, GC-content bias, etc. However, in-depth investigation of real RNA-Seq datasets has revealed the following complications: 1) Sequencing bias not only changes from one gene to another, but also varies among samples (Figure 2(c) and (d)); 2) Gene expression is jointly influenced by multiple bias factors, which leads to large variation across the entire transcriptome (Figure 6). However, current research activities have been focused on addressing individual bias corrections, which lacks a unified effort to account for total variability in RNA-Seq data. Therefore, we propose the BADGE method to extensively model both within-sample variability (bias and random variance) and between-sample variability (biological variations among replicates or within the same phenotype group) to improve quality of inference.
The publication costs for this article were funded by National Institutes of Health (NIH) [CA149653].
This work is supported by National Institutes of Health (NIH) [CA149653, CA149147, CA164384, and NS29525-18A, in part].
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 9, 2014: Proceedings of the Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-Seq 2014). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S9.
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28 (5): 511-515.PubMed CentralView ArticlePubMedGoogle Scholar
- Li W, Feng J, Jiang T: IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. J Comput Biol. 2011, 18 (11): 1693-1707.PubMed CentralView ArticlePubMedGoogle Scholar
- Li JJ, Jiang CR, Brown JB, Huang H, Bickel PJ: Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc Natl Acad Sci USA. 2011, 108 (50): 19867-19872.PubMed CentralView ArticlePubMedGoogle Scholar
- Zheng S, Chen L: A hierarchical Bayesian model for comparing transcriptomes at the individual transcript isoform level. Nucleic Acids Res. 2009, 37 (10): e75-PubMed CentralView ArticlePubMedGoogle Scholar
- Wu Z, Wang X, Zhang X: Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq. Bioinformatics. 2011, 27 (4): 502-508.View ArticlePubMedGoogle Scholar
- Hansen KD, Irizarry RA, Wu Z: Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. 2012, 13 (2): 204-216.PubMed CentralView ArticlePubMedGoogle Scholar
- Hansen KD, Brenner SE, Dudoit S: Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010, 38 (12): e131-PubMed CentralView ArticlePubMedGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26 (1): 139-140.PubMed CentralView ArticlePubMedGoogle Scholar
- Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11 (10): R106-PubMed CentralView ArticlePubMedGoogle Scholar
- Wu H, Wang C, Wu Z: A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics. 2013, 14 (2): 232-243.PubMed CentralView ArticlePubMedGoogle Scholar
- Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C: EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics. 2013, 29 (8): 1035-1043.PubMed CentralView ArticlePubMedGoogle Scholar
- Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L: Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol. 2013, 31 (1): 46-53.View ArticlePubMedGoogle Scholar
- Hu M, Zhu Y, Taylor JM, Liu JS, Qin ZS: Using Poisson mixed-effects model to quantify transcript-level gene expression in RNA-Seq. Bioinformatics. 2012, 28 (1): 63-68.PubMed CentralView ArticlePubMedGoogle Scholar
- Wei Z, Li H: A Markov random field model for network-based analysis of genomic data. Bioinformatics. 2007, 23 (12): 1537-1544.View ArticlePubMedGoogle Scholar
- Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993, 262 (5131): 208-214.View ArticlePubMedGoogle Scholar
- Sabatti C, James GM: Bayesian sparse hidden components analysis for transcription regulation networks. Bioinformatics. 2006, 22 (6): 739-746.View ArticlePubMedGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10 (1): 57-63.PubMed CentralView ArticlePubMedGoogle Scholar
- Comprehensive molecular portraits of human breast tumours. Nature. 2012, 490 (7418): 61-70.Google Scholar
- Cheung VG, Nayak RR, Wang IX, Elwyn S, Cousins SM, Morley M, Spielman RS: Polymorphic cis- and trans-regulation of human gene expression. PLoS Biol. 2010, 8 (9):Google Scholar
- Bottomly D, Walter NA, Hunter JE, Darakjian P, Kawane S, Buck KJ, Searles RP, Mooney M, McWeeney SK, Hitzemann R: Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays. PLoS One. 2011, 6 (3): e17820-PubMed CentralView ArticlePubMedGoogle Scholar
- Edgren H, Murumagi A, Kangaspeska S, Nicorici D, Hongisto V, Kleivi K, Rye IH, Nyberg S, Wolf M, Borresen-Dale AL: Identification of fusion genes in breast cancer by paired-end RNA-sequencing. Genome Biol. 2011, 12 (1): R6-PubMed CentralView ArticlePubMedGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5 (7): 621-628.View ArticlePubMedGoogle Scholar
- Tarazona S, Furio-Tari P, Ferrer A, Conesa A: NOISeq: Exploratory analysis and differential expression for RNA-seq data. R package version 200. 2012Google Scholar
- Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2013, 14 (6): 671-683.View ArticlePubMedGoogle Scholar
- Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D: Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 2013, 14 (9): R95-PubMed CentralView ArticlePubMedGoogle Scholar
- Benjamini Y, Speed TP: Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012, 40 (10): e72-PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.