DAFS: a dataadaptive flag method for RNAsequencing data to differentiate genes with low and high expression
 Nysia I George^{1} and
 ChingWei Chang^{1}Email author
DOI: 10.1186/147121051592
© George and Chang; licensee BioMed Central Ltd. 2014
Received: 30 August 2013
Accepted: 25 March 2014
Published: 31 March 2014
Abstract
Background
Nextgeneration sequencing (NGS) has advanced the application of highthroughput sequencing technologies in genetic and genomic variation analysis. Due to the large dynamic range of expression levels, RNAseq is more prone to detect transcripts with low expression. It is clear that genes with no mapped reads are not expressed; however, there is ongoing debate about the level of abundance that constitutes biologically meaningful expression. To date, there is no consensus on the definition of low expression. Since random variation is high in regions with low expression and distributions of transcript expression are affected by numerous experimental factors, methods to differentiate low and high expressed data in a sample are critical to interpreting classes of abundance levels in RNAseq data.
Results
A dataadaptive approach was developed to estimate the lower bound of high expression for RNAseq data. The KolmgorovSmirnov statistic and multivariate adaptive regression splines were used to determine the optimal cutoff value for separating transcripts with high and low expression. Results from the proposed method were compared to results obtained by estimating the theoretical cutoff of a fitted twocomponent mixture distribution. The robustness of the proposed method was demonstrated by analyzing different RNAseq datasets that varied by sequencing depth, species, scale of measurement, and empirical density shape.
Conclusions
The analysis of real and simulated data presented here illustrates the need to employ dataadaptive methodology in lieu of arbitrary cutoffs to distinguish low expressed RNAseq data from high expression. Our results also present the drawbacks of characterizing the data by a twocomponent mixture distribution when classes of gene expression are not well separated. The ability to ascertain stably expressed RNAseq data is essential in the filtering process of data analysis, and methodologies that consider the underlying data structure demonstrate superior performance in preserving most of the interpretable and meaningful data. The proposed algorithm for classifying low and high regions of transcript abundance promises widerange application in the continuing development of RNAseq analysis.
Keywords
RNAsequencing Low expression Dataadaptive Flag Mixture distributionBackground
Transcriptome analysis is integral in understanding the role of genetic and genomic variants in disease progression, classification, and diagnosis. In recent years, next generation sequencing (NGS) technologies have catapulted transcriptome profiling to new heights by providing greater precision of cellular RNA content. As a result, RNAseq is more sensitive to subtle changes in expression levels and typically includes many transcripts with low read count integers.
Researchers have taken interest in the low regions of a sample for several reasons, all with the primary intent of quantifying the level of expression that reflects important biological meaning. One of the major problems in RNAseq data analysis is that there is no consensus on what is considered low expression. Low counts have been referenced as less than 10 reads when summed across treatments [1, 2], less than 10 reads on average [3], less than 100 reads on average [4, 5], and less than 300 reads [6]. Additionally, countbased prefiltering methods have been adopted to exclude genes with minimal expression from differential testing. For example, Risso et al. [7] filtered out genes with an average count below 10. Robinson et al. [8] recommended removing genes that did not have at least 100 counts per million reads in at least two samples in the edgeR user’s guide. In the DESeq user’s guide, Anders [4] took a more aggressive approach and suggested removing up to 40% of genes that ranked lowest in regard to total count across all experimental samples. Thus, at present, whether low expressed transcripts are simply identified or deliberately omitted from analysis, methods of differentiating spurious RNAseq data from meaningful information have not been explicitly defined. Furthermore, it is not clear whether it is justifiable to extrapolate the aforementioned cutoffs to other RNAseq data.
A number of factors affect the distribution of read counts for a given study. Specifically, the manufacturer, library preparation and construction, alignment algorithm, gene length, sequencing depth, and experimental design all play a role in determining the number of reads that are mapped to a gene. Since each of these factors varies from study to study, it may be misleading to ignore properties of the sample distribution by applying an arbitrary cutoff to classify the low expressed region of a sample. To address the question of what should be considered the lower bound of functional gene expression, it is useful to consider the premise that genes can be categorized into a group of high expressed (HE), meaningful genes and a group of low expressed (LE), noninformative genes. In previous studies, the empirical distribution of a sample was used to identify classes of mRNA abundance levels [9]. This concept has also been utilized to differentiate functional expression states of microarray data [10–13]. In order to optimally separate expression classes, a priori knowledge of the expression distribution is useful. In many cases, the precise distribution of noise is unknown; however, characterizations of a global bimodal distribution and a normally distributed HE component have been reported in previous studies [10, 14–16].
Hebenstreit et al. [16] studied the global distribution of RNA sequences in mice Th2 cells. Their work demonstrated that log2transformed reads per kilobase per million (RPKM) could be separated into two classes of mRNA abundance. The researchers used the likelihood ratio test [17] to determine whether the data was best modeled as a mixture of n or n + 1 components, where 0 < n < 9. Mclust [18], an R package that uses an expectationmaximization (EM) algorithm, was used to compute maximum likelihood estimates of all parameters of the mixture distribution. For onedimensional data, Mclust evaluates a fitted model with equal variance terms and a fitted model with unequal variance and selects the model with higher Bayesian Information Criterion (BIC) [19]. However, the use of mixture modeling to identify mixture components of RNAseq data may not be appropriate for two reasons. First, it is difficult to capture the true number of mixture components. It is known that the EM algorithm performs well at estimating the parameters of a finite mixture model; however, when mixture distributions are unimodal and there is no clear separation between components, the EM algorithm commonly returns more components than what seems logical based on visual inspection of the data [20–22]. Second, when a twocomponent mixture model is forced to fit data that does appear to be bimodally distributed, the fitted model does not always approximate very well the observed empirical distribution.
Hebenstreit et al. [16] were able to characterize the mixture distribution of the LE and HE components and determine the peak of the LE region by using a Poisson distribution to estimate the proportion of undetected genes at each expression level. However, they did not discuss or provide an expression cutoff for separating the two overlapping regions. In this study, we propose DAFS, a dataadaptive method for identifying and subsequently flagging expressions in the LE region by estimating the lower bound of high expression in a given RNAseq sample. In light of the drawbacks of the mixture modeling approach, DAFS was constructed without imposing a finite mixture model on the data. Several real RNAseq datasets and simulated data are used to present our findings and demonstrate the robustness of our methodology.
Results
Analysis of bulk RNA positive controls generated from cultured HCT116 cells [23] allowed for an additional assessment of the ability of DAFS to separate known expressed genes into one mixture component and known unexpressed genes into a separate mixture component. In their analysis of singlecell whole transcriptome amplification, the authors identified 40 known expressed and unexpressed genes in HCT116 cells (Supplementary Table S2 in [23]). Fragments transformed by fragments per kilobase of exon per million mapped reads (FPKM) [24] were downloaded from the GEO database (GSE51254), and as specified in Wu et al. [23], were normalized according to median expression across all transcripts from a single cell and log2 transformed. The estimated DAFS cutoff of log2 medianadjusted FPKM data from RNA bulk was 0.78. At this cutoff all of the known expressed genes in the subset of 40 genes were mapped to the HE component and all of the known unexpressed genes were mapped to the LE component. The gene closest to the boundary of our cutoff for separating low and high expression was METTL3 (methyltransferase like 3) with a log2 medianadjusted FPKM value of 1.02. To assess the expression level of METTL3 in HCT116 cells, we examined two microarray datasets from the GEO database (GSE32323, GSE11618) and confirmed that METTL3 is expressed in HCT116 cells with low expression levels. Thus, to classify METTL3 in the HE component appears to be the appropriate decision based on biological evidence.
DAFS estimates of q_{ c } for each sample in Figure 2 demonstrate two additional important findings. Not only does the estimated quantile cutoff differ in each dataset, but the raw RNAseq count corresponding to the computed quantile also differs. We identified quantile cutoff values of 0.14 (Nagalakshmi et al. [25]), 0.42 (Wang et al. [26]), 0.36 (Mortazavi et al. [27]), and 0.33 (SEQC). These percentiles mapped to raw counts of 8, 22, 16, and 73, respectively. This indicates the necessity of a dataadaptive feature. As evidenced by the results, it would not be appropriate to apply an arbitrary cutoff based on percentiles of the data nor on raw counts without consideration of the individual data structure of the sample.
Additionally, we analyzed two different RNAseq datasets with technical replicates to assess the consistency of DAFS. Overall, DAFS demonstrated great consistency in estimating cutoffs from technical replicates. For SEQC, a q_{ c } value of 0.33 was computed for each replicate of sample A. Values of q_{ c } for replicates of sample B fluctuated between 0.24 and 0.25. In Hammer et al. [28], mRNAseq in rats was measured 2 weeks and 2 months after L5 spinal nerve ligation (SNL). The study included two technical replicates for each treatment condition and also included a control for each time point. DAFS returned q_{ c } values of 0.19 and 0.20 (control – 2 months); 0.20 and 0.16 (L5 SNL – 2 months); 0.23 and 0.24 (control – 2 weeks); 0.25 and 0.24 (L5 SNL – 2 weeks). Overall, DAFS showed small variability in estimating quantile cutoffs for technical replicated sequencing data. These results further demonstrate the impracticality of an arbitrary cutoff even within the same study. As evidenced by the data, it is reasonable to suspect that the separation of LE and HE genes is more homogeneous within replicates of the same experimental treatment. However, we may presume less agreement across biological replicates, treatments, and experiments.
Simulation
Performance evaluation of DAFS and Mclust on simulated data
Mean ± SE  

DAFS  Mclust  
Log2 Count  Sensitivity  0.9669 ± 0.0004  0.9529 ± 0.0015 
Specificity  0.8571 ± 0.0011  0.8683 ± 0.0021  
PPV  0.9235 ± 0.0005  0.9292 ± 0.0010  
NPV  0.9362 ± 0.0006  0.9173 ± 0.0022  
Log2 RPKM  Sensitivity  0.9847 ± 0.0002  0.9292 ± 0.0005 
Specificity  0.7528 ± 0.0025  0.9462 ± 0.0007  
PPV  0.9701 ± 0.0003  0.9929 ± 0.0001  
NPV  0.8623 ± 0.0016  0.6253 ± 0.0015 
Summary statistics of cutoff estimates from DAFS, Mclust, Sen _{ 0.85 } , Sen _{ 0.90 } , and Sen _{ 0.95 } on simulated data
Mean  P _{25}  Median  P _{97.5}  

DAFS  6.1961  5.8407  6.1725  6.5163  
Mclust  6.3460  5.9477  6.0390  7.3297  
Log2 Count  Sen_{0.85}  7.4362  7.3597  7.4375  7.5078 
Sen_{0.90}  7.0564  6.9687  7.0562  7.1524  
Sen_{0.95}  6.5013  6.3909  6.5025  6.6131  
DAFS  0.8736  1.2729  0.8023  0.4078  
Mclust  0.3630  0.1361  0.3710  0.6187  
Log2 RPKM  Sen_{0.85}  1.2391  1.1612  1.2359  1.3143 
Sen_{0.90}  0.7428  0.6551  0.7408  0.8341  
Sen_{0.95}  0.0575  0.0422  0.0598  0.1545 
For log2 transformed count data, which presented a clear bimodal distribution, the estimated cutoffs from DAFS and Mclust were similar. Average sensitivity for both methods exceeded 95%. On average, the theoretical cutoff estimated by Mclust was closer to the 95% sensitivity estimate. Although Mclust shows an improvement in specificity and DAFS does a better job of optimizing the sensitivity and NPV, the differences in performance measures were small. Analysis of the 95% confidence intervals for cutoff values demonstrated that both DAFS and Mclust were not significantly different from estimates obtained from achieving 95% sensitivity.
In evaluating cutoffs for log2 transformed RPKM data, DAFS returned an average sensitivity estimate that exceeded 98%. However, Mclust estimates fell between values obtained for achieving 90% and 95% sensitivity. Analysis of the 95% confidence intervals for cutoff values presented similar findings. The results indicated that DAFS captured a higher proportion of true high expression. As such, DAFS demonstrated superior performance in sensitivity and NPV, whereas PPV was comparable between Mclust and DAFS. The low specificity of DAFS is explained by the proportion of low expression. Since the low expressed region represented roughly 10% of the total sample size, a small number of misclassified observations will decrease specificity rapidly.
Based on the two different scenarios, the results suggest that DAFS is comparable to Mclust at maximizing sensitivity when the mixture components are well separated; however, DAFS is superior at providing a cutoff that maximizes sensitivity when the LE and HE components are not distinguishable. The simulation results clearly demonstrate how DAFS adapts to different data patterns. The pattern of log2 RPKM data is relatively closer to one normal distribution than the log2 raw counts. Therefore, DAFS adjusts and retains as many genes as possible when analyzing log2 RPKM data. Consequently, relative to the results obtained from simulating log2 raw counts, sensitivity and PPV are increased but specificity and NPV are decreased.
Discussion
This study was motivated by the need to provide a dataadaptive algorithm for separating RNAseq data with low and high expression, particularly when the distributions of expression abundance are not distinctly separated. In order to compute the optimal cutoff between low and high expression, our method relied heavily on the assumption that highexpressed data are normally distributed [16]. An advantage of using the KolmogorovSmirnov distance as a measure of agreement is that the method is easily modifiable if a different distributional assumption is required to characterize high expression. For any continuous distribution, the KS statistic tests the null hypothesis H_{0} : F(x) = F_{0}(x), for all x versus the alternative H_{1} : F(x) ≠ F_{0}(x), for some x. Thus, the reference distribution F_{0}(x), e.g. normal, lognormal, Student’s t, etc., is completely specifiable by the researcher. To our advantage, DAFS performed consistently well when the HE component departed from normality (as evidenced in log2 RPKM and FPKM data).
Careful consideration should be taken with regard to the number of observations used to model the distribution of KS statistics. The benefit of adding more data to characterize the underlying density should be balanced with the disadvantage of modeling added noise. If the predictor space is segmented too finely, then it is possible for multiple percentiles to map to the same KS statistic. Multiple manytoone mappings would make it difficult for the MARS algorithm to differentiate true variation from random noise. In the present study, incrementing p by step sizes of 0.01 or 0.005 provided a good balance between parsimonious and oversaturated input. Nevertheless, the increment selection is a variable in the proposed methodology that must be investigated by the researcher.
We were not remiss to consider the presence of more than two components of expression abundance. Since the Gaussian mixture model can well approximate the shape of any density [30], the number of Gaussian mixture components was estimated for multiple datasets. When Mclust was allowed to estimate the number of Gaussian mixture components, the algorithm often returned multiple mixture components. A similar finding was presented in the supplemental material presented of Hebenstreit et al. [16]. In their analysis of real datasets, values of AIC and BIC indicated that the data would be better fit by a k > 2 component Gaussian mixture. In our own analysis, many of the identified components were not separated enough to be heterogeneous populations. We employed a number of methods/packages to merge Gaussian mixture components (e.g. fpc [22], pdfcluster [31], REBMIX [32]) with no success. Nearly every method struggled by either distinguishing no separation or overly characterizing the distribution of abundance levels. The latter scenario was more pronounced in LE regions, where it seemed apparent that a number of mixture components were necessary to estimate nonGaussian density regions. It became clear that variability in the expression data made it difficult to ascertain whether homogeneous submixtures could be interpreted as a single component.
In our web search of the literature, the question of a cutoff for low expression in RNAseq was frequently asked. Some questions were motivated by a desire to quantify what is considered expressed in RNAseq. Others were motivated by a need to classify the level of measurement that could be trusted in assessing the significance of differential expression in lowexpressed regions, particularly since research shows that the precision of RNAseq data analysis improves as genes are more highly expressed. Whether transcripts with low expression are simply flagged or removed prior to testing via independent filtering, the work presented here provides a datadriven methodology for separating RNAseq expression into meaningful components. Providing an accurate separation of RNAseq data that is not based on ad hoc techniques or methodology that may be prone to modeldata misfit will facilitate interpreting the quality of sequencing reads and lead to improved power for differential detection of high expressed, reliable data.
Conclusions
In this study, we presented a method for classifying transcripts with low and high expression that promises widerange application. The robustness of DAFS was demonstrated by applying DAFS to a number of RNAseq data samples (real data examples and simulations) that varied by sequencing depth, species, normalization, and density shape.
Methods
Data
Several datasets were used to the test the performance of our methodology.
SEQC
Part III of the Microarray Quality Control was an FDAled, collaborate work to evaluate the technical performance of sequencing quality control (SEQC). In SEQC, reference RNA samples included the universal human reference RNA (UHHR) from Agilent/Stratagene (sample A) and the human brain total RNA (HBRR) from Life Technologies Corp. (sample B). External spikeins were added to both samples for purposes of testing validation and assessing accuracy. In order to minimize sources of technical variance in our study, we restricted our analysis to genelevel Illumina RNAseq data generated from one library preparation, processed on one lane, and obtained from one sequencing site. In total, our analysis included eight replicates of sample A and eight replicates of sample B.
ReCount datasets
To demonstrate the ability of DAFS to handle various empirical data structures, four additional datasets were downloaded from the ReCount webpage [33]. RNAseq from transcriptome analysis of yeast, humans, rats, and adult mice were obtained from Nagalakshmi et al. [25], Wang et al. [26], Hammer et al. [28], and Mortazavi et al. [27], respectively. All four studies were sequenced using Illumina/Solexa sequencing technology. Sequence reads were summarized into gene counts using Ensembl 61 annotation.
The dataadaptive flag method for RNAsequencing data (DAFS)
The algorithm for carrying out DAFS on a single sample is comprised of several steps. Since zero reads present no information, the first step consists of removing the zero counts from analysis. After removing the zero counts and transforming the data to log2 scale, the KolmogorovSmirnov statistic is applied to segmented quantiles of the empirical distribution. Finally, the multivariate adaptive regression splines function is fit to the distribution of KolmogorovSmirnov statistics and spline knots are used to determine the optimal cutoff for high expression.
The Kolmogorovsmirnov statistic
The KolmogorovSmirnov (KS) statistic is a quantitative measure of the maximum distance between the empirical distribution function of a sample and the cumulative distribution function of a reference distribution. In the present study, we rely on the KS statistic to determine a cutoff for data attributed to the HE region by assessing agreement between the empirical distribution of assumed HE genes and the reference distribution. RNAseq data after logarithmic transformation is approximately normally distributed. Thus, from a practical standpoint, we assumed a normal reference distribution for HE genes and used log2transformed data to compute the optimal cutoff.
where F_{ p } is the empirical distribution of the sample, Y_{ p }, and F is the cumulative distribution function of the assumed normal reference distribution. To best differentiate between LE and HE genes, the percentile cutoffs ranging from p_{0} to 0.50 in increments of either 0.01 or 0.005 (dependent upon the data structure) was considered. For each sample, p_{0} is the proportion of data aggregated at the minimum expression level e.g. the percentage of 0 values when analyzing log2transformed raw RNAseq. The decision to exclude minimum expression levels was motivated by a desire to eliminate a proportion of the noise produced by extreme low counts. The stopping value is set at 0.50 to allow for at least half the data to be used in analysis. This seems fitting to adequately describe the normal mixture component of HE genes. For each i^{ th } observation of {p}, a corresponding KS statistic is computed, denoted by D_{i,}, i = 1, 2, …, length of {p}.
Cutoff classification
The distribution of KS statistics, {D}, across percentiles of the sample is used to classify the differentiation between data with low and high expression. Ideally, the data should be separated where {D} achieves a local minimum. To estimate the slope change points in Kolmogorov distances, multivariate adaptive regression splines (MARS) [34] was used to model f : p → D.
1) The MARS algorithm
where M is the number of spline basis functions and B_{ m } and α_{ m } are the m^{ th } spline function and its regression coefficient, respectively. Points that signify a break in the sample space and describe distinct linear splines are of key importance. These knots pinpoint a change in the model function, which indicate critical points of marked changed in KS distances.
2) Optimal quantile cutoff, q_{ c }
where N is the number of observations and C(M) is the cost complexity function of M basis functions [38]. This criterion also provides an optimal balance between bias and variance.
The final MARS model includes the minimum number of necessary knots to capture the true model. Thus, the selection of knots is used to identify critical points along the range of KS statistics. The optimal quantile cutoff, q_{ c }, is determined by the quantile p such that f has a local minimum. In the presence of multiple local minima, q_{ c } = min{q_{ c }} i.e. the leftmost point indicating a decreasingtoincreasing change in slope. The MARS algorithm was performed using the ‘earth’ package in R.
R script to generate quantile cutoffs for a sample is provided in Additional file 1: Supplementary Materials.
Theoretical cutoff of Twocomponent normal mixture model
where π_{ k } > 0 ∀ k = 1, 2 is the mixing weight of the k^{ th } Gaussian distribution, $\sum _{\mathit{k}=1}^{2}{\pi}_{k}=1$, and ${\phi}_{k}\left(x{\mu}_{k},{\sigma}_{k}^{2}\right)$
where v = σ_{2}/σ_{1} when the variance terms are unequal. x_{evar} or x_{uvar} were used to determine the theoretical cutoff between expression abundance classes when twocomponent mixture modeling was used to fit the distribution of LE and HE regions.
Simulated data
Two simulation studies based on distributions of log2 transformed raw counts and log2 transformed RPKM data were proposed to demonstrate the ability of DAFS to differentiate LE/HE regions. For the first scenario, Mclust was used to fit the distribution of log2 raw counts of sample A from the SEQC study with no restrictions placed on the number of mixture components. The proportion, mean, and variance parameter estimates of the fitted 5 component mixture model were ${\widehat{\pi}}_{1}=\left(0.08,0.28,0.17,0.26,0.21\right)$, ${\widehat{\mu}}_{1}=\left(1.33,4.68,7.86,9.50,9.32\right)$, and ${\widehat{\sigma}}_{1}^{2}=\left(0.62,2.83,1.17,0.90,3.53\right)$. RPKM data from Wang et al. [39] and made available through GEO Accession viewer was used to derive the second simulation. Parameters estimates of the proportion, mean, and variance of log2 RPKM data were ${\widehat{\pi}}_{2}=\left(0.11,0.24,0.25,0.31,0.09\right)$, ${\widehat{\mu}}_{2}=\left(\u20121.80,1.30,3.40,4.20,6.60\right)$, and ${\widehat{\sigma}}_{2}^{2}=\left(1.80,1.80,1.80,1.80,1.80\right)$.
In the simulation of log2 raw counts (scenario 1), the first two components were treated as LE. In scenario 2, only the first component of log2 RPKM data was treated as LE. To evaluate DAFS’s performance with competing methodology, results of DAFS were compared to cutoffs determined by fixing values of sensitivity and by cutoffs determined by the point of intersection between the two theoretical distributions of a fitted twocomponent Mclust model. Several methods for selecting an optimal cutoff have been proposed based on specific underlying objectives e.g. costbenefit analysis or diagnostic test accuracy measures (sensitivity/specificity, predictive values, and diagnostic likelihood ratios). In this study, since the objective is to retain as many of the HE genes as possible, the decision criterion is based on setting sensitivity. Empirical cutoffs were computed for sensitivity values set to 0.85 (Sen_{0.85}), 0.90 (Sen_{0.90}), and 0.95 (Sen_{0.95}). The total number of genes was fixed at 5,000 and the simulation was repeated 500 times.
Availability of supporting data
R script to generate quantile cutoffs for a sample is provided in Additional file 1: Supplementary Materials.
Abbreviations
 DAFS:

DataAdaptive Flag Method for RNASequencing Data
 NGS:

Nextgeneration sequencing
 RNAseq:

whole transcriptome sequencing
 HE:

High expressed
 LE:

Low expressed
 RPKM:

Log2transformed reads per kilobase per million
 FPKM:

Fragments per kilobase per exon model per million mapped reads
 EM:

Expectationmaximization
 SNL:

Spinal nerve ligation
 PPV:

Positive predictive value
 NPV:

Negative predictive value
 KS:

KolmogorovSmirnov
 MARS:

Multivariate adaptive regression splines
 GCV:

Generalized crossvalidation.
Declarations
Acknowledgements
The views presented in this paper are those of the authors and do not necessarily represent those of the U.S. Food and Drug Administration.
Authors’ Affiliations
References
 Miller R, Wu G, Deshpande RR, Vieler A, Gärtner K, Li X, Moellering ER, Zäuner S, Cornish AJ, Liu B, Bullard B, Sears BB, Kuo MH, Hegg EL, ShacharHill Y, Shiu SH, Benning C: Changes in transcript abundance in Chlamydomonas reinhardtii following nitrogen deprivation predict diversion of metabolism. Plant Physiol. 2010, 154: 17371752. 10.1104/pp.110.165159.View ArticlePubMed CentralPubMedGoogle Scholar
 Gao L, Fang Z, Zhang K, Zhi D, Cui X: Length bias correction for RNAseq data in gene set analyses. Bioinformatics. 2010, 27 (5): 662669.View ArticleGoogle Scholar
 Chen Z, Liu J, Ng HKT, Nadarajah S, Kaufman HL, Yang JY, Deng Y: Statistical methods on detecting differentially expressed genes for RNAseq data. BMC Syst Biol. 2011, 5 (Suppl 3): S110.1186/175205095S3S1.View ArticleGoogle Scholar
 Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11 (10): R10610.1186/gb20101110r106.View ArticlePubMed CentralPubMedGoogle Scholar
 Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, Taylor JM: Efficient experimental design and analysis strategies for the detection of differential expression using RNASequencing. BMC Genomics. 2012, 17 (13): 484View ArticleGoogle Scholar
 Cherbas L, Willingham A, Zhang D, Yang L, Zou Y, Eads BD, Carlson JW, Landolin JM, Kapranov P, Dumais J, Samsonova A, Choi JH, Roberts J, Davis CA, Tang H, van Baren MJ, Ghosh S, Dobin A, Bell K, Lin W, Langton L, Duff MO, Tenney AE, Zaleski C, Brent MR, Hoskins RA, Kaufman TC, Andrews J, Graveley BR, Perrimon N: The transcriptional diversity of 25 Drosophila cell lines. Genome Res. 2011, 21: 301314. 10.1101/gr.112961.110.View ArticlePubMed CentralPubMedGoogle Scholar
 Risso D, Schwartz K, Sherlock G, Dudoit S: GCContent normalization for RNAseq data. BMC Bioinforma. 2011, 12: 48010.1186/1471210512480.View ArticleGoogle Scholar
 Robinson M, McCarthy D, Chen Y, Smyth G: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26 (1): 139140. 10.1093/bioinformatics/btp616.View ArticlePubMed CentralPubMedGoogle Scholar
 Hastie ND, Bishop JO: Three abundance classes of messenger RNA in mouse tissues. Cell. 1976, 9: 761774. 10.1016/00928674(76)901392.View ArticlePubMedGoogle Scholar
 Hoyle DC, Rattray M, Jupp R, Brass A: Making sense of microarray data distributions. Bioinformatics. 2002, 18: 576584. 10.1093/bioinformatics/18.4.576.View ArticlePubMedGoogle Scholar
 Chang CW, Zou W, Chen JJ: A new method for gene identification in comparative genomic analysis. J Data Sci. 2008, 4: 415427.Google Scholar
 Ohtaki M, Otani K, Hiyama K, Kamei N, Satoh K, Hiyama E: A robust method for estimating gene expression states using Affymetrix microarray probe level data. BMC Bioinforma. 2010, 11: 18310.1186/1471210511183.View ArticleGoogle Scholar
 Hebenstreit D, Teichmann S: Analysis and simulation of gene expression profiles in pure and mixed cell populations. Phys Biol. 2011, 8 (3): 03501310.1088/14783975/8/3/035013.View ArticlePubMedGoogle Scholar
 Lu C, King RD: An investigation into the population abundance distribution of mRNAs, proteins, and metabolites in biological systems. Bioinformatics. 2009, 25: 20202027. 10.1093/bioinformatics/btp360.View ArticlePubMedGoogle Scholar
 Ramskold D, Wang ET, Burge CB, Sandberg R: An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol. 2009, 5: e100059810.1371/journal.pcbi.1000598.View ArticlePubMed CentralPubMedGoogle Scholar
 Hebenstreit D, Fang M, Gu M, Charoensawan V, van Oudenaarden A, Teichmann S: RNA sequencing reveals two major classes of gene expression levels in metazoan cells. Mol Syst Biol. 2011, 7: 497View ArticlePubMed CentralPubMedGoogle Scholar
 Casella G, Berger RL: Statistical Inference. 2001, Pacific Grove, CA: Duxbury Press, 2Google Scholar
 Fraley C, Raftery AE: MCLUST Version 3 for R: Normal Mixture Modeling and ModelBased Clustering. 2006, Department of Statistics, University of WashingtonGoogle Scholar
 Schwarz G: Estimating the dimension of a model. Ann Stat. 1978, 6: 461464. 10.1214/aos/1176344136.View ArticleGoogle Scholar
 Biernacki C, Celeux G, Govaert G: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell. 2000, 22: 719725. 10.1109/34.865189.View ArticleGoogle Scholar
 Ray S, Lindsay BG: The topography of multivariate normal mixtures. Ann Stat. 2005, 33: 20422065. 10.1214/009053605000000417.View ArticleGoogle Scholar
 Hennig C: Methods for merging Gaussian mixture components. ADAC. 2010, 4 (1): 334. 10.1007/s1163401000583.View ArticleGoogle Scholar
 Wu AR, Neff NF, Kalisky T, Dalerba P, Treulein B, Rothenberg ME, Mburu FM, Mantalas GL, Sim S, Clarke MF, Quake SR: Quantitative assessment of singlecell RNAsequencing methods. Nat Methods. 2013, 11: 4146. 10.1038/nmeth.2694.View ArticlePubMed CentralPubMedGoogle Scholar
 Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNASeq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28 (5): 511515. 10.1038/nbt.1621.View ArticlePubMed CentralPubMedGoogle Scholar
 Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast whole genome defined by RNA sequencing. Science. 2008, 320: 13441349. 10.1126/science.1158441.View ArticlePubMed CentralPubMedGoogle Scholar
 Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456 (7221): 470476. 10.1038/nature07509.View ArticlePubMed CentralPubMedGoogle Scholar
 Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Methods. 2008, 5: 621628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
 Hammer P, Banck MS, Amberg R, Wang C, Petznick G, Luo S, Khrebtukova I, Schroth GP, Beyerlein P, Beutler AS: mRNAseq with agnostic splice site discovery for nervous system transcriptomics tested in chronic pain. Genome Res. 2010, 20 (6): 84760. 10.1101/gr.101204.109.View ArticlePubMed CentralPubMedGoogle Scholar
 Toung JM, Morley M, Li MY, Cheung VG: RNAsequence analysis of human Bcells. Genome Res. 2011, 21 (6): 991998. 10.1101/gr.116335.110.View ArticlePubMed CentralPubMedGoogle Scholar
 Fraley C, Raftery AE: Modelbased clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2002, 97: 611631. 10.1198/016214502760047131.View ArticleGoogle Scholar
 Menardi G, Azzalini A: An advancement in clustering via nonparametric density estimation. Stat Comput. 2013, doi:10.1007/s112220139400x. URL http://link.springer.com/10.1007/s112220139400x,Google Scholar
 Nagode M, Fajdiga M: The REBMIX algorithm for the univariate finite mixture estimation. Commun Stat Theory Methods. 2011, 40 (5): 876892. 10.1080/03610920903480890.View ArticleGoogle Scholar
 Frazee A, Langmead B, Leek J: Recount: a multiexperiment resource of analysisready RNAseq gene count datasets. BMC Bioinforma. 2011, 12: 44910.1186/1471210512449.View ArticleGoogle Scholar
 Friedman JH: Multivariate adaptive regression splines. Ann Stat. 1991, 19: 167. 10.1214/aos/1176347963.View ArticleGoogle Scholar
 Morgan JN, Sonquist JA: Problems in the analysis of survey data, and a proposal. J Am Stat Assoc. 1963, 58: 415435. 10.1080/01621459.1963.10500855.View ArticleGoogle Scholar
 Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and regression trees. 1984, Belmont, California: Wadsworth, Inc. PressGoogle Scholar
 Craven P, Wahba G: Smoothing noisy data with spline functions. Numer Math. 1979, 31: 377403.View ArticleGoogle Scholar
 Friedman JH, Silverman BW: Flexible parsimonious smoothing and additive modeling. Technometrics. 1989, 31: 339. 10.1080/00401706.1989.10488470.View ArticleGoogle Scholar
 Wang ET, Cody NA, Jog S, Biancolella M, Wang TT, Treacy DJ, Luo S, Schroth GP, Housman DE, Reddy S, Lécuyer E, Burge CB: Transcriptomewide regulation of PremRNA splicing and mRNA localization by muscleblind proteins. Cell. 2012, 150: 710724. 10.1016/j.cell.2012.06.041.View ArticlePubMed CentralPubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.