 Research article
 Open Access
 Published:
Evaluation of methods for differential expression analysis on multigroup RNAseq count data
BMC Bioinformatics volume 16, Article number: 360 (2015)
Abstract
Background
RNAseq is a powerful tool for measuring transcriptomes, especially for identifying differentially expressed genes or transcripts (DEGs) between sample groups. A number of methods have been developed for this task, and several evaluation studies have also been reported. However, those evaluations so far have been restricted to twogroup comparisons. Accumulations of comparative studies for multigroup data are also desired.
Methods
We compare 12 pipelines available in nine R packages for detecting differential expressions (DE) from multigroup RNAseq count data, focusing on threegroup data with or without replicates. We evaluate those pipelines on the basis of both simulation data and real count data.
Results
As a result, the pipelines in the TCC package performed comparably to or better than other pipelines under various simulation scenarios. TCC implements a multistep normalization strategy (called DEGES) that internally uses functions provided by other representative packages (edgeR, DESeq2, and so on). We found considerably different numbers of identified DEGs (18.5 ~ 45.7 % of all genes) among the pipelines for the same real dataset but similar distributions of the classified expression patterns. We also found that DE results can roughly be estimated by the hierarchical dendrogram of sample clustering for the raw count data.
Conclusion
We confirmed the DEGESbased pipelines implemented in TCC performed well in a threegroup comparison as well as a twogroup comparison. We recommend using the DEGESbased pipeline that internally uses edgeR (here called the EEEE pipeline) for count data with replicates (especially for small sample sizes). For data without replicates, the DEGESbased pipeline with DESeq2 (called SSSS) can be recommended.
Background
RNA sequencing (RNAseq) is a basic tool for measuring expressions of multiple genomic loci [1–5]. One important goal for RNAseq is to identify of differentially expressed genes (DEGs) under different conditions. Researchers typically start the differential expression (DE) analysis with a socalled “count matrix”, where each row indicates the gene (or exons or genomic loci), each column indicates the sample, and each cell indicates the number of reads mapped to the gene in the sample [5–9]. There are roughly four levels of resolution in current DE analysis: gene, transcript, exon, and baselevel. Examples of the DE methods for individual levels are (i) edgeR [10], DESeq [11], and TCC [12] for genelevel; (ii) Cuffdiff2 [13], IUTA [14], and SplicingCompass [15] for transcriptlevel; (iii) DEXSeq [16] and NPEBseq [17] for exonlevel; and (iv) DER Finder [18] for baselevel. Many methods can perform DE analysis for multiple levels (e.g., Cuffdiff2 can perform both gene and transcriptlevel analysis) and are provided as R/Bioconductor packages [19, 20].
Read counts across technical replicates derived from a single source fit to a Poisson distribution [3, 21]. For data on biological replicates (BRs) derived from different individuals, the genelevel counts well fit to an overdispersed Poisson distribution such as a negativebinomial (NB) model [10, 11, 22], betabinomial (BB) model [5, 23], PoissonTweedie model [6], and so on. In particular, the PoissonTweedie model well captures the biological variation (especially for zeroinflation and heavy tail behavior, for details see [6]) when many BRs are available. As an increase in sample size (i.e., the number of replicate samples) precedes an increase in sequencing depth (i.e., the number of sequenced reads) [24–26], a more complex model such as PoissonTweedie may be the first choice for count data with many BRs. However, as many replicates are still difficult to take due to sequencing cost and the small amount of the target RNA sample, RNAseq data with few BRs have mainly been stored. Two R packages based on the NB model (edgeR and DESeq) have been widely used as a common choice for DE analysis of RNAseq data with few BRs [9–11, 27].
In general, the DE analysis consists of two steps (data normalization X and DEG identification Y), and each R package has its own methods for the XY pipeline [12]. The aim of normalization is to make the normalized counts for nonDEGs similar between all samples [28]. The edgeR and DESeq manipulate the raw count data as input. They first calculate normalization factors (or size factors) for individual samples as X, then construct the model (i.e., estimate the parameters on the model in which the calculated normalization factors are used to rescale the raw counts), and calculate pvalues (i.e., perform the statistical test using the model) as Y. Previous studies have demonstrated that X has more impact than Y on the ranked gene list [8, 29, 30] and that two normalization methods implemented in the two packages (edgeR and DESeq) generally give satisfactory results [31]. While the normalization method provided in edgeR is termed TMM (trimmed mean of Mvalues) [32], we here call the default pipelines XY for edgeR and DESeq “edgeRedgeR (or EE)” and “DESeqDESeq (or DD)”, respectively.
We previously proposed a multistep normalization procedure called TbT [8]. TbT consists of three steps: X using TMM (step 1), Y using an empirical Bayesian method implemented in the baySeq package [22] (step 2), and X using TMM after elimination of the estimated DEGs (step 3) comprising the TMMbaySeqTMM normalization pipeline. The key concept is to alleviate the negative effect of potential DEGs before calculating the normalization factors in step 3. As mentioned previously [8], the DEG elimination strategy (called DEGES) can be repeated until the calculated normalization factors converge. The iterative TbT can be described as a TMM(baySeqTMM)_{ n } procedure. Accordingly, a generalized pipeline with the multistep normalization can be described as X(YX)_{ n }Y in which the X(YX)_{ n } with n > = 2 corresponds to the iterative DEGESbased normalization.
Our TCC package [12] implements the proposed pipeline X(YX)_{ n }Y. Recommendations are made from an extensive simulation analysis: (i) edgeR(edgeRedgeR)_{3}edgeR on twogroup RNAseq data with few replicates and (ii) DESeq(DESeqDESeq)_{3}DESeq on twogroup data without replicates [12]. However, similar to many other studies [24–28, 33], the performance evaluations were limited to a twogroup comparison. While many R packages as well as TCC can perform DE analysis on more complex experimental designs [5, 9, 22, 34–37], there have been few evaluation studies on RNAseq data with those designs, e.g., multigroup data. The current study aims to evaluate 12 pipelines available in nine R packages when analyzing multigroup RNAseq count data. Specifically, our primary interest is to investigate the effectiveness of the DEGESbased pipeline in TCC under such more complex designs. We report pipelines suitable for multigroup comparison.
Results and discussion
To investigate the performance of DE pipelines for a multigroup comparison, a total of 12 pipelines available in the nine packages were mainly evaluated in this study: TCC (ver. 1.7.15) [12], edgeR (ver. 3.8.5) [10], DESeq (ver. 1.18.0) [11], DESeq2 (ver. 1.6.3) [35], voom [38] in limma (ver. 3.22.1) [39], SAMseq [40] in samr (ver. 2.0), PoissonSeq (ver. 1.1.2) [41], baySeq (ver. 2.0.50) [22], and EBSeq (ver. 1.6.0) [42]. Note that TCC can perform several combinations for the DE pipeline X(YX)_{ n }Y with n = 3 as recommended [12]. We sometimes refer to this DEGESbased pipeline as XYXY with the fixed number of n for short. We basically confine individual methods (X and Y) in each pipeline to functions provided by the same packages (i.e., edgeR or DESeq2) for simplicity. For example, the edgeRrelated pipeline is “edgeR(edgeRedgeR)_{3}edgeR”, where X = TMM and Y = the DEG identification method, implemented in edgeR. Although we previously termed this pipeline “iDEGES/edgeRedgeR” [12], here we abbreviate it to EEEE for convenience. Similarly, the “DESeq(DESeqDESeq)_{3}DESeq” pipeline can be shortened to DDDD. This is because (1) users can select, for example, different DEG identification methods Y for steps 2 and 4 and (2) we will discuss some possible combinations such as DEDS for the “DESeq(edgeRDESeq)_{3}DESeq2” pipeline. In this sense, the DEGESbased pipeline can also be denoted as X(YX)_{ n }Z or XYXZ.
Following our previous studies [8, 12], we here demonstrate the performance of these pipelines mainly on the basis of same evaluation metric and simulation framework. We use the area under the ROC curve (AUC) as a main measure of comparison, which evaluates both sensitivity and specificity of the pipelines simultaneously [28, 43–48]. To perform the multigroup comparison as simply as possible, we focus here on the threegroup data (i.e., G1 vs. G2 vs. G3) with equal numbers of replicates (i.e., 1, 3, 6, and 9 replicates per group; Nrep = 1, 3, 6, and 9). The gene ranking was performed on the basis of an ANOVAlike pvalue or the derivatives, where a low pvalue for a gene indicates a high degree of DE in at least one of the groups compared. The simulation conditions are as follows: the total number of genes is 10,000 (Ngene = 10000), 5 or 25 % of the genes are DEGs (P_{DEG} = 5 or 25 %), the levels of DE are fourfold in individual groups, and the proportions of DEGs upregulated in individual groups (P_{G1}, P_{G2}, P_{G3}) are (1/3, 1/3, 1/3), (0.5, 0.3, 0.2), (0.5, 0.4, 0.1), (0.6, 0.2, 0.2), (0.6, 0.3, 0.1), (0.7, 0.2, 0.1), and (0.8, 0.1, 0.1).
Simulation data with replicates
We first assessed the performances of a total of 12 pipelines: three pipelines in TCC (i.e., EEEE, DDDD, and SSSS), edgeR, edgeR_robust, DESeq, DESeq2, voom, SAMseq, PoissonSeq, baySeq, and EBSeq. Table 1 lists the average AUC values of 100 trials between the ranked gene lists and the truth for various simulation conditions with Nrep = 3. Overall, the AUC values for the EEEE pipeline were the highest and similar across the seven different proportions of DEGs upregulated in individual groups (P_{G1}, P_{G2}, P_{G3}). The edgeR (i.e., the pipeline EE) performed the second best overall. EEEE and edgeR performed comparably under the unbiased proportion of DEGs in individual groups (1/3, 1/3, 1/3). This is quite reasonable because the EEEE can be viewed as an iterative edgeR pipeline and their theoretical performances are the same under the unbiased condition [12]. Similar to the relationship between EEEE and edgeR, the DDDD (or SSSS) can be viewed as an iterative DESeq (or DESeq2) pipeline. As expected, DDDD (or SSSS) consistently outperformed DESeq (or DESeq2) in all simulation conditions except for the unbiased situations.
We observed similar AUC values across the seven different proportions of DEGs for individual pipelines at P_{DEG} = 5 % (Table 1a). When a higher amount of DEGs was introduced (i.e., P_{DEG} = 25 %; Table 1b), the performances generally worsened as the degrees of biases increased (i.e., from left to right in Table 1). For example, the AUC values for voom under the unbiased (1/3, 1/3, 1/3) and most biased (0.8, 0.1, 0.1) proportions decreased from 87.08 to 84.56 %. We observed relatively poor performances for EBSeq and voom. This is consistent with a previous simulation study on twogroup data with a low number of BRs (Nrep = 2) [28]. A possible explanation of these results is that EBSeq was originally developed to detect DE isoforms (not DEGs) [41] and the large body of methodology in voom is for microarray data (not RNAseq count data) [38]. Our current evaluation focuses on the genelevel RNAseq count data and does not address the problem of such a detailed resolution of DE analysis. SAMseq and PoissonSeq performed stably across different proportions. This is probably because both methods are nonparametric ones that do not assume any particular distribution for the data and are generally robust against such biased situations. These methods, however, performed poorly overall. Additional file 1 is the R code for obtaining these results.
It should be noted that the relative performances for EBSeq tend to improve as the number of replicates per group increases (Nrep = 6 and 9; see Sheet 2 and 3 in Additional file 2). In particular, EBSeq consistently outperformed the others when Nrep = 9 and P_{DEG} = 5 %, suggesting that the DEGESbased pipeline based on EBSeq could produce a more accurate ranked gene list. However, as previously discussed for the DEGESbased pipeline based on baySeq [12], Bayesian methods (EBSeq and baySeq) generally require huge computation time (see Sheet 4 in Additional file 2). While the computation can be parallelized, the implementation of DEGES for EBSeq might be unfeasible.
Recall that the level of DE for DEGs was fourfold in this simulation framework and the shape of the distribution for introduced DEGs is the same as that of nonDEGs [8]. Although the simulation framework has been used [8; 12; 32], this may weaken the validity of the current simulation framework. To mitigate this concern, we performed simulations with different distributions of DE when introducing DEGs. In this simulation, the foldchanges for DEGs were randomly sampled from “1.2 + a gamma distribution with shape = 2.0 and scale = 0.5”, giving mean foldchange of 2.2 (=1.2 + 2.0 × 0.5). Similar to the results with a fixed level of DE (fourfold for all DEGs), EEEE performed the best overall (see Sheet 5 in Additional file 2). While a more extensive study with other simulation settings should still be performed, this trend suggests that different distributions of DE does not have much impact on the DE results. The functionality for generating the different distributions of DE in the “simulateReadCounts” function will be available in TCC ver. 1.9.3 or higher.
As mentioned above, TCC can perform various combinations for the DEGESbased DE pipeline X(YX)_{ n }Z or XYXZ, where Y and Z are the DEG identification methods and X is the normalization method. We investigated the effect of the individual methods (used as X, Y, and Z) by analyzing a total of 12 pipelines (eight DEGESbased pipelines and four nonDEGESbased pipelines). Table 2 shows the average AUC values for these pipelines. Note that the values in Tables 1 and 2 are comparable and that those for the four pipelines (EEEE, DDDD, EE, and DD; colored gray in Table 2) are provided in both tables. It is clear that choosing Z has more impact on the gene ranking accuracy than choosing Y and that using the DEG identification method provided in edgeR in both Y and Z can be recommended. In comparison with the two normalization methods in X in the eight DEGESbased pipelines, the method in DESeq (denoted as “D”) gave slightly higher AUC values than the TMM normalization method in edgeR (denoted as “E”). However, the superiority of DESeq in X was not observed when the four nonDEGESbased pipelines XZ were compared, where edgeR (i.e., the TMM normalization method) outperformed DESeq. In any case, the different choices in X have less impact than those in Y and Z. Additional file 3 is the R code for obtaining the results shown in Table 2.
Surprisingly, the best pipeline was DEDE, followed by EEEE and DDDE (Table 2b). The DEDE and DDDE pipelines consist of methods provided by different packages. For example, DEDE, the “DESeq(edgeRDESeq)_{3}edgeR” pipeline, consists of the normalization method in DESeq as X and the DEG identification method in edgeR as Y and Z. These results suggest that in some cases, the suitable choices of the best pipeline may slightly improve DE results. We should note that the current simulation data are generated by the “simulateReadCounts” function in TCC. This is simply because, to the best of our knowledge, TCC only provides the R function that can generate multigroup simulation count data. TCC simulates all counts using NB distributions, implying that this simulation framework advantageously acts on the classical R packages such as edgeR and DESeq. This is probably the main reason for poor performances of two recently published packages (edgeR_robust and DESeq2; see Table 1); those are the advanced versions for edgeR and DESeq, respectively, and are robust against count outliers such as abnormally high counts (for details, see [35, 36]). To the best of our knowledge, only one R package, compcodeR [48], can generate simulation count data with outliers, but it has been restricted to only twogroup comparisons so far. Extending the simulation framework of compcodeR to multigroup data may allow different pipelines to be compared more equally.
Simulation data without replicates
Unlike (multigroup) count data with replicates, there are few packages that can manipulate count data without replicates. These include TCC, edgeR, DESeq, DESeq2, and so on. We here evaluated a total of 20 pipelines (13 DEGESbased pipelines and seven nonDEGESbased pipelines). Table 3 shows the results for simulation data without replicates under P_{DEG} = 25 %. When three original nonDEGESbased pipelines XZ are compared, DESeq2 (i.e., SS) performed the best, followed by DESeq (DD) and edgeR (EE). This is completely different from the results in Table 2. When 13 DEGESbased pipelines XYXZ are compared, choosing Z for the DEGESbased pipeline has more impact on the gene ranking accuracy than choosing Y (similar to Table 2) and that using the DEG identification method provided in DESeq2 (i.e., S) can be recommended as Z. This result may possibly be explained by the removal of outliers that do not fit the distributional assumptions of the model [40]: DESeq2 [35] implements a functionality for detecting and removing outliers on the basis of Cook’s distance [49]. In the situation of count data without replicates, DEGs tend to be flagged as outliers: Cook’s distances are generally greater for DEGs than for nonDEGs. The negative effect of 25 % DEGs introduced in this simulation framework could therefore be weakened.
In addition to the model construction only with nonoutliers in the Z step, the DEGESbased normalization in the XYX step also slightly but reliably improves ranked gene lists. For example, the AUC values higher for SSSS than SS (i.e., DESeq2) are by virtue of the multistep normalization strategy originally proposed by Kadota et al. [8]. However, as also discussed in the TCC paper [12], DESeq and DESeq2 generally estimate false discovery rates (FDR) more conservatively than others [9]. Indeed, we observed that the numbers of potential DEGs satisfying 10 % FDR in step 2 (i.e., the Y step) in the SSSS pipeline were nearly zero (i.e., the estimated P_{DEG} values were 0 %) in all simulations, although the actual P_{DEG} values were 25 %. This is reasonable because any attempt to work without replicates will lead to conclusions of very limited reliability [12]. TCC employs a predefined floor P_{DEG} value (=5 %) to obtain certain differences between the DEGESbased approach SSSS and nonDEGESbased approach SS: at least 5 % of the topranked genes are not used when the normalization factors are calculated at step 3 in the XYX pipeline. As an estimated P_{DEG} value of x% tends to work better when simulation data with the same P_{DEG} value are analyzed, accurate estimation is the next important task. Additional file 4 is the R code for obtaining the results shown in Table 3.
Real data with replicates
In addition to the simulation study, we also analyzed a real RNAseq count dataset sequenced from the three species (i.e., the threegroup data): humans (HS), chimpanzees (PT), and rhesus macaques (RM) [50]. Briefly, Blekhman et al. studied expression levels of liver samples from three males (M1, M2, and M3) and three females (F1, F2, and F3) from each species, giving a total of six different individuals (i.e., six biological replicates) for each species. Since they performed duplicate experiments for each individual (i.e., two technical replicates), the publicly available raw count matrix consists of 20,689 genes × 36 samples (=3 species × 2 sexes × 3 biological replicates × 2 technical replicates). To correctly estimate the biological variation and make the assumed structure of input data, we summed and collapsed the count data of technical replicates, giving a reduced number of columns in the count matrix (i.e., 18 samples; three species × 2 sexes × 3 biological replicates). We here compared a total of 12 pipelines in light of the overall similarity of ranked gene lists, the number of shared DEGs satisfying an FDR threshold, and so on. To compare these pipelines as simply as possible, we regarded this dataset as a singlefactor experimental design of three species where each has six biological replicates (i.e., HS_rep16 vs. PT_rep16 vs. RM_rep16). The full R code for analyzing this dataset is provided in Additional file 5. The results of sample clustering applied to these raw and collapsed count datasets are given in Additional file 6 (as part of the results in Additional file 5).
Figure 1 shows the dendrogram of averagelinkage clustering for the 12 ranked gene lists. Seven pipelines located in the center (from SSSS to DD) show similar ranked gene lists. This is mainly because the seven pipelines from the four packages (TCC, edgeR, DESeq, and DESeq2) commonly employ a generalized linear model (GLM) framework. Indeed, the minimum value of Spearman’s correlation coefficients (r) among the seven pipelines was 0.9240. It is also noteworthy that ranked gene lists produced from TCC’s iterative strategies and the corresponding original noniterative strategies are particularly similar. For example, the r between EEEE from TCC and EE from edgeR was 0.9999, implying that these data are not extremely biased in light of the proportions of DEGs up and/or downregulated in individual groups (P_{G1}, P_{G2}, P_{G3}). That is, the proportions of DEGs in these data (P_{G1}, P_{G2}, P_{G3}) are rather closer to (1/3, 1/3, 1/3) than, for example, (0.8, 0.1, 0.1) or (0.0, 0.9, 0.1).
Note that the dendrogram shown in Fig. 1 does not necessarily indicate the superiority of the seven GLMbased pipelines over the others such as EBSeq and baySeq. For example, EBSeq employs an empirical Bayesian framework that returns the posterior probabilities for each of the five possible expression patterns (or models) to each gene. We here used the posterior probability obtained from the “nonDEG” pattern as a surrogate estimate for the adjusted pvalues and ranked genes in ascending order of the values. This is probably the main reason for EBSeq having lower similarity than the others. We also confirmed this trend with some simulation data. As shown in Sheet 2 in Additional file 2, EBSeq had the highest average AUC values in the simulation condition: P_{DEG} = 5 %, (0.5, 0.4, 0.1) for (P_{G1}, P_{G2}, P_{G3}), and Nrep = 9. A typical dendrogram of 12 ranked gene lists obtained from this simulation condition is given in Additional file 7. In this trial, while EBSeq and baySeq formed one of the two major clusters, those AUC values were not the top two: the ranks for EBSeq and baySeq were the 1st and 6th, respectively. These results indicate that the low similarities of ranked gene lists between Bayesian pipelines (such as EBSeq and baySeq) and the GLMbased pipelines do not matter.
We compared the numbers of DEGs obtained from individual pipelines and the overlaps between all pairs of pipelines (see Additional file 8). We found that different pipelines could produce considerably different numbers of DEGs. Indeed, the numbers widely ranged from 3832 (18.5 % of all genes; DESeq) to 9453 (45.7 %; SAMseq). This trend is consistent with that in a previous comparative study [28]. As expected from Fig. 1, we observed similar numbers of DEGs between the three DEGESbased pipelines (EEEE, DDDD, and SSSS) and the corresponding nonDEGESbased ones (EE, DD, and SS). The Jaccard coefficients, defined as “intersection/union” for two sets of DEGs, for the three pairs (EEEE vs. EE, DDDD vs. DD, and SSSS vs. SS) were topranked among a total of 66 possible pairs. For example, both EEEE in TCC and EE in edgeR reported the same numbers of DEGs (=7247). Of these, 7208 DEGs (99.46 %) were common, and the Jaccard coefficient was 7208 / 7286 = 0.9893 (see Additional file 8). The overall number of common genes across the 12 sets of DEGs was 2376. Since individual sets were identified under the 5 % FDR threshold, 95 % of the 2376 common DEGs can statistically be regarded as confident.
We next classified the expression patterns of the DEGs obtained from the 12 pipelines (Table 4). We here assigned individual DEGs to one of the ten possible patterns defined in baySeq [22]; this package returns one of these patterns to each gene. The background information for this data is shown in the “all_genes” row in Table 4. The “common” row indicates the percentages of individual expression patterns for the 2376 common DEGs. The remaining rows (from EEEE to EBSeq) show the distributions for each of the pipelines. It is reasonable that no DEGs identified by individual pipelines are assigned as a flat expression pattern (i.e., G1 = G2 = G3) for the HS (G1) vs. PT (G2) vs. RM (G3) comparison. We found that most DEGs were assigned preferably to one of four patterns (G1 > G2 > G3, G2 > G1 > G3, G3 > G1 > G2, and G3 > G2 > G1) and unpreferably to one of two patterns (G1 > G3 > G2 and G2 > G3 > G1). That is, up (or down) regulation in G1 for DEGs tends to coincide with G2 more than G3. This can also be seen in the results from sample clustering of the raw count data (see Additional file 6), implying that we can roughly predict the DE results such as those shown in Table 4 from the overall similarities of samples on the raw count data.
When comparing the distributions of patterns for DEGs between pipelines, we saw high similarities overall. If anything, baySeq showed a distribution relatively different from the others in light of the higher percentages for three patterns (G1 > G2 = G3, G2 > G1 = G3, and G3 > G1 = G2). This kind of classification can also be performed using EBSeq [42]. EBSeq defines a total of five possible patterns when comparing three groups: Pattern 1 for nonDEG (i.e., G1 = G2 = G3), Pattern 2 for differential expression (DE) in G3 (G1 = G2 < G3 and G1 = G2 > G3), Pattern 3 for DE in G2 (G2 > G1 = G3 and G2 < G1 = G3), Pattern 4 for DE in G1 (G1 > G2 = G3 and G1 < G2 = G3), and Pattern 5 for DE among all groups. Similar to baySeq, EBSeq also returns one of these patterns to each gene. The results of classification based on EBSeq are given in Additional file 9. Similar to the results from baySeq (Table 4), we observed that nearly half the DEGs were assigned to Pattern 2, where the expression patterns between G1 and G2 tend to be more similar than for G3. We also observed that the distribution for baySeq is relatively different from the others, e.g., lower percentages in Patterns 3 and 4 and a higher percentage in Pattern 5.
We finally assessed the reproducibility of ranked gene lists. Remember that the real dataset we analyzed here consists of three groups, each of which has six BRs (we denote this dataset as “rep16”). In addition to the original threegroup comparison with six replicates (i.e., HS_rep16 vs. PT_rep16 vs. RM_rep16), we also performed three threegroup comparisons by dividing the original dataset into three; individual subsets consist of two BRs for each group. For example, the first subset (say “rep12”) consists of a total of six samples for comparing HS_rep12, PT_rep12, and RM_rep12. Likewise, the third subset (“rep56”) is for comparing “HS_rep56 vs. PT_rep56 vs. RM_rep56”. After performing the DE analysis for the three subsets (i.e., rep12, rep34, and rep56), we obtained three ranked gene lists for these subsets. Accordingly, there are a total of four ranked gene lists (rep12, rep34, rep56, and rep16) for each pipeline. We evaluated the reproducibility of ranked gene lists (i) for each subset to the original dataset (i.e., rep16 vs. rep12, rep16 vs. rep34, and rep16 vs. rep56) and (ii) among the three subsets (i.e., rep12 vs. rep34 vs. rep56).
Figure 2 shows the numbers of common genes between the compared sets of topranked genes for individual pipelines: (a) for the top 100 and (b) for the top 1000. For example, there were 66 common genes when comparing the two sets (rep16 and rep56) of the 100 topranked genes obtained from the EEEE pipeline (see the leftmost blue bar in Fig. 2a). As shown in Table 1 and Additional file 2, the more BRs we use, the more accurate the ranked gene lists we can obtain. Accordingly, the evaluation based on the reproducibility of ranked gene lists is analogous to a performance comparison when the available count data have only two BRs. Overall, we see high reproducibility for three edgeRrelated pipelines (EEEE, EE, and edgeR_robust) and low reproducibility for two pipelines (SAMseq and EBSeq). This trend is consistent with the simulation results shown in Table 1 (i.e., threegroup data with three BRs) and previous simulation results for twogroup data with two BRs [28]. Although PoissonSeq showed the highest reproducibility when the 1000 topranked genes were evaluated (Fig. 2b), the performance seems unstable, especially on < 200 topranked genes. This is mainly due to low reproducibility of the ranked gene list for rep12 to the list for rep16. Although we saw a plausible outlying sample (RMM2 or RM_rep5) in the dendrogram of sample clustering for the raw count data, it would not have been related to the dissimilarity of ranked gene lists between rep12 and rep16. The percentages of overlapping/common genes (POGs) for any numbers of topranked genes are given in Additional file 10.
Effect of different choices for options
In general, there are multiple options for some functions, and different choices may result in different ranked gene lists. We investigated the effect of different choices for two representative pipelines (EE and DD) under one simulation condition (0.5, 0.4, 0.1) for (P_{G1}, P_{G2}, P_{G3}) shown in Table 1a. For EE, we evaluated a total of eight combinations, 4 method options (“TMM”, “RLE”, “upperquartile”, and “none”) in “calcNormFactors” function × 2 test options (“chisq” and “F”) in the “glmLRT” function, provided in edgeR. We observed quite similar performances between the two test options (“chisq” and “F”). The average AUC value when using the method = ”TMM” option was the highest (91.47 %), followed by “RLE” (91.46 %), “upperquartile” (91.40 %), and “none” (91.19 %). Since the best practice (i.e., using method = “TMM” and test = “chisq”) is the default in EE (i.e., edgeR), the choices should be left unchanged. Currently, TCC does not allow these options to be changed when performing EEEE that can be recommended for multigroup data with replicates.
We should note that DD does not follow the simple conclusion described above (i.e., the default is the best). We evaluated a total of 18 combinations, 3 method options (“pooled”, “pooledCR”, and “blind”) × 3 sharingMode options (“maximum”, “fitonly”, and “geneestonly”) × 2 fitType options (“parametric” and “local”), in “estimateDispersions” function provided in DESeq. While the average AUC value for the suggested combination in DESeq (i.e., method = “pooled”, sharingMode = “maximum”, and fitType = “parametric”) was 90.60 %, the highest value in the 18 interrogated combinations was 91.69 %. Surprisingly, the best performing combination (i.e., method = “blind”, sharingMode = “fitonly”, and fitType = “local”) did not include any suggested choice.
We found that using both method = “blind” and sharingMode = “fitonly” was especially important to obtain high AUC value when analyzing count data with replicates. Recall that the combination was the default (or suggested) choices for DESeq when analyzing count data without replicates. The high AUC values with the sharingMode = “fitonly” option can be explained by the nature of simulation data (see the ‘Simulation data with replicates’ subsection). In other words, (i) the sharingMode = “fitonly” option is advantageous when existence of count outliers is not assumed (see the “estimateDispersions” function manual in DESeq) and (ii) the simulation data generated by the “simulateReadCounts” function in TCC do not have count outliers. The method = “blind” option ignores the group labels (G1 or G2 or G3) and can compute a gene’s empirical dispersion value even if there is no BRs. As described in the manual, this method can lead to loss of power (i.e., low sensitivity). The use of method = “blind” for count data with replicates cannot practically be recommended in light of potential low sensitivity. Most importantly, DESeq (i.e., DD) is no longer recommended by the authors and DESeq2 (i.e., SS) is recommended instead regardless of the number of BRs [35].
We also investigated the effect of different choices for one pipeline SS under one simulation condition (0.5, 0.4, 0.1) for (P_{G1}, P_{G2}, P_{G3}) shown in Table 3. We evaluated a total of six combinations, 2 type options (“ratio” and “iterate”) in the “estimateSizeFactors” function × 3 fitType options (“parametric”, “local”, and “mean”) in “estimateDispersions” function, provided in DESeq2. Overall, we found that different choices for fitType options had more impact than those of type options. Moreover, the use of fitType = “parametric” had the highest AUC values (82.01 and 81.91 % with type = “ratio” and “iterate”, respectively), followed by the uses of “local” (81.53 and 81.31 %), and “mean” (76.02 and 75.84 %). Since the best practice (i.e., the use of type = “ratio” and fitType = “parametric”) is the default in SS (i.e., DESeq2), the choices should be left unchanged. Similar to the above described for EEEE, TCC does not allow these options to be changed when performing SSSS, which can be recommended for multigroup data without replicates. These results indicate that, as expected, suggested options should basically be used. The AUC values for these combinations are given in Additional file 11.
Conclusion
We evaluated 12 pipelines for DE analysis of multigroup RNAseq count data. Second to twogroup comparison, this experimental design has arguably been performed well in practice. To our knowledge, the current evaluation is the first comprehensive study on multigroup count data. Our main findings can be summarized as follows:
First, the idea of DEGES implemented in TCC can be applied to multigroup data. We confirmed that the AUC values for the three DEGESbased pipelines (EEEE, DDDD, and SSSS) were higher overall than the corresponding nonDEGESbased pipelines: EE (edgeR), DD (DESeq), and SS (DESeq2), respectively (Table 1).
Second, choosing DEG identification method Z in the DEGESbased pipeline XYXZ is critical for obtaining good DE results. For Z in the pipeline XYXZ, using E (the DEG identification method provided in edgeR; Table 2) and S (provided in DESeq2; Table 3) when analyzing threegroup data with and without replicates, respectively, gave higher AUC values than the others.
Third, to analyze threegroup data with replicates, we recommend using either DEDE or EEEE (Table 2). Both pipelines can easily be performed by using the TCC package. While DEDE showed the highest AUC values under the interrogated pipelines and simulation conditions, the difference between DEDE and the second best pipeline EEEE can practically be negligible. Since EEEE is the natural extension of a DEGESbased pipeline for edgeR, using EEEE would be the best practice. However, note that two Bayesian pipelines (baySeq and EBSeq) perform comparably to or better than the GLMbased pipelines (edgeR, DESeq, and DESeq2) when a number of replicates are available (Additional file 2). In particular, EBSeq consistently outperformed EEEE under some simulation conditions (Nrep = 9 and P_{DEG} = 5 %; Sheet 2 in Additional file 2), suggesting that the DEGESbased pipeline based on EBSeq could produce a more accurate ranked gene list. Although these Bayesian pipelines tend to come at the cost of a huge computation time, their implementation and evaluation are the next important tasks.
Fourth, to analyze threegroup data without replicates, we recommend using either EDES or SSSS (Table 3). Both pipelines can easily be performed by using the TCC package. While EDES showed the highest AUC values under the interrogated pipelines and simulation conditions, the difference between EDES and the second best pipeline SSSS can be practically negligible. Since SSSS is the natural extension of a DEGESbased pipeline for DESeq2, using SSSS would be the best practice. Note that our previous recommendation for analyzing twogroup data without replicates was to use DDDD and that this conclusion was obtained only by evaluating a total of eight competing pipelines (DD, DDDD, EDED, EbED, Db, DDDb, EDEb, and EbEb, where “b” denotes baySeq). We expect the DESeq2related pipelines (i.e., EDES and SSSS) would be recommended for analyzing twogroup data without replicates as an updated guideline. The comprehensive evaluation should, of course, be performed as one of the next tasks.
Fifth, the results of DE analysis (including existence or nonexistence of DEGs) can roughly be estimated by the hierarchical dendrogram of sample clustering for the raw count data (Table 4; Additional files 6, 8, and 9). The dendrogram of sample clustering tells us some useful information about the DE results. The real count data we used here have 18.5 ~ 45.7 % of DEGs at the 5 % FDR threshold (Additional file 8). In our experience, such results (i.e., existence of large numbers of DEGs) have frequently been obtained when individual groups (G1, G2, and G3) form distinct subclusters where each subcluster consists only of members in each group (Additional file 6). In other words, if members in each subcluster originate from plural groups, no or few DEGs would be obtained as the DE result for such indistinct data. Of course, it is critical to employ appropriate choices for the distance metric and filtering of low count data for obtaining a robust dendrogram. While we employed the default options (“1  Spearman correlation coefficient” as a distance and the use of unique expression patterns as an objective filtering) in the clustering function (“clusterSample”) provided in TCC, further evaluation should also be performed.
We speculate that the current recommendations made from the threegroup comparative study can be applied to data consisting of three or more groups. While our preliminary analysis for four and fivegroup simulation data has produced similar results to the current study, comprehensive evaluations are the next tasks.
Methods
All analyses were performed using R (ver. 3.2.0 prerelease) [19] and Bioconductor [20].
Simulation data
The threegroup simulation data analyzed here were produced using the “simulateReadCounts” function in TCC. The variance (V) of the NB distribution can generally be modeled as V = μ + Φμ ^{2}. The empirical distribution of read counts for producing the mean (μ) and dispersion (Φ) parameters of the NB model was obtained from Arabidopsis data (three BRs for both the treated and nontreated samples) in [51]. The output of the “simulateReadCounts” function is stored in the TCC class object with information about the simulation conditions and is therefore readytoanalyze.
Real data
The real count dataset (“suppTable1.xls”) was obtained from the supplementary website of [50]. The raw count matrix consisting of 20,689 genes × 36 samples (=3 species × 2 sexes × 3 BRs × 2 technical replicates) were collapsed by summing the data of technical replicates, giving a reduced number of columns in the count matrix (i.e., 18 samples; 3 species × 2 sexes × 3 BRs). The threegroup comparison of this dataset was performed by ignoring the sex differences (i.e., males or females). The relationship of sample names between the original and current study can be seen in Additional file 6.
Differential expression analysis using individual packages
Gene lists ranked in accordance with the level of DE are prerequired for calculating AUC values. The input data for DE analysis using all R packages are the raw count data where each row indicates the gene (or transcript), each column indicates the sample (or library), and each cell indicates the number of reads mapped to the gene in the sample. The versions of major R packages were TCC ver. 1.7.15, edgeR ver. 3.8.5, DESeq ver. 1.18.0, DESeq2 ver. 1.6.3, limma ver. 3.22.1, samr ver. 2.0, PoissonSeq ver. 1.1.2, baySeq ver. 2.0.50, and EBSeq ver. 1.6.0.
All the DEGESbased pipelines X(YX)_{ n }Z or XYXZ were performed using the TCC package. This pipeline includes EEEE, DEDE, EDEE, DDDE, EEED, DEDD, EDED, DDDD, SSSS, EEES, DEDS, EDES, and DDDS. Four other nonDEGESbased pipelines XZ (DE, ED, ES, and DS) were also performed using this package, since they were the hybrid ones originally implemented in different packages. These DEGESbased and nonDEGESbased pipelines were performed using two functions: “calcNormFactors” for X (and Y) and “estimateDE” for Z, in the TCC package. For the DEGESbased pipelines X(YX)_{ n }Z, the options for X, Y, and n in the “calcNormFactors’ function correspond to norm.method, test.method, and iteration, respectively. The E, D, and S for X correspond to norm.method = “tmm”, “deseq”, and “deseq2”, respectively. The E, D, and S for both Y and Z correspond to test.method = “edger”, “deseq”, and “deseq2”, respectively. For n in the DEGESbased pipelines X(YX)_{ n }Z, the iteration = 3 was used as recommended in [12]. For example, the DEDS pipeline was performed using the “calcNormFactors” function with norm.method = “deseq”, test.method = “edger”, and iteration = 3 options, followed by the “estimateDE” function with test.method = “deseq2” option (see Additional file 4). The nonDEGESbased pipelines XZ as X(YX)_{ 0 }Z were accomplished by applying iteration = FALSE. The genes were ranked in ascending order of the pvalues. The pvalue adjustment for the multipletesting problem was performed using the “p.adjust” function with method = “BH” option (BenjaminiHochberg FDR calculation).
The two functions in TCC internally use individual functions provided by one (or two) of the three other packages (edgeR, DESeq, and DESeq2) in accordance with the specific choices (i.e., “tmm”, “edger”, “deseq”, and “deseq2”) of options in TCC. The options used for individual functions in those three packages, internally used in TCC, are the same as those suggested in the original packages. Accordingly, three pipelines (i.e., EE, DD, and SS) as the default procedures in edgeR, DESeq, and DESeq2 can also be performed using TCC. For example, the SS pipeline in TCC can be performed using the “calcNormFactors” function with norm.method = “deseq2”, test.method = NULL, and iteration = FALSE options, followed by the “estimateDE” function with test.method = “deseq2” option (see Additional files 1 and 4). Although we did not employ TCC for the three pipelines in the current evaluation, researchers can easily learn what is done in TCC by comparing the corresponding original procedures described below.
Two pipelines, EE (the same as the default edgeR procedure) and edgeR_robust, were performed using the edgeR package. The EE pipeline for analyzing count data with replicates was performed using the following functions: “DGEList”, “calcNormFactors” with the method = “TMM” option, “estimateGLMCommonDisp” with method = “CoxReid” option, “estimateGLMTrendedDisp” with method = “auto” option, “estimateGLMTagwiseDisp”, “glmFit”, and “glmLRT” with test = “chisq” option. When analyzing count data without replicates, the “estimateGLMCommonDisp” function with three options (method = “deviance”, robust = TRUE, and subset = NULL) was used and two functions (“estimateGLMTrendedDisp” and “estimateGLMTagwiseDisp”) were not used, as suggested. The edgeR_robust method was performed using the following functions: “DGEList”, “calcNormFactors” with method = “TMM” option, “estimateGLMRobustDisp” with prior.df = 10, maxit = 6, and record = FALSE options, “glmFit”, and “glmLRT” with test = “chisq” option. The gene ranking and pvalue adjustment procedure were performed the same way as described above.
The pipeline DD was performed using the DESeq package. The DD for analyzing data with replicates was performed using the following functions: “newCountDataSet”, “estimateSizeFactors” with locfunc = median option, “estimateDispersions” with method = “pooled”, sharingMode = “maximum”, and fitType = “parametric” options, and “fitNbinomGLMs”. When analyzing data without replicates, the “estimateDispersions” function with the following options was used as suggested: method = “blind” and sharingMode = “fitonly”. The genes were ranked in ascending order of the pvalues. The pvalue adjustment for the multipletesting problem was performed using the “p.adjust” function with method = “BH” option (BenjaminiHochberg FDR calculation).
The pipeline SS in the DESeq2 package was performed using the following functions: “DESeqDataSetFromMatrix”, “estimateSizeFactors” with type = “ratio” option, “estimateDispersions” with fitType = “parametric” option, and “nbinomLRT” with modelMatrixType = “standard’ option. The genes were ranked in ascending order of the pvalues. Since this package provides adjusted pvalues, the number of DEGs satisfying the 5 % FDR threshold was obtained using the values.
The pipeline voom in the limma package was performed using the following functions: “DGEList”, “calcNormFactors” with method = “TMM” option in edgeR, “voom”, “lmFit”, “eBayes”, and “topTable”. The gene ranking was performed using the resultant pvalues. Since this package provides adjusted pvalues, the number of DEGs satisfying the 5 % FDR threshold was obtained using the values.
The pipeline SAMseq in the samr package was performed using the “SAMseq” function with the following options: nperms = 100, nresamp = 20, resp.type = “Multiclass”, and fdr.output = 1.0. Since this package only provides adjusted pvalues, the gene ranking was performed using the adjusted pvalues.
The pipeline PoissonSeq was performed by using the “PS.Main” function with npermu = 500 option. The gene ranking was performed using the resultant pvalues. Since this package provides adjusted pvalues, the number of DEGs satisfying the 5 % FDR threshold was obtained using the values.
The pipeline baySeq was performed using the following functions: “new”, “getLibsizes” with estimationType = “edgeR” option, “getPriors.NB” with samplesize = 5000 and estimation = “QL” options, “getLikelihoods” with pET = “BIC” option, and “topCounts”. Since this package only provides adjusted pvalues, the gene ranking was performed using the values. The ordering information in the output of the “topCounts” function was used for classifying the expression patterns of genes.
The pipeline EBSeq was performed using the following functions: “GetPatterns”, “MedianNorm”, “EBMultiTest” with three options (maxround = 5, Qtrm = 1.0, and QtrmCut = −1), and “GetMultiPP”. There are five expression patterns to consider when comparing threegroup data. The “EBMultiTest” function was performed with the consideration of all the five possible patterns. The posterior probability obtained from the “nonDEG” pattern was used as a surrogate estimate for the adjusted pvalues. The gene ranking was performed using the values. The MAP information in the output of the “GetMultiPP” function was used for classifying the expression patterns of genes.
Abbreviations
 AUC:

The area under the curve
 BB:

Betabinomial (distribution or model)
 BR:

Biological replicate
 DE:

Differential expression
 DEG:

Differentially expressed genes
 DEGES:

DEG elimination strategy
 FDR:

False discovery rate
 GLM:

Generalized linear model
 HS:

Homo sapiens
 NB:

Negativebinomial (distribution or model)
 POG:

Percentages of overlapping genes
 PT:

Pan troglodytes
 RM:

Rhesus macaques
 TMM:

Trimmed mean of M values (method)
 TbT:

the TMMbaySeqTMM pipeline
 TCC:

Tag Count Comparison
References
 1.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Methods. 2008;5(7):621–8.
 2.
Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008;321(5891):956–60.
 3.
Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNAseq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18(9):1509–17.
 4.
Wang Z, Gerstein M, Snyder M. RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63.
 5.
Hardcastle TJ, Kelly KA. Empirical Bayesian analysis of paired highthroughput sequencing data with a betabinomial distribution. BMC Bioinformatics. 2013;14:135.
 6.
Esnaola M, Puig P, Gonzalez D, Castelo R, Gonzalez JR. A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNAseq experiments. BMC Bioinformatics. 2013;14:254.
 7.
Oshlack A, Robinson MD, Young MD. From RNAseq read to differential expression results. Genome Biol. 2010;11(12):220.
 8.
Kadota K, Nishiyama T, Shimizu K. A normalization strategy for comparing tag count data. Algorithms Mol Biol. 2012;7(1):5.
 9.
Ching T, Huang S, Garmire LX. Power analysis and sample size estimation for RNASeq differential expression. RNA. 2014;20(11):1684–96.
 10.
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.
 11.
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106.
 12.
Sun J, Nishiyama T, Shimizu K, Kadota K. TCC: an R package for comparing tag count data with robust normalization strategies. BMC Bioinformatics. 2013;14:219.
 13.
Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNAseq. Nat Biotechnol. 2013;31(1):46–53.
 14.
Niu L, Huang W, Umbach DM, Li L. IUTA: a tool for effectively detecting differential isoform usage from RNASeq data. BMC Genomics. 2014;15:862.
 15.
Aschoff M, HotzWagenblatt A, Glatting KH, Fischer M, Eils R, König R. SplicingCompass: differential splicing detection using RNAseq data. Bioinformatics. 2013;29(9):1141–8.
 16.
Anders S, Reyes A, Huber W. Detecting differential usage of exons from RNAseq data. Genome Res. 2012;22(10):2008–17.
 17.
Bi Y, Davuluri RV. NPEBseq: nonparametric empirical bayesianbased procedure for differential expression analysis of RNAseq data. BMC Bioinformatics. 2013;14:262.
 18.
Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT. Differential expression analysis of RNAseq data at singlebase resolution. Biostatistics. 2014;15(3):413–26.
 19.
R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2011.
 20.
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80.
 21.
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinformatics. 2010;11:94.
 22.
Hardcastle TJ, Kelly KA. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010;11:422.
 23.
Zhou YH, Xia K, Wright FA. A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics. 2011;27(19):2672–8.
 24.
Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, et al. Comprehensive evaluation of differential gene expression analysis methods for RNAseq data. Genome Biol. 2013;14(9):R95.
 25.
Al Seesi S, Tiagueu Y, Zelikovsky A, Măndoiu II. Bootstrapbased differential gene expression analysis for RNASeq data with and without replicates. BMC Genomics. 2014;15 Suppl 8:S2.
 26.
Liu Y, Zhou J, White KP. RNAseq differential expression studies: more sequence or more replication? Bioinformatics. 2014;30(3):301–4.
 27.
Zhang ZH, Jhaveri DJ, Marshall VM, Bauer DC, Edson J, Narayanan RK, et al. A comparative study of techniques for differential expression analysis on RNASeq data. PLoS ONE. 2014;9(8), e103207.
 28.
Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNAseq data. BMC Bioinformatics. 2013;14:91.
 29.
Garmire LX, Subramaniam S. Evaluation of normalization methods in mammalian microRNASeq data. RNA. 2012;18:1279–88.
 30.
Luo H, Li J, Chia BK, Robson P, Nagarajan N. The importance of study design for detecting differentially abundant features in highthroughput experiments. Genome Biol. 2014;15(12):527.
 31.
Dillies MA, Rau A, Aubert J, HennequetAntier C, Jeanmougin M, Servant N, et al. A comprehensive evaluation of normalization methods for Illumina highthroughput RNA sequencing data analysis. Brief Bioinform. 2013;14(6):671–83.
 32.
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 2010;11:R25.
 33.
Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for detecting differential expression in RNAseq studies. Brief Bioinform. 2015;16(1):59–70.
 34.
McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNASeq experiments with respect to biological variation. Nucleic Acids Res. 2012;40(10):4288–97.
 35.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014;15(12):550.
 36.
Zhou X, Lindsay H, Robinson MD. Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res. 2014;42(11), e91.
 37.
An J, Kim K, Chae H, Kim S. DegPack: a web package using a nonparametric and information theoretic algorithm to identify differentially expressed genes in multiclass RNAseq samples. Methods. 2014;69(3):306–14.
 38.
Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools for RNAseq read counts. Genome Biol. 2014;15(2):R29.
 39.
Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article3.
 40.
Li J, Tibshirani R. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNASeq data. Stat Methods Med Res. 2013;22(5):519–36.
 41.
Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, Testing and False Discovery Rate Estimation for RNAsequencing Data. Biostatistics. 2012;13(3):523–38.
 42.
Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, et al. EBSeq: an empirical Bayes hierarchical model for inference in RNAseq experiments. Bioinformatics. 2013;29(8):1035–43.
 43.
Kadota K, Nakai Y, Shimizu K. A weighted average difference method for detecting differentially expressed genes from microarray data. Algorithms Mol Biol. 2008;3:8.
 44.
Kadota K, Nakai Y, Shimizu K. Ranking differentially expressed genes from Affymetrix gene expression data: methods with reproducibility, sensitivity, and specificity. Algorithms Mol Biol. 2009;4:7.
 45.
Kadota K, Shimizu K. Evaluating methods for ranking differentially expressed genes applied to MicroArray Quality Control data. BMC Bioinformatics. 2011;12:227.
 46.
Farztdinov V, McDyer F. Distributional fold change test  a statistical approach for detecting differential expression in microarray experiments. Algorithms Mol Biol. 2012;7(1):29.
 47.
Dembélé D, Kastner P. Fold change rank ordering statistics: a new method for detecting differentially expressed genes. BMC Bioinformatics. 2014;15:14.
 48.
Soneson C. compcodeRan R package for benchmarking differential expression methods for RNAseq data. Bioinformatics. 2014;30(17):2517–8.
 49.
Cook RD. Detection of influential observation in linear regression. Technometrics. 1977;19:15–8.
 50.
Blekhman R, Marioni JC, Zumbo P, Stephens M, Gilad Y. Sexspecific and lineagespecific alternative splicing in primates. Genome Res. 2010;20(2):180–9.
 51.
Di Y, Schafer DW, Cumbie JS, Chang JH. The NBP negative binomial model for assessing differential gene expression from RNASeq. Stat Appl Genet Mol Biol. 2011;10:art24.
Acknowledgments
This work was supported by a grant from the Chinese Scholarship Council (CSC). This work was also supported by JSPS KAKENHI Grant Numbers 15K06919, 24500359, and 22128008.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
TM performed the analysis and drafted the manuscript. JS maintained the TCC package and provided the critical comments. KS supervised the critical discussion and refined the paper. KK refined the paper, confirmed the analysis results, and led this project. All the authors read and approved the final manuscript.
Min Tang and Jianqiang Sun contributed equally to this work.
Additional files
Additional file 1:
R code for obtaining simulation results with replicates (Table 1 ). After execution of this Rcode with default parameter settings, the average AUC values of 100 trials under the following conditions shown in Table 1 can be obtained: P_{DEG} = 5 %, (0.5, 0.4, 0.1) for (P_{G1}, P_{G2}, P_{G3}), and Nrep = 3. The results in Additional file 2 can also be obtained by changing the parameter Nrep to be 6 or 9. (R 12 kb)
Additional file 2:
Results for simulation data with replicates (mainly Nrep = 6 and 9). Average AUC values (%) of 100 trials are shown for a total of 12 pipelines for threegroup simulation data, where each group has six (Nrep = 6; Sheet 1) and nine (Nrep = 9; Sheet 2) BRs. Sheet 3: Average partial AUC values (%) of 20 trials with (1  specificity) < 0.1. Sheet 4: Average computation times (in seconds) of 20 trials. The times were measured on a Windows system (Windows 7 Professional, Intel(R) Core(TM) i52540 M CPU, 2.60 GHz, and 8 GB memory). Sheet 5: Average AUC values (%) of 20 trials where the foldchanges for DEGs were randomly sampled from “1.2 + a gamma distribution with shape = 2.0 and scale = 0.5”. Other legends are the same as in Table 1. (XLSX 24 kb)
Additional file 3:
R code for obtaining simulation results with replicates (Table 2 ). After execution of this Rcode with default parameter settings, the average AUC values of 100 trials under the following conditions shown in Table 2 can be obtained: P_{DEG} = 5 %, (0.5, 0.4, 0.1) for (P_{G1}, P_{G2}, P_{G3}), and Nrep = 3. (R 4 kb)
Additional file 4:
R code for obtaining simulation results without replicates (Table 3 ). After execution of this Rcode with default parameter settings, the average AUC values of 100 trials under the following conditions shown in Table 3 can be obtained: P_{DEG} = 25 %, (0.5, 0.4, 0.1) for (P_{G1}, P_{G2}, P_{G3}), and Nrep = 1. (R 8 kb)
Additional file 5:
R code for obtaining results of Blekhman’s count data. After execution of this Rcode, full results of real data analysis can be obtained. (R 21 kb)
Additional file 6:
Dendrogram of averagelinkage hierarchical clustering for the Blekhman’s count data. Results of sample clustering are shown: (a) a raw count dataset consisting of 36 samples, (b) a collapsed data consisting of 18 samples, and (c) the same data as (b) but with different sample labels. The clustering was performed using the “clusterSample” function with default options provided in TCC. (PPTX 62 kb)
Additional file 7:
Dendrogram of averagelinkage hierarchical clustering for 12 ranked gene lists. Twelve ranked gene lists used for constructing the dendrogram were obtained from the analysis of the simulation data under the following conditions: P_{DEG} = 5 %, (0.5, 0.4, 0.1) for (P_{G1}, P_{G2}, P_{G3}), and Nrep = 9. The clustering was performed using the “clusterSample” function with distances defined as (1  Spearman’s rank correlation coefficient). EBSeq showed the highest AUC values (=96.83 %) in this simulation trial, followed by EEEE (96.45 %), EE (96.42 %), DDDD (96.35 %), DD (96.31 %), baySeq (96.21 %), edgeR_robust (95.13 %), SS (94.54 %), SSSS (94.43 %), PoissonSeq (94.07 %), voom (92.70 %), and SAMseq (92.23 %). (PNG 6 kb)
Additional file 8:
Comparison of DEGs obtained from individual pipelines for the Blekhman’s count data. Sheet 1: Numbers of DEGs satisfying the 5 % FDR threshold and the overlaps between all pairs of pipelines are shown. The presentation method is the same as in table 1 in [28]: the numbers on the diagonal are highlighted in bold. Sheet 2: The corresponding Jaccard coefficients are shown. (XLSX 13 kb)
Additional file 9:
Classification of expression patterns for DEGs (based on EBSeq). EBSeq defines a total of five possible patterns (Patterns 1 ~ 5). DEGs (satisfying 5 % FDR threshold) identified by individual pipelines were assigned to one of the five possible patterns. (XLSX 11 kb)
Additional file 10:
Percentages of Overlapping Genes (POGs) between ranked gene lists for 12 pipelines. POG values for any numbers of topranked genes for individual pipelines are shown. Legends are basically the same as in Fig. 2. (PPTX 151 kb)
Additional file 11:
Average AUC values for simulation data with various options. Average AUC values of 100 trials are shown. The suggested (or default) options and the highest AUC values are in bold. Sheet 1: EE (edgeR), Sheet 2: DD (DESeq), Sheet 3: SS (DESeq2). (XLSX 12 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Tang, M., Sun, J., Shimizu, K. et al. Evaluation of methods for differential expression analysis on multigroup RNAseq count data. BMC Bioinformatics 16, 360 (2015). https://doi.org/10.1186/s1285901507947
Received:
Accepted:
Published:
Keywords
 Count Data
 Differential Expression
 Gene Ranking
 False Discovery Rate Threshold
 Rank Gene List