TCC: an R package for comparing tag count data with robust normalization strategies
 Jianqiang Sun^{1},
 Tomoaki Nishiyama^{2},
 Kentaro Shimizu^{1} and
 Koji Kadota^{1}Email author
DOI: 10.1186/1471210514219
© Sun et al.; licensee BioMed Central Ltd. 2013
Received: 10 January 2013
Accepted: 7 July 2013
Published: 9 July 2013
Abstract
Background
Differential expression analysis based on “nextgeneration” sequencing technologies is a fundamental means of studying RNA expression. We recently developed a multistep normalization method (called TbT) for twogroup RNAseq data with replicates and demonstrated that the statistical methods available in four R packages (edgeR, DESeq, baySeq, and NBPSeq) together with TbT can produce a wellranked gene list in which true differentially expressed genes (DEGs) are topranked and nonDEGs are bottom ranked. However, the advantages of the current TbT method come at the cost of a huge computation time. Moreover, the R packages did not have normalization methods based on such a multistep strategy.
Results
TCC (an acronym for Tag Count Comparison) is an R package that provides a series of functions for differential expression analysis of tag count data. The package incorporates multistep normalization methods, whose strategy is to remove potential DEGs before performing the data normalization. The normalization function based on this DEG elimination strategy (DEGES) includes (i) the original TbT method based on DEGES for twogroup data with or without replicates, (ii) much faster methods for twogroup data with or without replicates, and (iii) methods for multigroup comparison. TCC provides a simple unified interface to perform such analyses with combinations of functions provided by edgeR, DESeq, and baySeq. Additionally, a function for generating simulation data under various conditions and alternative DEGES procedures consisting of functions in the existing packages are provided. Bioinformatics scientists can use TCC to evaluate their methods, and biologists familiar with other R packages can easily learn what is done in TCC.
Conclusion
DEGES in TCC is essential for accurate normalization of tag count data, especially when up and downregulated DEGs in one of the samples are extremely biased in their number. TCC is useful for analyzing tag count data in various scenarios ranging from unbiased to extremely biased differential expression. TCC is available at http://www.iu.a.utokyo.ac.jp/~kadota/TCC/ and will appear in Bioconductor (http://bioconductor.org/) from ver. 2.13.
Background
Highthroughput sequencing (HTS), also known as nextgeneration sequencing (NGS), is widely used to identify biological features such as RNA transcript expression and histone modification to be quantified as tag count data by RNA sequencing (RNAseq) and chromatin immunoprecipitation sequencing (ChIPseq) analyses [1, 2]. In particular, differential expression analysis based on tag count data has become a fundamental task for identifying differentially expressed genes or transcripts (DEGs). Such countbased technology covers a wide range of gene expression level [36]. Several R [7] packages have been developed for this purpose [814].
In general, the procedure for identifying DEGs from tag count data consists of two steps: data normalization and identification of DEGs (or gene ranking), and each R package has its own methods for these steps. For example, the R package edgeR[8] uses a global scaling method, the trimmed mean of M values (TMM) method [15], in the data normalization step and an exact test for the negative binomial (NB) distribution [16] in the identification step. The estimated normalization factors are used within the statistical model for differential analysis and gene lists ranked in ascending order of pvalue (or the derivative) are produced. Naturally, a good normalization method combined with a DEG identification method, should produce wellranked gene lists in which true DEGs are top ranked and nonDEGs are bottom ranked according to the confidence or degree of differential expression (DE). Recent studies have demonstrated that the normalization method has more impact than the DEG identification method on the gene list ranking [17, 18].
Note that the normalization strategies employed by most R packages assume that there is an approximately balanced proportion of DEGs between the compared samples (i.e., unbiased DE) [19]. However, a loss of function of histone modification enzymes will lead to a biased distribution of DEGs between compared conditions in the corresponding ChIPseq analysis; i.e., there will be data with biased DE. As a result, methods assuming unbiased DE will not work well on data with biased DE. To normalize data that potentially has various scenarios (including unbiased and biased DE), we recently proposed a multistep normalization procedure called TbT [17]. TbT consists of three steps: data normalization using TMM [15] (step 1), DEG identification by using an empirical Bayesian method implemented in the baySeq package [9] (step 2), and data normalization using TMM [15] after eliminating the estimated DEGs (step 3) comprising the TMMbaySeqTMM pipeline. Different from conventional methods, our multistep normalization strategy can eliminate the negative effect of potential DEGs before the normalization in step 3.
While the threestep TbT normalization method performed well on simulated and real twogroup tag count data with replicates [17], it is practically possible to make different choices for the methods in each step. A more comprehensive study regarding better choices for the DEG elimination strategy (DEGES) is needed. To our knowledge, only the R package, TCC (from Tag Count Comparison), provides tools to perform multistep normalization methods based on DEGES. Our work presented here enables differential expression analysis of tag count data without having to worry much about biased distributions of DEGs.
Implementation
Preparations
Normalization
Normalization of twogroup count data with replicates
When obtaining normalization factors from count data with replicates, users can select a total of six combinations (two normalization methods × three DEG identification methods) coupled with an virtually arbitrary number of iterations (n = 0, 1, 2, …, 100) in our DEGESbased normalization pipeline (Figure 1b). Here, we describe two representative choices (DEGES/TbT and iDEGES/edgeR).
DEGES/TbT
In relation to the other DEGESbased methods, we will call the method “DEGES/TbT” for convenience. As mentioned in ref. [17], the multistep normalization can be repeated until the calculated normalization factors converge. The iterative version of the DEGES/TbT (iDEGES/TbT) can be described as a TMM(baySeqTMM)_{ n } pipeline with n > = 2. Accordingly, the TMM normalization method [15] and the DEGES/TbT can be described as pipelines with n = 0 and 1, respectively, and n can be specified by the iteration argument.
DEGES/edgeR
A major disadvantage of the TbT method is the long time it requires to calculate the normalization factors. This requirement is due to the empirical Bayesian method implemented in the baySeq package. To alleviate this problem, a choice of alternative methods should be provided for step 2. For instance, using the exact test [16] in edgeR in step 2 enables the DEGES normalization pipeline to be much faster and entirely composed of functions provided by the edgeR package. The threestep DEGES normalization pipeline (we will refer to this as the TMM(edgeRTMM)_{ n } pipeline with n = 1 or DEGES/edgeR, for convenience) can be performed by changing the test.method argument to “edger”.
The TbT pipeline automatically calculates the percentage of DEGs (P_{DEG}) by virtue of its use of baySeq. In contrast, a reasonable threshold for defining potential DEGs should also be provided when using the exact test in edgeR (or the NB test in DESeq). Here, we define the threshold as an arbitrary false discovery rate (FDR) with a floor value for P_{DEG}. The default is FDR < 0.1 (i.e., FDR = 0.1), and the default floor P_{DEG} value is 5% (i.e., floorPDEG = 0.05); different choices are possible. For example, in the case of the default settings, x% (x > 5%) of the topranked potential DEGs are eliminated in step 2 if the percentage (= x%) of genes satisfying FDR < 0.1 is over 5%.
The suggested number of iterations is determined from the results of iDEGES/TbT; it is a number at which the accuracies of DEG identifications corresponding to the calculated normalization factors converge [17]. The number of iterations for iDEGES/edgeR is determined in the same way (see the Results and discussion section).
Normalization of twogroup count data without replicates
Most R packages are designed primarily for analyzing data including biological replications because the biological variability has to be accurately estimated to avoid spurious DE calls [21]. In fact, the functions for the DEG identification method implemented in edgeR (i.e., the exact test; ver. 3.0.4) do not allow one to perform an analysis without replicates, even though the TMM normalization method in the package can be used regardless of whether the data has replicates or not. Although the edgeR manual provides users with some ideas on how to perform the DE analysis, it is practically difficult to customize the analysis with DEGES to data without replicates.
Normalization of multigroup count data with replicates
Many R packages (including edgeR, DESeq, and baySeq) support DE analysis for multigroup tag count data. TCC provides DEGESbased normalization methods for such data by virtue of the three packages that are internally used in TCC. Similar to the analysis of twogroup count data with replicates, users can select from a total of six combinations (two normalization methods × three DEG identification methods) and a virtually arbitrary number of iterations (n = 0, 1, 2, …, 100) when obtaining normalization factors from multigroup data with replicates.
Retrieving normalized data
The calculated normalization factors can be obtained from tcc$norm.factors. Similar functions are the calcNormFactors function in the edgeR package and the estimateSizeFactors function in the DESeq package, (Figure 1a). Note that the terminology used in DESeq (i.e., size factors) is different from that used in edgeR (i.e., effective library sizes) and ours. The effective library size in edgeR is calculated as the library size multiplied by the normalization factor. The size factors in the DESeq package are comparable to the normalized effective library sizes, wherein the summary statistics for the effective library sizes are adjusted to one. Our normalization factors, which can be obtained from tcc$norm.factors, have the same names as those in edgeR. Since biologists are often interested in such information [19], we provide the getNormalizedData function for retrieving the normalized data. The normalized data can directly be used as, for example, the input data of another package NOISeq[13].
Differential expression
A similar analysis based on DE methods in DESeq or baySeq can be performed by changing the test.method parameter to “deseq” or “bayseq”. The results of the DE analysis are stored in the TCC class object. The summary statistics for the topranked genes can be retrieved by using the getResult function. In general, the identified DEGs at FDR < 0.1 should be upregulated in either G1 or G2. The plot function generates an MA plot, where “M” indicates the logratio (i.e., M = log_{2}G2  log_{2}G1) and “A” indicates the log average read count (i.e., A = (log_{2}G2 + log_{2}G1)/2) based on the normalized count data.
Generation of simulation data
The simulation conditions for comparing two groups (G1 vs. G2) with biological replicates are as follows: (i) the number of genes is 1,000 (i.e., Ngene = 1000), (ii) the first 20% of genes are DEGs (PDEG = 0.2), (iii) the first 90% of the DEGs are upregulated in G1 and the remaining 10% are upregulated in G2 (DEG.assign = c(0.9, 0.1)), (iv) the levels of DE are fourfold in both groups (DEG.foldchange = c(4, 4)), and (v) there are a total of six samples (three biological replicates for G1 and three biological replicates for G2) (replicates = c(3, 3)). The empirical distribution of read counts is built from Arabidopsis data in NBPSeq[12].
The output of the simulateReadCounts function is stored in the TCC class object with information about the simulation conditions and is therefore readytoanalyze. This function can generate several kinds of simulation data, such as those for comparing four groups (Groups 14) with replicates and those for comparing two groups without replicates. See the vignette for details.
Results and discussion
Accurate data normalization is essential for obtaining wellranked gene lists from tag count data. Similar to other R packages such as edgeR, the TCC package has functionalities for DE analysis of tag count data. Of these functionalities, TCC provides multistep normalization methods (including DEGES/TbT [17]) that internally use the functions implemented in edgeR, DESeq, and baySeq. Here, we demonstrate that the DEGESbased normalization methods are more effective than the methods implemented in the other packages. All analyses were performed using R (ver. 2.15.2) and Bioconductor [20]. Execution times were measured on a Linux system (CentOS release 6.2 (Final), Intel® Xeon® E54617 (2.9 GHz) 24 CPU, and 512 GB memory). The versions of major R libraries were TCC ver. 1.1.99, edgeR ver. 3.0.4, DESeq ver. 1.10.1, and baySeq ver. 1.12.0.
Following our previous study [17], we here demonstrate the performance of these methods by using the same evaluation metric, simulation framework, and real experimental datasets. We use the area under the receiver operating characteristic (ROC) curve (i.e., AUC) as a means of comparison. The simulation conditions are as follows: 525% of the genes are DEGs (P_{DEG} = 525%), 5090% of the DEGs are upregulated in G1 compared to G2 (P_{G1} = 5090%), and the levels of DE are fourfold in both groups. The count dataset consists of 70,619 unique small RNAs (sRNAs) and a total of four Arabidopsis thaliana leaf samples (i.e., two wildtype [WT1 and WT2] and two RNAdependent RNA polymerase 6 (RDR6) knockout [KO1 and KO2] samples) [9]. This dataset was originally analyzed with baySeq. The data has 657 provisional true DE sRNAs (i.e., P_{DEG} = 657 / 70,619 = 0.93%), and all of the sRNAs can be regarded as upregulated in the wildtype (i.e., P_{G1} = 100%). This is because they uniquely match tasRNA, which is produced by RDR6, and they are downregulated in RDR6 mutants. In addition to the RDR6 knockout dataset, we also analyze four other experimental datasets (called “sultan” [4], “gilad” [22], “maqc” [23], and “katz.mouse” [24]) obtained from the ReCount database [25]. The four datasets are used for evaluating the DEGESbased methods aimed at twogroup count data with replicates.
Simulation data with replicates (sensitivity, specificity, and computation time)
Average AUC values for simulation data with replicates
DE method  edgeR  edgeR  edgeR  DESeq  DESeq  DESeq  baySeq  baySeq  baySeq 

P _{G1}  50%  70%  90%  50%  70%  90%  50%  70%  90% 
(a) TMM  
P_{DEG} = 5%  90.30  90.30  90.18  89.09  89.08  88.92  89.43  89.42  89.22 
P_{DEG} = 15%  90.30  90.12  89.63  89.15  88.91  88.26  89.54  89.28  88.66 
P_{DEG} = 25%  90.40  89.89  88.26  89.27  88.62  86.82  89.63  89.07  87.25 
(b) DEGES/TbT  
P_{DEG} = 5%  90.30  90.31  90.28  89.09  89.12  89.07  89.43  89.47  89.41 
P_{DEG} = 15%  90.30  90.25  90.26  89.15  89.07  89.01  89.54  89.48  89.49 
P_{DEG} = 25%  90.41  90.27  89.89  89.27  89.07  88.61  89.63  89.56  89.22 
(c) DEGES/edgeR  
P_{DEG} = 5%  90.29  90.32  90.28  89.09  89.13  89.08  89.43  89.48  89.41 
P_{DEG} = 15%  90.30  90.25  90.26  89.15  89.07  89.02  89.54  89.48  89.50 
P_{DEG} = 25%  90.40  90.28  89.91*  89.27  89.08  88.64  89.63  89.56  89.24 
(d) iDEGES/edgeR  
P_{DEG} = 5%  90.29  90.32  90.28  89.09  89.13  89.08  89.43  89.48  89.42 
P_{DEG} = 15%  90.30  90.25  90.29*  89.15  89.08  89.06  89.54  89.49  89.54 
P_{DEG} = 25%  90.40  90.30*  90.04*  89.27  89.11  88.80  89.63  89.60  89.42 
(e) iDEGES/TDT  
P_{DEG} = 5%  90.30  90.32  90.28  89.09  89.12  89.09  89.43  89.48  89.42 
P_{DEG} = 15%  90.30  90.23  90.20  89.15  89.05  88.93  89.54  89.45  89.40 
P_{DEG} = 25%  90.40  90.25  89.82  89.27  89.04  88.53  89.63  89.52  89.13 
(f) iDEGES/DESeq  
P_{DEG} = 5%  90.30  90.31  90.28  89.09  89.13  89.09  89.43  89.48  89.43 
P_{DEG} = 15%  90.30  90.24  90.21  89.15  89.05  88.95  89.54  89.45  89.42 
P_{DEG} = 25%  90.40  90.26  89.86  89.27  89.05  88.58  89.63  89.53  89.18 
DEGES/edgeR performed comparably to TbT, whereas iDEGES/edgeR outperformed the others, irrespective of the choice of DEG identification method (edgeR, DESeq, and baySeq) after normalization. That is, iDEGES/edgeR followed by any DEG identification method YYY (termed the iDEGES/edgeRYYY combination) performed the best among the six normalization methods. These results demonstrate that the alternative DEGES approaches (iDEGES/edgeR) implemented in TCC generally outperform the original DEGES approach (i.e., DEGES/TbT). In other words, the use of the exact test [16] implemented in edgeR is sufficient to determine the potential DEGs to be eliminated. Advantageous characteristics for the exact test also revealed themselves after performing any normalization method: Comparing the three DEG identification methods (edgeR, DESeq, and baySeq), we see that the XXXedgeR have the highest AUC values. Overall, iDEGES/edgeRedgeR performed the best. It should be noted that, however, the AUC values for the pipeline look very close to those of the original recommendation (i.e., DEGES/TbTedgeR) [17], e.g., 90.30% for iDEGES/edgeRedgeR and 90.27% for DEGES/TbTedgeR under one simulation condition of P_{DEG} = 25% and P_{G1} = 70%.
Average computation times for obtaining normalization factors
P _{G1}  50%  70%  90% 

(a) TMM  
P_{DEG} = 5%  0.13  0.18  0.22 
P_{DEG} = 15%  0.16  0.19  0.15 
P_{DEG} = 25%  0.17  0.14  0.15 
(b) DEGES/TbT  
P_{DEG} = 5%  1492.92  1423.55  1573.23 
P_{DEG} = 15%  1556.93  1510.17  1408.46 
P_{DEG} = 25%  1527.08  1483.09  1543.80 
(c) DEGES/edgeR  
P_{DEG} = 5%  3.04  3.08  3.24 
P_{DEG} = 15%  3.08  3.05  2.89 
P_{DEG} = 25%  2.99  2.92  3.05 
(d) iDEGES/edgeR  
P_{DEG} = 5%  8.80  8.95  9.53 
P_{DEG} = 15%  8.94  8.91  8.39 
P_{DEG} = 25%  8.75  8.39  8.92 
(e) iDEGES/TDT  
P_{DEG} = 5%  17.92  17.54  18.24 
P_{DEG} = 15%  19.85  19.45  18.74 
P_{DEG} = 25%  21.17  20.74  21.16 
(f) iDEGES/DESeq  
P_{DEG} = 5%  17.88  17.54  18.34 
P_{DEG} = 15%  19.96  19.43  18.72 
P_{DEG} = 25%  21.17  20.75  21.27 
Simulation data with replicates (effect of iteration)
Result of the iDEGES approach does not necessarily convergent
P _{G1}  50%  70%  90%  Total 

(a) iDEGES/TbT  
convergent  69  70  83  222 
cyclic  31  30  17  78 
(b) iDEGES/edgeR  
convergent  86  86  79  251 
cyclic  14  14  21  49 
(c) iDEGES/TDT  
convergent  82  81  84  247 
cyclic  18  19  16  53 
(d) iDEGES/DESeq  
convergent  99  97  100  296 
cyclic  1  3  0  4 
Of practical interest when using the iDEGES approach is the number of iterations required for obtaining a convergent result. We defined n as N_{C} if the potential DEGs estimated at the (N_{C} + 1)^{th} iteration are the same as those in the (N_{C})^{th} iteration. The distribution of N_{C} values for the four iDEGESbased methods are given in Additional file 2. In our trials, the maximum N_{C} value was 8 (see “Sheet 1” in Additional file 2). The distribution suggests that the iDEGES approach with n = 8 could be sufficient for obtaining convergent results under various simulation scenarios. However, the improvement had by iDEGES/edgeR with n > = 4 compared with that with n = 3 is actually negligible despite the requirement of additional computation time (see “Sheet 3” in Additional file 2). Therefore, the use of our iDEGES pipeline with n = 3 can be recommended for reducing useless computation time.
We observed that the number of iterations needed for obtaining convergent results (the Nc value) tends to increase when the degree of biased DE is high (P_{G1} > 50%): the average N_{C} values for the iDEGES/edgeR under the three conditions of P_{G1} = 50, 70, and 90% were 2.17, 3.14, and 3.60, respectively (see “Sheet 1” in Additional file 2). This is reasonable because, in the case of the simulation with P_{G1} > > 50%, a relatively large number of n in the TMM(edgeRTMM)_{ n } pipeline is theoretically needed for obtaining accurate normalization factors and because, in the case of the simulation with P_{G1} = 50%, the theoretical P_{G1} value obtained from the potential DEGs in any iDEGESbased pipeline at n = 1 is 50% (i.e., no need to apply the iDEGES approach to such data).
Now, let us briefly discuss the nonconvergent results. All the nonconvergent results showed cyclic characteristics, that is, there exists a same set of potential DEGs estimated in both the i^{th} and the (i + N_{P})^{th} iterations within a trial (N_{P} > = 2). We found that the most frequent N_{P} value was 2 (see “Sheet 1” in Additional file 2). For example, iDEGES/edgeR yielded 83.7% (41/49) nonconvergent results with N_{P} = 2. This result indicates that two different sets of potential DEGs are alternately eliminated in the i^{th} and (i + 1)^{th} iterations. Note that different normalization methods do not seem to produce consistent results (convergent (C) or nonconvergent (P)) within a trial. Under one simulation condition (P_{DEG} = 20% and P_{G1} = 90%), for example, iDEGES/edgeR and iDEGES/TbT got 21 and 17 nonconvergent results, respectively. Of these, only three trials (i.e., the 8th, 10th, and 61th) showed the same nonconvergent results (see “Sheet 2” in Additional file 2).
We confirmed that the P_{DEG} values (satisfying FDR < 0.1) originally estimated by three methods (iDEGES/edgeR, iDEGES/TDT, and iDEGES/DESeq) in every iteration in Table 3 were above the predefined floor value of 5%, indicating that the floor value of 5% for P_{DEG} has no effect on whether the results converge or not. The iterative method can be viewed as a discrete dynamical system since the number of state is finite and determined by the combination of genes as potential DEGs. Oscillations in the trajectory are common phenomena in such dynamical systems. Thus, a method based on an optimality criterion should be developed to select the best point in the cycle after detecting an oscillation. Nevertheless, we observed an overall improvement regarding nonconvergent results when more iterations were used.
Simulation data without replicates
TCC also provides DEGESbased methods for normalizing twogroup data without replicates. As described previously, the DEG identification method (i.e., the exact test) in edgeR does not allow for an analysis without replicates. Accordingly, we evaluated a total of eight XXXYYY combinations (XXX = DESeq, iDEGES/DESeq, iDEGES/TDT, and DEGES/TbT; YYY = DESeq and baySeq). Here, the DESeqDESeq combination indicates the original procedure in DESeq. Different from the results for the data with replicates, iDEGES/DESeq (and iDEGES/TDT) performed better than DEGES/TbT in this case (see “Sheet 1” in Additional file 3). The same trend can be seen in the accuracies of the estimated DEGs: the accuracies calculated from baySeq (i.e., DEGES/TbT) were clearly inferior to those from DESeq, irrespective of the choice of normalization method (see “Sheet 2” in Additional file 3). The advantageous characteristics of the NB test in DESeq become apparent when we compare the two DEG identification methods (DESeq and baySeq) for any normalization method XXX: the XXXDESeq combination outperforms the XXXbaySeq combination. These results indicate that the higher AUC values of the iDEGES/DESeqDESeq are primarily by virtue of a wellranked gene list produced from the NB test and that constructing a model based on bootstrap resampling employed in baySeq is difficult in a nonreplicate situation.
We observed that the numbers of potential DEGs satisfying FDR < 0.1 in the iDEGES/DESeq were nearly zero (i.e., the estimated P_{DEG} values were 0%) in all of the simulations, although the P_{DEG} values were 525%. This is reasonable because any attempt to work without replicates will lead to conclusions of very limited reliability: DESeq employs a conservative approach to prevent spurious DE calls. Accordingly, a predefined floor P_{DEG} value (= 5%) was used; i.e., 5% of the topranked genes were not used when calculating the normalization factors in the iDEGES/DESeq and iDEGES/TDT methods. These facts indicate that (i) iDEGES/DESeq performs almost as well as the original normalization method in DESeq when the floor P_{DEG} value is decreased from the default (= 5%), (ii) the methods perform equally well when the floor P_{DEG} value of 0% for iDEGES/DESeq is used, and (iii) iDEGES/DESeq (and iDEGES/TDT) with a floor value of x% tends to work better when analyzing simulation data with the same P_{DEG} value. Note that we set the floor P_{DEG} value to 5% in order to obtain certain differences between the iDEGESbased methods and the original DESeq, but we should consider the appropriate choice of the parameter in the future. Results of iterations in the iDEGES approach (i.e., convergent or nonconvergent) when analyzing twogroup data without replicates were given in Additional file 4.
Real data (ArabidopsisRDR6 knockout dataset)
AUC values for an RDR6 knockout dataset
DE method  edgeR  DESeq  baySeq 

TMM  68.36  61.09  60.21 
DEGES/TbT  66.70  61.40  60.09 
DEGES/edgeR  69.28  61.27  60.31 
iDEGES/edgeR  65.70  61.34  60.86 
iDEGES/TDT  65.76  60.84  60.19 
iDEGES/DESeq  64.88  60.45  59.20 
We observed that the compositions of the potential DE sRNAs were identical when the AUC values were the same (see, e.g., light blue line at n = 2 ~ 30). This was true even for cyclic results (e.g., the compositions obtained from iDEGES/edgeR at the 10th, 17th, and 24th iterations). The different AUC values, in turn, were due to the difference in the normalization factors at each iteration and the different compositions caused the next normalization factors to be different. We found that the relatively large difference in the AUC values among cycles for the nonconvergent results (approximately 3% difference) were due to (i) the paucity of variations among the 657 provisional true DE sRNAs and (ii) their low expression levels. Of the 657 sRNAs, there were only 143 unique patterns of counts across the four samples (i.e., WT1, WT2, KO1, and KO2). For example, there were 91 DE sRNAs that had one tag count only in WT1, i.e., (1, 0, 0, 0) and 200 DE sRNAs that had one count only in WT2, i.e., (0, 1, 0, 0). These two count patterns occupied 44.3% of the 657 DE sRNAs. Such lowcount DE sRNAs cannot be distinguished from other nonDE sRNAs if both patterns are identical. Indeed, we found that the positions for the lowcount DE sRNAs in the ranked gene list varied from iteration to iteration.
Similar to the paucity of variations regarding the count vectors for the 657 DE sRNAs, we found that out of a total of 70,619 sRNAs, there were only 2,535 unique patterns of sRNA counts. This indicates that many sRNAs displayed the same degrees of DE, and therefore, their ranks (or the AUC values calculated based on the ranks) will vary if slight changes are made to the calculated normalization factors. Indeed, we observed considerably different changes in the AUC values when a higher floor P_{DEG} value of 10% was used (Figure 2b); e.g., the results for iDEGES/TbT converged when a 5% floor was used, but they became cyclic when 10% was used (the red lines in the figure). We also found that iDEGES/ TbT, iDEGES/edgeR, and iDEGES/TDT performed about the same when a 10% floor was used. These results indicate that the large difference between AUC values in this dataset might be within the error range.
All results described above were obtained with the original raw count data, in accordance with the edgeR and baySeq design. However, we should note that the current procedure is different from one recommended in the TbT paper [17], where the data was scaled to counts per million (CPM) in each sample when the DEG identification method in baySeq (i.e., the empirical Bayesian method) was executed. This recommendation was derived from a comparison of uncertain AUC values between the raw count data and the CPM data of this dataset, and it is now questionable. Accordingly, all the procedures implemented in the current TCC are based on the original raw count data.
Real data (four ReCount datasets)
Lastly, let us show the results of four experimental datasets: three human RNAseq datasets (called “sultan” [4], “gilad” [22], and “maqc” [23]) and one mouse dataset (“katz.mouse” [24]). These datasets were obtained from the ReCount database [25]. Different from the RDR6 dataset and simulated datasets, we do not know the true DEGs for the four datasets, indicating that we cannot calculate the AUC values. We therefore investigated the effect of iteration regarding the potential DEGs to be eliminated (i.e., convergent or nonconvergent). Consistent with the RDR6 result, we observed several cyclic (P) characteristics for the ReCount datasets (see “Sheet 2” in Additional file 5).
Conclusion
The R package TCC provides users with a robust and accurate framework to perform DE analysis of tag count data. TCC has an improved data normalization step, compared with existing packages such as edgeR. While the other normalization strategies assume that there is an approximately balanced proportion of DEGs between compared samples (i.e., unbiased DE), our multistep DEGESbased normalization methods are designed to deal with various scenarios (including unbiased and biased DE situations): the internally used DEG identification method eliminates the effects of biases, if any, of potential DEGs. Our study demonstrated that the iDEGES/edgeRedgeR combination can be recommended for analyzing twogroup data with replicates (see Tables 1 and 2) and that the iDEGES/DESeqDESeq can be recommended for analyzing twogroup data without replicates (see Additional file 3). The success of these methods primarily owes to the high scalability of the normalization and DEG identification methods in the R packages used in TCC.
The functionality of TCC can be extended in many ways. First, the current study focuses on the analysis of twogroup data, but some users may want to utilize the DEGESbased methods for data consisting of two or more groups. As can be seen in the vignette, some prototypes of DEGESbased pipelines for analyzing data in three or four groups have already been implemented. Evaluations such as these and further improvement are our next tasks. Second, the current approach of TCC is somewhat reminiscent of a microarray analysis of the count matrix for genes. We know that (i) one or more isoforms can be transcribed in a same gene region, (ii) those transcripts may have distinct expression levels, and (iii) it could lead to a DE result at the transcript level but a nonDE result at the gene level. To prevent this possible problem and fully utilize the resolution of RNAseq data, advanced R packages (e.g., BitSeq[21] and DEXSeq[14]) have recently been developed. We believe that the use of those packages together with DEGES enables us to obtain a more reliable result, because the idea of DEGES can, of course, be applied to DE analysis at both genelevel and transcriptlevel resolutions.
Finally, the current DEGESbased normalization methods implemented in TCC only employ linear scaling normalization methods and statistical methods for identifying DEGs. This is because these scaling normalization methods do not change the shape of the original count distribution and the statistical methods for identifying DEGs in edgeR, DESeq, and baySeq assume the model. Technically speaking, we can construct many other pipelines consisting of, for example, a nonlinear normalization method (e.g., quantile normalization [26]) and a DEG identification method originally developed for microarray analysis (e.g., WAD [27]). As we learned from the microarrays, there are suitable combinations of normalization methods and DEG identification methods [28]: we speculate that a nonlinear normalization method would be incompatible with a statistical DE method. A critical evaluation of those pipelines will be of interest in the future and we will continue to offer uptodate guidelines.
Availability and requirements
Project name: TCC
Project home page: TCC is available at http://www.iu.a.utokyo.ac.jp/~kadota/TCC/ and will appear in Bioconductor (http://bioconductor.org/) from ver. 2.13.
Operating systems: Platform independent
Programming language: R
Other requirements: requires the edgeR, DESeq, baySeq, and ROC packages
License: GPL2
Any restrictions to use by nonacademics: None
Acknowledgments
The authors thank Dr. TJ Hardcastle for providing the dataset used in the baySeq study. This study was supported by grants (KAKENHI 24500359 to KK and 22128008 to TN) from the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT).
Abbreviations
 DE:

Differential expression
 DEG:

Differentially expressed gene
 DEGES:

DEG elimination strategy
 FDR:

False discovery rate
 CPM:

Counts per million (normalization)
 sRNA:

small RNA
 tasRNA:

TAS locusderived small RNA
 TMM:

Trimmed mean of M values (method)
 TbT:

the TMMbaySeqTMM pipeline
 TCC:

Tag count comparison.
Declarations
Authors’ Affiliations
References
 Schuster SC: Nextgeneration sequencing transforms today’s biology. Nat Methods. 2008, 5: 1618.View ArticlePubMedGoogle Scholar
 Mardis ER: The impact of nextgeneration sequencing technology on genetics. Trends Genet. 2008, 24 (3): 133141. 10.1016/j.tig.2007.12.007.View ArticlePubMedGoogle Scholar
 Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Methods. 2008, 5 (7): 621628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
 Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O’Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo ML: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008, 321 (5891): 956960. 10.1126/science.1160342.View ArticlePubMedGoogle Scholar
 Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNAseq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18 (9): 15091517. 10.1101/gr.079558.108.PubMed CentralView ArticlePubMedGoogle Scholar
 Asmann YW, Klee EW, Thompson EA, Perez EA, Middha S, Oberg AL, Therneau TM, Smith DI, Poland GA, Wieben ED, Kocher JP: 3′ tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer. BMC Genomics. 2009, 10: 53110.1186/1471216410531.PubMed CentralView ArticlePubMedGoogle Scholar
 R Development Core Team: R: A Language and Environment For Statistical Computing. 2011, Vienna, Austria: R Foundation for Statistical computingGoogle Scholar
 Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26 (1): 139140. 10.1093/bioinformatics/btp616.PubMed CentralView ArticlePubMedGoogle Scholar
 Hardcastle TJ, Kelly KA: baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010, 11: 42210.1186/1471210511422.PubMed CentralView ArticlePubMedGoogle Scholar
 Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11: R10610.1186/gb20101110r106.PubMed CentralView ArticlePubMedGoogle Scholar
 Wang L, Feng Z, Wang X, Wang X, Zhang X: DEGseq: an R package for identifying differentially expressed genes from RNAseq data. Bioinformatics. 2010, 26 (1): 136138. 10.1093/bioinformatics/btp612.View ArticlePubMedGoogle Scholar
 Di Y, Schafer DW, Cumbie JS, Chang JH: The NBP negative binomial model for assessing differential gene expression from RNASeq. Stat Appl Genet Mol Biol. 2011, 10: art24Google Scholar
 Tarazona S, GarciaAlcalde F, Dopazo J, Ferrer A, Conesa A: Differential expression in RNAseq: a matter of depth. Genome Res. 2011, 21 (12): 22132223. 10.1101/gr.124321.111.PubMed CentralView ArticlePubMedGoogle Scholar
 Anders S, Reyes A, Huber W: Detecting differential usage of exons from RNAseq data. Genome Res. 2012, 22 (10): 20082017. 10.1101/gr.133744.111.PubMed CentralView ArticlePubMedGoogle Scholar
 Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 2010, 11: R2510.1186/gb2010113r25.PubMed CentralView ArticlePubMedGoogle Scholar
 Robinson MD, Smyth GK: Smallsample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008, 9: 321332.View ArticlePubMedGoogle Scholar
 Kadota K, Nishiyama T, Shimizu K: A normalization strategy for comparing tag count data. Algorithms Mol Biol. 2012, 7: 510.1186/1748718875.PubMed CentralView ArticlePubMedGoogle Scholar
 Garmire LX, Subramaniam S: Evaluation of normalization methods in mammalian microRNASeq data. RNA. 2012, 18: 12791288. 10.1261/rna.030916.111.PubMed CentralView ArticlePubMedGoogle Scholar
 Dillies MA, Rau A, Aubert J, HennequetAntier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloë D, Le Gall C, Schaëffer B, Le Crom S, Guedj M, Jaffrézic F, on behalf of The French StatOmique Consortium: A comprehensive evaluation of normalization methods for Illumina highthroughput RNA sequencing data analysis. Brief Bioinform. 2012, 10.1093/bib/bbs046.Google Scholar
 Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R8010.1186/gb2004510r80.PubMed CentralView ArticlePubMedGoogle Scholar
 Glaus P, Honkela A, Rattray M: Identifying differentially expressed transcripts from RNAseq data with biological variation. Bioinformatics. 2012, 28 (13): 17211728. 10.1093/bioinformatics/bts260.PubMed CentralView ArticlePubMedGoogle Scholar
 Blekhman R, Marioni JC, Zumbo P, Stephens M, Gilad Y: Sexspecific and lineagespecific alternative splicing in primates. Genome Res. 2010, 20 (2): 180189. 10.1101/gr.099226.109.PubMed CentralView ArticlePubMedGoogle Scholar
 Bullard JH, Purdom E, Hansen KD, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinformatics. 2010, 11: 9410.1186/147121051194.PubMed CentralView ArticlePubMedGoogle Scholar
 Katz Y, Wang ET, Airoldi EM, Burge CB: Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Methods. 2010, 7 (12): 10091015. 10.1038/nmeth.1528.PubMed CentralView ArticlePubMedGoogle Scholar
 Frazee AC, Langmead B, Leek JT: ReCount: A multiexpreriment resource of analysisready RNAseq gene count datasets. BMC Bioinformatics. 2011, 12: 44910.1186/1471210512449.PubMed CentralView ArticlePubMedGoogle Scholar
 Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003, 19 (2): 185193. 10.1093/bioinformatics/19.2.185.View ArticlePubMedGoogle Scholar
 Kadota K, Nakai Y, Shimizu K: A weighted average difference method for detecting differentially expressed genes from microarray data. Algorithms Mol Biol. 2008, 3: 810.1186/1748718838.PubMed CentralView ArticlePubMedGoogle Scholar
 Kadota K, Nakai Y, Shimizu K: Ranking differentially expressed genes from Affymetrix gene expression data: methods with reproducibility, sensitivity, and specificity. Algorithms Mol Biol. 2009, 4: 710.1186/1748718847.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.