 Software
 Open access
 Published:
TCC: an R package for comparing tag count data with robust normalization strategies
BMC Bioinformatics volume 14, Article number: 219 (2013)
Abstract
Background
Differential expression analysis based on “nextgeneration” sequencing technologies is a fundamental means of studying RNA expression. We recently developed a multistep normalization method (called TbT) for twogroup RNAseq data with replicates and demonstrated that the statistical methods available in four R packages (edgeR, DESeq, baySeq, and NBPSeq) together with TbT can produce a wellranked gene list in which true differentially expressed genes (DEGs) are topranked and nonDEGs are bottom ranked. However, the advantages of the current TbT method come at the cost of a huge computation time. Moreover, the R packages did not have normalization methods based on such a multistep strategy.
Results
TCC (an acronym for Tag Count Comparison) is an R package that provides a series of functions for differential expression analysis of tag count data. The package incorporates multistep normalization methods, whose strategy is to remove potential DEGs before performing the data normalization. The normalization function based on this DEG elimination strategy (DEGES) includes (i) the original TbT method based on DEGES for twogroup data with or without replicates, (ii) much faster methods for twogroup data with or without replicates, and (iii) methods for multigroup comparison. TCC provides a simple unified interface to perform such analyses with combinations of functions provided by edgeR, DESeq, and baySeq. Additionally, a function for generating simulation data under various conditions and alternative DEGES procedures consisting of functions in the existing packages are provided. Bioinformatics scientists can use TCC to evaluate their methods, and biologists familiar with other R packages can easily learn what is done in TCC.
Conclusion
DEGES in TCC is essential for accurate normalization of tag count data, especially when up and downregulated DEGs in one of the samples are extremely biased in their number. TCC is useful for analyzing tag count data in various scenarios ranging from unbiased to extremely biased differential expression. TCC is available at http://www.iu.a.utokyo.ac.jp/~kadota/TCC/ and will appear in Bioconductor (http://bioconductor.org/) from ver. 2.13.
Background
Highthroughput sequencing (HTS), also known as nextgeneration sequencing (NGS), is widely used to identify biological features such as RNA transcript expression and histone modification to be quantified as tag count data by RNA sequencing (RNAseq) and chromatin immunoprecipitation sequencing (ChIPseq) analyses [1, 2]. In particular, differential expression analysis based on tag count data has become a fundamental task for identifying differentially expressed genes or transcripts (DEGs). Such countbased technology covers a wide range of gene expression level [36]. Several R [7] packages have been developed for this purpose [814].
In general, the procedure for identifying DEGs from tag count data consists of two steps: data normalization and identification of DEGs (or gene ranking), and each R package has its own methods for these steps. For example, the R package edgeR[8] uses a global scaling method, the trimmed mean of M values (TMM) method [15], in the data normalization step and an exact test for the negative binomial (NB) distribution [16] in the identification step. The estimated normalization factors are used within the statistical model for differential analysis and gene lists ranked in ascending order of pvalue (or the derivative) are produced. Naturally, a good normalization method combined with a DEG identification method, should produce wellranked gene lists in which true DEGs are top ranked and nonDEGs are bottom ranked according to the confidence or degree of differential expression (DE). Recent studies have demonstrated that the normalization method has more impact than the DEG identification method on the gene list ranking [17, 18].
Note that the normalization strategies employed by most R packages assume that there is an approximately balanced proportion of DEGs between the compared samples (i.e., unbiased DE) [19]. However, a loss of function of histone modification enzymes will lead to a biased distribution of DEGs between compared conditions in the corresponding ChIPseq analysis; i.e., there will be data with biased DE. As a result, methods assuming unbiased DE will not work well on data with biased DE. To normalize data that potentially has various scenarios (including unbiased and biased DE), we recently proposed a multistep normalization procedure called TbT [17]. TbT consists of three steps: data normalization using TMM [15] (step 1), DEG identification by using an empirical Bayesian method implemented in the baySeq package [9] (step 2), and data normalization using TMM [15] after eliminating the estimated DEGs (step 3) comprising the TMMbaySeqTMM pipeline. Different from conventional methods, our multistep normalization strategy can eliminate the negative effect of potential DEGs before the normalization in step 3.
While the threestep TbT normalization method performed well on simulated and real twogroup tag count data with replicates [17], it is practically possible to make different choices for the methods in each step. A more comprehensive study regarding better choices for the DEG elimination strategy (DEGES) is needed. To our knowledge, only the R package, TCC (from Tag Count Comparison), provides tools to perform multistep normalization methods based on DEGES. Our work presented here enables differential expression analysis of tag count data without having to worry much about biased distributions of DEGs.
Implementation
The TCC package was developed in the R statistical environment. This is because R is widely used and the main functionalities in TCC consist of combinations of functions from the existing R/Bioconductor [20] packages (i.e., edgeR[8], DESeq[10], and baySeq[9]). Since the main purpose (identification of DEGs from tag count data) of these three packages is essentially the same as that of TCC and many users may be experienced in their use, we will illustrate the main functionalities of TCC by contrasting them with the corresponding functions in those packages (see Figure 1). While TCC employs Object Oriented Programming design utilizing the R5 reference class, it has interface functions that do not change the object passed as the argument in order to be compatible with the semantics of the standard R environment. Detailed documentation for this package is provided as a vignette:
Preparations
Differential expression analysis between compared samples based on tag count data typically starts by preparing two objects: i) a count table matrix where each row indicates a gene, each column a sample, and each cell the number of counts (or reads) mapped to each gene in each sample and ii) a vector that indicates which group each sample belongs to. These data are stored in a TCC class object using the new function. Similar functions of other packages are the DGEList function in the edgeR package, the newCountDataSet function in the DESeq package, the new function in the baySeq package, and so on (see Figure 1a). Consider, for example, a matrix object hypoData consisting of 1,000 rows and six columns and a numeric vector group consisting of six elements, i.e., (1, 1, 1, 2, 2, 2). The first three samples in the matrix are from Group 1 (G1), and the others are from Group 2 (G2). The TCC class object is constructed as follows:
Normalization
Normalization of twogroup count data with replicates
When obtaining normalization factors from count data with replicates, users can select a total of six combinations (two normalization methods × three DEG identification methods) coupled with an virtually arbitrary number of iterations (n = 0, 1, 2, …, 100) in our DEGESbased normalization pipeline (Figure 1b). Here, we describe two representative choices (DEGES/TbT and iDEGES/edgeR).
DEGES/TbT
The TCC package provides robust normalization methods based on the DEGES recently proposed by Kadota et al. [17]. The original threestep normalization method (TbT) is performed by specifying the two major arguments (norm.method and test.method) as follows:
In relation to the other DEGESbased methods, we will call the method “DEGES/TbT” for convenience. As mentioned in ref. [17], the multistep normalization can be repeated until the calculated normalization factors converge. The iterative version of the DEGES/TbT (iDEGES/TbT) can be described as a TMM(baySeqTMM)_{ n } pipeline with n > = 2. Accordingly, the TMM normalization method [15] and the DEGES/TbT can be described as pipelines with n = 0 and 1, respectively, and n can be specified by the iteration argument.
DEGES/edgeR
A major disadvantage of the TbT method is the long time it requires to calculate the normalization factors. This requirement is due to the empirical Bayesian method implemented in the baySeq package. To alleviate this problem, a choice of alternative methods should be provided for step 2. For instance, using the exact test [16] in edgeR in step 2 enables the DEGES normalization pipeline to be much faster and entirely composed of functions provided by the edgeR package. The threestep DEGES normalization pipeline (we will refer to this as the TMM(edgeRTMM)_{ n } pipeline with n = 1 or DEGES/edgeR, for convenience) can be performed by changing the test.method argument to “edger”.
The TbT pipeline automatically calculates the percentage of DEGs (P_{DEG}) by virtue of its use of baySeq. In contrast, a reasonable threshold for defining potential DEGs should also be provided when using the exact test in edgeR (or the NB test in DESeq). Here, we define the threshold as an arbitrary false discovery rate (FDR) with a floor value for P_{DEG}. The default is FDR < 0.1 (i.e., FDR = 0.1), and the default floor P_{DEG} value is 5% (i.e., floorPDEG = 0.05); different choices are possible. For example, in the case of the default settings, x% (x > 5%) of the topranked potential DEGs are eliminated in step 2 if the percentage (= x%) of genes satisfying FDR < 0.1 is over 5%.
Although iDEGES/TbT would not be practical because of its long computation time, the iterative version of DEGES/edgeR (iDEGES/edgeR with n > = 2) is potentially superior to both noniterative DEGES/edgeR and DEGES/TbT. A suggested choice of n = 3, consisting of seven steps, can be performed by changing the iteration argument (i.e., iteration = 3) as follows:
The suggested number of iterations is determined from the results of iDEGES/TbT; it is a number at which the accuracies of DEG identifications corresponding to the calculated normalization factors converge [17]. The number of iterations for iDEGES/edgeR is determined in the same way (see the Results and discussion section).
Normalization of twogroup count data without replicates
Most R packages are designed primarily for analyzing data including biological replications because the biological variability has to be accurately estimated to avoid spurious DE calls [21]. In fact, the functions for the DEG identification method implemented in edgeR (i.e., the exact test; ver. 3.0.4) do not allow one to perform an analysis without replicates, even though the TMM normalization method in the package can be used regardless of whether the data has replicates or not. Although the edgeR manual provides users with some ideas on how to perform the DE analysis, it is practically difficult to customize the analysis with DEGES to data without replicates.
However, there are still cases in which we have to perform DE analysis of tag count data without replicates. When obtaining normalization factors from twogroup count data without replicates, users can select from a total of four combinations (two normalization methods × two DEG identification methods) and a virtually arbitrary number of iterations (n = 0, 1, 2, …, 100) in our DEGESbased normalization pipeline. That is, the calcNormFactors function with the norm.method = “deseq” or “tmm” and test.method = “deseq” or “bayseq” can be selected. A pipeline (iDEGES/DESeq; the DESeq(DESeqDESeq)_{ n } pipeline with n = 3) using functions in the DESeq package for analyzing twogroup count data without replicates can be performed by changing the two arguments (norm.method and test.method) as follows:
Normalization of multigroup count data with replicates
Many R packages (including edgeR, DESeq, and baySeq) support DE analysis for multigroup tag count data. TCC provides DEGESbased normalization methods for such data by virtue of the three packages that are internally used in TCC. Similar to the analysis of twogroup count data with replicates, users can select from a total of six combinations (two normalization methods × three DEG identification methods) and a virtually arbitrary number of iterations (n = 0, 1, 2, …, 100) when obtaining normalization factors from multigroup data with replicates.
Retrieving normalized data
The calculated normalization factors can be obtained from tcc$norm.factors. Similar functions are the calcNormFactors function in the edgeR package and the estimateSizeFactors function in the DESeq package, (Figure 1a). Note that the terminology used in DESeq (i.e., size factors) is different from that used in edgeR (i.e., effective library sizes) and ours. The effective library size in edgeR is calculated as the library size multiplied by the normalization factor. The size factors in the DESeq package are comparable to the normalized effective library sizes, wherein the summary statistics for the effective library sizes are adjusted to one. Our normalization factors, which can be obtained from tcc$norm.factors, have the same names as those in edgeR. Since biologists are often interested in such information [19], we provide the getNormalizedData function for retrieving the normalized data. The normalized data can directly be used as, for example, the input data of another package NOISeq[13].
Differential expression
The goal of the analysis would be to obtain a list of DEGs. To this end, we provide the estimateDE function. The function internally calls the corresponding functions implemented in three packages: exactTest in edgeR, nbinomTest in DESeq, and getLikelihoods.NB in baySeq (Figure 1). If the user wants to perform the DE method implemented in edgeR and to determine the genes having an FDR threshold of 10% as DEGs, one can do as follows:
A similar analysis based on DE methods in DESeq or baySeq can be performed by changing the test.method parameter to “deseq” or “bayseq”. The results of the DE analysis are stored in the TCC class object. The summary statistics for the topranked genes can be retrieved by using the getResult function. In general, the identified DEGs at FDR < 0.1 should be upregulated in either G1 or G2. The plot function generates an MA plot, where “M” indicates the logratio (i.e., M = log_{2}G2  log_{2}G1) and “A” indicates the log average read count (i.e., A = (log_{2}G2 + log_{2}G1)/2) based on the normalized count data.
Generation of simulation data
As demonstrated in our previous study [17], the DEGESbased normalization methods implemented in TCC theoretically outperform the other normalization methods when the numbers of DEGs between groups are biased. However, the validation of the biased DE in real data is very difficult in practice [19]. Thus, the performance of methods handling biased DE needs to be evaluated using simulation data. The simulateReadCounts function generates simulation data with various conditions. Currently, this function can generate simulation data analyzed in the TbT paper [17], thereby enabling further comparisons of our DEGESbased methods with methods developed by other researchers in the near future. For example, the hypoData object, a hypothetical count dataset in TCC, was generated by using this function with the following arguments:
The simulation conditions for comparing two groups (G1 vs. G2) with biological replicates are as follows: (i) the number of genes is 1,000 (i.e., Ngene = 1000), (ii) the first 20% of genes are DEGs (PDEG = 0.2), (iii) the first 90% of the DEGs are upregulated in G1 and the remaining 10% are upregulated in G2 (DEG.assign = c(0.9, 0.1)), (iv) the levels of DE are fourfold in both groups (DEG.foldchange = c(4, 4)), and (v) there are a total of six samples (three biological replicates for G1 and three biological replicates for G2) (replicates = c(3, 3)). The empirical distribution of read counts is built from Arabidopsis data in NBPSeq[12].
The output of the simulateReadCounts function is stored in the TCC class object with information about the simulation conditions and is therefore readytoanalyze. This function can generate several kinds of simulation data, such as those for comparing four groups (Groups 14) with replicates and those for comparing two groups without replicates. See the vignette for details.
Results and discussion
Accurate data normalization is essential for obtaining wellranked gene lists from tag count data. Similar to other R packages such as edgeR, the TCC package has functionalities for DE analysis of tag count data. Of these functionalities, TCC provides multistep normalization methods (including DEGES/TbT [17]) that internally use the functions implemented in edgeR, DESeq, and baySeq. Here, we demonstrate that the DEGESbased normalization methods are more effective than the methods implemented in the other packages. All analyses were performed using R (ver. 2.15.2) and Bioconductor [20]. Execution times were measured on a Linux system (CentOS release 6.2 (Final), Intel® Xeon® E54617 (2.9 GHz) 24 CPU, and 512 GB memory). The versions of major R libraries were TCC ver. 1.1.99, edgeR ver. 3.0.4, DESeq ver. 1.10.1, and baySeq ver. 1.12.0.
Following our previous study [17], we here demonstrate the performance of these methods by using the same evaluation metric, simulation framework, and real experimental datasets. We use the area under the receiver operating characteristic (ROC) curve (i.e., AUC) as a means of comparison. The simulation conditions are as follows: 525% of the genes are DEGs (P_{DEG} = 525%), 5090% of the DEGs are upregulated in G1 compared to G2 (P_{G1} = 5090%), and the levels of DE are fourfold in both groups. The count dataset consists of 70,619 unique small RNAs (sRNAs) and a total of four Arabidopsis thaliana leaf samples (i.e., two wildtype [WT1 and WT2] and two RNAdependent RNA polymerase 6 (RDR6) knockout [KO1 and KO2] samples) [9]. This dataset was originally analyzed with baySeq. The data has 657 provisional true DE sRNAs (i.e., P_{DEG} = 657 / 70,619 = 0.93%), and all of the sRNAs can be regarded as upregulated in the wildtype (i.e., P_{G1} = 100%). This is because they uniquely match tasRNA, which is produced by RDR6, and they are downregulated in RDR6 mutants. In addition to the RDR6 knockout dataset, we also analyze four other experimental datasets (called “sultan” [4], “gilad” [22], “maqc” [23], and “katz.mouse” [24]) obtained from the ReCount database [25]. The four datasets are used for evaluating the DEGESbased methods aimed at twogroup count data with replicates.
Simulation data with replicates (sensitivity, specificity, and computation time)
We assessed the performance of a total of six normalization methods: (a) TMM, (b) DEGES/TbT, (c) DEGES/edgeR, (d) iDEGES/edgeR, (e) iDEGES/TDT (i.e., the TMM(DESeqTMM)_{ n } pipeline), and (f) iDEGES/DESeq. The ranked gene lists were obtained using the individual DEG identification methods in three packages (edgeR, DESeq, and baySeq), together with normalization factors calculated from each normalization method. Accordingly, a total of 18 combinations (six normalization methods × three DEG identification methods) were evaluated. Table 1 shows the average AUC values of 100 trials between the ranked gene lists and the truth for various simulation conditions (P_{DEG} = 525% and P_{G1} = 5090%). While the n iterations for the three iDEGESbased methods roughly require an nfold computation time, the improvement due to increasing the number of iterations plateaued around n = 3 when performing iDEGES/TbT [17]. Therefore, we decided to show AUC values for the three iDEGESbased methods with n = 3.
DEGES/edgeR performed comparably to TbT, whereas iDEGES/edgeR outperformed the others, irrespective of the choice of DEG identification method (edgeR, DESeq, and baySeq) after normalization. That is, iDEGES/edgeR followed by any DEG identification method YYY (termed the iDEGES/edgeRYYY combination) performed the best among the six normalization methods. These results demonstrate that the alternative DEGES approaches (iDEGES/edgeR) implemented in TCC generally outperform the original DEGES approach (i.e., DEGES/TbT). In other words, the use of the exact test [16] implemented in edgeR is sufficient to determine the potential DEGs to be eliminated. Advantageous characteristics for the exact test also revealed themselves after performing any normalization method: Comparing the three DEG identification methods (edgeR, DESeq, and baySeq), we see that the XXXedgeR have the highest AUC values. Overall, iDEGES/edgeRedgeR performed the best. It should be noted that, however, the AUC values for the pipeline look very close to those of the original recommendation (i.e., DEGES/TbTedgeR) [17], e.g., 90.30% for iDEGES/edgeRedgeR and 90.27% for DEGES/TbTedgeR under one simulation condition of P_{DEG} = 25% and P_{G1} = 70%.
Although the AUC values for the individual combinations are under 0.0042% of the standard deviation (see “Sheet 1” in Additional file 1) and are statistically significant (p < 0.01; see Table 1), some researchers may think the current recommendation (iDEGES/edgeRedgeR) does not look compelling. We do not argue the fact that iDEGES/edgeR performs comparably to DEGES/TbT (or DEGES/edgeR), regarding the absolute AUC values (i.e., sensitivity and specificity). We rather want to emphasize that iDEGES/edgeR outperforms DEGES/TbT in terms of computation time (Table 2). This table shows that iDEGES/edgeR takes roughly threetimes longer than DEGES/edgeR but it is over 100 times faster than DEGES/TbT. In light of the absolute computation times (shorter than 10 seconds), iDEGES/edgeR finishes execution within one minute. Although the TMM normalization method has the shortest computation time, its AUC values are clearly lower than those of the others, especially when the P_{G1} value is displaced from 50%. An evaluation based on sensitivity and specificity should take precedence over one based on the computation time. These results show that iDEGES/edgeR is good for analyzing twogroup tag count data with replicates because of its sensitivity, specificity, and practical computation time.
Simulation data with replicates (effect of iteration)
Next, let us show the effect of iterations in the iDEGES approach to see whether the iteration can truly produce a convergent result or not. Table 3 summarizes the results under three simulation conditions (P_{G1} = 50, 70, and 90% with a fixed P_{DEG} value of 20%). Of a total of 300 trials, iDEGES/edgeR yielded 251 convergent and 49 nonconvergent (i.e., cyclic) results. We got similar results for the three other iDEGES methods (iDEGES/TbT, iDEGES/TDT, and iDEGES/DESeq), irrespective of the simulation conditions. These results clearly indicate that the iDEGESbased normalization methods do not always produce convergent normalization factors, contrary to previous expectations [17].
Of practical interest when using the iDEGES approach is the number of iterations required for obtaining a convergent result. We defined n as N_{C} if the potential DEGs estimated at the (N_{C} + 1)^{th} iteration are the same as those in the (N_{C})^{th} iteration. The distribution of N_{C} values for the four iDEGESbased methods are given in Additional file 2. In our trials, the maximum N_{C} value was 8 (see “Sheet 1” in Additional file 2). The distribution suggests that the iDEGES approach with n = 8 could be sufficient for obtaining convergent results under various simulation scenarios. However, the improvement had by iDEGES/edgeR with n > = 4 compared with that with n = 3 is actually negligible despite the requirement of additional computation time (see “Sheet 3” in Additional file 2). Therefore, the use of our iDEGES pipeline with n = 3 can be recommended for reducing useless computation time.
We observed that the number of iterations needed for obtaining convergent results (the Nc value) tends to increase when the degree of biased DE is high (P_{G1} > 50%): the average N_{C} values for the iDEGES/edgeR under the three conditions of P_{G1} = 50, 70, and 90% were 2.17, 3.14, and 3.60, respectively (see “Sheet 1” in Additional file 2). This is reasonable because, in the case of the simulation with P_{G1} > > 50%, a relatively large number of n in the TMM(edgeRTMM)_{ n } pipeline is theoretically needed for obtaining accurate normalization factors and because, in the case of the simulation with P_{G1} = 50%, the theoretical P_{G1} value obtained from the potential DEGs in any iDEGESbased pipeline at n = 1 is 50% (i.e., no need to apply the iDEGES approach to such data).
Now, let us briefly discuss the nonconvergent results. All the nonconvergent results showed cyclic characteristics, that is, there exists a same set of potential DEGs estimated in both the i^{th} and the (i + N_{P})^{th} iterations within a trial (N_{P} > = 2). We found that the most frequent N_{P} value was 2 (see “Sheet 1” in Additional file 2). For example, iDEGES/edgeR yielded 83.7% (41/49) nonconvergent results with N_{P} = 2. This result indicates that two different sets of potential DEGs are alternately eliminated in the i^{th} and (i + 1)^{th} iterations. Note that different normalization methods do not seem to produce consistent results (convergent (C) or nonconvergent (P)) within a trial. Under one simulation condition (P_{DEG} = 20% and P_{G1} = 90%), for example, iDEGES/edgeR and iDEGES/TbT got 21 and 17 nonconvergent results, respectively. Of these, only three trials (i.e., the 8th, 10th, and 61th) showed the same nonconvergent results (see “Sheet 2” in Additional file 2).
We confirmed that the P_{DEG} values (satisfying FDR < 0.1) originally estimated by three methods (iDEGES/edgeR, iDEGES/TDT, and iDEGES/DESeq) in every iteration in Table 3 were above the predefined floor value of 5%, indicating that the floor value of 5% for P_{DEG} has no effect on whether the results converge or not. The iterative method can be viewed as a discrete dynamical system since the number of state is finite and determined by the combination of genes as potential DEGs. Oscillations in the trajectory are common phenomena in such dynamical systems. Thus, a method based on an optimality criterion should be developed to select the best point in the cycle after detecting an oscillation. Nevertheless, we observed an overall improvement regarding nonconvergent results when more iterations were used.
Simulation data without replicates
TCC also provides DEGESbased methods for normalizing twogroup data without replicates. As described previously, the DEG identification method (i.e., the exact test) in edgeR does not allow for an analysis without replicates. Accordingly, we evaluated a total of eight XXXYYY combinations (XXX = DESeq, iDEGES/DESeq, iDEGES/TDT, and DEGES/TbT; YYY = DESeq and baySeq). Here, the DESeqDESeq combination indicates the original procedure in DESeq. Different from the results for the data with replicates, iDEGES/DESeq (and iDEGES/TDT) performed better than DEGES/TbT in this case (see “Sheet 1” in Additional file 3). The same trend can be seen in the accuracies of the estimated DEGs: the accuracies calculated from baySeq (i.e., DEGES/TbT) were clearly inferior to those from DESeq, irrespective of the choice of normalization method (see “Sheet 2” in Additional file 3). The advantageous characteristics of the NB test in DESeq become apparent when we compare the two DEG identification methods (DESeq and baySeq) for any normalization method XXX: the XXXDESeq combination outperforms the XXXbaySeq combination. These results indicate that the higher AUC values of the iDEGES/DESeqDESeq are primarily by virtue of a wellranked gene list produced from the NB test and that constructing a model based on bootstrap resampling employed in baySeq is difficult in a nonreplicate situation.
We observed that the numbers of potential DEGs satisfying FDR < 0.1 in the iDEGES/DESeq were nearly zero (i.e., the estimated P_{DEG} values were 0%) in all of the simulations, although the P_{DEG} values were 525%. This is reasonable because any attempt to work without replicates will lead to conclusions of very limited reliability: DESeq employs a conservative approach to prevent spurious DE calls. Accordingly, a predefined floor P_{DEG} value (= 5%) was used; i.e., 5% of the topranked genes were not used when calculating the normalization factors in the iDEGES/DESeq and iDEGES/TDT methods. These facts indicate that (i) iDEGES/DESeq performs almost as well as the original normalization method in DESeq when the floor P_{DEG} value is decreased from the default (= 5%), (ii) the methods perform equally well when the floor P_{DEG} value of 0% for iDEGES/DESeq is used, and (iii) iDEGES/DESeq (and iDEGES/TDT) with a floor value of x% tends to work better when analyzing simulation data with the same P_{DEG} value. Note that we set the floor P_{DEG} value to 5% in order to obtain certain differences between the iDEGESbased methods and the original DESeq, but we should consider the appropriate choice of the parameter in the future. Results of iterations in the iDEGES approach (i.e., convergent or nonconvergent) when analyzing twogroup data without replicates were given in Additional file 4.
Real data (ArabidopsisRDR6 knockout dataset)
Here, we describe the results for a real dataset with replicates, i.e., {WT1, WT2} vs. {KO1, KO2}. The experiment was originally reported in Ref. [9]. Following the evaluation scheme in Table 1, we calculated the AUC values by using the 18 combinations shown in Table 4. Consistent with the results shown in Table 1, we found that the XXXedgeR combination (i.e., the use of the DEG identification method in edgeR after any normalization method) was the best, and consequently, we recommend using this combination to analyze twogroup data with replicates. Although the iDEGES/edgeRedgeR combination performed the best on simulation data with replicates (see Table 1), iDEGES/edgeR did not distinguish itself among the six XXXedgeR combinations. Note also that the AUC value for iDEGES/edgeR with n = 3 (65.70%) was inferior to that for iDEGES/edgeR with n = 1 (i.e., DEGES/edgeR; 69.28%), suggesting that the iterative approach had a negative impact on this TMM(edgeRTMM)_{ n } pipeline.
To see the effect of iteration, we investigated the changes in the AUC values for the four XXXedgeR combinations (XXX = iDEGES/TbT, iDEGES/edgeR, iDEGES/TDT, and iDEGES/DESeq) when the default floor P_{DEG} value (= 5%) was used (Figure 2a). The figure shows that the values from the iDEGES/edgeR pipeline (black lines) has a cyclic characteristics with N_{P} = 7, and the cycle starts at the 7th iteration (i.e., N_{S} = 7). In contrast, the AUC values from the other pipelines had convergent characteristics: Nc = 3 for iDEGES/TbT, 2 for iDEGES/DESeq, and 5 for iDEGES/TDT. Interestingly, the P_{DEG} values estimated by iDEGES/edgeR, iDEGES/TDT, and iDEGES/DESeq were 0% in every iteration. Because of that, the floor P_{DEG} value (5% default) was employed instead of the original values. Meanwhile, the estimated P_{DEG} values for iDEGES/TbT were above 5% in every iteration (ranging from 6.09% to 6.18%). These results indicate that the floor value of 5% for P_{DEG} has no effect on whether the results converge or not.
We observed that the compositions of the potential DE sRNAs were identical when the AUC values were the same (see, e.g., light blue line at n = 2 ~ 30). This was true even for cyclic results (e.g., the compositions obtained from iDEGES/edgeR at the 10th, 17th, and 24th iterations). The different AUC values, in turn, were due to the difference in the normalization factors at each iteration and the different compositions caused the next normalization factors to be different. We found that the relatively large difference in the AUC values among cycles for the nonconvergent results (approximately 3% difference) were due to (i) the paucity of variations among the 657 provisional true DE sRNAs and (ii) their low expression levels. Of the 657 sRNAs, there were only 143 unique patterns of counts across the four samples (i.e., WT1, WT2, KO1, and KO2). For example, there were 91 DE sRNAs that had one tag count only in WT1, i.e., (1, 0, 0, 0) and 200 DE sRNAs that had one count only in WT2, i.e., (0, 1, 0, 0). These two count patterns occupied 44.3% of the 657 DE sRNAs. Such lowcount DE sRNAs cannot be distinguished from other nonDE sRNAs if both patterns are identical. Indeed, we found that the positions for the lowcount DE sRNAs in the ranked gene list varied from iteration to iteration.
Similar to the paucity of variations regarding the count vectors for the 657 DE sRNAs, we found that out of a total of 70,619 sRNAs, there were only 2,535 unique patterns of sRNA counts. This indicates that many sRNAs displayed the same degrees of DE, and therefore, their ranks (or the AUC values calculated based on the ranks) will vary if slight changes are made to the calculated normalization factors. Indeed, we observed considerably different changes in the AUC values when a higher floor P_{DEG} value of 10% was used (Figure 2b); e.g., the results for iDEGES/TbT converged when a 5% floor was used, but they became cyclic when 10% was used (the red lines in the figure). We also found that iDEGES/ TbT, iDEGES/edgeR, and iDEGES/TDT performed about the same when a 10% floor was used. These results indicate that the large difference between AUC values in this dataset might be within the error range.
All results described above were obtained with the original raw count data, in accordance with the edgeR and baySeq design. However, we should note that the current procedure is different from one recommended in the TbT paper [17], where the data was scaled to counts per million (CPM) in each sample when the DEG identification method in baySeq (i.e., the empirical Bayesian method) was executed. This recommendation was derived from a comparison of uncertain AUC values between the raw count data and the CPM data of this dataset, and it is now questionable. Accordingly, all the procedures implemented in the current TCC are based on the original raw count data.
Real data (four ReCount datasets)
Lastly, let us show the results of four experimental datasets: three human RNAseq datasets (called “sultan” [4], “gilad” [22], and “maqc” [23]) and one mouse dataset (“katz.mouse” [24]). These datasets were obtained from the ReCount database [25]. Different from the RDR6 dataset and simulated datasets, we do not know the true DEGs for the four datasets, indicating that we cannot calculate the AUC values. We therefore investigated the effect of iteration regarding the potential DEGs to be eliminated (i.e., convergent or nonconvergent). Consistent with the RDR6 result, we observed several cyclic (P) characteristics for the ReCount datasets (see “Sheet 2” in Additional file 5).
It is important to evaluate the degree of impact for the cyclic results. In the gilad dataset, for example, we observed cyclic results for two DEGES pipelines: iDEGES/edgeR (N_{S} = 3 and N_{P} = 3) and iDEGES/TbT (N_{S} = 3 and N_{P} = 2). Figure 3 shows the results of hierarchical clustering applied to a total of 26 ranked gene lists (13 XXXedgeR and 13 XXXDESeq combinations). Two distinct clusters can be seen: each cluster (i) consists of 13 gene lists with the same DEG identification methods after performing different normalization methods (i.e., the XXXedgeR cluster and the XXXDESeq cluster) and (ii) has five cyclic results (denoted as “(iteration number) followed by combination name”; e.g., “(n = 4)iDEGES/TbTedgeR”). It is clear that the five cyclic results are quite similar to the eight other results. We also got similar results for another dataset (see Additional file 6). These results suggest that cyclic results are not of concern in practice.
Conclusion
The R package TCC provides users with a robust and accurate framework to perform DE analysis of tag count data. TCC has an improved data normalization step, compared with existing packages such as edgeR. While the other normalization strategies assume that there is an approximately balanced proportion of DEGs between compared samples (i.e., unbiased DE), our multistep DEGESbased normalization methods are designed to deal with various scenarios (including unbiased and biased DE situations): the internally used DEG identification method eliminates the effects of biases, if any, of potential DEGs. Our study demonstrated that the iDEGES/edgeRedgeR combination can be recommended for analyzing twogroup data with replicates (see Tables 1 and 2) and that the iDEGES/DESeqDESeq can be recommended for analyzing twogroup data without replicates (see Additional file 3). The success of these methods primarily owes to the high scalability of the normalization and DEG identification methods in the R packages used in TCC.
The functionality of TCC can be extended in many ways. First, the current study focuses on the analysis of twogroup data, but some users may want to utilize the DEGESbased methods for data consisting of two or more groups. As can be seen in the vignette, some prototypes of DEGESbased pipelines for analyzing data in three or four groups have already been implemented. Evaluations such as these and further improvement are our next tasks. Second, the current approach of TCC is somewhat reminiscent of a microarray analysis of the count matrix for genes. We know that (i) one or more isoforms can be transcribed in a same gene region, (ii) those transcripts may have distinct expression levels, and (iii) it could lead to a DE result at the transcript level but a nonDE result at the gene level. To prevent this possible problem and fully utilize the resolution of RNAseq data, advanced R packages (e.g., BitSeq[21] and DEXSeq[14]) have recently been developed. We believe that the use of those packages together with DEGES enables us to obtain a more reliable result, because the idea of DEGES can, of course, be applied to DE analysis at both genelevel and transcriptlevel resolutions.
Finally, the current DEGESbased normalization methods implemented in TCC only employ linear scaling normalization methods and statistical methods for identifying DEGs. This is because these scaling normalization methods do not change the shape of the original count distribution and the statistical methods for identifying DEGs in edgeR, DESeq, and baySeq assume the model. Technically speaking, we can construct many other pipelines consisting of, for example, a nonlinear normalization method (e.g., quantile normalization [26]) and a DEG identification method originally developed for microarray analysis (e.g., WAD [27]). As we learned from the microarrays, there are suitable combinations of normalization methods and DEG identification methods [28]: we speculate that a nonlinear normalization method would be incompatible with a statistical DE method. A critical evaluation of those pipelines will be of interest in the future and we will continue to offer uptodate guidelines.
Availability and requirements
Project name: TCC
Project home page: TCC is available at http://www.iu.a.utokyo.ac.jp/~kadota/TCC/ and will appear in Bioconductor (http://bioconductor.org/) from ver. 2.13.
Operating systems: Platform independent
Programming language: R
Other requirements: requires the edgeR, DESeq, baySeq, and ROC packages
License: GPL2
Any restrictions to use by nonacademics: None
Acknowledgments
The authors thank Dr. TJ Hardcastle for providing the dataset used in the baySeq study. This study was supported by grants (KAKENHI 24500359 to KK and 22128008 to TN) from the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT).
Abbreviations
 DE:

Differential expression
 DEG:

Differentially expressed gene
 DEGES:

DEG elimination strategy
 FDR:

False discovery rate
 CPM:

Counts per million (normalization)
 sRNA:

small RNA
 tasRNA:

TAS locusderived small RNA
 TMM:

Trimmed mean of M values (method)
 TbT:

the TMMbaySeqTMM pipeline
 TCC:

Tag count comparison.
References
Schuster SC: Nextgeneration sequencing transforms today’s biology. Nat Methods. 2008, 5: 1618.
Mardis ER: The impact of nextgeneration sequencing technology on genetics. Trends Genet. 2008, 24 (3): 133141. 10.1016/j.tig.2007.12.007.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Methods. 2008, 5 (7): 621628. 10.1038/nmeth.1226.
Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O’Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo ML: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008, 321 (5891): 956960. 10.1126/science.1160342.
Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNAseq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18 (9): 15091517. 10.1101/gr.079558.108.
Asmann YW, Klee EW, Thompson EA, Perez EA, Middha S, Oberg AL, Therneau TM, Smith DI, Poland GA, Wieben ED, Kocher JP: 3′ tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer. BMC Genomics. 2009, 10: 53110.1186/1471216410531.
R Development Core Team: R: A Language and Environment For Statistical Computing. 2011, Vienna, Austria: R Foundation for Statistical computing
Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26 (1): 139140. 10.1093/bioinformatics/btp616.
Hardcastle TJ, Kelly KA: baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010, 11: 42210.1186/1471210511422.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11: R10610.1186/gb20101110r106.
Wang L, Feng Z, Wang X, Wang X, Zhang X: DEGseq: an R package for identifying differentially expressed genes from RNAseq data. Bioinformatics. 2010, 26 (1): 136138. 10.1093/bioinformatics/btp612.
Di Y, Schafer DW, Cumbie JS, Chang JH: The NBP negative binomial model for assessing differential gene expression from RNASeq. Stat Appl Genet Mol Biol. 2011, 10: art24
Tarazona S, GarciaAlcalde F, Dopazo J, Ferrer A, Conesa A: Differential expression in RNAseq: a matter of depth. Genome Res. 2011, 21 (12): 22132223. 10.1101/gr.124321.111.
Anders S, Reyes A, Huber W: Detecting differential usage of exons from RNAseq data. Genome Res. 2012, 22 (10): 20082017. 10.1101/gr.133744.111.
Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 2010, 11: R2510.1186/gb2010113r25.
Robinson MD, Smyth GK: Smallsample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008, 9: 321332.
Kadota K, Nishiyama T, Shimizu K: A normalization strategy for comparing tag count data. Algorithms Mol Biol. 2012, 7: 510.1186/1748718875.
Garmire LX, Subramaniam S: Evaluation of normalization methods in mammalian microRNASeq data. RNA. 2012, 18: 12791288. 10.1261/rna.030916.111.
Dillies MA, Rau A, Aubert J, HennequetAntier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloë D, Le Gall C, Schaëffer B, Le Crom S, Guedj M, Jaffrézic F, on behalf of The French StatOmique Consortium: A comprehensive evaluation of normalization methods for Illumina highthroughput RNA sequencing data analysis. Brief Bioinform. 2012, 10.1093/bib/bbs046.
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R8010.1186/gb2004510r80.
Glaus P, Honkela A, Rattray M: Identifying differentially expressed transcripts from RNAseq data with biological variation. Bioinformatics. 2012, 28 (13): 17211728. 10.1093/bioinformatics/bts260.
Blekhman R, Marioni JC, Zumbo P, Stephens M, Gilad Y: Sexspecific and lineagespecific alternative splicing in primates. Genome Res. 2010, 20 (2): 180189. 10.1101/gr.099226.109.
Bullard JH, Purdom E, Hansen KD, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinformatics. 2010, 11: 9410.1186/147121051194.
Katz Y, Wang ET, Airoldi EM, Burge CB: Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Methods. 2010, 7 (12): 10091015. 10.1038/nmeth.1528.
Frazee AC, Langmead B, Leek JT: ReCount: A multiexpreriment resource of analysisready RNAseq gene count datasets. BMC Bioinformatics. 2011, 12: 44910.1186/1471210512449.
Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003, 19 (2): 185193. 10.1093/bioinformatics/19.2.185.
Kadota K, Nakai Y, Shimizu K: A weighted average difference method for detecting differentially expressed genes from microarray data. Algorithms Mol Biol. 2008, 3: 810.1186/1748718838.
Kadota K, Nakai Y, Shimizu K: Ranking differentially expressed genes from Affymetrix gene expression data: methods with reproducibility, sensitivity, and specificity. Algorithms Mol Biol. 2009, 4: 710.1186/1748718847.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JS improved the TCC package, performed the analysis, and drafted the paper. TN drafted, maintained the TCC package, and refined the paper. KS supervised the critical discussion and refined the paper. KK drafted the R scripts, refined the paper, and conducted this project. All the authors read and approved the final manuscript.
Electronic supplementary material
12859_2013_5974_MOESM1_ESM.xls
Additional file 1: Additional results for simulation data with replicates. Sheet 1: Standard deviations of AUC values in Table 1 are shown. Legends are the same as given in Table 1. Sheet 2: Estimated values for P_{DEG} and accuracies of the potential DEGs are shown. Averages of 100 trials are shown. Values for (a) TMM do not exist because it does not estimate potential DEGs. Following the original TbT study, the P_{DEG} value for DEGES/TbT was directly obtained from the posterior probability output of baySeq (i.e., a floor value for P_{DEG} was set to 0%; floorPDEG = 0.00) and the accuracy was calculated on the basis of the estimated DEGs. The estimated P_{DEG} values for the other DEGESbased methods were calculated as relative numbers of genes satisfying FDR < 0.1 (i.e., FDR = 0.1). The accuracy was calculated on the basis of the potential DEGs if the P_{DEG} value was over 5% of the predefined floor P_{DEG} value (i.e., floorPDEG = 0.05) and, otherwise, the accuracy of the 5% of topranked genes was calculated as the surrogate DEGs. (XLS 34 KB)
12859_2013_5974_MOESM2_ESM.xls
Additional file 2: Effect of iterations for simulation data with replicates. Details of the results in Table 3 are shown. Sheet 1: The N_{C} and N_{P} values are shown: (a) iDEGES/TbT, (b) iDEGES/edgeR, (c) iDEGES/TDT, and (d) iDEGES/DESeq. Sheet 2: Raw results (convergent (C) or nonconvergent (P)) for each of the 100 trials are shown. The number to the right of the C (or P) indicates the N_{C} (or N_{P}) value. Sheet 3: Average AUC values (%) for the iDEGES/edgeRedgeR. AUC values with n = 16 for (a) all trials, (b) trials only having convergent results, and (c) trials only having nonconvergent results are shown. (XLS 78 KB)
12859_2013_5974_MOESM3_ESM.xls
Additional file 3: Results for simulation data without replicates. Results of four normalization methods (XXX = (a) DESeq, (b) iDEGES/DESeq, (c) iDEGES/TDT, and (d) DEGES/TbT) under each simulation condition are shown. Sheet 1: Average AUC values (%) of 100 trials and the standard deviations from the XXXDESeq and the XXXbaySeq combinations are shown. Legends are the same as in Table 1. Sheet 2: Estimated values for P_{DEG} and accuracies of the potential DEGs are shown. Averages of 100 trials are shown. Values for (a) DESeq do not exist because it does not estimate potential DEGs. Other legends are the same as in Additional file 1. Sheet 3: Average computation times (in seconds) of 100 trials for the four normalization methods are shown. (XLS 37 KB)
12859_2013_5974_MOESM4_ESM.xls
Additional file 4: Effect of iterations for simulation data without replicates. Sheet 1: Summary of convergent or nonconvergent (cyclic) results of 100 trials for three iDEGESbased normalization methods: (a) iDEGES/DESeq, (b) iDEGES/TDT, and (c) iDEGES/TbT. Sheet 2: Raw results (convergent (C) or nonconvergent (P)) for each of the 100 trials are shown. Sheet 3: Average AUC values (%) for the iDEGES/DESeqDESeq. AUC values with n = 16 for (a) all trials, (b) trials only having convergent results, and (c) trials only having nonconvergent results are shown. The characteristics of N_{P} = 2 for nonconvergent results can be seen. (XLS 45 KB)
12859_2013_5974_MOESM5_ESM.xls
Additional file 5: Effect of iterations for real data. Sheet 1: Basic information for the four ReCount datasets as well as the RDR6 dataset are shown. The dataset names except for the RDR6 one are the same as those provided in the ReCount database. The numbers in the “nonzero counts” column indicate the numbers of genes (or transcripts or rows) having nonzero counts for at least one of the compared groups. The numbers in the “unique patterns” column indicate the numbers of unique patterns among the genes having nonzero counts. Sheet 2: Results (convergent (C) or nonconvergent (P)) of four iDEGESbased normalization methods when using floor P_{DEG} value settings of (a) 5% and (b) 10% are shown. The number to the right of the C (or P) indicates the N_{C} (or N_{P}) value. The analysis were performed using nonzero count data. Note that the results of iDEGES/edgeR for the maqc dataset were not shown because the method cannot be applied to data without replicates. (XLS 32 KB)
12859_2013_5974_MOESM6_ESM.ppt
Additional file 6: Dendrogram of averagelinkage hierarchical clustering for a katz.mouse dataset. Legends are basically the same as those in Figure 3. Gene lists having iteration numbers on the right side correspond to the results of the iDEGES/TbTYYY combinations: the iDEGES/TbT pipeline for this dataset converged after the fifth iteration (i.e., N_{C} = 5; see “Sheet 2” in Additional file 5). It can be seen that the ranked gene lists obtained from the same combination with different iteration numbers (i.e., the five lists obtained from the iDEGES/TbTYYY combination with n = 1 ~ 5) are quite similar to the lists obtained from the other XXXYYY combinations if YYY is the same. (PPT 108 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Sun, J., Nishiyama, T., Shimizu, K. et al. TCC: an R package for comparing tag count data with robust normalization strategies. BMC Bioinformatics 14, 219 (2013). https://doi.org/10.1186/1471210514219
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1471210514219