 Methodology article
 Open Access
 Published:
Comparative evaluation of gene set analysis approaches for RNASeq data
BMC Bioinformatics volume 15, Article number: 397 (2014)
Abstract
Background
Over the last few years transcriptome sequencing (RNASeq) has almost completely taken over microarrays for highthroughput studies of gene expression. Currently, the most popular use of RNASeq is to identify genes which are differentially expressed between two or more conditions. Despite the importance of Gene Set Analysis (GSA) in the interpretation of the results from RNASeq experiments, the limitations of GSA methods developed for microarrays in the context of RNASeq data are not well understood.
Results
We provide a thorough evaluation of popular multivariate and genelevel selfcontained GSA approaches on simulated and real RNASeq data. The multivariate approach employs multivariate nonparametric tests combined with popular normalizations for RNASeq data. The genelevel approach utilizes univariate tests designed for the analysis of RNASeq data to find genespecific Pvalues and combines them into a pathway Pvalue using classical statistical techniques. Our results demonstrate that the Type I error rate and the power of multivariate tests depend only on the test statistics and are insensitive to the different normalizations. In general standard multivariate GSA tests detect pathways that do not have any bias in terms of pathways size, percentage of differentially expressed genes, or average gene length in a pathway. In contrast the Type I error rate and the power of genelevel GSA tests are heavily affected by the methods for combining Pvalues, and all aforementioned biases are present in detected pathways.
Conclusions
Our result emphasizes the importance of using selfcontained nonparametric multivariate tests for detecting differentially expressed pathways for RNASeq data and warns against applying genelevel GSA tests, especially because of their high level of Type I error rates for both, simulated and real data.
Background
Over the last few years transcriptome deep sequencing (RNASeq) has almost completely taken over microarrays for highthroughput studies of gene expression. In contrast to microarrays, RNASeq technology quantifies expression in counts of transcript reads mapped to a genomic region [1],[2]. These read counts are integer numbers ranging from zero to millions. This is why approaches that were developed for the analysis of microarray data are generally not applicable to the analysis of RNASeq data: microarray approaches model the gene expression by continuous distributions. The most common use of RNASeq has been identifying genes that are differentially expressed (DE) between two or more conditions. Typically, gene counts are modeled using Poisson or Negative Binomial (NB) distribution, and several commonly used software packages such as edgeR [3], DESeq [4], and SamSeq [5] adapted for RNASeq, are freely available. Recently it was suggested to transform RNASeq count data prior to the analysis and apply normalbased microarraylike statistical methods, e.g. the limma pipeline [6] to RNASeq data [7].
Similarly, a decade ago, the focus of microarrays data analysis was also on finding DE genes. The methods for microarray data were dominated by univariate twosample statistical tests for finding DE genes. However, it was quickly recognized that (1) biologically relevant genes with small changes in expression are almost always absent in the list of statistically significant DE genes, detected using twosample tests with the correction for multiple testing [8], and (2) because genes do not work in isolation, statistical tests need to account for the multivariate nature of expression changes [9],[10]. To address the shortcomings of genelevel analyses, conceptually new approaches were suggested which operated with gene sets, i.e. treating a gene set as an expression unit. Importantly, differentially expressed gene sets (such as biological pathways) incorporate existing biological knowledge into the analysis, thus providing more explanatory power than a long list of seemingly unrelated genes [9]. To date many methodologies for testing differential expression of gene sets have been suggested and are collectively named Gene Set Analysis (GSA) approaches [10][13].
GSA approaches can be either competitive or selfcontained. Competitive approaches compare a gene set against its complement that contains all genes except the genes in the set, and selfcontained approaches test whether a gene set is differentially expressed between two phenotypes [14],[15]. Another technique that incorporates biological knowledge into the analysis, that requires a list of preselected DE genes to proceed, is the gene set overrepresentation analysis. Here, a set of preselected significantly DE genes is tested for overrepresentation in annotated gene sets such as Gene Ontology (GO) categories or Kyoto Encyclopedia of genes and genomes (KEGG) using standard statistical tests for enrichment [16]. A shortcoming of the overrepresentation approach is that it still requires a preselected gene list and genes with small changes may not be accounted for [10].
The first competitive GSA test for microarray data analysis (Gene Set Enrichment Analysis, GSEA [8]) was developed a decade ago, and in the last decade pathways analysis for microarray data has become a method of choice for explaining the biology underlying the experimental results [10],[17],[18]. One would expect there to be plenty of GSA approaches suitable for RNASeq data analysis, yet welltested and justified methods are scarce. The first approach, adapting GSA for RNASeq data, was suggested by Young and colleagues [19]. They developed GOseq, a GO categories overrepresentation analysis that accounts for the overdetection of GO categories enriched with long and highly expressed genes in RNASeq data. Next, a nonparametric competitive GSA approach named GSVA (Gene Set Variation Analysis) has demonstrated highly correlated results between microarrays and RNASeq sets of samples of lymphoblastoids, cell lines which have been profiled by both technologies [20]. Shortly after, Wang and Cairns [21] suggested SeqGSEA, an adaptation of GSEA to RNASeq data. All of the aforementioned approaches are not without inherent biases: GOSeq results depend on the methods selected for finding DE genes [19], and competitive approaches (in particular GSEA) are influenced by the filtering of the data and can even increase their power by the addition of unrelated data and noise [22].
The discussion about the possibility of using selfcontained genelevel tests for GSA for microarrays data was ongoing for a long time: such tests are straightforward and can be easily designed [11]. Some authors (e.g. [23],[24]) were recommending to use genelevel tests for GSA. At the same time, because these tests are not truly multivariate and have much lower power compared to multivariate approaches, some authors [18] were advising against the application of genelevel tests for GSA. In a recent publication genelevel tests were claimed to be the first method of choice for GSA of RNASeq data [25]. In the simulation study expression data (reads) were taken from a multivariate normal distribution [25]. Because reads are integer numbers and are usually modeled using Poisson or Negative Binomial distribution, the simulation results of the study [25] may be inconclusive.
Thus far, except for genelevel GSA tests [25], the power and Type I error rates of selfcontained approaches were not examined in the context of RNASeq data. Here we study the performance of several selfcontained GSA approaches – multivariate and genelevel – for finding differentially expressed pathways in RNASeq data. The goals of our study are to: 1) describe several nonparametric multivariate GSA approaches developed for microarray data [18],[26] that do not have distributional assumptions and are readily applicable to RNASeq data given proper normalization; 2) evaluate the performance of the four most commonly used RNASeq normalization approaches in combination with the aforementioned nonparametric multivariate GSA; 3) describe how univariate tests specifically designed for finding DE genes in RNASeq data can be extended to genelevel GSA tests by using procedures for combining genes Pvalues into a pathway Pvalue (Fisher’s combining probabilities Method (FM) [27], Stouffer’s Method (SM) [28] and the soft thresholding Gamma Method (GM) [25]); 4) evaluate the performance of the three most commonly used univariate tests for the analysis of RNASeq data (edgeR, DESeq, and eBayes) in combination with approaches for combining genes Pvalues into a pathway Pvalue; and 5) provide comparative power and Type I error rate analyses for multivariate and genelevel GSA tests.
In addition we evaluate whether nonparametric multivariate GSA approaches with different normalizations as well as genelevel GSA tests are prone to different types of selection biases. We check all GSA approaches for overdetection of pathways enriched with long genes. This bias was shown to exist in gene set overrepresentation analysis [19], but it is currently unknown whether it exists in GSA approaches. We also check whether GSA approaches overdetect pathways with small (large) number of genes and small (large) percentage of differentially expressed genes. In conclusion, we provide some recommendations for employing selfcontained GSA approaches given RNASeq data.
In what follows we briefly describe several multivariate nonparametric tests [18],[26]. We also consider the multivariate ROAST test [29] designed for microarray data but, given proper normalization, also applicable to RNASeq. Then we discuss approaches for combining Pvalues from univariate tests, such as edgeR, DESeq, and eBayes, specifically designed for the analysis of differential gene expression using RNASeq data sets into a pathway Pvalue. Approaches for RNASeq data normalizations together with a brief description of biological and simulated data used for testing purposes are presented in the end of this section.
Methods
Hypothesis testing
Statistically speaking the problem of finding differentially expressed pathways is a hypothesis testing problem. Consider two different phenotypes with n _{1} samples of measurements of p genes for the first and n _{2} samples of measurement of the same p genes for the second phenotype. Let the two pdimensional random vectors of measurements X = (X _{1},…, X _{n1)} and Y = (Y _{1},…, Y _{n2}) be independent and identically distributed with the distribution functions F, G, mean vectors μ _{ x }, μ _{ y } and p × p covariance matrices S _{ x } , S _{ y }. We consider the problem of testing the general hypothesis H _{ 0 }: F = G against an alternative H _{1}: F ≠ G, or a restricted hypothesis H _{ 0 }: μ _{ x } = μ _{ y } against an alternative H _{1}: μ _{ x } ≠ μ _{ y }, depending on the test statistic.
Multivariate tests
We adopted the multivariate generalization of the WaldWolfowitz (WW) and KolmogorovSmirnov (KS) tests [18] as suggested by Friedman and Rafsky [26]. These two tests were not used before in the context of pathway analysis with RNASeq data. The multivariate generalization is based on the minimum spanning tree (MST) of the complete network (graph) generated from gene expression data.
For an edgeweighted graph G(V,E) where V is the set of vertices and E is the set of edges, the MST is defined as the acyclic subset T ⊆ E that connects all vertices in V and whose total length ∑_{i,j ∈ T} d(v _{ i }, v _{ j }) is minimal. For the pdimensional observations X and Y, an edgeweighted complete graph can be constructed with N nodes and N(N1)/2 edge weights estimated by the Euclidean (or any other) distance measure between pairs of points in R ^{p}. The MST of such graph connects all N nodes (vertices) that are close in R ^{p} with N1 nodes.
For a univariate twosample test (p = 1), the KS test begins by sorting the N = n _{1} + n _{2} observations in ascending order. Then, observations are ranked and the quantity d _{ i } = r _{ i }/n _{1} − s _{ i }/n _{2} is calculated where r _{ i } (s _{ i }) is the number of observations in X (Y) ranked lower than i, 1 ≤ i ≤ N. The test statistic is the maximal absolute difference D = max _{ i }d _{ i }, and H _{ 0 }: μ _{ x } = μ _{ y } is rejected for large D. The multivariate generalization of the KS test ranks multivariate observations based on their MST to obtain the strong relation between observations differences in ranks and their distances in R ^{p}. The MST is rooted at a node with the largest geodesic distance, and then the nodes are ranked in the high directed preorder (HDP) traversal of the tree [26]. Then, the test statistic D is found for the ranked nodes. The null distribution of D is estimated using samples label permutations, and H _{ 0 }: μ _{ x } = μ _{ y } is rejected for a large observed D [26].
For a univariate twosample test (p = 1), the WW test begins by sorting the N = n _{1} + n _{2} observations in ascending order. Then, each observation is replaced by its phenotype label (X or Y), and the number of runs (R) is calculated where R is a consecutive sequence of identical labels. In the multivariate generalization of the WW test, all edges of MST incident between nodes belonging to different phenotype labels (X and Y) are removed, and the number of the remaining disjoint subtrees (R) is calculated. The permutation distribution of the standardized number of subtrees is asymptotically normal, and H _{ 0 }: μ _{ x } = μ _{ y } is rejected for a small number of subtrees [26].
We consider two other multivariate test statistics based on their high power and popularity. Nstatistic [30],[31] tests the most general hypothesis H _{ 0 }: F = G against a twosided alternative H _{1}: F ≠ G:
Here we consider only L(X, Y) = ∥ X − Y ∥, the Euclidian distance in R ^{p}.
In the context of microarray data, a parametric multivariate rotation gene set test (ROAST) became popular for the selfcontained GSA approaches [29]. ROAST uses the framework of linear models and tests whether, for all genes in a set, a particular contrast of the coefficients is nonzero [29]. It accounts for correlations between genes and has the flexibility of using different alternative hypotheses, testing whether the direction of changes for a gene in a set is up, down or mixed (up or down) [29]. For all comparisons implemented here the mixed hypothesis was selected. Applying ROAST to RNASeq data requires count normalization first. The VOOM normalization [7] was proposed specifically for this purpose where log counts, normalized for sequence depth, are used. In addition to counts normalization, VOOM calculates associated precision weights which can be incorporated into the linear modeling process within ROAST to eliminate the meanvariance trend in the normalized counts [7]. Considering that this feature is suited specifically for ROAST, we apply VOOM normalization with ROAST and do not apply any other normalization (except normalizing for gene length, see below).
Combining Pvalues obtained using univariate tests for RNASeq
One way of designing a GSA test is to combine univariate statistics for individual genes [11],[18]; we refer to this technique as ‘genelevel GSA’ in what follows. There are two popular univariate tests specifically designed for RNASeq data that rely on Negative Binomial model for read counts: edgeR [3] and DESeq [4]. Empirical Bayes method (eBayes [6]) correctly identifies hypervariable genes in the context of microarray data and, when adapted for RNASeq data through VOOM normalization [7], should be a powerful approach. Thus, in our comparative power analysis of genelevel GSA approaches, we include the following univariate tests: edgeR, DESeq, and eBayes. It should be noted that RNASeq counts are normalized for each test based on its recommended normalization method only.
The key question in designing a genelevel GSA test is how to combine statistics (Pvalues) from individual genes into a single gene set score (Pvalue). The problem of combining Pvalues has been recognized and studied for a long time (Fisher’s combining probabilities test [27]). Many methods for combining Pvalues are available and can usually be expressed in a form of T = ∑_{ i }H(p _{ i }), where Pvales are transformed by a function H [32]. In particular, Fisher’s method (FM) uses H(p _{ i }) = − 2ln(p _{ i }) and Stouffer’s method (SM) uses H to be the inverse normal distribution function [28].
Gamma Method (GM) is based on summing the transformed genelevel Pvalues using an inverse gamma cumulative distribution function ${G}_{w,1}^{1}$ where w is the shape parameter, i.e. the combined test statistic is given by $T={\sum}_{i}{G}_{w,1}^{1}\left(1{p}_{i}\right)$ [33]. The shape parameter w controls the amount of emphasis given to genelevel Pvalues below a particular threshold. This feature is imposed by any transformation function H and is referred to as soft truncation threshold (STT) [33]. It is useful when there is pronounced heterogeneity in effects. The STT is controlled by w such that $w={G}_{w,1}^{1}\left(1STT\right)$. When w is large, GM becomes equivalent to the inverse normal Stouffer’s method which has STT = 0.5, and when it is 1 it becomes equivalent to Fisher’s method with STT = 1/e. Fridley et al. examined the performance of GM with various STT values and reported that STT values between 0.01 and 0.36 tend to give the best power [25]. For our study we chose w = 0.0137 that gives STT = 0.5. (For more detailed description of the methods for combining Pvalues see Additional file 1).
Approaches to normalize RNASeq data before applying multivariate tests
Similar to microarray data [34],[35], RNASeq data should be properly normalized before any further statistical tests can be applied. Raw counts are neither directly comparable between genes within one sample, nor between samples for the same gene. The counts of each gene are expected to be proportional to both gene abundance and gene length because longer genes produce more reads in the sequencing process. The counts will also vary between samples as a result of differences in the total number of mapped counts per sample (library size or sequencing depth). The first normalization for RNASeq data, ‘reads per kilobase per million’ (RPKM), was suggested by Mortazavi et al. [36] and was supposed to guard against overdetection of longer and more highly expressed genes. Recently, it was found that RPKM tends to identify weakly expressed genes as differentially expressed [37] and is not able to remove the length bias properly [19],[37]. Oshlack and Wakefield [38] have demonstrated that the ttest power has a dependency on the square root of gene length even after RPKM normalization. While RPKM remains very popular, a number of other normalizations were suggested [4],[39][41]. We employed three frequently used RNASeq normalization strategies to examine the performance of multivariate tests: the read per kilobase per million (RPKM) [36], the quantilequantile normalization (QQN) [40], and the trimmed mean of Mvalues (TMM) [39]. Since both QQN and TMM ignore gene lengths, they are followed by RPKM to account for withinsample differences (see Additional file 1).
Instead of searching for better normalization, an alternative way of analyzing RNASeq data is to find a count data transformation such that all approaches developed for microarray data will become applicable [7]. It was shown that log counts, normalized for sequence depth, serve perfectly for this purpose when finding DE genes (VOOM [7] function in the limma package [6]). Since VOOM achieves betweensamples normalization only, we followed it with RPKM normalization to account for gene length differences. VOOM returns normalized data in a log scale, so, before applying the RPKM normalization, the data were backtransformed to a linear scale.
Importantly, none of these normalizations (except RPKM for GO analysis [19]) have been tested in the context of GSA approaches. Here we provide the comparative power analysis of multivariate GSA approaches relying on the four aforementioned normalizations.
Sample permutation
The null distribution of the test statistics used for the WW, KS, and Nstatistic tests are estimated using sample permutations where sample phenotype labels (X and Y) are permuted randomly and the test statistic is calculated many times. To get reasonable estimates here this process was repeated 1,000 times. The empirical estimate of a Pvalue for a gene set is then taken as the proportion of permutations yielding a test statistic more extreme than the observed one from the original gene set. The same procedure was employed to compute the combined Pvalue P _{ c } for a gene set after genelevel Pvalues are transformed and combined. This is necessary due to the lack of independence between genes which renders the parametric approach inaccurate.
Biological data and pathways
We analyzed the subset of the data from Pickrell et al. [42], the sequenced RNA from lymphoblastoid cell lines (LCL) in 69 Nigerian individuals. We selected 58 unrelated individuals (parents), 29 males and 29 females. Pickrell et al. [42] dataset (the ‘Nigerian dataset’ in what follows) is attractive because there are two natural sets of True Positives: genes that are escaping Xchromosome inactivation and are therefore overexpressed in females (XiE), and genes that are located on malespecific region of Y chromosome and are therefore overexpressed in males (msY). The dataset also contains a natural set of False Positives: all Xlinked genes that are not escaping inactivation (Xi, 387 genes after filtering). See Additional file 1 for more details.
Gene counts were obtained by detecting the overlaps between mapped short reads and the list of genomic ranges (of exons) under each gene using the Bioconductor GenomicRanges package (version 1.12.5). Short reads, which have nonunique mappings, were discarded. After filtering, the resulting count matrix had a total of 13,191 annotated genes and 58 samples. The normalized counts were transformed to logscale using the function log_{2}(1 + Y _{ ij }) to further reduce the effects of outliers. (For more detailed description of the Pickrell et al. [42] data preprocessing steps see Additional file 1).
Except Xi, msY, and XiE other gene sets were taken from the C2 pathways set of the molecular signature database (MSigDB) [43]. These gene sets were curated from online databases, biomedical literature, and knowledge of domain experts. Genes not present in the filtered dataset were discarded, and only pathways with the number of genes (p) in the range of 10 ≤ p ≤ 500 were included. The resulted dataset comprised 12,051 genes and 4,020 pathways. One C2 pathway, DISTECHE_ESCAPED_FROM_X_INACTIVATION (DEX), contains 13 Xlinked genes found in our filtered dataset that were reported to escape inactivation [44]. While we can’t be sure if the other C2 pathways are differentially expressed between males and females, we expect that at least the three aforementioned pathways (msY, XiE and DEX) should be, and the Xi pathway should not be detected by any GSA test. Additional file 2 provides lists of all the genes and their descriptions in msY, XiE, DEX and Xi gene sets.
Simulation of RNASeq counts
We model the count for a gene i in sample j by a random variable Y _{ ij } from Negative Binomial (NB) distribution Y _{ ij } ~ NB(mean = μ _{ ij }, var = μ _{ ij }(1 + μ _{ ij }φ _{ ij })) = NB(μ _{ ij }, φ _{ ij }), where μ _{ ij } and φ _{ ij } are respectively the mean count and dispersion parameter of gene i in sample j. For each gene in a gene set, a vector of mean counts, dispersion, and gene length information (μ _{ i }, φ _{ i }, L _{ i }), is randomly selected from a pool of vectors derived from the processed Nigerian dataset (see Additional file 1). The dispersion parameter for each gene was estimated using the Bioconductor package edgeR (version 3.4.2) by the empirical Bayes method [45]. Counts, normalized using different approaches, were transformed to logscale using the transformation function log_{2}(1 + Y _{ ij }) to further reduce the effects of outliers. Additional file 3: Figure S2 and S3 show the density and histogram plots for the original counts and NB simulated counts before and after different normalizations. The simulated counts match the original counts reasonably well.
To evaluate the tests performance as accurately as possible, simulation experiments should mimic real expression data as closely as possible. In a real biological setting, not all genes in a gene set are differentially expressed, and the fold changes of genes between different phenotypes can vary. Therefore, we introduced two parameters: γ, the percentage of genes truly differentially expressed in a gene set; and FC, the amount of fold change in gene counts between two phenotypes. These parameters are expected to influence the power of different tests on a different scale. For the γ parameter, we consider γ∈{1/8, 1/4, 1/2}, and for the parameter FC, the values span the range from 1.2 to 3. Using simulations we assess the detection power for all tests by testing the hypothesis H _{ 0 }: μ _{ x } = μ _{ y } (or H _{ 0 }: FC = 1) against an alternative H _{1}: μ _{ x } ≠ μ _{ y } (or H _{1}: FC ≠ 1).
We simulated two datasets of equal sample size, N/2 (N = 20 and N = 40) forming 1,000 nonoverlapping gene sets, each constructed from p random realizations of NB distribution. These two datasets represent two biological conditions with different outcomes. For a gene set in one phenotype, we generate p random realizations of NB distribution with parameters (μ _{ i }, φ _{ i }). For the same gene set in the second phenotype, we generate NB realizations with parameters (FC μ _{ i }, φ _{ i }) when i ≤ γp represents DE genes and NB realizations with parameters (μ _{ i }, φ _{ i }) when i > γp represents nonDE genes. Two cases were considered in our simulations: when the number of genes in a gene set is relatively small (p = 16) or when the number is relatively large (p = 100). To avoid having all the DE genes upregulated for all generated gene sets in one phenotype, we swapped the generated counts for half of the DE genes (1 ≤ i ≤ γp/2) between the two phenotypes. Hence, now in each generated gene set, half of the DE genes are upregulated and half are downregulated between the two phenotypes. This will also avoid the problem of having large differences in total counts per sample between the two phenotypes.
To estimate the Type I error rates for all tests using simulated count data, we set FC and γ to 1 and simulated two datasets of equal sample size, N/2 (N∈{20,40,60}) from 1,000 gene sets, each constructed form p random realizations of Negative Binomial distribution with parameters (μ _{ i }, φ _{ i }) where p∈{16,60,100}. Then, we estimate the proportion of gene sets that reject H _{ 0 }: μ _{ x } = μ _{ y } (or H _{ 0 }: FC = 1) among the 1,000 generated sets.
Results
Simulation study
Type I error rate
Table 1 presents the estimates of the attained significant levels for the multivariate tests with different normalizations. As expected as the sample size N increases, the Type I error rates decrease. When the sample size is small (N = 20), Nstatistic with VOOM normalization gives the most conservative Type I error rate, followed by ROAST (for p = 16, 60). This can be explained by VOOM’s ability to model the meanvariance relationship of count data for small N. But when the sample size is larger, TMM almost always gives more conservative estimates than VOOM (except when N = 60, p = 100). WW seems to be the most liberal among multivariate tests, followed by KS. For every test the Type I error rate is virtually unaffected when the number of genes in a pathway (p) increases.
We next consider Type I error rates for genelevel GSA tests that use univariate RNASeq specific tests (edgeR, DESeq and eBayes) and employ different methods for combining Pvalues (FM, SM and GM with STT = 0.05). To better understand the functional relationship between the transformed and the original Pvalues we applied the transformation functions H (used by FM, SM and GM with STT = 0.05) to a range of Pvalues (Pvalue is changing from 10^{−5} to 1 with the step of 10^{−5}, Figure 1).
Figure 1 shows interesting biases that are introduced by different transformations (FM, SM and GM). First, GM is only sensitive to the extremely small Pvalues and virtually ignores all the others. In practice it means that gene sets with a large number of genes will be called DE by tests with GM more frequently than gene sets with a small number of genes. This is expected because, by pure chance alone, gene sets with a large number of genes have higher probability to contain genes with extremely small Pvalues, and GM ignores all the others. Second, FM accounts not only for the extremely small Pvalues, but also for generally small Pvalues, as well as large Pvalues. Therefore, tests with FM would call a gene set DE if and only if most of the genes in a gene set have small Pvalues. Gene sets with a large number of genes will be called DE by tests with FM less frequently than gene sets with a small number of genes, because, again, by pure chance alone, gene sets with a large number of genes have higher probability to contain genes with large Pvalues and large Pvalues affect the FM score (Figure 1). Third, unlike FM and GM, SM maps Pvalues less than 0.5 and greater than 0.5 to positive and negative values with magnitudes depending on the deviation from 0.5 (Figure 1). As a result tests with SM would call a gene set DE if and only if all genes in a set have small Pvalues. Similar to tests with FM, tests with SM are expected to call DE gene sets with a small number of genes.
The simulation results clearly demonstrate that the Type I error rates are influenced by the aforementioned biases introduced by different transformation functions (FM, SM and GM). As expected, for all genelevel GSA approaches that use univariate tests and different transformation functions to combine Pvalues, tests with GM show the highest Type I error, followed by tests with FM and SM (Table 2, Figure 1). Also, for any Pvalues combining method, edgeR shows the highest Type I error, followed by DESeq and eBayes respectively. In addition, with GM transformation, when the number of genes in a gene set (p) increases, especially for edgeR and DESeq, the Type I error rate becomes extremely high.
The power to detect shift alternatives
Figure 2 presents the power estimates for the Nstatistic, WW and KS multivariate tests with different normalizations and ROAST with only VOOM followed by RPKM normalization (see Section Multivariate tests), when H _{1}: μ _{ x } ≠ μ _{ y } is true (N = 20, p = 16). It appears that ROAST outperforms all the other approaches followed respectively by the Nstatistic, WW, and KS. Different normalizations do not affect the tests’ power at all (Figure 2). When N = 20 and p = 100 (Additional file 3: Figure S3), N = 40 and p = 16 (Additional file 3: Figure S4), N = 40 and p = 100 (Additional file 3: Figure S5) the results are similar, but the power to detect even small fold changes is higher for all tests.
Figure 3 presents the power estimates for genelevel GSA approaches that use univariate tests (edgeR, DESeq, and eBayes) and employ different methods for combining Pvalues (FM, SM, and GM with STT = 0.05) when H _{1} is true (N = 20, p = 16). When the percentage of truly differentially expressed genes is small (γ = 1/8), all three tests that apply GM have slightly higher power than those tests with FM, while the power of tests with SM is much smaller. When γ increases (from the top to the bottom on each panel of Figure 3) the difference between tests with GM and tests with FM diminishes, and the power of tests with SM becomes very close to the power of tests with FM and GM. The results when N = 20 and p = 100 (Additional file 3: Figure S6), N = 40 and p = 16 (Additional file 3: Figure S7) and N = 40 and p = 100 (Additional file 3: Figure S8) are similar, but the power to detect even small fold changes is higher for all tests. Comparing the performance of the three univariate tests under each Pvalue combining method shows that edgeR has slightly higher power than DESeq and eBayes, with both FM and GM, while eBayes has slightly higher power than edgeR and DESeq with SM (Additional file 3: Figure S9). Additional file 3: Figure S10 (N = 20 and p = 100), Additional file 3: Figure S11 (N = 40 and p = 16), and Additional file 3: Figure S12 (N = 40 and p = 100) demonstrate a similar pattern with even more insignificant differences.
To summarize, Figures 2 and 3 demonstrate, that when a gene set has only a few differentially expressed genes (γ = 1/8), edgeR (with GM or FM) has a higher power to detect very small fold changes than the other multivariate and genelevel GSA methods. However, when γ = 1/4 and γ = 1/2, ROAST has the same power as edgeR with GM or FM. It should be noted that the higher power of edgeR with GM or FM is caused by the higher Type I error of edgeR with GM or FM (Table 2 and see below).
The analysis of the Nigerian dataset
Type I error rate
To estimate how different tests control the Type I error rate for the real data, we performed intracondition comparisons using only male samples from the Nigerian dataset. The male samples were randomly distributed over two groups, and GSA was conducted using all tests over C2 pathways from the MSigDB [43] database. There should be no gene sets differentially expressed between these two groups. The Type I error rate was averaged over 100 sample permutations (Table 3). For multivariate tests, ROAST has the lowest average Type I error rate, followed by Nstatistic, KS and WW. Similar to the simulated data when the sample size is large, for real data TMM and QQN normalizations have lower average Type I errors than RPKM and VOOM.
Interestingly, for genelevel GSA tests with different Pvalues transformations (FM, SM, GM), the Type I error rate estimates on real data mimic exactly the Type I error rate estimates on simulated data (Tables 1 and 2). All three tests (edgeR, DESeq, and eBayes) that apply GM show the highest Type I error followed by tests with FM and SM respectively. Under each Pvalue’s combining method, edgeR has the highest Type I error rate, followed by DESeq and eBayes.
The Type I error rate estimates on real and simulated data are perfectly correlated for genelevel GSA tests. For real data and multivariate tests, TMM and QQN normalizations lead to the more conservative Type I error rate estimates.
Detected pathways
While, for real data, the Type I error rate of different GSA approaches can be directly evaluated by using two subsets from the same group, there is no straightforward and unbiased way to evaluate their power. We selected the Nigerian dataset [42] because it contains two sets of True Positives: genes that are escaping Xchromosome inactivation and are therefore overexpressed in females (XiE), and genes that are located on malespecific region of Y chromosome and are therefore overexpressed in males (msY). All tests detect msY, XiE, and DEX (C2 pathway, containing Xlinked genes escaping inactivation) with high significance. All tests fail to detect Xi (all Xlinked genes that are not escaping inactivation) except for the univariate tests with GM, because univariate tests with GM have the highest Type I error rate (see Additional file 1: Table S3).
Except for pathways containing genderspecific genes, there is no set of pathways that are guaranteed to be differentially expressed between male and female samples. We therefore decided to examine the entire set of C2 pathways with the goal to quantitatively characterize different methods based on: (1) a number of detected pathways at the different significance levels; (2) the average number of genes in detected pathways; (3) the average length of genes in detected pathways; and (4) the percentage of differentially expressed genes in detected pathways. This information will clarify whether there are methods that are: (1) overlay liberal (detect too many pathways that are not shared with the majority of the other approaches); (2) biased in terms of the number of genes in detected pathways; (3) biased in terms of the average gene length in detected pathways; or (4) detecting only pathways with small (large) number of differentially expressed genes.
Among multivariate tests WW is the most liberal (343 with RPKM, 352 with QQN, 333 with TMM and 348 with VOOM). KS is the next most liberal (267 with RPKM, 292 with QNN, 278 with TMM and 271 with VOOM). Nstatistic is more conservative than both WW and KS (241 with RPKM, 254 with QNN, 252 with TMM and 245 with VOOM). ROAST is the most conservative among multivariate tests (199 pathways). Methods with QQN normalization detect slightly more pathways as compared to the same method with other normalizations.
Univariate tests with GM detect by far the highest number of pathways (603 with edgeR, 565 with DESeq, and 465 with eBayes). Tests with SM are the most conservative among methods (63 with edgeR, 56 with DESeq, and 77 with eBayes), followed by FM (151 with edgeR, 162 with DESeq, and 146 with eBayes). These observations are in agreement with the Type I error rate estimates for univariate tests with different approaches for combining Pvalues (Tables 2 and 3).
The Venn diagrams in Figure 4 show the common pathways detected (α = 0.05) by multivariate tests with different normalizations (except ROAST which uses VOOM followed by RPKM only) and univariate tests with different Pvalues combining approaches. Nstatistic detects more common pathways with ROAST than WW and KS and also has more common pathways across different normalizations (Figure 4a). Both WW and KS have much more unique pathways detected by one normalization method than Nstatistic (Figure 4b,c). When α = 0.001 only highly significant pathways are detected, consequently, WW and KS now show similar common groups with ROAST (Additional file 3: Figure S13).
Univariate tests with different Pvalues combining methods have very small overlap between pathways detected by different approaches (Figure 4d,e,f), as compared to multivariate tests. The overlap between pathways detected by different univariate tests with the same Pvalue combining method (Figure 4g,h,i) is larger than the overlap between pathways detected by the same univariate test with different methods for combining Pvalues (Figure 4d,e,f). This demonstrates that the Pvalue combining method is the more important factor than the test itself in detecting DE pathways.
Figure 5 shows the number of genes, the percentage of DE genes, and the average gene length in detected pathways for all tests (α = 0.05). The DE genes in each pathway were found using eBayes. Figure 5 confirms the presence of biases in genelevel tests for GSA that are introduced by different Pvalue combining approaches. Tests with SM and FM favor pathways with a small number of genes and require larger percentages of DE genes in order to detect a pathway. On the contrary tests with GM favor pathways with a large number of genes and require less percentage of DE genes to detect a pathway.
To test whether there are methods that detect pathways with average gene lengths significantly different than the average gene length of all 4020 C2 pathways, Wilcoxon’s nonparametric test was applied. All univariate tests with GM, eBayes with SM, and WW with any normalization, detect pathways with longer genes than the average. The deviation found in eBayes with SM can be attributed to a small number of detected pathways with a small number of genes, which doesn’t allow accurate estimation of the average gene length per detected pathways. On the other hand, tests with GM detect a large number of pathways with a large number of genes, and it also makes the estimate of the average gene length per detected pathway biased.
Discussion
Here we have presented a comparative power and Type I error rate analyses for selfcontained GSA approaches that could be possibly used for RNASeq data. In contrast to microarrays, RNASeq data consists of discrete counts, therefore GSA approaches developed for microarrays are not directly applicable to RNASeq. We have evaluated and compared three multivariate nonparametric approaches (Nstatistic, KolmogorovSmirnov, and WaldWolfowitz tests) in combination with four different normalizations (RPKM, TMM, QNN, and VOOM), ROAST, [29] and genelevel GSA methods that use univariate RNASeq specific tests (edgeR, DESeq, and eBayes) and employed different methods for combining Pvalues (FM, SM, and GM). In sum we analyzed the performance of twentytwo combinations of tests, including normalization and Pvalue combining methods in the analysis of RNASeq data. All approaches were evaluated on simulated and real data, and their significance was evaluated from sample permutations.
We found that for simulated data the Type I error rate and the power of different multivariate approaches in combination with four different normalizations were virtually unaffected by different normalizations. It should be noted that the Type I error rate was only slightly (in the range of 0.01 for the same multivariate test) affected by the normalization used, while the power was not affected at all. Expectedly, both measures were seriously affected by different test statistics. The bestperforming approach, in terms of the smallest Type I error rate and the largest power, when the percentage of truly DE expressed genes in a pathway (γ) and a fold change (FC) were small, was ROAST [29], closely followed by Nstatistic. Multivariate nonparametric WaldWolfowitz and KolmogorovSmirnov had the smallest power and the largest Type I error rates with all normalizations. The Type I error rate estimates on real data reproduced the trends observed on simulated data. Again, ROAST was the most conservative approach among multivariate tests, and different normalizations didn’t affect the Type I error rates as much as the different test statistics. All of the tests were able to detect genderspecific pathways (msY, XiE and DEX) as differentially expressed between male and female samples with high significance. Xi was not detected by any test.
We also examined the entire set of C2 pathways to quantitatively characterize different methods. The analysis of all C2 pathways confirmed that ROAST is the most conservative among multivariate tests, having the least amount of DE pathways detected. Similarly to the simulated data, ROAST was closely followed by Nstatistic. Again, for real data, only multivariate test statistics and not normalizations influenced the results to a measurable extent. Thus, on simulated and real data, in terms of the Type I error rate and power, ROAST and Nstatistic outperformed all other tests, independently of the normalization used. We did not find any evidence of bias for multivariate tests in terms of the number of genes, or the percentage of DE genes in detected pathways with any type of normalization. Surprisingly, among all multivariate tests, multivariate nonparametric WaldWolfowitz with any normalization detected pathways with longer genes than the average. It might be related to the fact that WW was the most liberal test among all multivariate tests considered.
For the simulated data, genelevel tests for GSA were heavily dependent on the method used for combining Pvalues, and the differences in power and Type I error rate between univariate tests with the same approach for combining Pvalues were much smaller than the differences when the same test, but different combining Pvalues approaches, were applied. When the percentage of truly differentially expressed genes (γ) and fold changes were small, all three tests (edgeR, DESeq, and eBayes) with GM outperformed tests with FM and SM. This difference disappeared when γ increased.
For genelevel tests for GSA, it appeared that trends in Type I error rates, estimated from real data, were again similar to the trends in simulated data. All genelevel tests for GSA detected genderspecific pathways, but, in addition, all tests with GM detected the Xi pathway that should not be detected. For genelevel tests for GSA, the analysis of all C2 pathways shows that all of them (except tests with GM) have very small overlap between pathways detected by different approaches as compared to multivariate tests. The overlap between pathways detected by different univariate tests with the same method for combining Pvalues was larger than the overlap between pathways detected by the same univariate test with different methods for combining Pvalues, but still in an order of magnitude smaller than for multivariate tests (excluding tests with GM, see below). This indicates that, first, the Pvalues combining method is the leading factor in detecting DE pathways using genelevel tests for GSA, and, second, for real data they have less power than multivariate approaches in an order of magnitude.
The analysis of C2 pathways on the Nigerian data confirmed our expectations, which were formed by the analysis of the functional dependencies between the original and transformed Pvalues for different Pvalues combining methods (Figure 1). All tests with GM exclusively detected pathways with a large number of genes and a small percentage of DE genes as compared to the other approaches. All tests with SM exclusively detected pathways with a small number of genes and a large percentage of DE genes as compared to the other approaches. The Type I error rate, the number of genes and the percentage of DE genes necessary to detect a pathway for all tests with FM, were exactly inbetween GM and SM: smaller than for all tests with GM and larger than for all tests with SM. In agreement tests with GM and eBayes with SM all detected pathways with longer genes than the average (Wilcoxon’s test, Figure 5).
The results from simulated and real data show that genelevel tests for GSA with GM have the highest Type I error rates and the highest power. In addition all tests with GM had the highest number of genes and the smallest percentage of truly DE genes in detected C2 pathways. These observations indicate that the gain in power for tests with GM is caused by the gain in false positives. Tests with SM had the smallest power and the smallest Type I error rates, while the results for tests with FM were intermediate.
It should be noted that recently edgeR with GM was found to outperform many other approaches for GSA in terms of power and Type I error rate and was recommended for RNASeq data analysis [25]. Indeed, we observed that edgeR with GM had the highest power among all the other approaches. In a recent publication edgeR with GM was suggested to be the first method of choice for GSA of RNASeq data [25]. In contrast to this result, our study showed that for simulated and real data edgeR with GM has the highest Type I error rate among all the other tests for GSA. We hypothesize that the difference between the two studies stems from the way the data were simulated. In our study we used the Negative Binomial model, which is used in edgeR for finding DE genes. In [25] the multivariate normal distribution with fixed correlation structure was used, but, surprisingly, edgeR was used for finding DE genes. Therefore, in the latter case, the distributional assumption of the method (edgeR) was not met, which could have led to the bias in the estimation of the Type I error rates. However, all simulations are only crude approximations of biological reality. To estimate the Type I error rates on the real data, we performed intracondition comparisons using only male samples from the Nigerian dataset: there should not be gene sets differentially expressed between these two groups. Again, edgeR with GM had the highest Type I error rate for real data among all other tests, confirming that in contrast with [25] results, edgeR with GM has inadequate control of the Type I error rate.
Conclusions
Overall, for the selfcontained category of GSA, multivariate GSA tests are insensitive to different normalizations and have better control of Type I error rates and higher power as compared to genelevel GSA tests, both on simulated and real data. In addition, while standard gene set overrepresentation analysis shows as overrepresented categories with longer genes [19], standard multivariate GSA tests (except WW) with different normalizations do not have any biases in terms of the pathway size, the percentage of DE genes, or the average gene length in a pathway. The opposite is true for all genelevel GSA tests. Thus, our study argues against the use of genelevel tests for GSA whether with Fisher’s combining probabilities Method [27], or Stouffer’s Method [28], or the soft thresholding Gamma Method [25], and emphasize the importance of using nonparametric multivariate tests for detecting DE pathways for RNASeq data.
Availability of software
Software implementing the multivariate generalizations of the KolmogorovSmirnov and WaldWolfowitz tests in R was released within the GSAR package in version 3.0 of Bioconductor (http://www.bioconductor.org/packages/release/bioc/html/GSAR.html).
Additional files
Abbreviations
 DE:

Differentially expressed
 GSA:

Gene set analysis
 GSEA:

Gene set enrichment analysis
 WW:

WaldWolfowitz
 KS:

KolmogorovSmirnov
 ROAST:

Rotation gene set test
 FC:

Fold change
 FM:

Fisher’s method
 GM:

Gamma method
 SM:

Stouffer’s method
 STT:

Soft truncation threshold
 RPKM:

Reads per kilobase per million
 QQN:

Quantilequantile normalization
 TMM:

Trimmed mean of Mvalues
 NB:

Negative bionomial
 msY:

male specific genes of chromosome Y
 XiE:

Chromosome X genes inactivation escaping
 DEX:

Disteche escaped from chromosome X inactivation
 MSigDB:

Molecular signature database
 GO:

Gene Ontology
 KEGG:

Kyoto Encyclopedia of genes and genomes
References
 1.
Core LJ, Waterfall JJ, Lis JT: Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science. 2008, 322 (5909): 18451848. 10.1126/science.1162228.
 2.
Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bahler J: Dynamic repertoire of a eukaryotic transcriptome surveyed at singlenucleotide resolution. Nature. 2008, 453 (7199): 12391243. 10.1038/nature07002.
 3.
Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26 (1): 139140. 10.1093/bioinformatics/btp616.
 4.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11 (10): R10610.1186/gb20101110r106.
 5.
Li J, Tibshirani R: Finding consistent patterns: A nonparametric approach for identifying differential expression in RNASeq data. Stat Methods Med Res. 2013, 22 (5): 519536. 10.1177/0962280211428386.
 6.
Smyth G: Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions using R and Bioconductor. Edited by: Smyth G, Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W. 2005, Springer, New York, 397420. 10.1007/0387293620_23.
 7.
Law CW, Chen Y, Shi W, Smyth GK: Voom: precision weights unlock linear model analysis tools for RNAseq read counts. Genome Biol. 2014, 15 (2): R2910.1186/gb2014152r29.
 8.
Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003, 34 (3): 267273. 10.1038/ng1180.
 9.
Glazko GV, EmmertStreib F: Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics. 2009, 25 (18): 23482354. 10.1093/bioinformatics/btp406.
 10.
EmmertStreib F, Glazko GV: Pathway analysis of expression data: deciphering functional building blocks of complex diseases. PLoS Comput Biol. 2011, 7 (5): e100205310.1371/journal.pcbi.1002053.
 11.
Ackermann M, Strimmer K: A general modular framework for gene set enrichment analysis. BMC Bioinformatics. 2009, 10 (1): 4710.1186/147121051047.
 12.
da Huang W, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37 (1): 113. 10.1093/nar/gkn923.
 13.
Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Geneset analysis and reduction. Brief Bioinform. 2009, 10 (1): 2434. 10.1093/bib/bbn042.
 14.
Goeman JJ, Buhlmann P: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007, 23 (8): 980987. 10.1093/bioinformatics/btm051.
 15.
Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci USA. 2005, 102 (38): 1354413549. 10.1073/pnas.0506577102.
 16.
da Huang W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4 (1): 4457. 10.1038/nprot.2008.211.
 17.
Khatri P, Sirota M, Butte AJ: Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012, 8 (2): e100237510.1371/journal.pcbi.1002375.
 18.
Rahmatallah Y, EmmertStreib F, Glazko G: Gene set analysis for selfcontained tests: complex null and specific alternative hypotheses. Bioinformatics. 2012, 28 (23): 30733080. 10.1093/bioinformatics/bts579.
 19.
Young MD, Wakefield MJ, Smyth GK, Oshlack A: Gene ontology analysis for RNAseq: accounting for selection bias. Genome Biol. 2010, 11 (2): R1410.1186/gb2010112r14.
 20.
Hanzelmann S, Castelo R, Guinney J: GSVA: gene set variation analysis for microarray and RNAseq data. BMC Bioinformatics. 2013, 14: 710.1186/14712105147.
 21.
Wang X, Cairns MJ: Gene set enrichment analysis of RNASeq data: integrating differential expression and splicing. BMC Bioinformatics. 2013, 14 (Suppl 5): S1610.1186/1471210514S5S16.
 22.
Tripathi S, Glazko GV, EmmertStreib F: Ensuring the statistical soundness of competitive gene set approaches: gene filtering and genomescale coverage are essential. Nucleic Acids Res. 2013, 41 (7): e8210.1093/nar/gkt054.
 23.
Varemo L, Nielsen J, Nookaew I: Enriching the gene set analysis of genomewide data by incorporating directionality of gene expression and combining statistical hypotheses and methods. Nucleic Acids Res. 2013, 41 (8): 43784391. 10.1093/nar/gkt111.
 24.
Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Improving gene set analysis of microarray data by SAMGS. BMC Bioinformatics. 2007, 8: 24210.1186/147121058242.
 25.
Fridley BL, Jenkins GD, Grill DE, Kennedy RB, Poland GA, Oberg AL: Soft truncation thresholding for gene set analysis of RNAseq data: application to a vaccine study. Sci Rep. 2013, 3: 289810.1038/srep02898.
 26.
Friedman JH, Rafsky C: Multivariate Generalizations of the WaldWolfowitz and Smirnov TwoSample Tests. Ann Stat. 1979, 7 (4): 697717. 10.1214/aos/1176344722.
 27.
Fisher R: Statistical methods for research workers. 1932, Oliver and Boyd, Edinburgh, Scotland
 28.
Stouffer S, DeVinney L, Suchmen E: The American Soldier: Adjustment during army life., vol. 1. 1949, Princeton University Press, Princeton, US
 29.
Wu D, Lim E, Vaillant F, AsselinLabat ML, Visvader JE, Smyth GK: ROAST: rotation gene set tests for complex microarray experiments. Bioinformatics. 2010, 26 (17): 21762182. 10.1093/bioinformatics/btq401.
 30.
Baringhaus L, Franz C: On a new multivariate twosample test. J Multivariate Anal. 2004, 88: 190206. 10.1016/S0047259X(03)000794.
 31.
Klebanov L, Glazko G, Salzman P, Yakovlev A, Xiao Y: A multivariate extension of the gene set enrichment analysis. J Bioinform Comput Biol. 2007, 5 (5): 11391153. 10.1142/S0219720007003041.
 32.
Zaykin DV: Optimally weighted Ztest is a powerful method for combining probabilities in metaanalysis. J Evol Biol. 2011, 24 (8): 18361841. 10.1111/j.14209101.2011.02297.x.
 33.
Zaykin DV, Zhivotovsky LA, Czika W, Shao S, Wolfinger RD: Combining pvalues in largescale genomics experiments. Pharm Stat. 2007, 6 (3): 217226. 10.1002/pst.304.
 34.
Quackenbush J: Microarray data normalization and transformation. Nat Genet. 2002, 32 (Suppl): 496501. 10.1038/ng1032.
 35.
Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4 (2): 249264. 10.1093/biostatistics/4.2.249.
 36.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Methods. 2008, 5 (7): 621628. 10.1038/nmeth.1226.
 37.
Dillies MA, Rau A, Aubert J, HennequetAntier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloë D, Le Gall C, Schaëffer B, Le Crom S, Guedj M, Jaffrézic F: A comprehensive evaluation of normalization methods for Illumina highthroughput RNA sequencing data analysis. Brief Bioinform. 2012, 14 (6): 671683. 10.1093/bib/bbs046.
 38.
Oshlack A, Wakefield MJ: Transcript length bias in RNAseq data confounds systems biology. Biol Direct. 2009, 4: 1410.1186/17456150414.
 39.
Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 2010, 11 (3): R2510.1186/gb2010113r25.
 40.
Bullard JH, Purdom E, Hansen KD, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinformatics. 2010, 11: 9410.1186/147121051194.
 41.
Hansen KD, Irizarry RA, Wu Z: Removing technical variability in RNAseq data using conditional quantile normalization. Biostatistics. 2012, 13 (2): 204216. 10.1093/biostatistics/kxr054.
 42.
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK: Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010, 464 (7289): 768772. 10.1038/nature08872.
 43.
Liberzon A, Subramanian A, Pinchback R, Thorvaldsdottir H, Tamayo P, Mesirov JP: Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011, 27 (12): 17391740. 10.1093/bioinformatics/btr260.
 44.
Disteche CM, Filippova GN, Tsuchiya KD: Escape from X inactivation. Cytogenet Genome Res. 2002, 99 (1–4): 3643. 10.1159/000071572.
 45.
Robinson MD, Smyth GK: Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007, 23 (21): 28812887. 10.1093/bioinformatics/btm453.
Acknowledgments
We are grateful to anonymous reviewers whose comments have helped us improve the manuscript. We thank our colleague Nia Indelicato for editorial help. Support has been provided in part by the National Center for Advancing Translational Science award UL1TR000039 and the IDeA Networks of Biomedical Research Excellence (INBRE) grant P20RR16460. FES is supported by the EPSRC (EP/H048871/1). Largescale computer simulations were implemented using the High Performance Computing (HPC) resources at the UALR Computational Research Center supported by the following grants: National Science Foundation grants CRI CNS‐0855248, EPS‐0701890, MRI CNS‐0619069, and OISE‐0729792.
Author information
Additional information
Competing interests
The authors declare that they do not have any competing interests.
Authors’ contributions
GG and YR designed the study, performed the analysis and wrote the manuscript. FES contributed to the comparative power analysis and manuscript writing. All authors read and approved the final manuscript.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Rahmatallah, Y., EmmertStreib, F. & Glazko, G. Comparative evaluation of gene set analysis approaches for RNASeq data. BMC Bioinformatics 15, 397 (2014). https://doi.org/10.1186/s1285901403978
Received:
Accepted:
Published:
Keywords
 Minimum Span Tree
 Differentially Express
 Differentially Express Gene
 Univariate Test
 Negative Binomial