Down-weighting overlapping genes improves gene set analysis

Background The identification of gene sets that are significantly impacted in a given condition based on microarray data is a crucial step in current life science research. Most gene set analysis methods treat genes equally, regardless how specific they are to a given gene set. Results In this work we propose a new gene set analysis method that computes a gene set score as the mean of absolute values of weighted moderated gene t-scores. The gene weights are designed to emphasize the genes appearing in few gene sets, versus genes that appear in many gene sets. We demonstrate the usefulness of the method when analyzing gene sets that correspond to the KEGG pathways, and hence we called our method Pathway Analysis with Down-weighting of Overlapping Genes (PADOG). Unlike most gene set analysis methods which are validated through the analysis of 2-3 data sets followed by a human interpretation of the results, the validation employed here uses 24 different data sets and a completely objective assessment scheme that makes minimal assumptions and eliminates the need for possibly biased human assessments of the analysis results. Conclusions PADOG significantly improves gene set ranking and boosts sensitivity of analysis using information already available in the gene expression profiles and the collection of gene sets to be analyzed. The advantages of PADOG over other existing approaches are shown to be stable to changes in the database of gene sets to be analyzed. PADOG was implemented as an R package available at: http://bioinformaticsprb.med.wayne.edu/PADOG/or http://www.bioconductor.org.

Since the beginning of the microarray-based expression profiling experiments, researchers were interested in finding common "themes" among the genes identified as differentially expressed between two conditions. For instance the identification of Gene Ontology (GO) terms enriched in differentially expressed genes was used as early as 1999 [1], but became widespread only after the first on-line GO analysis tools were made available [2,3]. As biological annotations started to include descriptions of gene interactions in the form of pathways (found in resources such as KEGG [4], BioCarta www.biocarta.com, and Reactome [5]), the identification of the pathways involved in various conditions has emerged as a ubiquitous bioinformatics task.
In general, biological pathways can be divided into gene signaling pathways, and metabolic pathways. Gene signaling pathways are graphs that use nodes to represent genes, or gene products, and edges to represent signals http://www.biomedcentral.com/1471-2105/13/136 that go from one gene to another. Metabolic pathways are graphs that use nodes to represent biochemical compounds, and edges to describe biochemical reactions that involve such compounds. Since biochemical reactions are usually carried out by enzymes which are coded for by genes, in a metabolic pathway genes are associated with edges rather than nodes. Ideally, a comprehensive pathway analysis method would be able to take into consideration all aspects of the phenomena described by a pathway. These aspects would include the position and role of each gene in a pathway, the types of signals between genes, the efficiency with which a signal travels from one gene to another, or the efficiency with which a certain reaction is carried out, rate limiting conditions, etc. Such methods have been proposed for both signaling pathways [6][7][8][9], and metabolic pathways [8,10], but no method is currently available to analyze both types of pathways taking into consideration all the information available. Hence, even though they do not use all information available, methods that treat the pathways as simple gene sets are still popular because they can be applied equally well to signaling pathways, metabolic pathways, GO terms, as well as arbitrary sets of genes.
Two of the most popular such methods are the Gene Set Enrichment Analysis (GSEA) [11] and the Gene Set Analysis (GSA) [12]. These methods belong to the functional class scoring category of gene set analysis methods [13,14]. For a simple two group experiment (e.g. disease vs. normal), both GSEA and GSA start with computing a t-statistic for each gene measured on the array. Then, a score is computed for each gene set using the t-scores of all genes in the gene set. The significance of the gene set scores is determined by using permutations of the samples. Both approaches treat the genes in the gene set equally.
In this work, we propose the Pathway Analysis with Down-weighting of Overlapping Genes (PADOG) which is a general gene set analysis method. The method gives more weight to genes that are gene set-specific, than to genes which can be found in multiple gene sets. This is similar to the approach commonly used in information retrieval (e.g. web search engines) that decreases the importance of words that appear in many documents (e.g. "and", "or", etc.) in favor of words that are highly specific to given documents, the latter type being considered to carry more information about the informational content of the document. Similarly, in our approach, if the differential expression affects genes that are highly specific to a given pathway (e.g. huntingtin to Hungtington's disease), it is more likely that the respective pathway is truly relevant in that condition.
The process of down-weighting popular genes does not affect one's ability to find a gene set to be significant whenever the gene set is composed mostly of ubiquitous genes, but rather increase the contrast between gene sets that overlap by reducing the contribution of the overlapping genes into the gene set scores. As a simple example, with PADOG, a gene set A having 20 out of 50 genes differentially expressed, that appear only in gene set A, will be found more significant than another gene set B of same size that has also 20 differentially expressed genes but which appear in other gene sets as well. Both GSEA and GSA would find the two gene sets equally significant.
Analysis methods that do not treat all genes equally were previously proposed for pathway analysis in an overrepresentation context [6,7], or in a functional class scoring context [8], yet none specifically exploit the frequency of occurrence of genes across the pathways. Moreover, unlike GSA, PADOG does not rely on ordinary t-scores to derive gene set scores but uses moderated t-statistics [15] instead. A similar idea to use non-ordinary t-scores in the gene set scores computation was illustrated first in [16] by using SAM statistics [17] in conjunction with GSEA. Moreover, unlike GSA, PADOG summarizes the gene scores into a gene set score using the mean of absolute values instead of the maxmean statistic.
The sensitivity of gene set analysis methods (i.e. their ability to produce significant p-values for gene sets that are truly relevant to a phenotype), as well their ability to rank the relevant gene sets near the top, is typically assessed using a few data sets, by asking domain experts to make informed guesses about which gene sets are relevant to each condition/dataset. Relevance is determined using the expert's knowledge and/or literature citations supporting the link between certain gene sets and the condition under the study [6,7,11,18]. The problem is that almost any gene set analysis result will be supported by some references which makes an unbiased and objective comparison of various analysis methods practically impossible. In this study, we used a different approach in which we make fewer assumptions, and use an order of magnitude more data sets (24 sets). The type of gene sets considered in our validation were KEGG biological pathways. Each of the 24 microarray data sets that we used (see Table 1) involved a particular disease for which there is an associated pathway in the KEGG database [19], e.g. Alzheimer's disease, Colorectal cancer, Asthma, etc. We refer to these as the target pathways, and we, very conservatively, consider them to be the only ones certain to be relevant for their respective conditions. Since the target pathways for all 24 datasets belong to the non-metabolic pathways category, we can restrict the analysis only to KEGG non-metabolic pathways. Analyzing all metabolic and non-metabolic pathways brings an additional challenge to the analysis methods because the assumed relevant pathway for a given condition (dataset) is now to be found among a larger pool of pathways. The gene set analysis methods were compared in terms of their ability http://www.biomedcentral.com/1471-2105/13/136 Each data set comes from tissues affected by a specific disease. The KEGG pathway describing that disease is henceforth considered to be the target pathway. The analysis methods were compared in terms of their ability to rank the target pathway as high as possible in the analysis of each data set.
to produce significant p-values for these target pathways and rank them near the top.

Existing methods
The two methods we chose to compare PADOG against are the Gene Set Enrichment Analysis (GSEA) [11] and the Gene Set Analysis (GSA) [12]. Briefly, GSEA works as follows. Let GS i denote the i th gene set, where i = 1..N GS . For each gene j on the array, GSEA computes a t-statistic z j for the differential expression of the gene between the disease group and the control group. A gene set score S(GS i ) is computed similar to a signed version of the Kolmogorov-Smirnov statistic between the values z j , j ∈ GS i and their complement (genes measured on the array but not belonging to the gene set). The class labels of the arrays are permuted and the significance of the gene set score is assessed by determining the null distribution of the gene set score.
The Gene Set Analysis (GSA) [12] differs from GSEA in two ways. Firstly, the gene set summary statistic used is the maxmean statistic, defined as: where the (+) and (−) signs identify the positive and negative t-scores respectively, and n represents the number of genes in the gene set. Secondly, GSA differs from GSEA by re-standardizing the gene set scores by taking into account scores from sets formed by random selection of genes.
Permutations of class labels are then used to infer the significance of the standardized gene set scores. The need for re-standardization is justified by the fact that, given that the genes are correlated (they tend to have either high or low t-scores simultaneously), the gene set score computed with the true class labels will be systematically larger than with permuted class labels and, hence, the significance of all gene sets will be overstated. http://www.biomedcentral.com/1471-2105/13/136

Pathway Analysis with Down-weighting of Overlapping Genes (PADOG)
Let GS i with i = 1..N GS be the collection of gene sets to be analyzed, each containing N(GS i ) genes, and G be the set of all genes measured on the array that can be mapped to at least one gene set to be analyzed. Then let T g be the value of a moderated t-score [15] of the gene g between the two conditions of interest with g ∈ G. The moderated t-scores are similar to ordinary t-scores, except that their standard errors have been moderated across genes, i.e., shrunk towards a common value using a Bayesian model [15]. The moderated t-scores are expected to be more reliable than ordinary t-scores because the shrinkage of the gene standard deviations will prevent large t-scores to occur only due to small gene standard deviations.
Moreover, let f (g) be the frequency of gene g across all gene sets to be analyzed. Here f (g) can take values from 1 to N GS since a gene can be either specific to a gene set by appearing only in that gene set, or it is present in all gene sets, respectively. We want to weight the t-scores of the genes with a function of their frequency in such a way that the most frequently appearing gene gets a weight of w = 1.0, while gene set specific genes get double weight (w = 2.0). We chose a monotonically decreasing function to relate the gene weight w(g) to the gene frequency f (g) so that it is bounded between 1.0 and 2.0 and drops faster with increasing frequency values: For illustration purposes, the distribution of gene frequencies across all 143 KEGG non-metabolic pathways (treated here as gene sets), as well as the dependency of gene weights on gene frequency, is shown in Figure 1. For each gene set we compute a score as: The formula above describes the gene set scores as the mean across all genes in the gene set of the weighted absolute moderated t-scores. The gene set scores obtained with the formula above are first standardized using a row randomization approach described in [12] to yield S 0 (GS i ). The row randomization consists of subtracting the mean and dividing by the standard deviation of gene set scores that could be obtained by randomly selecting sets of genes with the same size as the current gene set. Given that our gene set summarization function Eq. 2 is essentially a mean (of absolute weighted moderated tscores) both the row standardization mean and standard deviation can be inferred from the mean and standard deviation of |T (g)| · w(g) values of all genes on the array, as the central limit theorem would suggest, and hence no actual permutations are needed. More specifically, the row randomization mean for gene set GS i will given by the mean (of absolute weighted moderated t-scores) of all genes on the array, and the row randomization standard deviation can be calculated as the standard deviation of |T (g)| · w(g) values of all genes on the array divided by √ N(GS i ). A second standardization is applied by subtracting the mean and dividing to the standard deviation of S 0 (GS i ) scores across all N GS gene sets to obtain the observed standardized scores, S * 0 (GS i ). The probability P PADOG (GS i ) to observe such a large or larger standardized score is determined by permuting N ite = 1000 times the array/samples labels: where I is a function that returns 1 when the argument is true and 0 otherwise, and S * ite (GS i ) represents the standardized score obtained with the ite-th permutation of the samples for gene set GS i . Assessing the sensitivity and gene set ranking capability using real data To assess the sensitivity and ranking capability of the gene set analysis methods discussed in this paper, we identified in the Gene Expression Omnibus (GEO) [38], 24 microarray data sets each involving a particular disease. For each such disease, we considered the KEGG pathway that describes the biological phenomena taking place in that disease as the target pathway. For instance, the Alzheimer's disease pathway is the target pathway for all Alzheimer data sets, etc. Table 1 shows the details about these 24 datasets. For most diseases considered, there are several associated data sets in this collection. The gene set analysis methods were compared in terms of their ability to produce low p-values, and rank at the top these target pathways (one in each data set). A schematic representation of the benchmark system used to assess the performance of each gene set analysis method is shown in Figure 2. There were three categories of statistics computed to compare the performance of the gene set analysis methods considered in this study: 1. Statistics that describe the distribution of the 24 target pathway's p-values, including the geometric mean and median (the lower the better), and the percentage of target pathways with nominal p < 0.05 (the higher the better). This later statistic is an estimate of the sensitivity of a given analysis method.
The percentage of target pathways with False Discovery Rate [39] corrected p-values (called q-values) less than 0.05 is also given.
2. Statistics that describe the distribution of the 24 target pathways ranks, including mean and median (the lower the better). The rank of a target pathway, having the i th smallest p-value amongst all N GS pathways analyzed for a given dataset, will be equal to i/N GS · 100. 3. Statistics that allow to determine if a given pathway analysis method produces better rankings than a reference method, chosen to be GSA since it was the best among the two published methods that we tested. A simple method to test that the ranks produced by a given method for the 24 target pathways are smaller (better) than the reference method would be to use a one-tailed paired Wilcoxon test, the pairing being at data set level. However, the Wilcoxon test assumes that the different ranks are independent between the 24 datasets, yet this is may not be the case because some ranks are obtained for the same pathway in up to 4 datasets (see Table 1).
Another approach that we used to analyze the ranks while accounting for the eventual lack of independence among them was to fit a linear mixedeffects model. The dependent variable in this model were the rank values, while the explanatory variables were the analysis method (factor with two levels, with the reference level being GSA) and the dataset ID (to reflect that the ranks are paired at the dataset level), while the random effects were the pathway IDs. Both the coefficient, and one-tailed p-value that a given analysis method produces better (smaller) ranks than the reference method were reported. Note that the gene set analysis methods could have been compared also in analyzing gene ontology terms instead of pathways, however, choosing one GO term most relevant for each dataset would have been more subjective.
Assessing the sensitivity of gene set analysis using simulated data A sensitivity comparison between GSEA, GSA and PADOG using simulated data was performed as in [12], but further expanded to also allow for overlap between gene sets. Expression data for 1000 genes and 100 samples (50 in each condition) is generated from a random normal distribution N(0, 1). A number of 50 gene sets of size 20 were created, with the expression levels of some of the genes in gene set 1 (GS 1 ) being artificially altered to make only this gene set relevant to the phenotype. Expression levels of genes in GS 1 were changed according to the following 5 scenarios by varying the amount of change, the number of genes that change in the gene set, as well as the proportion of up-to down-regulated genes: A number of 50 data sets were generated for each of the six scenarios above. Orthogonal on the different scenarios we considered three analysis setups that could influence the results of PADOG but not GSA and GSEA, according to whether or not the genes in GS 1 are allowed to be present in other gene set as well (e.g. (GS 50 )). In the first setup I), GS 1 did not overlap with other gene sets as in [12], II) All genes designed to be DE in GS 1 were included also in GS 50 , and III) All non-DE genes of GS 1 were included in GS 50 . With setup I) we are basically interested in assessing if the gene set summarization function of PADOG (mean or absolute values) combined with the moderated t-scores compares favorably to GSA and GSEA, because in the absence of overlap, the genes of GS 1 will have the same weight (w = 1.0). When the DE genes in GS 1 appear also in other gene sets but the non-DE do not (setup II), PADOG is expected to give higher p-values to GS 1 compared to the situation when there is no overlap. This is because the weight of the DE genes in this case will be lower than the weight on non-DE genes. In contrary, if the genes that are non-DE in GS 1 overlap but the DE genes are specific to GS 1 (setup III) then PADOG is expected to produce smaller p-values for GS 1 because the DE genes will have more weight and also larger t-scores.

Assessing the specificity of gene set analysis
To test the ability of the gene set analysis methods to not reject the null hypothesis when it is true, i.e. their specificity, we conducted two simulation studies.

Simulation of the null hypothesis by sample labels permutation
In the first simulation study all the 24 data sets were considered, but their array/samples class labels were permuted at random before analysis so that the correlation structure between genes is preserved. In 100 different trials, we computed several of the statistics described above, including the median of target pathways p-values, median ranks, and the percentage of pathways with p < 0.05. The average of these statistics over the 100 trials are reported.
The purpose of this simulation was two-fold. First, it allows us to determine if the target pathways-based benchmark works, i.e. if the ranking results are worse for all methods when the labels are permuted compared to when the true class labels are used. Second, it allows us to estimate the false positive rate (1-specificity) of each gene set analysis method and compare it with the level expected under the null hypothesis. All analysis methods were run on the same 100 permutations of the original class labels of each of the 24 data sets to eliminate any differences introduced by random chance. The number of internal iterations used by each analysis method was N ite = 500.

Simulation of the null hypothesis by generating random data
At the suggestion of one of the reviewers, a second type of simulation was performed to determine the false positives rate of gene set analysis methods by generating random data from a normal distribution with mean 0 and standard deviation of 1, N(0, 1). For each of the 24 real datasets, 50 fake replicas were created by maintaining the actual sample size and number of genes but generating data at random, for a total of 1200 simulated datasets. The structure of the gene sets was preserved as defined by the 229 KEGG metabolic and non-metabolic pathways, therefore maintaining a meaningful overlap between the different genes in the gene sets. The fraction of all significant pathways (false positive rate) at different α thresholds was determined.

Data Analysis
For all 24 datasets shown in Table 1 which were available from the Gene Expression Omnibus (GEO), the http://www.biomedcentral.com/1471-2105/13/136 analysis was performed consistently by: a) removing outlier arrays (if necessary), b) log transforming the data and normalizing it, c) performing a moderated ttest between groups and computing probes/probesets pvalues, d) resolving duplicate probes/probesets to Entrez ID mappings by keeping the probe/probeset with smallest p-value for each unique gene and, e) filtering out all genes that could not be mapped on any of the pathways. The normalization of datasets obtained on Affymetrix arrays was performed using the RMA algorithm [40] implemented in the affy [41] package of Bioconductor [42], while normalization of of datasets run on Illumina arrays were normalized using the quantile normalization algorithm [43] implemented in the preprocessCore of Bioconductor. The package limma [44] was used to compute a moderated two-sample paired or unpaired t-score depending on the particular design of each experiment.
The GSEA analysis was performed using the R implementation available freely at www.broadinstitute.org/ gsea/index.jsp, while the GSA analysis was performed using the GSA R package [45]. PADOG was implemented in R as well, together with the validation benchmark system comparing the methods. All methods were run using 1,000 iterations to estimate the pathway p-values shown in Tables 2, 3, 4 and 5, while 500 iterations were used in the specificity analysis results shown in Table 6 and 7.
The set of 229 metabolic and non-metabolic pathways and their genes were obtained from the KEGG.db annotation package [46] of Bioconductor [42]. The split between metabolic and non-metabolic pathways was done based on KEGG's classification.
All analyses were run under the R statistical language and environment [47] version 2.14 and using other infrastructure packages available in Bioconductor version 2.9.

Sensitivity and rank analysis using real data
We compared the PADOG method proposed here with two existing methods (GSA and GSEA). The analysis was performed on i) 143 non-metabolic pathways (which included all target pathways) and ii) 229 metabolic and non-metabolic KEGG pathways. The criteria used in the comparison between these methods were the sensitivity, the ranking, as well as the specificity of the gene set analysis methods considered. Table 2 shows the summary of gene set analysis results for the three different methods based on the panel of 24 datasets described in Table 1 when analyzing only KEGG non-metabolic pathways.
PADOG compared favorably to both GSA and GSEA in terms of median and geometric mean p-values of the target pathways (which are expected to be relevant). Eight (33.3%) of the 24 target pathways were found to be significant (with a p-value less than 0.05) with PADOG, but only three did so with GSA (12.5%), and none with GSEA. PADOG was the only method to identify one (4.2%) of the 24 target pathways as significant after adjusting for multiple testing. In terms of the rank that each target pathway received in its data set (sorting pathways by p-values), PADOG produced significantly better (lower) rank values compared to GSA, as evaluated by both a paired Wilcoxon test (p = 0.0007), and a linear mixed-effects model (p = 0.0008). This later test accounts for the fact that the same disease pathway is the target pathway in up to 4 data sets (see Table 1). PADOG improves (reduces) the rank of target pathways by 7.2 rank units compared to GSA, which in turn is better than GSEA by 13.7 units. In other words, on average across the 24 data sets, the target pathways are ranked by PADOG approximately 7 rank units better than GSA, and approximately 21 rank units better than GSEA. The paired difference in ranks for the target pathways   The   The table shows statistics computed from nominal and adjusted p-values, and ranks of the 24 target pathways only, including geometric mean, median and percentages of pathways significant at 0.05 level based on nominal and adjusted p-values (q-values). The results of comparing the ranks of each method against noMnoW method, using a paired Wilcoxon test and a linear mixed-effects model, are included. The best value for each criterion is shown in bold. PADOG is compared against simpler approaches that i) use gene weights but regular rather than moderated t-scores (noM), ii) use moderated t-scores but no gene weights (noW) and iii) use neither moderated t-scores nor gene weights (noMnoW).
between pathway analysis methods and the GSA method, chosen as reference, are also shown using box plots in Figure 3.
To determine the robustness of PADOG with respect to changes in the collection of gene sets to be analyzed changes, we have run the same comparison shown in Table 2, on the entire set of 229 KEGG human pathways (metabolic and non-metabolic). An increase in the number of gene sets to be analyzed for a fixed gene expression dataset, is expected to impact the various methods in http://www.biomedcentral.com/1471-2105/13/136 The medians of target pathways p-values and ranks, as well as the fraction of target pathways with p < 0.05 were computed in 100 simulation trials. At each trial the class labels of the samples in each of 24 real datasets were permuted before analysis. The averages statistics over the 100 trials are shown. The fraction of all pathways significant at α = 0.05 and α = 0.01 were computed after applying the three analysis methods on 1200 datasets having expression data generated from a random normal N(0,1) distribution. The collection of gene sets used in the analysis was defined by the 229 KEGG non-metabolic and metabolic pathways. different ways. With PADOG, when there are more gene sets and, hence, more genes to be analyzed, the moderated t-scores of genes in all gene sets are expected to change because the shrinkage of standard deviations in the tscores is based on a larger pool of genes [15]. Secondly, the exact weights assigned to genes in PADOG depend on the number of gene sets in which they appear so these gene weights also change when the collection of gene sets to be analyzed changes. Table 3 and Figure 4 show that PADOG performed favorably compared to the other methods, and that the gains in terms or ranking and sensitivity are robust to changes in the collection of gene sets to be analyzed. Moreover, unlike any other method tested, PADOG identified one (4.2%) of the 24 target pathways as significant after adjusting for multiple testing.

Sensitivity analysis using simulated data
The result of the sensitivity analysis based on 50 simulated data sets in each of the 5 different scenarios are given in Table 4. These results show that when all genes designed to be differentially expressed (DE) in GS 1 are changing in the same direction (scenario 1), GSA and GSEA have an advantage over PADOG while the opposite is true in all remaining 4 scenarios. These results can be understood by considering the fact that GSA and GSEA statistics are designed to find such cases when all the genes in the gene set change in the same direction while PADOG's summary statistic is more flexible to accommodate cases when the changes occur in both directions. When the overlap favors the DE genes in GS 1 (Setup III), that is, when its DE genes are specific to this gene set while its non DE-genes are not specific to the gene set, the performance of PADOG increase in all scenarios 1 through 5, as compared to the absence of overlap. However, even when the overlap is not favorable to GS 1 (setup I), that is, when all its non-DE genes are specific to this gene set, PADOG still performes better than GSA and GSEA under scenarios 2 through 5.

Sources of improvement in PADOG
The use of gene weights is the main source of improvement with PADOG in terms of ranking and power. This is shown in Figure 5 and Table 5 in which PADOG is The lower the p-values, ranks and ranks differences, the better method. PADOG is compared against simpler approaches that i) use gene weights but regular rather than moderated t-scores (noM), ii) use moderated t-scores but no gene weights (noW) and iii) use neither moderated t-scores nor gene weights (noMnoW).
compared with simpler alternative methods that i) use gene weights but regular rather than moderated t-scores (noM), ii) use moderated t-scores but no gene weights (noW ) and iii) use neither moderated t-scores nor gene weights (noMnoW ). As it can be seen in Figure 5 left panel both methods that do not use weights (noW and noM-noW) give higher (worse) p-values for the target pathways than the two other methods that use weights (PADOG and noM). Also as, shown in Table 5, the use of moderated t-scores alone (noW) does not improve the raking compared to the reference (noMnoW) (mean rank is 22.3 vs 22.5 respectively). Although the use of weights (noM) improves the ranking significantly compared to the reference method (noMnoW), the improvement is higher in the presence of the moderated t-scores.

Specificity analysis of gene set analysis methods
Two simulation studies were performed to determine whether the improved sensitivity of the PADOG method, i.e. producing lower p-values for the target pathways, comes at the expense of reduced specificity (increased false positive rate). Table 6 shows three of the same statistics introduced in Table 2 (median p-values, median ranks, and percentage of pathways with p < 0.05) except that their average was taken over 100 trials in which the class labels of the arrays in all 24 datasets were randomly permuted before the analysis. The percentage of target pathways with p < 0.05 is now the false positive rate (FP) because using random class labels models the null hypothesis in which expression levels are dissociated from the studied phenotypes, yet the gene-gene correlations are preserved. Under these circumstances, any pathways that are reported as significant by any method are, in fact, false positives. Table 6 shows that, under the null hypothesis, the average median p-values, median ranks and fraction of pathways with p < 0.05 across the 100 random permutations are 0.49, 48.9% and 0.049, respectively for PADOG and similar values are obtained for GSA and GSEA. This is expected since when class labels are permuted, the p-values of the target pathways should be uniformly distributed between between 0 and 1 (expected mean 0.5), and rank values should be uniformly distributed between 1/N GS · 100 = 0.44 and 100 (expected mean 50.22) where N GS is the number of gene sets analyzed. Table 6 also shows that the average median p-values and median ranks are much above (worse) than the level they had when true class labels were used in the analysis (see Table 2). This is the case for all analysis methods. These results prove that: i) the target pathways were indeed in average relevant to their respective phenotypes, ii) the benchmark system was sound, and iii) both the novel, as well as the existing methods were correctly deployed.
An additional simulation in which 1200 datasets were generated by drawing random values from a normal distribution has yielded similar results as the previous simulation. In this case the false positives rate was estimated as the fraction of all pathways across all 1200 datasets with a p-value less than a given threshold α. The estimated false positive rates of all three methods were very close to the expected α levels as shown in Table 7. This again confirms that PADOG is not expected to find significant gene sets more often than expected by chance regardless if gene are correlated (as in the simulation above) or not (this simulation).

Specificity analysis of the set of target pathways
In response to the suggestion of one of the reviewers, we aimed at determining how specific the target pathways were to their respective conditions. Given that the phenotype in 16 out of the 24 datasets used in our sensitivity assessment benchmark study is a form of cancer, we determined if the target pathway for each of these cancer types, in average, is found to be more significant than other general pathways typically associated with cancer such as Apoptosis, Cell cycle, Pathways in cancer, and RNA polymerase. Table 8 shows that in average on the 16 cancer datasets PADOG shows the strongest evidence (smallest p-values and rank statistics) for association between the pehnotype of the dataset and KEGG's disease specific pathway for the phenotype (target pathway). The target pathway was preferred by all three methods to any other generic cancer related pathway that we have included in this comparison, based on median p-values and, by PADOG and GSA based on median ranks as well. The Pathways in cancer gene set came in a close second for both PADOG and GSA. While for Apoptosis and Cell cycle the median p-values and ranks were around 25% for all methods, for the RNA polymerase pathways these values were above 0.5. This analysis provides evidence that the target pathways we chose were indeed specific for their respective phenotypes.

Conclusions
The original contribution of this paper is two-fold. Firstly, this paper introduces the idea of gene weighting in gene set analysis on the basis of gene frequency across the gene sets to analyzed. The reasoning behind this type of gene weighting is that whenever a gene belongs to multiple gene sets, that particular gene is less useful in prioritizing among those gene sets. Conversely, the differential expression of a gene that is present only on a single gene set/pathway represents a stronger evidence that the given gene set/pathway is impacted in the given condition. A second original contribution is the validation procedure deployed here. The classical approach involves analyzing a handful of selected data sets and discussing the results in the light of the existing literature. This is subjective and makes the comparison of various methods practically impossible. The validation proposed here involves the analysis of a large number of data sets (24 in this case) that can be objectively associated with a target gene set/pathway. This objective association is based on the fact that the samples analyzed are collected from tissues affected by the target disease (e.g. in the analysis of colorectal cancer samples, the colorectal cancer pathway is chosen as the target pathway, etc.). This approach allows a comparison of analysis methods in terms of sensitivity and ranking. Such a comparison is: a) objective, b) reproducible, and c) independent of the accuracy and thoroughness of a literature search. Using this approach, we have shown that PADOG is able to identify the target pathways as significant more frequently and rank them consistently higher than two of the best existing methods for the analysis of gene sets based on high-throughput gene expression data.