Down-weighting overlapping genes improves gene set analysis
BMC Bioinformatics volume 13, Article number: 136 (2012)
The identification of gene sets that are significantly impacted in a given condition based on microarray data is a crucial step in current life science research. Most gene set analysis methods treat genes equally, regardless how specific they are to a given gene set.
In this work we propose a new gene set analysis method that computes a gene set score as the mean of absolute values of weighted moderated gene t-scores. The gene weights are designed to emphasize the genes appearing in few gene sets, versus genes that appear in many gene sets. We demonstrate the usefulness of the method when analyzing gene sets that correspond to the KEGG pathways, and hence we called our method P athway A nalysis with D own-weighting of O verlapping G enes (PADOG). Unlike most gene set analysis methods which are validated through the analysis of 2-3 data sets followed by a human interpretation of the results, the validation employed here uses 24 different data sets and a completely objective assessment scheme that makes minimal assumptions and eliminates the need for possibly biased human assessments of the analysis results.
PADOG significantly improves gene set ranking and boosts sensitivity of analysis using information already available in the gene expression profiles and the collection of gene sets to be analyzed. The advantages of PADOG over other existing approaches are shown to be stable to changes in the database of gene sets to be analyzed. PADOG was implemented as an R package available at: http://bioinformaticsprb.med.wayne.edu/PADOG/or http://www.bioconductor.org.
Microarray-based gene expression profiling experiments, which are routine today, allow researchers to identify, for instance, genes differentially expressed (DE) between diseased and normal patient samples or genes that change in expression over time during a treatment. Unfortunately, the steady increase in the amount of data generated in the past decade from such experiments was not paralleled by the evolution of analytical methods used to extract knowledge from such datasets and, therefore, there is a gap between our ability to measure gene expression data and to extract workable knowledge from it.
Since the beginning of the microarray-based expression profiling experiments, researchers were interested in finding common “themes” among the genes identified as differentially expressed between two conditions. For instance the identification of Gene Ontology (GO) terms enriched in differentially expressed genes was used as early as 1999 , but became widespread only after the first on-line GO analysis tools were made available [2, 3]. As biological annotations started to include descriptions of gene interactions in the form of pathways (found in resources such as KEGG , BioCarta http://www.biocarta.com, and Reactome ), the identification of the pathways involved in various conditions has emerged as a ubiquitous bioinformatics task.
In general, biological pathways can be divided into gene signaling pathways, and metabolic pathways. Gene signaling pathways are graphs that use nodes to represent genes, or gene products, and edges to represent signals that go from one gene to another. Metabolic pathways are graphs that use nodes to represent biochemical compounds, and edges to describe biochemical reactions that involve such compounds. Since biochemical reactions are usually carried out by enzymes which are coded for by genes, in a metabolic pathway genes are associated with edges rather than nodes. Ideally, a comprehensive pathway analysis method would be able to take into consideration all aspects of the phenomena described by a pathway. These aspects would include the position and role of each gene in a pathway, the types of signals between genes, the efficiency with which a signal travels from one gene to another, or the efficiency with which a certain reaction is carried out, rate limiting conditions, etc. Such methods have been proposed for both signaling pathways [6–9], and metabolic pathways [8, 10], but no method is currently available to analyze both types of pathways taking into consideration all the information available. Hence, even though they do not use all information available, methods that treat the pathways as simple gene sets are still popular because they can be applied equally well to signaling pathways, metabolic pathways, GO terms, as well as arbitrary sets of genes.
Two of the most popular such methods are the Gene Set Enrichment Analysis (GSEA)  and the Gene Set Analysis (GSA) . These methods belong to the functional class scoring category of gene set analysis methods [13, 14]. For a simple two group experiment (e.g. disease vs. normal), both GSEA and GSA start with computing a t-statistic for each gene measured on the array. Then, a score is computed for each gene set using the t-scores of all genes in the gene set. The significance of the gene set scores is determined by using permutations of the samples. Both approaches treat the genes in the gene set equally.
In this work, we propose the P athway A nalysis with D own-weighting of O verlapping G enes (PADOG) which is a general gene set analysis method. The method gives more weight to genes that are gene set-specific, than to genes which can be found in multiple gene sets. This is similar to the approach commonly used in information retrieval (e.g. web search engines) that decreases the importance of words that appear in many documents (e.g. “and”, “or”, etc.) in favor of words that are highly specific to given documents, the latter type being considered to carry more information about the informational content of the document. Similarly, in our approach, if the differential expression affects genes that are highly specific to a given pathway (e.g. huntingtin to Hungtington’s disease), it is more likely that the respective pathway is truly relevant in that condition.
The process of down-weighting popular genes does not affect one’s ability to find a gene set to be significant whenever the gene set is composed mostly of ubiquitous genes, but rather increase the contrast between gene sets that overlap by reducing the contribution of the overlapping genes into the gene set scores. As a simple example, with PADOG, a gene set A having 20 out of 50 genes differentially expressed, that appear only in gene set A, will be found more significant than another gene set B of same size that has also 20 differentially expressed genes but which appear in other gene sets as well. Both GSEA and GSA would find the two gene sets equally significant.
Analysis methods that do not treat all genes equally were previously proposed for pathway analysis in an over-representation context [6, 7], or in a functional class scoring context , yet none specifically exploit the frequency of occurrence of genes across the pathways. Moreover, unlike GSA, PADOG does not rely on ordinary t-scores to derive gene set scores but uses moderated t-statistics  instead. A similar idea to use non-ordinary t-scores in the gene set scores computation was illustrated first in  by using SAM statistics  in conjunction with GSEA. Moreover, unlike GSA, PADOG summarizes the gene scores into a gene set score using the mean of absolute values instead of the maxmean statistic.
The sensitivity of gene set analysis methods (i.e. their ability to produce significant p-values for gene sets that are truly relevant to a phenotype), as well their ability to rank the relevant gene sets near the top, is typically assessed using a few data sets, by asking domain experts to make informed guesses about which gene sets are relevant to each condition/dataset. Relevance is determined using the expert’s knowledge and/or literature citations supporting the link between certain gene sets and the condition under the study [6, 7, 11, 18]. The problem is that almost any gene set analysis result will be supported by some references which makes an unbiased and objective comparison of various analysis methods practically impossible. In this study, we used a different approach in which we make fewer assumptions, and use an order of magnitude more data sets (24 sets). The type of gene sets considered in our validation were KEGG biological pathways. Each of the 24 microarray data sets that we used (see Table 1) involved a particular disease for which there is an associated pathway in the KEGG database , e.g. Alzheimer’s disease Colorectal cancer Asthma, etc. We refer to these as the target pathways, and we, very conservatively, consider them to be the only ones certain to be relevant for their respective conditions. Since the target pathways for all 24 datasets belong to the non-metabolic pathways category, we can restrict the analysis only to KEGG non-metabolic pathways. Analyzing all metabolic and non-metabolic pathways brings an additional challenge to the analysis methods because the assumed relevant pathway for a given condition (dataset) is now to be found among a larger pool of pathways. The gene set analysis methods were compared in terms of their ability to produce significant p-values for these target pathways and rank them near the top.
The two methods we chose to compare PADOG against are the Gene Set Enrichment Analysis (GSEA)  and the Gene Set Analysis (GSA) . Briefly, GSEA works as follows. Let GS i denote the ith gene set, where i=1.. N GS . For each gene j on the array, GSEA computes a t-statistic z j for the differential expression of the gene between the disease group and the control group. A gene set score S(GS i ) is computed similar to a signed version of the Kolmogorov-Smirnov statistic between the values z j j∈G S i and their complement (genes measured on the array but not belonging to the gene set). The class labels of the arrays are permuted and the significance of the gene set score is assessed by determining the null distribution of the gene set score.
The Gene Set Analysis (GSA)  differs from GSEA in two ways. Firstly, the gene set summary statistic used is the maxmean statistic, defined as:
where the (+) and (-) signs identify the positive and negative t-scores respectively, and n represents the number of genes in the gene set. Secondly, GSA differs from GSEA by re-standardizing the gene set scores by taking into account scores from sets formed by random selection of genes. Permutations of class labels are then used to infer the significance of the standardized gene set scores. The need for re-standardization is justified by the fact that, given that the genes are correlated (they tend to have either high or low t-scores simultaneously), the gene set score computed with the true class labels will be systematically larger than with permuted class labels and, hence, the significance of all gene sets will be overstated.
Pathway Analysis with Down-weighting of Overlapping Genes (PADOG)
Let GS i with i=1.. N GS be the collection of gene sets to be analyzed, each containing N(GS i ) genes, and G be the set of all genes measured on the array that can be mapped to at least one gene set to be analyzed. Then let be the value of a moderated t-score  of the gene g between the two conditions of interest with g∈G. The moderated t-scores are similar to ordinary t-scores, except that their standard errors have been moderated across genes, i.e., shrunk towards a common value using a Bayesian model . The moderated t-scores are expected to be more reliable than ordinary t-scores because the shrinkage of the gene standard deviations will prevent large t-scores to occur only due to small gene standard deviations.
Moreover, let f (g) be the frequency of gene g across all gene sets to be analyzed. Here f (g) can take values from 1 to N GS since a gene can be either specific to a gene set by appearing only in that gene set, or it is present in all gene sets, respectively. We want to weight the t-scores of the genes with a function of their frequency in such a way that the most frequently appearing gene gets a weight of w=1.0, while gene set specific genes get double weight (w=2.0). We chose a monotonically decreasing function to relate the gene weight w(g) to the gene frequency f (g) so that it is bounded between 1.0 and 2.0 and drops faster with increasing frequency values:
For illustration purposes, the distribution of gene frequencies across all 143 KEGG non-metabolic pathways (treated here as gene sets), as well as the dependency of gene weights on gene frequency, is shown in Figure 1. For each gene set we compute a score as:
The formula above describes the gene set scores as the mean across all genes in the gene set of the weighted absolute moderated t-scores. The gene set scores obtained with the formula above are first standardized using a row randomization approach described in  to yield S’0(GS i ). The row randomization consists of subtracting the mean and dividing by the standard deviation of gene set scores that could be obtained by randomly selecting sets of genes with the same size as the current gene set. Given that our gene set summarization function Eq. 2 is essentially a mean (of absolute weighted moderated t-scores) both the row standardization mean and standard deviation can be inferred from the mean and standard deviation of values of all genes on the array, as the central limit theorem would suggest, and hence no actual permutations are needed. More specifically, the row randomization mean for gene set GS i will given by the mean (of absolute weighted moderated t-scores) of all genes on the array, and the row randomization standard deviation can be calculated as the standard deviation of values of all genes on the array divided by . A second standardization is applied by subtracting the mean and dividing to the standard deviation of S’0(GS i ) scores across all N GS gene sets to obtain the observed standardized scores, . The probability P PADOG (GS i ) to observe such a large or larger standardized score is determined by permuting N ite =1000 times the array/samples labels:
where I is a function that returns 1 when the argument is true and 0 otherwise, and represents the standardized score obtained with the ite-th permutation of the samples for gene set GS i .
Assessing the sensitivity and gene set ranking capability using real data
To assess the sensitivity and ranking capability of the gene set analysis methods discussed in this paper, we identified in the Gene Expression Omnibus (GEO) , 24 microarray data sets each involving a particular disease. For each such disease, we considered the KEGG pathway that describes the biological phenomena taking place in that disease as the target pathway. For instance, the Alzheimer’s disease pathway is the target pathway for all Alzheimer data sets, etc. Table 1 shows the details about these 24 datasets. For most diseases considered, there are several associated data sets in this collection. The gene set analysis methods were compared in terms of their ability to produce low p-values, and rank at the top these target pathways (one in each data set). A schematic representation of the benchmark system used to assess the performance of each gene set analysis method is shown in Figure 2. There were three categories of statistics computed to compare the performance of the gene set analysis methods considered in this study:
Statistics that describe the distribution of the 24 target pathway’s p -values, including the geometric mean and median (the lower the better), and the percentage of target pathways with nominal p<0.05 (the higher the better). This later statistic is an estimate of the sensitivity of a given analysis method. The percentage of target pathways with False Discovery Rate  corrected p-values (called q-values) less than 0.05 is also given.
Statistics that describe the distribution of the 24 target pathways ranks, including mean and median (the lower the better). The rank of a target pathway, having the ithsmallest p-value amongst all N GS pathways analyzed for a given dataset, will be equal to i/ N GS ·100.
Statistics that allow to determine if a given pathway analysis method produces better rankings than a reference method, chosen to be GSA since it was the best among the two published methods that we tested. A simple method to test that the ranks produced by a given method for the 24 target pathways are smaller (better) than the reference method would be to use a one-tailed paired Wilcoxon test, the pairing being at data set level. However, the Wilcoxon test assumes that the different ranks are independent between the 24 datasets, yet this is may not be the case because some ranks are obtained for the same pathway in up to 4 datasets (see Table 1). Another approach that we used to analyze the ranks while accounting for the eventual lack of independence among them was to fit a linear mixed-effects model. The dependent variable in this model were the rank values, while the explanatory variables were the analysis method (factor with two levels, with the reference level being GSA) and the dataset ID (to reflect that the ranks are paired at the dataset level), while the random effects were the pathway IDs. Both the coefficient, and one-tailed p-value that a given analysis method produces better (smaller) ranks than the reference method were reported.
Note that the gene set analysis methods could have been compared also in analyzing gene ontology terms instead of pathways, however, choosing one GO term most relevant for each dataset would have been more subjective.
Assessing the sensitivity of gene set analysis using simulated data
A sensitivity comparison between GSEA, GSA and PADOG using simulated data was performed as in , but further expanded to also allow for overlap between gene sets. Expression data for 1000 genes and 100 samples (50 in each condition) is generated from a random normal distribution N(0,1). A number of 50 gene sets of size 20 were created, with the expression levels of some of the genes in gene set 1 (GS1) being artificially altered to make only this gene set relevant to the phenotype. Expression levels of genes in GS1 were changed according to the following 5 scenarios by varying the amount of change, the number of genes that change in the gene set, as well as the proportion of up- to down-regulated genes:
Level of the first 15 genes of GS1 was increased by 0.3 units in group 2.
Level of the first 10 genes of GS1 was increased by 0.3 units and the level of the next 5 genes was decreased by 0.3 units in group 2.
Level of the first 8 genes of GS1 was increased by 0.3 units and the level of the next 7 genes was decreased by 0.3 units in group 2.
Level of the first 7 genes of GS1 was increased by 0.4 units and the level of the next 3 genes was decreased by 0.4 units in group 2.
Level of the first 5 genes of GS1 was increased by 0.4 units and the level of the next 5 genes was decreased by 0.4 units in group 2.
A number of 50 data sets were generated for each of the six scenarios above. Orthogonal on the different scenarios we considered three analysis setups that could influence the results of PADOG but not GSA and GSEA, according to whether or not the genes in GS1 are allowed to be present in other gene set as well (e.g. (GS50)). In the first setup I), GS1 did not overlap with other gene sets as in , II) All genes designed to be DE in GS1 were included also in GS50, and III) All non-DE genes of GS1 were included in GS50. With setup I) we are basically interested in assessing if the gene set summarization function of PADOG (mean or absolute values) combined with the moderated t-scores compares favorably to GSA and GSEA, because in the absence of overlap, the genes of GS1 will have the same weight (w=1.0). When the DE genes in GS1 appear also in other gene sets but the non-DE do not (setup II), PADOG is expected to give higher p-values to GS1 compared to the situation when there is no overlap. This is because the weight of the DE genes in this case will be lower than the weight on non-DE genes. In contrary, if the genes that are non-DE in GS1 overlap but the DE genes are specific to GS1 (setup III) then PADOG is expected to produce smaller p-values for GS1 because the DE genes will have more weight and also larger t-scores.
Assessing the specificity of gene set analysis
To test the ability of the gene set analysis methods to not reject the null hypothesis when it is true, i.e. their specificity, we conducted two simulation studies.
Simulation of the null hypothesis by sample labels permutation
In the first simulation study all the 24 data sets were considered, but their array/samples class labels were permuted at random before analysis so that the correlation structure between genes is preserved. In 100 different trials, we computed several of the statistics described above, including the median of target pathways p-values, median ranks, and the percentage of pathways with p<0.05. The average of these statistics over the 100 trials are reported.
The purpose of this simulation was two-fold. First, it allows us to determine if the target pathways-based benchmark works, i.e. if the ranking results are worse for all methods when the labels are permuted compared to when the true class labels are used. Second, it allows us to estimate the false positive rate (1-specificity) of each gene set analysis method and compare it with the level expected under the null hypothesis. All analysis methods were run on the same 100 permutations of the original class labels of each of the 24 data sets to eliminate any differences introduced by random chance. The number of internal iterations used by each analysis method was N ite =500.
Simulation of the null hypothesis by generating random data
At the suggestion of one of the reviewers, a second type of simulation was performed to determine the false positives rate of gene set analysis methods by generating random data from a normal distribution with mean 0 and standard deviation of 1, N(0,1). For each of the 24 real datasets, 50 fake replicas were created by maintaining the actual sample size and number of genes but generating data at random, for a total of 1200 simulated datasets. The structure of the gene sets was preserved as defined by the 229 KEGG metabolic and non-metabolic pathways, therefore maintaining a meaningful overlap between the different genes in the gene sets. The fraction of all significant pathways (false positive rate) at different α thresholds was determined.
For all 24 datasets shown in Table 1 which were available from the Gene Expression Omnibus (GEO), the analysis was performed consistently by: a) removing outlier arrays (if necessary), b) log transforming the data and normalizing it, c) performing a moderated t-test between groups and computing probes/probesets p-values, d) resolving duplicate probes/probesets to Entrez ID mappings by keeping the probe/probeset with smallest p-value for each unique gene and, e) filtering out all genes that could not be mapped on any of the pathways. The normalization of datasets obtained on Affymetrix arrays was performed using the RMA algorithm  implemented in the affy package of Bioconductor, while normalization of of datasets run on Illumina arrays were normalized using the quantile normalization algorithm  implemented in the preprocessCore of Bioconductor. The package limma was used to compute a moderated two-sample paired or unpaired t-score depending on the particular design of each experiment.
The GSEA analysis was performed using the R implementation available freely at http://www.broadinstitute.org/gsea/index.jsp, while the GSA analysis was performed using the GSA R package . PADOG was implemented in R as well, together with the validation benchmark system comparing the methods. All methods were run using 1,000 iterations to estimate the pathway p-values shown in Tables 234 and 5, while 500 iterations were used in the specificity analysis results shown in Table 6 and 7.
The set of 229 metabolic and non-metabolic pathways and their genes were obtained from the KEGG.db annotation package  of Bioconductor . The split between metabolic and non-metabolic pathways was done based on KEGG’s classification.
All analyses were run under the R statistical language and environment  version 2.14 and using other infrastructure packages available in Bioconductor version 2.9.
Results and discussion
Sensitivity and rank analysis using real data
We compared the PADOG method proposed here with two existing methods (GSA and GSEA). The analysis was performed on i) 143 non-metabolic pathways (which included all target pathways) and ii) 229 metabolic and non-metabolic KEGG pathways. The criteria used in the comparison between these methods were the sensitivity, the ranking, as well as the specificity of the gene set analysis methods considered. Table 2 shows the summary of gene set analysis results for the three different methods based on the panel of 24 datasets described in Table 1 when analyzing only KEGG non-metabolic pathways.
PADOG compared favorably to both GSA and GSEA in terms of median and geometric mean p-values of the target pathways (which are expected to be relevant). Eight (33.3%) of the 24 target pathways were found to be significant (with a p-value less than 0.05) with PADOG, but only three did so with GSA (12.5%), and none with GSEA. PADOG was the only method to identify one (4.2%) of the 24 target pathways as significant after adjusting for multiple testing. In terms of the rank that each target pathway received in its data set (sorting pathways by p-values), PADOG produced significantly better (lower) rank values compared to GSA, as evaluated by both a paired Wilcoxon test (p=0.0007), and a linear mixed-effects model (p=0.0008). This later test accounts for the fact that the same disease pathway is the target pathway in up to 4 data sets (see Table 1). PADOG improves (reduces) the rank of target pathways by 7.2 rank units compared to GSA, which in turn is better than GSEA by 13.7 units. In other words, on average across the 24 data sets, the target pathways are ranked by PADOG approximately 7 rank units better than GSA, and approximately 21 rank units better than GSEA. The paired difference in ranks for the target pathways between pathway analysis methods and the GSA method, chosen as reference, are also shown using box plots in Figure 3.
To determine the robustness of PADOG with respect to changes in the collection of gene sets to be analyzed changes, we have run the same comparison shown in Table 2, on the entire set of 229 KEGG human pathways (metabolic and non-metabolic). An increase in the number of gene sets to be analyzed for a fixed gene expression dataset, is expected to impact the various methods in different ways. With PADOG, when there are more gene sets and, hence, more genes to be analyzed, the moderated t-scores of genes in all gene sets are expected to change because the shrinkage of standard deviations in the t-scores is based on a larger pool of genes . Secondly, the exact weights assigned to genes in PADOG depend on the number of gene sets in which they appear so these gene weights also change when the collection of gene sets to be analyzed changes. Table 3 and Figure 4 show that PADOG performed favorably compared to the other methods, and that the gains in terms or ranking and sensitivity are robust to changes in the collection of gene sets to be analyzed. Moreover, unlike any other method tested, PADOG identified one (4.2%) of the 24 target pathways as significant after adjusting for multiple testing.
Sensitivity analysis using simulated data
The result of the sensitivity analysis based on 50 simulated data sets in each of the 5 different scenarios are given in Table 4. These results show that when all genes designed to be differentially expressed (DE) in GS1 are changing in the same direction (scenario 1), GSA and GSEA have an advantage over PADOG while the opposite is true in all remaining 4 scenarios. These results can be understood by considering the fact that GSA and GSEA statistics are designed to find such cases when all the genes in the gene set change in the same direction while PADOG’s summary statistic is more flexible to accommodate cases when the changes occur in both directions. When the overlap favors the DE genes in GS1 (Setup III), that is, when its DE genes are specific to this gene set while its non DE-genes are not specific to the gene set, the performance of PADOG increase in all scenarios 1 through 5, as compared to the absence of overlap. However, even when the overlap is not favorable to GS1 (setup I), that is, when all its non-DE genes are specific to this gene set, PADOG still performes better than GSA and GSEA under scenarios 2 through 5.
Sources of improvement in PADOG
The use of gene weights is the main source of improvement with PADOG in terms of ranking and power. This is shown in Figure 5 and Table 5 in which PADOG is compared with simpler alternative methods that i) use gene weights but regular rather than moderated t-scores (noM), ii) use moderated t-scores but no gene weights (noW ) and iii) use neither moderated t-scores nor gene weights (noMnoW ). As it can be seen in Figure 5 left panel both methods that do not use weights (noW and noMnoW) give higher (worse) p-values for the target pathways than the two other methods that use weights (PADOG and noM). Also as, shown in Table 5, the use of moderated t-scores alone (noW) does not improve the raking compared to the reference (noMnoW) (mean rank is 22.3 vs 22.5 respectively). Although the use of weights (noM) improves the ranking significantly compared to the reference method (noMnoW), the improvement is higher in the presence of the moderated t-scores.
Specificity analysis of gene set analysis methods
Two simulation studies were performed to determine whether the improved sensitivity of the PADOG method, i.e. producing lower p-values for the target pathways, comes at the expense of reduced specificity (increased false positive rate). Table 6 shows three of the same statistics introduced in Table 2 (median p-values, median ranks, and percentage of pathways with p<0.05) except that their average was taken over 100 trials in which the class labels of the arrays in all 24 datasets were randomly permuted before the analysis. The percentage of target pathways with p<0.05 is now the false positive rate (FP) because using random class labels models the null hypothesis in which expression levels are dissociated from the studied phenotypes, yet the gene-gene correlations are preserved. Under these circumstances, any pathways that are reported as significant by any method are, in fact, false positives.
Table 6 shows that, under the null hypothesis, the average median p-values, median ranks and fraction of pathways with p<0.05 across the 100 random permutations are 0.49, 48.9% and 0.049, respectively for PADOG and similar values are obtained for GSA and GSEA. This is expected since when class labels are permuted, the p-values of the target pathways should be uniformly distributed between between 0 and 1 (expected mean 0.5), and rank values should be uniformly distributed between 1/ N GS ·100=0.44 and 100 (expected mean 50.22) where N GS is the number of gene sets analyzed. Table 6 also shows that the average median p-values and median ranks are much above (worse) than the level they had when true class labels were used in the analysis (see Table 2). This is the case for all analysis methods. These results prove that: i) the target pathways were indeed in average relevant to their respective phenotypes, ii) the benchmark system was sound, and iii) both the novel, as well as the existing methods were correctly deployed.
An additional simulation in which 1200 datasets were generated by drawing random values from a normal distribution has yielded similar results as the previous simulation. In this case the false positives rate was estimated as the fraction of all pathways across all 1200 datasets with a p-value less than a given threshold α. The estimated false positive rates of all three methods were very close to the expected α levels as shown in Table 7. This again confirms that PADOG is not expected to find significant gene sets more often than expected by chance regardless if gene are correlated (as in the simulation above) or not (this simulation).
Specificity analysis of the set of target pathways
In response to the suggestion of one of the reviewers, we aimed at determining how specific the target pathways were to their respective conditions. Given that the phenotype in 16 out of the 24 datasets used in our sensitivity assessment benchmark study is a form of cancer, we determined if the target pathway for each of these cancer types, in average, is found to be more significant than other general pathways typically associated with cancer such as Apoptosis, Cell cycle, Pathways in cancer, and RNA polymerase. Table 8 shows that in average on the 16 cancer datasets PADOG shows the strongest evidence (smallest p-values and rank statistics) for association between the pehnotype of the dataset and KEGG’s disease specific pathway for the phenotype (target pathway). The target pathway was preferred by all three methods to any other generic cancer related pathway that we have included in this comparison, based on median p-values and, by PADOG and GSA based on median ranks as well. The Pathways in cancer gene set came in a close second for both PADOG and GSA. While for Apoptosis and Cell cycle the median p-values and ranks were around 25% for all methods, for the RNA polymerase pathways these values were above 0.5. This analysis provides evidence that the target pathways we chose were indeed specific for their respective phenotypes.
The original contribution of this paper is two-fold. Firstly, this paper introduces the idea of gene weighting in gene set analysis on the basis of gene frequency across the gene sets to analyzed. The reasoning behind this type of gene weighting is that whenever a gene belongs to multiple gene sets, that particular gene is less useful in prioritizing among those gene sets. Conversely, the differential expression of a gene that is present only on a single gene set/pathway represents a stronger evidence that the given gene set/pathway is impacted in the given condition. A second original contribution is the validation procedure deployed here. The classical approach involves analyzing a handful of selected data sets and discussing the results in the light of the existing literature. This is subjective and makes the comparison of various methods practically impossible. The validation proposed here involves the analysis of a large number of data sets (24 in this case) that can be objectively associated with a target gene set/pathway. This objective association is based on the fact that the samples analyzed are collected from tissues affected by the target disease (e.g. in the analysis of colorectal cancer samples, the colorectal cancer pathway is chosen as the target pathway, etc.). This approach allows a comparison of analysis methods in terms of sensitivity and ranking. Such a comparison is: a) objective, b) reproducible, and c) independent of the accuracy and thoroughness of a literature search. Using this approach, we have shown that PADOG is able to identify the target pathways as significant more frequently and rank them consistently higher than two of the best existing methods for the analysis of gene sets based on high-throughput gene expression data.
ALT designed and implemented the research and drafted the manuscript. ALT, SD, and RR evaluated the research and improved the manuscript. GB collected the datasets and performed data pre-processing. All authors read and approved the final manuscript.
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ: Church GM: Systematic determination of genetic network architecture. Nat Genet 1999, 22: 281–285. 10.1038/10343
Khatri P, Drăghici S, Ostermeier GC, Krawetz SA: Profiling Gene Expression Using Onto-Express. Genomics 2002, 79(2):266–270. 10.1006/geno.2002.6698
Drăghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA: Global functional profiling of gene expression. Genomics 2003, 81(2):98–104. 10.1016/S0888-7543(02)00021-6
Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, et al: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 1999, 27: 29–34. 10.1093/nar/27.1.29
Joshi-Tope G, Gillespie M, Vasrik I, D’Eustachio P, Schmidt E, de Bone B, Jassal B, Gopinath GR, Wu GR, Matthews L, Lewis S, Birney E, Stein L: Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 2005, 33(Database issue):D428–432.
Draghici S, Khatri P, Tarca AL, Amin K, Done A, Voichita C, Georgescu C, Romero R: A systems biology approach for pathway level analysis. Genome Research 2007, 17(10):1537–1545. 10.1101/gr.6202607
Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim JS, Kim CJ, Kusanovic JP: Romero R: A novel signaling pathway impact analysis. Bioinformatics 2009, 25: 75–82. 10.1093/bioinformatics/btn577
Thomas R, Gohlke JM, Stopper GF, Parham FM, Portier CJ: Choosing the right path: enhancement of biologically relevant sets of genes or proteins using pathway structure. Genome Biol 2009, 10(4):R44. 10.1186/gb-2009-10-4-r44
Massa MS, Chiogna M, Romualdi C: Gene set analysis exploiting the topology of a pathway. BMC Syst Biol 2010, 4: 121.
Rahnenführer J, Domingues FS, Maydt J, Lengauer T: Calculating the Statistical Significance of Changes in Pathway Activity From Gene Expression Data. Statistical Applications in Genetics and Molecular Biology 2004, 3: Article16.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceeding of The National Academy of Sciences of the USA 2005, 102(43):15545–15550. 10.1073/pnas.0506580102
Efron B, Tibshirani R: On testing the significance of sets of genes. Annals of Applied Statistics 2006, 1: 107–129.
Goeman JJ, Buhlmann P: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 2007, 23: 980–987. 10.1093/bioinformatics/btm051
Ackermann M, Strimmer K: A general modular framework for gene set enrichment analysis. BMC Bioinformatics 2009, 10: 47. 10.1186/1471-2105-10-47
Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical applications in genetics and molecular biology 2004, 3: Article3.
Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Improving gene set analysis of microarray data by SAM-GS. BMC Bioinformatics 2007, 8: 242. 10.1186/1471-2105-8-242
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci 2001, 98(9):5116–5121. 10.1073/pnas.091062498
Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E: Using the Gene Ontology for Microarray Data Mining: A Comparison of Methods and Application to Age Effects in Human Prefrontal Cortex. Neurochem Res 2004, 29(6):1213–1222.
Kanehisa M, Goto S, Kawashima S, Okunom Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res 2004, 32(Database isuue):277–280.
Blalock EM, Geddes JW, Chen KC, Porter NM, Markesbery WR, Landfield PW: Incipient Alzheimer’s disease: microarray correlation analyses reveal major transcriptional and tumor suppressor responses. Proc Natl Acad Sci U.S.A 2004, 101: 2173–2178. 10.1073/pnas.0308512100
Liang WS, Dunckley T, Beach TG, Grover A, Mastroeni D, Walker DG, Caselli RJ, Kukull WA, McKeel D, Morris JC, Hulette C, Schmechel D, Alexander GE, Reiman EM, Rogers J, Stephan DA: Gene expression profiles in anatomically and functionally distinct regions of the normal aged human brain. Physiol Genomics 2007, 28: 311–322.
Zheng B, Liao Z, Locascio JJ, Lesniak KA, Roderick SS, Watt ML, Eklund AC, Zhang-James Y, Kim PD, Hauser MA, Grunblatt E, Moran LB, Mandel SA, Riederer P, Miller RM, Federoff HJ, Wullner U, Papapetropoulos S, Youdim MB, Cantuti-Castelvetri I, Young AB, Vance JM, Davis RL, Hedreen JC, Adler CH, Beach TG, Graeber MB, Middleton FA, Rochet JC, Scherzer CR: PGC-1?, a potential therapeutic target for early intervention in Parkinson’s disease. Sci Transl Med 2010, 2(52):52ra73. 10.1126/scitranslmed.3001059
Zhang Y, James M, Middleton FA, Davis RL: Transcriptional analysis of multiple brain regions in Parkinson’s disease supports the involvement of specific protein processing, energy metabolism, and signaling pathways and suggests novel disease mechanisms. Am J Med Genet B Neuropsychiatr Genet 2005, 137B: 5–16. 10.1002/ajmg.b.30195
Runne H, Kuhn A, Wild EJ, Pratyaksha W, Kristiansen M, Isaacs JD, Regulier E, Delorenzi M, Tabrizi SJ, Luthi-Carter R: Analysis of potential transcriptomic biomarkers for Huntington’s disease in peripheral blood. Proc Natl Acad Sci U.S.A 2007, 104: 14424–14429. 10.1073/pnas.0703652104
Hong Y, Ho KS, Eu KW, Cheah PY: A susceptibility gene set for early onset colorectal cancer that integrates diverse signaling pathways: implication for tumorigenesis. Clin Cancer Res 2007, 13: 1107–1114. 10.1158/1078-0432.CCR-06-1633
Sabates-Bellver J, Van der Flier LG, de Palo M, Cattaneo E, Maake C, Rehrauer H, Laczko E, Kurowski MA, Bujnicki JM, Menigatti M, Luz J, Ranalli TV, Gomes V, Pastorelli A, Faggiani R, Anti M, Jiricny J, Clevers H, Marra G: Transcriptome profile of human colorectal adenomas. Mol Cancer Res 2007, 5: 1263–1275. 10.1158/1541-7786.MCR-07-0267
Hong Y, Downey T, Eu KW, Koh PK, Cheah PY, Koh PK, Cheah PY: A ’metastasis-prone’ signature for early-stage mismatch-repair proficient sporadic colorectal cancer patients and its implications for possible therapeutics. Clin Exp Metastasis 2010, 27: 83–90. 10.1007/s10585-010-9305-4
Wang Y, Roche O, Yan MS, Finak G, Evans AJ, Metcalf JL, Hast BE, Hanna SC, Wondergem B, Furge KA, Irwin MS, Kim WY, Teh BT, Grinstein S, Park M, Marsden PA, Ohh M: Regulation of endocytosis via the oxygen-sensing pathway. Nat Med 2009, 15: 319–324. 10.1038/nm.1922
Lenburg ME, Liou LS, Gerry NP, Frampton GM, Cohen HT, Christman MF: Previously unidentified changes in renal cell carcinoma gene expression identified by parametric analysis of microarray data. BMC Cancer 2003, 3: 31. 10.1186/1471-2407-3-31
Badea L, Herlea V, Dima SO, Dumitrascu T, Popescu I: Combined gene expression analysis of whole-tissue and microdissected pancreatic ductal adenocarcinoma identifies genes specifically overexpressed in tumor epithelia. Hepatogastroenterology 2008, 55: 2016–2027.
Pei H, Li L, Fridley BL, Jenkins GD, Kalari KR, Lingle W, Petersen G, Lou Z, Wang L: FKBP51 affects cancer cell response to chemotherapy by negatively regulating Akt. Cancer Cell 2009, 16: 259–266. 10.1016/j.ccr.2009.07.016
Wallace TA, Prueitt RL, Yi M, Howe TM, Gillespie JW, Yfantis HG, Stephens RM, Caporaso NE, Loffredo CA, Ambs S: Tumor immunobiological differences in prostate cancer between African-American and European-American men. Cancer Res 2008, 68: 927–936. 10.1158/0008-5472.CAN-07-2608
He H, Jazdzewski K, Li W, Liyanarachchi S, Nagy R, Volinia S, Calin GA, Liu CG, Franssila K, Suster S, Kloos RT, Croce CM, de la Chapelle A: The role of microRNA genes in papillary thyroid carcinoma. Proc Natl Acad Sci U.S.A 2005, 102: 19075–19080. 10.1073/pnas.0509603102
Stirewalt DL, Meshinchi S, Kopecky KJ, Fan W, Pogosova-Agadjanyan EL, Engel JH, Cronk MR, Dorcy KS, McQuary AR, Hockenbery D, Wood B, Heimfeld S, Radich JP: Identification of genes with abnormal expression changes in acute myeloid leukemia. Genes Chromosomes Cancer 2008, 47: 8–20. 10.1002/gcc.20500
Sanchez-Palencia A, Gomez-Morales M, Gomez-Capilla JA, Pedraza V, Boyero L, Rosell R, Farez-Vidal ME: Gene expression profiling reveals novel biomarkers in nonsmall cell lung cancer. Int J Cancer 2010, 129(2):355–364.
Hou J, Aerts J, den Hamer, van Ijcken, den Bakker, Riegman P, van der Leest, van der Spek, Foekens JA, Hoogsteden HC, Grosveld F, Philipsen S: Gene expression-based classification of non-small cell lung carcinomas and survival prediction. PLoS ONE 2010, 5: e10312. 10.1371/journal.pone.0010312
Barth AS, Kuner R, Buness A, Ruschhaupt M, Merk S, Zwermann L, Kaab S, Kreuzer E, Steinbeck G, Mansmann U, Poustka A, Nabauer M, Sultmann H: Identification of a common gene expression signature in dilated cardiomyopathy across independent microarray studies. J Am Coll Cardiol 2006, 48: 1610–1617.
Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002, 30: 207–210. 10.1093/nar/30.1.207
Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of The Royal Statistical Society B 1995, 57: 289–300.
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics 2003, 4(2):249–264. 10.1093/biostatistics/4.2.249
Gautier L, Cope L, Bolstad BM, Irizarry RA: affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20(3):307–315. 10.1093/bioinformatics/btg405
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. 10.1186/gb-2004-5-10-r80
Bolstad BM, Irizarry RA, Astrand M, Speed TP: A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on variance and bias. Bioinformatics 2003, 19(2):185–193. 10.1093/bioinformatics/19.2.185
Smyth GK: Limma: linear models for microarray data. Springer, New York; 2005.
Efron B, Tibshirani R: GSA: Gene set analysis. 2010. . [R package version 1.03] http://CRAN.R-project.org/package=GSA . [R package version 1.03]
Carlson M, Falcon S, Pages H, Li N: KEGG.db: A set of annotation maps for KEGG. [R package version 2.5.0] [R package version 2.5.0]
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2011. . [ISBN 3–900051–07–0] http://www.R-project.org . [ISBN 3-900051-07-0]
This research was supported, in part, by the Intramural Research Program of the National Institute of Child Health and Human Development, NIH/DHHS. SD was also supported by the following grants: NIH RO1 RDK089167-01 and NSF DBI-0965741 (to S.D.), and by the Robert J. Sokol Endowment in Systems Biology. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of any of the funding agencies.
The authors of the manuscript do not hold or intend to apply for a patent. However, one or more authors may file an employee invention disclosure form to notify their employer that the work leading to the manuscript may be patentable.
About this article
Cite this article
Tarca, A.L., Draghici, S., Bhatti, G. et al. Down-weighting overlapping genes improves gene set analysis. BMC Bioinformatics 13, 136 (2012). https://doi.org/10.1186/1471-2105-13-136