Down-weighting overlapping genes improves gene set analysis
© Tarca et al.; licensee BioMed Central Ltd. 2012
Received: 14 February 2012
Accepted: 18 May 2012
Published: 19 June 2012
Skip to main content
© Tarca et al.; licensee BioMed Central Ltd. 2012
Received: 14 February 2012
Accepted: 18 May 2012
Published: 19 June 2012
The identification of gene sets that are significantly impacted in a given condition based on microarray data is a crucial step in current life science research. Most gene set analysis methods treat genes equally, regardless how specific they are to a given gene set.
In this work we propose a new gene set analysis method that computes a gene set score as the mean of absolute values of weighted moderated gene t-scores. The gene weights are designed to emphasize the genes appearing in few gene sets, versus genes that appear in many gene sets. We demonstrate the usefulness of the method when analyzing gene sets that correspond to the KEGG pathways, and hence we called our method P athway A nalysis with D own-weighting of O verlapping G enes (PADOG). Unlike most gene set analysis methods which are validated through the analysis of 2-3 data sets followed by a human interpretation of the results, the validation employed here uses 24 different data sets and a completely objective assessment scheme that makes minimal assumptions and eliminates the need for possibly biased human assessments of the analysis results.
PADOG significantly improves gene set ranking and boosts sensitivity of analysis using information already available in the gene expression profiles and the collection of gene sets to be analyzed. The advantages of PADOG over other existing approaches are shown to be stable to changes in the database of gene sets to be analyzed. PADOG was implemented as an R package available at: http://bioinformaticsprb.med.wayne.edu/PADOG/or http://www.bioconductor.org.
Microarray-based gene expression profiling experiments, which are routine today, allow researchers to identify, for instance, genes differentially expressed (DE) between diseased and normal patient samples or genes that change in expression over time during a treatment. Unfortunately, the steady increase in the amount of data generated in the past decade from such experiments was not paralleled by the evolution of analytical methods used to extract knowledge from such datasets and, therefore, there is a gap between our ability to measure gene expression data and to extract workable knowledge from it.
Since the beginning of the microarray-based expression profiling experiments, researchers were interested in finding common “themes” among the genes identified as differentially expressed between two conditions. For instance the identification of Gene Ontology (GO) terms enriched in differentially expressed genes was used as early as 1999 , but became widespread only after the first on-line GO analysis tools were made available [2, 3]. As biological annotations started to include descriptions of gene interactions in the form of pathways (found in resources such as KEGG , BioCarta http://www.biocarta.com, and Reactome ), the identification of the pathways involved in various conditions has emerged as a ubiquitous bioinformatics task.
In general, biological pathways can be divided into gene signaling pathways, and metabolic pathways. Gene signaling pathways are graphs that use nodes to represent genes, or gene products, and edges to represent signals that go from one gene to another. Metabolic pathways are graphs that use nodes to represent biochemical compounds, and edges to describe biochemical reactions that involve such compounds. Since biochemical reactions are usually carried out by enzymes which are coded for by genes, in a metabolic pathway genes are associated with edges rather than nodes. Ideally, a comprehensive pathway analysis method would be able to take into consideration all aspects of the phenomena described by a pathway. These aspects would include the position and role of each gene in a pathway, the types of signals between genes, the efficiency with which a signal travels from one gene to another, or the efficiency with which a certain reaction is carried out, rate limiting conditions, etc. Such methods have been proposed for both signaling pathways [6–9], and metabolic pathways [8, 10], but no method is currently available to analyze both types of pathways taking into consideration all the information available. Hence, even though they do not use all information available, methods that treat the pathways as simple gene sets are still popular because they can be applied equally well to signaling pathways, metabolic pathways, GO terms, as well as arbitrary sets of genes.
Two of the most popular such methods are the Gene Set Enrichment Analysis (GSEA)  and the Gene Set Analysis (GSA) . These methods belong to the functional class scoring category of gene set analysis methods [13, 14]. For a simple two group experiment (e.g. disease vs. normal), both GSEA and GSA start with computing a t-statistic for each gene measured on the array. Then, a score is computed for each gene set using the t-scores of all genes in the gene set. The significance of the gene set scores is determined by using permutations of the samples. Both approaches treat the genes in the gene set equally.
In this work, we propose the P athway A nalysis with D own-weighting of O verlapping G enes (PADOG) which is a general gene set analysis method. The method gives more weight to genes that are gene set-specific, than to genes which can be found in multiple gene sets. This is similar to the approach commonly used in information retrieval (e.g. web search engines) that decreases the importance of words that appear in many documents (e.g. “and”, “or”, etc.) in favor of words that are highly specific to given documents, the latter type being considered to carry more information about the informational content of the document. Similarly, in our approach, if the differential expression affects genes that are highly specific to a given pathway (e.g. huntingtin to Hungtington’s disease), it is more likely that the respective pathway is truly relevant in that condition.
The process of down-weighting popular genes does not affect one’s ability to find a gene set to be significant whenever the gene set is composed mostly of ubiquitous genes, but rather increase the contrast between gene sets that overlap by reducing the contribution of the overlapping genes into the gene set scores. As a simple example, with PADOG, a gene set A having 20 out of 50 genes differentially expressed, that appear only in gene set A, will be found more significant than another gene set B of same size that has also 20 differentially expressed genes but which appear in other gene sets as well. Both GSEA and GSA would find the two gene sets equally significant.
Analysis methods that do not treat all genes equally were previously proposed for pathway analysis in an over-representation context [6, 7], or in a functional class scoring context , yet none specifically exploit the frequency of occurrence of genes across the pathways. Moreover, unlike GSA, PADOG does not rely on ordinary t-scores to derive gene set scores but uses moderated t-statistics  instead. A similar idea to use non-ordinary t-scores in the gene set scores computation was illustrated first in  by using SAM statistics  in conjunction with GSEA. Moreover, unlike GSA, PADOG summarizes the gene scores into a gene set score using the mean of absolute values instead of the maxmean statistic.
The 24 data sets used to assess the proposed gene set analysis method
Brain, Entorhinal Cortex
Brain, Primary visual cortex
Postmortem brain putamen
Acute myeloid leukemia
Blood, Bone marrow
Non-Small Cell Lung Cancer
Non-Small Cell Lung Cancer
The two methods we chose to compare PADOG against are the Gene Set Enrichment Analysis (GSEA)  and the Gene Set Analysis (GSA) . Briefly, GSEA works as follows. Let GS i denote the i th gene set, where i=1.. N GS . For each gene j on the array, GSEA computes a t-statistic z j for the differential expression of the gene between the disease group and the control group. A gene set score S(GS i ) is computed similar to a signed version of the Kolmogorov-Smirnov statistic between the values z j j∈G S i and their complement (genes measured on the array but not belonging to the gene set). The class labels of the arrays are permuted and the significance of the gene set score is assessed by determining the null distribution of the gene set score.
where the (+) and (-) signs identify the positive and negative t-scores respectively, and n represents the number of genes in the gene set. Secondly, GSA differs from GSEA by re-standardizing the gene set scores by taking into account scores from sets formed by random selection of genes. Permutations of class labels are then used to infer the significance of the standardized gene set scores. The need for re-standardization is justified by the fact that, given that the genes are correlated (they tend to have either high or low t-scores simultaneously), the gene set score computed with the true class labels will be systematically larger than with permuted class labels and, hence, the significance of all gene sets will be overstated.
Let GS i with i=1.. N GS be the collection of gene sets to be analyzed, each containing N(GS i ) genes, and G be the set of all genes measured on the array that can be mapped to at least one gene set to be analyzed. Then let be the value of a moderated t-score  of the gene g between the two conditions of interest with g∈G. The moderated t-scores are similar to ordinary t-scores, except that their standard errors have been moderated across genes, i.e., shrunk towards a common value using a Bayesian model . The moderated t-scores are expected to be more reliable than ordinary t-scores because the shrinkage of the gene standard deviations will prevent large t-scores to occur only due to small gene standard deviations.
where I is a function that returns 1 when the argument is true and 0 otherwise, and represents the standardized score obtained with the ite-th permutation of the samples for gene set GS i .
Statistics that describe the distribution of the 24 target pathway’s p -values, including the geometric mean and median (the lower the better), and the percentage of target pathways with nominal p<0.05 (the higher the better). This later statistic is an estimate of the sensitivity of a given analysis method. The percentage of target pathways with False Discovery Rate  corrected p-values (called q-values) less than 0.05 is also given.
Statistics that describe the distribution of the 24 target pathways ranks, including mean and median (the lower the better). The rank of a target pathway, having the i th smallest p-value amongst all N GS pathways analyzed for a given dataset, will be equal to i/ N GS ·100.
Statistics that allow to determine if a given pathway analysis method produces better rankings than a reference method, chosen to be GSA since it was the best among the two published methods that we tested. A simple method to test that the ranks produced by a given method for the 24 target pathways are smaller (better) than the reference method would be to use a one-tailed paired Wilcoxon test, the pairing being at data set level. However, the Wilcoxon test assumes that the different ranks are independent between the 24 datasets, yet this is may not be the case because some ranks are obtained for the same pathway in up to 4 datasets (see Table 1). Another approach that we used to analyze the ranks while accounting for the eventual lack of independence among them was to fit a linear mixed-effects model. The dependent variable in this model were the rank values, while the explanatory variables were the analysis method (factor with two levels, with the reference level being GSA) and the dataset ID (to reflect that the ranks are paired at the dataset level), while the random effects were the pathway IDs. Both the coefficient, and one-tailed p-value that a given analysis method produces better (smaller) ranks than the reference method were reported.
Note that the gene set analysis methods could have been compared also in analyzing gene ontology terms instead of pathways, however, choosing one GO term most relevant for each dataset would have been more subjective.
Level of the first 15 genes of GS 1 was increased by 0.3 units in group 2.
Level of the first 10 genes of GS 1 was increased by 0.3 units and the level of the next 5 genes was decreased by 0.3 units in group 2.
Level of the first 8 genes of GS 1 was increased by 0.3 units and the level of the next 7 genes was decreased by 0.3 units in group 2.
Level of the first 7 genes of GS 1 was increased by 0.4 units and the level of the next 3 genes was decreased by 0.4 units in group 2.
Level of the first 5 genes of GS 1 was increased by 0.4 units and the level of the next 5 genes was decreased by 0.4 units in group 2.
A number of 50 data sets were generated for each of the six scenarios above. Orthogonal on the different scenarios we considered three analysis setups that could influence the results of PADOG but not GSA and GSEA, according to whether or not the genes in GS 1 are allowed to be present in other gene set as well (e.g. (GS 50)). In the first setup I), GS 1 did not overlap with other gene sets as in , II) All genes designed to be DE in GS 1 were included also in GS 50, and III) All non-DE genes of GS 1 were included in GS 50. With setup I) we are basically interested in assessing if the gene set summarization function of PADOG (mean or absolute values) combined with the moderated t-scores compares favorably to GSA and GSEA, because in the absence of overlap, the genes of GS 1 will have the same weight (w=1.0). When the DE genes in GS 1 appear also in other gene sets but the non-DE do not (setup II), PADOG is expected to give higher p-values to GS 1 compared to the situation when there is no overlap. This is because the weight of the DE genes in this case will be lower than the weight on non-DE genes. In contrary, if the genes that are non-DE in GS 1 overlap but the DE genes are specific to GS 1 (setup III) then PADOG is expected to produce smaller p-values for GS 1 because the DE genes will have more weight and also larger t-scores.
To test the ability of the gene set analysis methods to not reject the null hypothesis when it is true, i.e. their specificity, we conducted two simulation studies.
In the first simulation study all the 24 data sets were considered, but their array/samples class labels were permuted at random before analysis so that the correlation structure between genes is preserved. In 100 different trials, we computed several of the statistics described above, including the median of target pathways p-values, median ranks, and the percentage of pathways with p<0.05. The average of these statistics over the 100 trials are reported.
The purpose of this simulation was two-fold. First, it allows us to determine if the target pathways-based benchmark works, i.e. if the ranking results are worse for all methods when the labels are permuted compared to when the true class labels are used. Second, it allows us to estimate the false positive rate (1-specificity) of each gene set analysis method and compare it with the level expected under the null hypothesis. All analysis methods were run on the same 100 permutations of the original class labels of each of the 24 data sets to eliminate any differences introduced by random chance. The number of internal iterations used by each analysis method was N ite =500.
At the suggestion of one of the reviewers, a second type of simulation was performed to determine the false positives rate of gene set analysis methods by generating random data from a normal distribution with mean 0 and standard deviation of 1, N(0,1). For each of the 24 real datasets, 50 fake replicas were created by maintaining the actual sample size and number of genes but generating data at random, for a total of 1200 simulated datasets. The structure of the gene sets was preserved as defined by the 229 KEGG metabolic and non-metabolic pathways, therefore maintaining a meaningful overlap between the different genes in the gene sets. The fraction of all significant pathways (false positive rate) at different αthresholds was determined.
For all 24 datasets shown in Table 1 which were available from the Gene Expression Omnibus (GEO), the analysis was performed consistently by: a) removing outlier arrays (if necessary), b) log transforming the data and normalizing it, c) performing a moderated t-test between groups and computing probes/probesets p-values, d) resolving duplicate probes/probesets to Entrez ID mappings by keeping the probe/probeset with smallest p-value for each unique gene and, e) filtering out all genes that could not be mapped on any of the pathways. The normalization of datasets obtained on Affymetrix arrays was performed using the RMA algorithm  implemented in the affy package of Bioconductor, while normalization of of datasets run on Illumina arrays were normalized using the quantile normalization algorithm  implemented in the preprocessCore of Bioconductor. The package limma was used to compute a moderated two-sample paired or unpaired t-score depending on the particular design of each experiment.
Comparison between gene set analysis methods in terms of sensitivity and pathway ranking when analyzing 143 KEGG non-metabolic pathways
p geometric mean
Comparison between pathway analysis methods in terms of sensitivity and pathway ranking when analyzing 229 KEGG metabolic and non-metabolic pathways
p geometric mean
A sensitivity analysis using simulated data in the absence and presence of overlap between gene sets
PADOG Setup I
PADOG Setup II
PADOG Setup III
Determining the contribution of gene weighting and moderated t-scores in PADOG when analyzing 229 KEGG metabolic and non-metabolic pathways
Comparing gene set analysis methods performance under the null hypothesis simulated by class labels permutation
False positive rates when null hypothesis is simulated by generating random expression data
The set of 229 metabolic and non-metabolic pathways and their genes were obtained from the KEGG.db annotation package  of Bioconductor . The split between metabolic and non-metabolic pathways was done based on KEGG’s classification.
All analyses were run under the R statistical language and environment  version 2.14 and using other infrastructure packages available in Bioconductor version 2.9.
We compared the PADOG method proposed here with two existing methods (GSA and GSEA). The analysis was performed on i) 143 non-metabolic pathways (which included all target pathways) and ii) 229 metabolic and non-metabolic KEGG pathways. The criteria used in the comparison between these methods were the sensitivity, the ranking, as well as the specificity of the gene set analysis methods considered. Table 2 shows the summary of gene set analysis results for the three different methods based on the panel of 24 datasets described in Table 1 when analyzing only KEGG non-metabolic pathways.
The result of the sensitivity analysis based on 50 simulated data sets in each of the 5 different scenarios are given in Table 4. These results show that when all genes designed to be differentially expressed (DE) in GS 1 are changing in the same direction (scenario 1), GSA and GSEA have an advantage over PADOG while the opposite is true in all remaining 4 scenarios. These results can be understood by considering the fact that GSA and GSEA statistics are designed to find such cases when all the genes in the gene set change in the same direction while PADOG’s summary statistic is more flexible to accommodate cases when the changes occur in both directions. When the overlap favors the DE genes in GS 1 (Setup III), that is, when its DE genes are specific to this gene set while its non DE-genes are not specific to the gene set, the performance of PADOG increase in all scenarios 1 through 5, as compared to the absence of overlap. However, even when the overlap is not favorable to GS 1 (setup I), that is, when all its non-DE genes are specific to this gene set, PADOG still performes better than GSA and GSEA under scenarios 2 through 5.
Two simulation studies were performed to determine whether the improved sensitivity of the PADOG method, i.e. producing lower p-values for the target pathways, comes at the expense of reduced specificity (increased false positive rate). Table 6 shows three of the same statistics introduced in Table 2 (median p-values, median ranks, and percentage of pathways with p<0.05) except that their average was taken over 100 trials in which the class labels of the arrays in all 24 datasets were randomly permuted before the analysis. The percentage of target pathways with p<0.05 is now the false positive rate (FP) because using random class labels models the null hypothesis in which expression levels are dissociated from the studied phenotypes, yet the gene-gene correlations are preserved. Under these circumstances, any pathways that are reported as significant by any method are, in fact, false positives.
Table 6 shows that, under the null hypothesis, the average median p-values, median ranks and fraction of pathways with p<0.05 across the 100 random permutations are 0.49, 48.9% and 0.049, respectively for PADOG and similar values are obtained for GSA and GSEA. This is expected since when class labels are permuted, the p-values of the target pathways should be uniformly distributed between between 0 and 1 (expected mean 0.5), and rank values should be uniformly distributed between 1/ N GS ·100=0.44 and 100 (expected mean 50.22) where N GS is the number of gene sets analyzed. Table 6 also shows that the average median p-values and median ranks are much above (worse) than the level they had when true class labels were used in the analysis (see Table 2). This is the case for all analysis methods. These results prove that: i) the target pathways were indeed in average relevant to their respective phenotypes, ii) the benchmark system was sound, and iii) both the novel, as well as the existing methods were correctly deployed.
An additional simulation in which 1200 datasets were generated by drawing random values from a normal distribution has yielded similar results as the previous simulation. In this case the false positives rate was estimated as the fraction of all pathways across all 1200 datasets with a p-value less than a given threshold α. The estimated false positive rates of all three methods were very close to the expected α levels as shown in Table 7. This again confirms that PADOG is not expected to find significant gene sets more often than expected by chance regardless if gene are correlated (as in the simulation above) or not (this simulation).
A specificity analysis of the target pathways on 16 cancer data sets
Pathways in cancer
Pathways in cancer
The original contribution of this paper is two-fold. Firstly, this paper introduces the idea of gene weighting in gene set analysis on the basis of gene frequency across the gene sets to analyzed. The reasoning behind this type of gene weighting is that whenever a gene belongs to multiple gene sets, that particular gene is less useful in prioritizing among those gene sets. Conversely, the differential expression of a gene that is present only on a single gene set/pathway represents a stronger evidence that the given gene set/pathway is impacted in the given condition. A second original contribution is the validation procedure deployed here. The classical approach involves analyzing a handful of selected data sets and discussing the results in the light of the existing literature. This is subjective and makes the comparison of various methods practically impossible. The validation proposed here involves the analysis of a large number of data sets (24 in this case) that can be objectively associated with a target gene set/pathway. This objective association is based on the fact that the samples analyzed are collected from tissues affected by the target disease (e.g. in the analysis of colorectal cancer samples, the colorectal cancer pathway is chosen as the target pathway, etc.). This approach allows a comparison of analysis methods in terms of sensitivity and ranking. Such a comparison is: a) objective, b) reproducible, and c) independent of the accuracy and thoroughness of a literature search. Using this approach, we have shown that PADOG is able to identify the target pathways as significant more frequently and rank them consistently higher than two of the best existing methods for the analysis of gene sets based on high-throughput gene expression data.
This research was supported, in part, by the Intramural Research Program of the National Institute of Child Health and Human Development, NIH/DHHS. SD was also supported by the following grants: NIH RO1 RDK089167-01 and NSF DBI-0965741 (to S.D.), and by the Robert J. Sokol Endowment in Systems Biology. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of any of the funding agencies.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.