A Bayesian variable selection procedure to rank overlapping gene sets
© Skarman et al.; licensee BioMed Central Ltd. 2012
Received: 22 September 2011
Accepted: 11 April 2012
Published: 3 May 2012
Skip to main content
© Skarman et al.; licensee BioMed Central Ltd. 2012
Received: 22 September 2011
Accepted: 11 April 2012
Published: 3 May 2012
Genome-wide expression profiling using microarrays or sequence-based technologies allows us to identify genes and genetic pathways whose expression patterns influence complex traits. Different methods to prioritize gene sets, such as the genes in a given molecular pathway, have been described. In many cases, these methods test one gene set at a time, and therefore do not consider overlaps among the pathways. Here, we present a Bayesian variable selection method to prioritize gene sets that overcomes this limitation by considering all gene sets simultaneously. We applied Bayesian variable selection to differential expression to prioritize the molecular and genetic pathways involved in the responses to Escherichia coli infection in Danish Holstein cows.
We used a Bayesian variable selection method to prioritize Kyoto Encyclopedia of Genes and Genomes pathways. We used our data to study how the variable selection method was affected by overlaps among the pathways. In addition, we compared our approach to another that ignores the overlaps, and studied the differences in the prioritization. The variable selection method was robust to a change in prior probability and stable given a limited number of observations.
Bayesian variable selection is a useful way to prioritize gene sets while considering their overlaps. Ignoring the overlaps gives different and possibly misleading results. Additional procedures may be needed in cases of highly overlapping pathways that are hard to prioritize.
Genome-wide expression profiling using microarrays or sequence-based technologies allows us to identify genes and genetic pathways whose expression patterns influence complex traits. It is likely that many phenotypic differences are manifested by small but consistent expression changes in a set of genes (e.g. biological pathways, complexes or modules). Therefore statistical methods have been developed to capture changes in the expression of pre-defined sets of genes. These gene set approaches are complementary to analyses at the single-gene level and represent powerful tools to dissect the complex changes in gene expression that underlie phenotypic traits .
Gene set approaches are based on sets of genes typically defined based on prior biological knowledge, such as genes belonging to the same molecular pathway (Kyoto Encyclopedia of Genes and Genomes (KEGG))  or genes encoding proteins with similar functions (e.g., Gene Ontology (GO)) . One potential problem is that genes can exist in many gene sets and the level of overlap can be substantial. This is ignored in statistical analyses where the gene sets are analyzed individually, which can cause difficulty in interpreting the results, because one cannot determine from the available data which one of the gene sets is more responsible for the effect. In the most extreme case, there could be one or more that are identical. In these cases it is impossible to determine which is responsible for the effect .
One powerful gene set approach is Gene Set Enrichment Analysis (GSEA) [1, 5]. GSEA aggregates the per-gene statistics within a gene set, thus making it possible to detect situations where all the genes in a predefined set change in a small but coordinated way. GSEA can be implemented in a manner similar to a linear regression modeling approach that consists of three components: the incidence matrix linking genes to the gene set; the per-gene statistic vector, e.g., the t-statistic, and a per-set summing function. In this way, a large number of gene sets and overlapping gene sets can be viewed as a linear regression with a large number of highly collinear regression variables. This is a typical combinatorial and model selection problem. One way of handling this problem has been to use linear modeling to select a model . This could, for example, be achieved using Akaike’s Information Criterion (AIC) . However, the optimizations according to this criterion should in principle compare all 2 p possible models (where P is the number of gene sets). This becomes computationally demanding as the number of gene sets increases.
Another way to consider the overlaps is the TopGO approach . This takes into account the hierarchical structure of the gene sets among GO terms. However, this method is limited to overlaps that appear because of this hierarchical structure. The DAVID gene classification tool is another method that takes account of overlaps [8, 9]. In this tool, the more highly overlapping gene sets are organized as groups . This method analyzes these groups of gene sets instead of the individual gene sets, and makes it possible to score the groups according to the scores of the member gene sets.
In this study we present a gene set approach based on the Bayesian variable selection method, known as Stochastic Search Variable Selection (SSVS) . The Markov chains underlying such Bayesian methods are efficient at handling combinatorial problems such as model selection. This approach can deal with a large number of gene sets and considers the overlaps among the gene sets. Instead of first finding the gene sets with significant effects and thereafter assessing their overlaps and correcting for the correlations among them, this approach should accomplish this process in just one step. The focus now is to investigate how the variable selection procedure analyzes the overlaps among gene sets and how this affects the prioritization of gene sets. Here, we demonstrate the use of this novel gene set approach in a genome-wide expression study of Escherichia coli-induced mastitis (udder infection) in dairy cattle.
The gene set approach presented in this study is based on a linear model to identify and prioritize KEGG pathways whose expression levels are associated with bovine mastitis.
KEGG pathways were used as gene sets. These pathways are a collection of high-quality molecular interaction and reaction networks representing the current knowledge of many important biological processes. The use of KEGG pathways as gene sets illustrates situations where the number of gene sets is relatively large and where the level of overlap among the gene sets is substantial. We used 196 KEGG pathways and 3130 bovine Entrez gene identifiers. Only KEGG pathways containing more than four genes were included. The number of genes in a set ranged from five to 793 (‘Metabolic Pathways’). The number of occurrences of the same gene across pathways ranged from one to 41. The KEGG pathways were taken from the KEGG database using the genome-wide annotations of bovine Entrez gene identifiers.
Scaling up the analysis to higher dimensions and to cases with increased overlap, as when GO is used to define gene sets, was not expected to be problematic. The current analysis uses little memory and computing time, and both will scale-up linearly in these Bayesian implementations. With higher overlaps among gene sets, worse mixing for gene effects among sets can be expected, probably requiring some increase in the Markov chain Monte Carlo chain length to obtain accurate estimates.
where z is the per-gene statistic (e.g., t-statistic), which is a measure of the association between the individual genes and the trait phenotype; μ is the general mean; X is an incidence matrix linking genes to the gene set and the per-gene statistic z. The residuals e are assumed to be independent and identically distributed according to e ~ N (0, Iσ 2). The elements of the incidence matrix have a non-zero value if the gene belongs to the gene set and zero otherwise. To account for direction of the expression changes we used a −1 for genes that are down regulated and a 1 for those that are up regulated. Each row of the incidence matrix corresponds to a gene and each column to a gene set. For this study, the full incidence matrix had 3130 rows (corresponding to the total number of genes) and 196 columns (corresponding to the total number of KEGG pathways) ( Additional file 1); β is the regression coefficient that is the summary statistic for each pathway.
Analysis of Variance (ANOVA) can be used to identify gene sets that explain a larger proportion of the variance in z, using the least squares technique. This can be achieved by fitting one gene set at a time and ignoring the overlap among gene sets . To account for the overlap it is necessary to fit multiple gene sets simultaneously. The total number of models possible to create from this is 2 p , where p is the number of gene sets. This could become computationally challenging for model selection based on comparisons of all possible models such as, for example, AIC . Thus ANOVA-based testing of one gene set at a time was used as the reference method in comparison to the Bayesian variable selection method described in detail below. The test was corrected for multiple testing using the false discovery rate method of Benjamini and Yekutieli .
The model uses a small prior probability for γ i to be 1, and is chosen to be a small number while is conditioned to be larger than and is estimated from the data, which has the effect that most regression coefficients β i are (very) small. γ i = 1 indicates that KEGG pathway i is present in the model and γ i = 0 indicates that KEGG pathway i is absent from the model. The details of the Bayesian analysis are given in Additional file 2.
Both the dimension problem and the problem of comparing all 2 p models are countered by the use of Gibbs sampling . The Gibbs sampler generates sequences of Markov chain Monte Carlo samples of the latent variables that converge rapidly to the posterior distributions of the latent variables. The Gibbs sampler also generates a sequence of β values and residual standard deviations σ as well as the latent variables γ. These variables are dependent on each other. From the posterior probabilities of the indicator variables Bayes Factors were computed as the posterior odds divided by the prior odds for including a predictor in the model .
In this way, the computational complexity would be where r is the number of iterations of the simulation, n is the total number of genes in all the KEGG pathways, and m is the number of KEGG pathways. It is clear that the number of possible models and the dimensional problem do not drastically affect the computation time. This is mainly because computationally demanding matrix multiplications are avoided. The combinatorial problem is also reduced, because the Markov chain converges to the posterior distribution of the model probabilities.
Ultimately, this method would be reasonably fast but still able to account for overlaps among the gene sets. The identification of gene sets is based on the average t-statistic for the genes in the set, and therefore there is no principle relationship between the size of the gene set and the chance of selecting a gene set for the model. The variable selection method was implemented in the software package iBay (http://www.bayz.biz) .
We demonstrated our approach using data from a genome-wide expression study of mastitis in dairy cattle . The aim was to identify the global changes in mammary gland gene expression associated with bovine E. coli-induced mastitis during the acute and chronic stage of the infection in early lactating dairy cows. Sixteen healthy Danish Holstein-Friesian cows were challenged intra-mammarily with E. coli 4 to 6 weeks after parturition. Udder tissue biopsies were collected ante-mortem from dairy cows during the acute (24 h) and the chronic (192 h) stages of the E. coli infection. Further experimental details can be found in the original publication .
Gene expression was measured using the Bovine Genome array (Affymetrix, Santa Clara, CA). The array contained 24128 probe sets that represented 11030 Entrez genes. The Bovine Genome Array annotation available from the NetAffx™ Analysis Centre (Bovine.na29.annot.csv) was used as well as bovine.db (version 2.3.5) in Bioconductor . In total, 3130 Entrez genes were assigned to KEGG pathways using the package org.Bt.eg.db (version 2.3.5).
The primary gene expression data were analyzed using R (version 2.10.1) . Normalization of the expression values for the udder was performed using the Robust Multi-array Average algorithm implemented in the Affy package (version 1.24.2) . Differential expression of individual genes was computed using linear modeling and empirical Bayes methods, which were implemented in the R package Limma (version 3.2.2) . The linear models allowed for changes in the time-points. The time-points were 24 h and 192 h. The contrast used was 24 h versus 192 h post-infection. A modified t-value was computed for each gene targeted by a probe. This was the per-gene summary statistic used as the response variable z in the linear model described above.
The complete data are shown in Additional file 3.
We compared the results from the Bayesian variable selection procedure to those of an ANOVA approach that ignores the overlaps among gene sets. The KEGG pathways highly ranked by the ANOVA approach are shown in Figure 2. From this heat-map it appears that the ANOVA approach ranks highly overlapping pathways together. Another way of investigating this tendency is that if the overlapping gene sets tend to be ranked similarly, the differences among the ranking scores would be negatively correlated to the relative overlaps. If the gene sets that are highly overlapping do not tend to have a similar score, the correlation would be zero or close to zero. For the ANOVA method, the p-values were used as scores and the Spearman correlation was −0.34. For the variable selection method, the Spearman correlation was −0.0063. It appears that the ANOVA approach tends to rank overlapping gene sets similarly while the Bayesian variable selection approach does not.
One illustrative example is the ‘Focal Adhesion’ pathway that contains 152 genes. It was considered highly significant by the ANOVA approach. Using the Bayesian variable selection procedure, however, it had a very low posterior probability (0.013) of being included in the model. Therefore, it is unlikely to play an important role in the acute phase response in the udder. The discrepancy can be explained because it has a large overlap (50 genes) with the ‘Extracellular Matrix (ECM)-receptor Interaction’ pathway, which is highly ranked by the variable selection method. In total, ‘ECM-receptor Interaction’ contains 63 genes. By the Bayesian method, the ‘Focal Adhesion’ pathway does not contribute greatly once the ‘ECM-receptor Interaction’ pathway is taken into account. The overlap between ‘Focal Adhesion’ and ‘Extracellular Matrix (ECM)-receptor Interaction’ is shown in a Venn diagram in Additional file 4.
When the genes are highly overlapping in two pathways it can be difficult for the Bayesian variable selection procedure to select one of them. Consider for example the following two simple pathways: A- > B- > C and B- > C- > D. These pathways overlap by genes B and C, and it would be hard to distinguish the more influential pathway if genes B and C were highly expressed. In the variable selection procedure, one of the pathways would be included in the model interchangeably. This is reflected in the posterior correlation for the latent variables. If the two pathways were interchangeably included in the model the correlation would be negative. We searched for examples of pathways with a high overlap and a negative posterior correlation. For both prior probabilities for the latent variable, the posterior correlations were centered on zero with the majority (95–98%) of the posterior samples being between −0.1 and 0.1.
When using a prior probability of 0.05 of including a pathway in the model, there were only two examples of low posterior correlation below −0.1 and a large overlap: ‘Tight Junction’ and ‘Leukocyte Transendothelial Migration’. The correlation was −0.15 and the relative overlap was 0.43. This was not an example of a pair of pathways that were hard to discriminate, because ‘Tight Junction’ had a high posterior probability (0.94) and ‘Leukocyte Transendothelial Migration’ had a low posterior probability (0.013).
When using a prior probability of including the pathways in the model of 0.4, the two pathways ‘Huntington’s Disease’ and ‘Parkinson’s Disease’ were noteworthy. These two pathways are both involved in neurodegenerative diseases and have a negative posterior correlation of −0.37. Their relative overlap was 0.81. Neither pathway had a much higher posterior than prior probability of being included in the model. However, both were highly ranked by the ANOVA method. This provides an example of two pathways that are hard to discriminate by the Bayesian variable selection method. On the other hand, it also illustrates a feature of the variable selection procedure that permits a deeper insight into biologically relevant expression patterns in the data.
30 top-ranked KEGG pathways using a prior probability of 0.05
Number of genes
Posterior probability of being included in the model
Variance per gene (10^-3)
Complement and coagulation cascades
RIG-I-like receptor signaling pathway
Cell adhesion molecules (CAMs)
SNARE interactions in vesicular transport
Ubiquitin mediated proteolysis
Neuroactive ligand-receptor interaction
PPAR signaling pathway
MAPK signaling pathway
Fc gamma R-mediated phagocytosis
Insulin signaling pathway
Notch signaling pathway
Cytokine-cytokine receptor interaction
Chemokine signaling pathway
Chronic myeloid leukemia
Pathways in cancer
Basal transcription factors
Circadian rhythm - mammal
As shown in Table 1, a potential limitation is that it was not possible to discriminate among the 24 most highly ranked gene sets in terms of significance using the posterior probability of being included in the model or the odds ratio between the prior and posterior probabilities. To distinguish them, it may be helpful to study the variance explained per gene by each of the sets. ‘ABC Transporters’ and ‘Lysosome’ were the two most highly ranked gene sets using this way of ranking. However, neither of these was identified in a hypergeometric gene set enrichment study on gene sets defined by KEGG pathways taking only the differentially expressed genes .
Our gene set approach uses a moderated t-statistic, the per-gene summary statistic, as the response variable. The summary statistics were computed for each of the 3130 genes linked to KEGG pathways, but were based on a limited number of observations—only eight animals in each treatment group in this study. We assessed the influence of a limited number of observations on the robustness of the high-ranking pathways by randomly selecting a subset of the animals, computing the moderated t-statistics, and running the variable selection procedure. We repeated this 120 times, eliminating the data from two animals at a time, and recorded the rankings of the pathways in each round. Among the 30 highest-ranking pathways in each round, 26 pathways appeared in all 120 runs. Although these results suggest that our approach is robust in cases where there is a limited number of observations, we suggest that this type of analysis should be performed for each data-set.
Bayesian variable selection can prioritize gene sets while also considering the overlaps among them. This can be performed for a large number of genes without overwhelmingly demanding computation. The selection method tends to select one or a few pathways among the highly overlapping pathways. It also makes it possible to determine which pairs of overlapping pathways are harder to prioritize. This can be achieved by studying the latent variables computed in the variable selection.
Our results show that the ANOVA approach can give misleading results by not considering the pathway overlaps. The Bayesian variable selection method gave similar results to the ANOVA method, but it was able to highlight one or a few among the highly overlapping genes.
Cases that would prove difficult for the Bayesian variable selection method include when there are very highly overlapping genes and the overlapping genes are differentially expressed. In such cases, we suggest that the posterior correlations among highly overlapping pathways should be examined to determine whether they are negative and have a high absolute value.
Akaike’s information criterion
Analysis of variance
Gene set enrichment analysis
Kyoto encyclopedia of genes and genomes
Stochastic search variable selection.
The work was funded by the Cutting Edge Genomics for Sustainable Animal Breeding (SABRE) research project (contract no. FOOD-CT-2006-016250) and Quantomics a collaborative project under the 7th Framework Programme (contract no. KBBE-2A-222664).Wewould like to acknowledge Rebecca S Devon for her revision of themanuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.