Integrative set enrichment testing for multiple omics platforms
- Laila M Poisson^{1},
- Jeremy M Taylor^{2} and
- Debashis Ghosh^{3}Email author
https://doi.org/10.1186/1471-2105-12-459
© Poisson et al; licensee BioMed Central Ltd. 2011
Received: 5 September 2011
Accepted: 25 November 2011
Published: 25 November 2011
Abstract
Background
Enrichment testing assesses the overall evidence of differential expression behavior of the elements within a defined set. When we have measured many molecular aspects, e.g. gene expression, metabolites, proteins, it is desirable to assess their differential tendencies jointly across platforms using an integrated set enrichment test. In this work we explore the properties of several methods for performing a combined enrichment test using gene expression and metabolomics as the motivating platforms.
Results
Using two simulation models we explored the properties of several enrichment methods including two novel methods: the logistic regression 2-degree of freedom Wald test and the 2-dimensional permutation p-value for the sum-of-squared statistics test. In relation to their univariate counterparts we find that the joint tests can improve our ability to detect results that are marginal univariately. We also find that joint tests improve the ranking of associated pathways compared to their univariate counterparts. However, there is a risk of Type I error inflation with some methods and self-contained methods lose specificity when the sets are not representative of underlying association.
Conclusions
In this work we show that consideration of data from multiple platforms, in conjunction with summarization via a priori pathway information, leads to increased power in detection of genomic associations with phenotypes.
Background
In biomedical studies we are often interested in comparing two groups of samples on a collection of measured variables that are possibly associated with group status. When the explanatory variables of interest are measurements from a high-throughput molecular assay, thousands of comparisons may be performed. The resulting list of differential genes from a gene expression array can be unwieldy with hundreds of entries. For metabolomics the number of molecules measured is reduced by at least an order of magnitude compared to gene expression assays [1, 2], but the list of differential molecules can still be lengthy with respect to the number of leads that can be feasibly followed. Given this, researchers are often interested in grouping these lists of differentially expressed molecules into sets with common functionality. The area of enrichment testing looks at an a priori defined set, such as from KEGG (Kyoto Encyclopedia of Genes and Genomes) or GO (Gene Ontology) [3], and asks if the number of differentially expressed elements in the set is remarkable; either more or less than expected.
Enrichment tests work by assessing the overall evidence of differential expression behavior of the elements (e.g. genes, metabolites) within the set. The pitfall with a smaller list, such as with metabolomic analysis, is that the sets of interest may not be well-represented for testing. For instance, 67% enrichment sounds impressive unless there are only 3 molecules measured. In this case it may not be statistically significant, and it is also not clear if it is biologically interesting. When we have measured many molecular aspects, e.g. gene expression, metabolites, proteins, it is reasonable to assess their differential tendencies jointly across platforms. Integration of omics technologies has been beneficial in other areas resulting in more interpretable results (e.g., [4]) and more meaningful associations [5] than when the platforms are assessed separately. In an effort to translate this success to the area of set enrichment we explore joint set enrichment tests that incorporate the multiple platforms into a single test.
We begin with an overview of enrichment testing theory and detail the methods of interest. Using simulation we explore the properties of these methods and make comparisons and recommendations. A metabolomics dataset [1] and matched gene expression dataset are used to apply the top methods to real data. For clarity, the following exposition is presented with respect to genes and metabolites, but the results should readily apply to any high-throughput molecular measures that (1) are expected to be related within the cell or tissue, (2) can be assessed for association between clinical groups, and (3) can be assigned into a priori sets. Further details, as well as the R code for the simulations presented here is available as additional material (Additional Files 1 and 2) so that the reader may explore specific scenarios of interest.
Enrichment Testing
General schema for a pathway enrichment test
Differential gene (D) | Non-differential gene (D') | Total | |
---|---|---|---|
In the set (S) | g _{ SD } | g _{ SD' } | g _{ S } |
Not in the set (S') | g _{ S'D } | g _{ S'D' } | g _{ S' } |
Total | g _{ D } | g _{ D' } | g |
Competitive Tests
For a set of genes, S, a competitive test assesses whether the amount of differential expression differs from that of its complement S'. The competitive null hypothesis, ${H}_{0}^{comp}$, then assumes that genes within the set S show the same amount of association with the phenotype as those in set S' [6]. In this way each gene set competes against its complementary set of measured genes.
A popular competitive set enrichment test is the Fisher's Exact test run on Table 1. Independence of the columns and rows is assessed and a statistically significant result that rejects the null hypothesis implies that the rate of differentially expressed genes is associated with set status. As it is a two-sided test, a detected association may be due to enrichment or depletion of differential genes.
The chief complaint against competitive enrichment tests is the relative enrichment estimation of a set S can differ depending upon the gene sets used for the reference set S' [8]. Though viewed as a limitation, critics concede that relative estimation is useful when there is a large number of genes that are differentially expressed.
Most competive tests rely on gene-resampling to generate the null distribution [6, 7]. Empirical estimates of the null hypothesis are generated by randomly sampling g_{ S }genes from S ∪ S' and repeating the test on these randomly generated sets. Arguments against gene-resampling methods are three-fold. First, lack of independence between genes is contrary to the assumption of exchangability in gene resampling. Second, the null distribtution is based on the selection of a new gene set not a new sample set. Finally, the sample size is dependent upon the number of genes, which is often substantially larger than the number of samples. Logistic regression was introduced by [9] as an alternative test to the Fisher's Exact test which does not require dichotomization of the genes by differential ability. The model proposed is logit(Pr(G_{ j }∈ S)) = γ_{0} +γx, for $x=-lo{g}_{10}\left({p}_{j}^{G}\right)$ where ${p}_{j}^{G}$ is the p-value from the per-gene test of differential expression for gene j. The test of ${H}_{0}^{LR}:\gamma =0$ can be obtained from standard statistical software using a 1-degree of freedom Wald test where rejection of ${H}_{0}^{LR}$ indicates enrichment or depletion of the set.
Self-contained Tests
In contrast to competitive tests, self-contained tests do not utilize S' in the assessment of S. Specifically, only the first row of Table 1 is considered. The self-contained null hypothesis, ${H}_{0}^{sc}$, assumes that the gene set S does not contain any genes whose expression levels are associated with the phenotype of interest [6]. A binomial test of proportions based on g_{ SD } ~ binomial(g_{ S }, α), where α is an expected rate of differential genes (e.g., α = 0.05) is an example of a self-contained test. Since only the set of interest S is considered a self contained test is not relative. Thus it reduces to a test of differential expression for a single gene and expands to a global test of differential expression when the entire array is the set of interest. Arguments against self-contained tests focus on the strong null hypothesis in relation to its biological interpretation wherein a single differentially expressed gene may be able to give enough evidence to reject the null hypothesis.
Self-contained tests primarily utilize subject-resampling methods to determine the null distribution of the test statistic [7]. Subject-resampling assumes that the subjects are independent and that under the null hypothesis the sample labels are randomly assigned. In contrast to gene-resampling, subject-resampling (1) follows the experimental design of most studies by assuming that the subjects are independent realizations of the study population, (2) retains the between-gene correlation structure, (3) is reflective of the number of subjects, and (4) provides a p-value that is generalizable to experiments with new subjects under study. However, subject-resampling tests are limited by their small sample size and the null hypothesis being tested by subject sampling may be difficult to state.
An example of a self-contained test that utilizes the strength of differential expression per element without dichotomization is the sum of squared test statistics. Begining with the per-element test statistics of differential ability, ${\underset{-}{T}}^{G}=\left({T}_{1}^{G},{T}_{2}^{G},\dots {T}_{g}^{G}\right)$, we simply sum all squared test statistics within set S, ${W}_{S}^{G}={\sum}_{j\in S}{\left({T}_{j}^{G}\right)}^{2}$ for j = 1,..., g genes [10]. Significance of ${W}_{S}^{G}$ is determined by generating the null distribution using permutation of sample labels to form null datasets.
Methods
Approach: Joint Assessment of Enrichment
In this work we are interested in tests of enrichment that incorporate both the gene expression and metabolite information. We begin with per-gene and per-metabolite tests of differential expression. We define two vectors of test statistics, specifically two-sample t-tests, ${\underset{-}{T}}^{G}=\left({T}_{1}^{G},{T}_{2}^{G},\dots ,{T}_{g}^{G}\right)$ and ${\underset{-}{T}}^{M}=\left({T}_{1}^{M},{T}_{2}^{M},\dots ,{T}_{m}^{M}\right)$, and two corresponding vectors of p-values, ${\underset{-}{P}}^{G}=\left({P}_{1}^{G},{P}_{2}^{G},\dots ,{P}_{g}^{G}\right)$ and ${\underset{-}{P}}^{M}=\left({P}_{1}^{M},{P}_{2}^{M},\dots ,{P}_{m}^{M}\right)$, for g genes and m metabolites, respectively.
Concatenation of Lists
Univariate tests can be employed directly as a means of joint assessment by concatenating the per-gene and per-metabolite test statistics to form a single vector of data, e.g. $\underset{-}{T}=\left({\underset{-}{T}}^{G},{\underset{-}{T}}^{M}\right)$. Care must be taken that the joined vectors are made comparable before concatenation. For instance, two-sample t-test statistics are not comparable if they are from different sample sizes as they have different degrees of freedom. P-values, being comparable by design, will resolve this problem, however, they lose directionality, which may be of interest, and empirical p-values may lack precision. Additionally, concatenation of the lists may lead to bias favoring the larger dataset [11].
P-value multiplication
To arrive at a single statistic from two tests we can use a p-value combining method such as Fisher's method [12]. Here we assume that $-2\times \left(lo{g}_{e}{P}_{S}^{M}+lo{g}_{e}{P}_{S}^{G}\right)~{\mathcal{X}}_{4}^{2}$ for the enrichment p-values for metabolites and genes in set S. The ${\mathcal{X}}^{2}$ distribution is not technically correct when the tests being summed are dependent, which is likely the case here. However, since this is a commonly used method we wished to incude it in the comparison.
Logistic regression analysis with 2-df Wald test
We propose a multivariate extension of the competitive logistic regression test of [9]. First the genes and metabolites are modelled separately using the absolute value of the per-element t-statistic as the measure of differential ability. Thus for set S we fit models (1) $logit\left(Pr\left({G}_{j}\in S\right)\right)={\gamma}_{0}+\gamma \left|{T}_{j}^{G}\right|$ and (2) $logit\left(Pr\left({M}_{k}\in S\right)\right)={\mu}_{0}+\mu \left|{T}_{k}^{M}\right|$. (We use the absolute t-statistic, instead of -log_{10}(p), because it is more stable in the bootstrap resampling described below.)
We next construct a joint test of ${H}_{02}^{LR}:\gamma =0,\mu =0$ using a two degree-of-freedom Wald test, EV^{-1}E^{ T }, where $E=\left[\widehat{\gamma},\widehat{\mu}\right]$, and V is an estimated variance-covariance matrix for γ and μ. Estimates of the variance of γ and μ are available from the univariate models but the covariance term is not easily obtained. We attempted to estimate this covariance measure through bootstrap simulation but we find that there is near-zero correlation between the parameters $\widehat{\gamma}$ and $\widehat{\mu}$ even in sets where the genes and metabolites were simulated to be correlated. The loss of correlation is due to row-resampling used for this bootstrap estimation, however, subject sampling severly underestimates the parameter variance compared to the variance estimates obtained from the univariate models. For more discussion of this bootstrap estimation, we refer the reader to Additional File 1.
Given this we estimate V as a diagonal matrix with ${\sigma}_{\gamma}^{2}=var\left(\widehat{\gamma}\right)$ and ${{\sigma}_{\mu}}^{2}=var\left(\widehat{\mu}\right)$ obtained from the univarite models. This reduces our test statistic to the sum of the two one-degree-of-freedom tests which can be written as ${U}_{S}^{LR}=E{V}^{-1}{E}^{T}={\widehat{\gamma}}^{2}{\sigma}_{\gamma}^{-2}+{\widehat{\mu}}^{2}{\sigma}_{\mu}^{-2}$, for set S, where $E=\left[\widehat{\gamma},\widehat{\mu}\right]$, and $V=diag\left({\sigma}_{\gamma}^{2},{\sigma}_{\mu}^{2}\right)$. We assume that ${U}_{S}^{LR}~{\mathcal{X}}_{2}^{2}$ under the null hypothesis ${H}_{02}^{LR}:\gamma =0,\mu =0$.
Sum of squared statistics with 2-dimensional permutation test
We propose a multivariate extension of the self-contained sum of squared statistics [10]. First, we obtain the enrichment test statistic for each of the metabolites and the genes in set S, $\left({W}_{S}^{G},{W}_{S}^{M}\right)$. Next, we assess this observed statistic pair against the distribution of null estimates pairs ${\left({\stackrel{\u0303}{W}}_{S}^{G},{\stackrel{\u0303}{W}}_{S}^{M}\right)}_{h},h=1,\dots ,H$. Marginally, ${P}_{S}^{G}={H}^{-1}{\sum}_{h=1}^{H}I\left({\left({\stackrel{\u0303}{W}}_{S}^{G}\right)}_{h}\ge {\widehat{W}}_{S}^{G}\right)$ for genes and ${P}_{S}^{M}={H}^{-1}{\sum}_{h=1}^{H}I\left({\left({\stackrel{\u0303}{W}}_{S}^{M}\right)}_{h}\ge {\widehat{W}}_{S}^{M}\right)$ for metabolites. We then calculate the Mahalanobis distance from the observed statistics $\left({\widehat{W}}_{S}^{G},{\widehat{W}}_{S}^{M}\right)$ to the centroid of the cloud of permutation statistic pairs.
Simulation Models
We use simulation to assess the properties of the joint enrichment tests described above. The two simulation models used are presented in the following. Further details and simulation code are available online as additional material (Additional Files 1 and 2).
Simulation I: Disjoint Set Simulation
In this simulation model we assume that the genes and metabolites can be separated into fifty equally-sized disjoint sets. That is each gene and metabolite is included in only one set. The correlation structure is the same for each set but no correlation is assumed between sets. Additionally, ten sets are simulated to have association with disease and the level of enrichment is consistent across these sets. This simple model with homogeneous sets allows us to explore specific hypotheses about the properties of the methods.
Define Y_{ ij }as the gene expression measurement for sample i and gene j. Likewise let Z_{ i'k }be the metabolite intensity measure for sample i' and metabolite k. The gene expression measures and metabolite measures need not arise from the same subjects, but it is possible for some or all samples to be matched by subject, i.e. i = i'. Define Q_{ i }and Q_{i'}as the case-control status for samples i and i', respectively, where they take values 1 for case and 0 for control.
Here d_{1}, d_{0}, c_{1}, and c_{0} are fixed values that can be set in the simulation. To simulate set enrichment we assign d_{1} >d_{0} and c_{1} >c_{0}. Interestingly, as d_{1} → d_{0} (or as c_{1} → c_{0}) the effect of being in the set diminishes under the competitive definition of enrichment. However, tests of the self-contained null hypothesis will not be affected provided that d_{1} and c_{1} are still sufficiently large.
This additive model allows for a non-zero global mean expression (intensity) level through α (θ). It assumes a mean expression (intensity) level per gene (metabolite) as defined by β_{ j }(φ_{ k }), which is modified for case samples by ω_{ j } (η_{ k }) according to the distributions D(I_{ S }g_{ jS })_{ j }(C(I_{ S }m_{ kS })_{ k }). We allow ${\rho}_{MG}=Corr\left({e}_{{Y}_{ij}},{e}_{{Z}_{{i}^{\prime}k}}\right)>0$ in order to simulate matched samples (i.e. i = i'). We also allow correlation between genes ${\rho}_{GG}=Corr\left({e}_{{Y}_{ij}},{e}_{{Y}_{i{j}^{\prime}}}\right)$ and between metabolites ${\rho}_{MM}=Corr\left({e}_{{Z}_{{i}^{\prime}k}},{e}_{{Z}_{{i}^{\prime}{k}^{\prime}}}\right)$. In this simulation model these correlations are limited to genes and metabolites within the same set, thereby reducing the complexity of the simulated data structure.
Y_{ i }and Z_{ i' }are drawn from a multivariate normal distribution, $({\underset{\xaf}{Y}}_{i}^{\prime},{\underset{\xaf}{Z}}_{i}^{\prime}~MVN((\beta ,\varphi ),{\underset{\xaf}{\underset{\xaf}{\sigma}}}_{YZ})$, where β_{ j }and φ_{ k }are drawn from normal distributions with zero mean and variances 4σ^{2} with ${\sigma}^{2}~{\mathcal{X}}_{4}^{-2}$ are drawn for each of g genes and m metabolites. The covariance matrix ${\underset{\xaf}{\underset{\xaf}{\sigma}}}_{YZ}$ uses the element-wise variances used for β_{ j }and φ_{ k }for the diagonal variance entires and constant covariances defined to retain the desired between-element correlations for the off-diagonal entries. The modifiers are drawn from ω_{ j }~ Unif([-2.5, -0.5] ∪ [0.5, 2.5]) and η_{ k }~ Unif([-1.5, -0.5] U [0.5, 1.5]) and added to the multivariate normal results according to the g_{ jS }and m_{ kS }indicators. The smaller range on η_{ k }creates weaker intensity differences for the metabolites; this reflects our real experience with this platform and provides another dimension for comparison.
For this simulation, we assume that the number of samples is the same for cases and controls, with N_{ sample }∈ (30,100). We allow the correlations to vary: ρ_{ YY }= ρ_{ ZZ }∈ (0.2, 0.6) and ρ_{ YZ }∈ (0.10, 0.25) where ρ_{ GG }= ρ_{ MM }>ρ_{ MG }. We consider gene sets with 20 measurements, i.e. ${N}_{{G}_{S}}=20,\phantom{\rule{2.77695pt}{0ex}}{N}_{{M}_{S}}=4$, and metabolite sets with ${N}_{{M}_{S}}\in \left(4,20\right)$. The enrichment levels (d_{1}, d_{0}) and (c_{1}, c_{0}) are allowed to vary with (d_{1}, d_{0}) = (c_{1}, c_{0}) ∈ [(0.5,0), (0.25,0), (0.10, 0), (0.25, 0.05), (0.05,0.05), (0.10,0.10), (0, 0)] with the last three pairs representing null models for the competitive tests, the last being null for the self-contained test.
Simulation II: Heterogeneous Set Simulation
This simulation model generates the same number of genes and metabolites in total and per-set as for the disjoint simulation above. The simulation model of Equation 1 is used as the basis of the data generation. However, m_{ kS }, g_{ jS }, D_{ j }, and C_{ k }are fixed to construct clusters of genes and metabolites with varying levels of association with disease. We also allow the sets to overlap and to be non-homogeneous in correlation structure and enrichment. This style of simulation was used by [10] in their review of various single-platform enrichment tests. Here we can assess how well the methods are able to detect various set types in a non-homogeneous setting. Null sets are also included providing a reference for competitive tests and to allow estimation of false discovery rates.
Structure of the multivariate distributions in the disjoint pathway simulation
Distribution (h) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
% Diff. Genes | 0 | 100 | 100 | 100 | 100 | 100 | 100 | 0 | 0 | 0 |
% Diff. Metab. | 0 | 100 | 100 | 100 | 0 | 0 | 0 | 100 | 100 | 100 |
ρ_{ YY }, ρ_{ ZZ } | 0 | 0 | r _{1} | r _{1} | 0 | r _{1} | r _{1} | 0 | r _{1} | r _{1} |
ρ _{ YZ } | 0 | 0 | 0 | r _{2} | 0 | 0 | r _{2} | 0 | 0 | r _{2} |
Set definitions for the disjoint pathway simulation
π | h | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
0.5 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | |
0.25 | 19 | 20 | 21 | 22 | 23 | 24 | - | - | - |
Using the simulation models above we explored the behavior the Fisher's exact test, logistic regression, and sum of squared statistics test when applied jointly to two data sets. We combined the data via concatenation, the Fisher's method of combining p-values, and the multivariate extensions of logistic regression and the sum of squared statistics.
Results and Discussion
Heterogeneous set simulation results
Let us first consider some results from the heterogeneous set simulations which provide an overview of the behavior of the methods. For each simulation scenario, 100 datasets were generated and tested. In Figure 2 we depict the frequency with each set, from 1 - 29, was determined to be significant at α = 0.05 for each test considered. The average rate of false positives is computed across the 41 null sets per test.
When we turn our attention to the multivariate methods (red diamonds) we see a similar pattern to the results from p-values combined by Fisher's method (black stars). For the competitive tests these methods tend to follow the gene expression data results. There is some improvement in the metabolite only sets (i.e., sets 16-18). For the logistic regression test they also show a moderate effect, between that of the gene and metabolite only tests for sets 19-24 (panel B). All methods perform maximally in the self-contained sum-of-squared statistics test and increased power is provided to sets 19-21 (panel C).
The sum-of-squared statistics test (panel C) continues to perform maximally for all tests. However, these self-contained tests identify all of the random sets (25-29) as enriched. Given that 12% of the genes and metabolites are simulated to be differential in the full dataset these five sets will have 12% enrichment on average. These sets are not detected by the competitive tests (panels A and B) because of their use of relative estimation.
To ensure that our power gains are real we must also look at the error rate of these tests. We consider the Type I error rates across the 41 null sets in 100 simulations. In both scenarios, given in Figures 2 and 3, there are high error rates when the data sets are concatenated and the Fisher's exact test is used (0.297 and 0.408, respectively). Likewise this occurs for the logistic regression model (0.0527 and 0.285, respectively). This error comes from a high rate of depletion calls, that is identification of sets that have fewer significant genes than expected. This problem is greatest when the set size increases. When there are 40 elements per set in the concatenated list, zero differential elements is significantly smaller than the 12% expected by random selection.
The p-values combined by Fisher's method also show inflated error rates for the competitive tests when ${N}_{{M}_{s}}=20$; 0.132, 0.0771. This again is a symptom of detecting depleted sets. Recall that in these competitive tests the sample size for the test is based on the number of elements. The larger set size may offer stronger depletion results that are then amplified by the joining of the two tests.
We do not observe error inflation in the sum-of-squares statistic methods. Firstly, the p-value is calculated for a one-sided test. Thus, as currently defined, the sum of squares test cannot detect depletion. Second, the p-values are determined by permutation so there is a limit on the level of precision for the p-values which thus limits the precision of the p-value as combined by summation in the Fisher's method.
Disjoint set simulation results
Now that we have the general pattern of operating characteristics for each of these methods let us explore some specific hypotheses using the disjoint simulations. Recall that in these simulations we generate data for 50 disjoint sets, of which 10 are designed to be enriched. The correlation structure is homogeneous in that each set has the same structure. However, there is no correlation simulated between sets.
Here we use a different metric to assess the results of the methods. Specifically, we ask, if we were to choose the top ten sets by ranking p-values, would we select the 10 associated sets? Instead of looking at frequencies of being in the top 10 we consider the sum of the ranks for the 10 associated sets. When the 10 associated sets form the top 10 sets selected the sum of the ranks is $R={\sum}_{x=1}^{10}x=55$. When there is no association between the set and disease then the 10 sets of interest should have a sum of the ranks with range (55,455) and E(R) = 255. Boxplots are availble in Additional File 1 for graphical representation of the results presented here.
Under the null model of no enrichment, that is d_{1} = d_{0} = c_{1} = c_{0} = 0, the rank sum of the associated sets fall nicely around E(R) = 255. Under the null model of uniform enrichment, that is d_{1} = d_{0} = c_{1} = c_{0} = δ we also see that the rank sum of the associated sets matches E(R) = 255 when δ = 0.05 and when δ = 0.10. We next assume that on average 25% of the elements in the associated sets are differential, that is d_{1} = c_{1} = 0.25 and d_{0} = c_{0} = 0. This results in an overall 5% rate of differential elements within the datasets. The sum-of-squares statistic R achieves nearly perfect rank sums for all tests when ${N}_{{M}_{s}}=20$ and only falter for the metabolite only tests when ${N}_{{M}_{s}}=4$ (mean ± sd: 126.5 ± 38.3). In the competitive tests there is improvement in R when any of the joint tests are used compared to the univariate gene or metabolite tests. For example, the 2-degree of freedom test has mean score 85.3 (±24.8 sd, where ${N}_{{M}_{s}}=4$) compared to the gene-only or metabolite-only logistic tests with means 101.1 ± 25.6 and 166.7 ± 38.9, respectively. The correlation in this scenario was 0.20 within the sets. If we increase the correlation to 0.60 we find that R increases in the competitive tests, specifically the logistic tests become 119.3 ± 25.6 for the joint test, 138.3 ± 28.8 for genes only, and 185.8 ± 42.0 for metabolites only (${N}_{{M}_{s}}=4$). This loss of power is possibly due to loss of information attributed to the dependent measurements. Thankfully such high correlations are not likely to be present in real applications (e.g., [14]). Considering a scenario in which few elements are differential, we reduce the parameters d_{1} = c_{1} = 0.1 with d_{0} = c_{0} = 0. Since d_{1} and c_{1} are probabilities we expect that on average 10% of the elements of the associated sets are differential and we focus only on scenarios where ${N}_{{M}_{s}}=20$. The overall enrichment is 2% on average so the competitive tests still perform better than if the sets were randomly assigned. It may be the case that as few as one, or none of the elements are simulated to be differential. Due to these low counts we see an increase in R for the sum-of-squared statistics (e.g. mean ± sd: 66.3 ± 16.3 for the 2-df test).
Finally, we consider a scenario with noise in the null sets, that is d_{1} = c_{1} = 0.25 and d_{0} = c_{0} = 0.05. It is in this scenario that the sum-of-squared statistic begins to falter (e.g. mean ± sd: 159.8 ± 15.8 for the 2-df test). In fact we see that, beyond an increase in R, under this scenario the joint enrichment test performs more poorly than the univariate tests of the gene or metabolites alone (140.2 ± 16.8 and 114.0 ± 18.7, respectively). It is not surprising that this self-contained test performs poorly as this non-specific behavior is a criticism of self-contained method. It is curious, however, that the joint methods appear to fare worse in this situation.
Application to prostate metabolomics and transcriptomic data
As a final application of the above methods, we utilize the metabolomic data of Sreekumar et al. (2009) [1] and the gene expression data from the same samples (GEO:GSE8511). We consider the comparison between localized tumor (n = 12) and benign tissues (n = 16) and use the Kyoto Encyclopedia of Genes and Genomes (KEGG, version 50, April 2009) to determine the set mapping. Of 518 well measured metabolites, 147 metabolites that can be mapped to the KEGG sets. Of the >40,000 gene probes measured on the Agilent Whole Human Genome microarray, 2169 genes that can be mapped to KEGG. There are 98 KEGG sets in which at least one gene and one metabolite are measured.
The Fisher's exact test methods behaved similarly to the logistic regression tests shown in Figure 4. The sum-of-squared statistics tests were overly liberal identifying over 90% of the pathways as enriched. This implies that there was a high rate of differential elements throughout the datasets. Such background noise makes a competitive test the preferred choice of enrichment test. Additionally this may suggest that the KEGG pathway maps, as applied, may not accurately capture the co-regulation in the data.
Conclusions
In this work we have considered set enrichment testing methods for the joint analysis of transcriptomic and metabolomic datasets. We consider several procedures including two novel methods: the logistic regression 2-degree of freedom Wald test and the 2-dimensional permutation p-value for the sum-of-squared statistics test. Through a novel simulation model and design we explored the properties of these tests in relation to their univariate counterparts. We find that the joint tests can improve our ability to detect results that are marginal univariately, as evidenced in Figures 2 and 3. We also find that joint tests improve the ranking of associated pathways compared to their univariate counterparts.
The various joint methods performed similarly for most simulations. The concatenation of datasets and the Fisher's method of combining p-values had inflated error in the competitive test. For the logistic regression test, the 2-df wald test currently peforms similarly to the Fisher's method for combining the two p-values, due to the assumption that ρ_{ γμ }= 0. Non-zero correlation would have provided a weighted sum in the 2-df test. Though we were not estimating ρ_{ γμ }to be non-zero, the slightly inflated error rate of the Fisher's method test, suggests that the independence assumption may not always hold and dependent methods should be considered [15]. In future work we will continue to explore if and when correlation may be a contributing factor or if there are other methods for combining the tests in a weighted fashion either at the level of the test statistic or p-value. One of the more attractive features of the 2-df Wald test and 2-dimensional permutation test is that they can easily be extended to n-dimensions. This will allow for the incorporation of multiple omics platforms such as proteomics, genomics, or gene copy number. The data concatenation and sum of p-values methods can also be extended, but this may compound their potential error.
Declarations
Acknowledgements
DG was supported in part by R01-GM72007 and by the Huck Institute for Life Sciences.
Authors’ Affiliations
References
- Sreekumar A, Poisson L, Rajendiran T, Khan A, Cao Q, Yu J, Laxman B, Mehra R, Lonigro R, Li Y, Nyati M, Ahsan A, Kalyana-Sundaram S, Han B, Cao X, Byun J, Omenn G, Ghosh D, Pennathur S, Alexander D, Berger A, Shuster J, Wei J, Varambally S, Beecher C, Chinnaiyan A: Metabolomic profiles delineate potential role for sarcosine in prostate cancer progression. Nature 2009, 457(7231):910–914. 10.1038/nature07762PubMed CentralView ArticlePubMedGoogle Scholar
- Weckwerth W: Integration of metabolomics and proteomics in molecular plant physiology and coping with the complexity by data-dimensionality reduction. Physiologia Plantarum 2008, 132: 176–189. 10.1111/j.1399-3054.2007.01011.xView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 2000, 28: 27–30. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Network TCGAR: Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 255: 1061–1068.Google Scholar
- Jones S, Zhang X, Parsons DW, Lin JC, Leary RJ, Angenendt P, Mankoo P, Carter H, Kamiyama H, Jimeno A, Hong SM, Fu B, Lin MT, Calhoun ES, Kamiyama M, Walter K, Nikolskaya T, Nikolsky Y, Hartigan J, Smith DR, Hidalgo M, Leach SD, Klein AP, Jaffee EM, Goggins M, Maitra A, Iacobuzio-Donahue C, Eshleman JR, Kern SE, Hruban RH, Karchin R, Papadopoulos N, Parmigiani G, Vogelstein B, Velculescu VE, Kinzler KW: Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science 2008, 321: 1801–1806. 10.1126/science.1164368PubMed CentralView ArticlePubMedGoogle Scholar
- Goeman J, Bühlmann P: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 2007, 23(8):980–987. 10.1093/bioinformatics/btm051View ArticlePubMedGoogle Scholar
- Nam D, Kim SY: Gene-set approach for expression pattern analysis. Briefings in Bioinformatics 2008, 9: 189–197. 10.1093/bib/bbn001View ArticlePubMedGoogle Scholar
- Allison DB, Cui X, Page GP, Sabripour M: Gene-set approach for expression pattern analysis. Nature Reviews Genetics 2006, 7: 55–65. 10.1038/nrg1749View ArticlePubMedGoogle Scholar
- Sartor MA, Leikauf GD, Medvedovic M: LRpath: a logistic regression approach for identifying enriched biological groups in gene expression data. Bioinformatics 2009, 25: 211–217. 10.1093/bioinformatics/btn592PubMed CentralView ArticlePubMedGoogle Scholar
- Ackermann M, Strimmer K: A general modular framework for gene set enrichment analysis. BMC Bioinformatics 2009, 10: 47. 10.1186/1471-2105-10-47PubMed CentralView ArticlePubMedGoogle Scholar
- Spicker JS, Brunak S, Frederiksen KS, Toft H: Integration of clinical chemistry, expression, and metabolite data leads to better toxicological class separation. Toxicol Sci 2008, 102: 444–454.View ArticlePubMedGoogle Scholar
- Fisher RA: Questions and answers #14. The American Statistician 1948, 2: 30–31. 10.2307/2681650Google Scholar
- De Maesschalck R, Jouan-Rimbaud D, Massart DL: The Mahalanobis distance. Chemometrics and Intelligent Laboratory Systems 2000, 50: 1–18. 10.1016/S0169-7439(99)00047-7View ArticleGoogle Scholar
- Carrari F, Baxter C, Usadel B, Urbanczyk-Wochniak E, Zanor MI, Nunes-Nesi A, Nikiforova V, Centero D, Ratzka A, Pauly M, Sweetlove LJ, Fernie AR: Integrated analysis of metabolite and transcript levels reveals the metabolic shifts that underlie tomato fruit development and highlight regulatory aspects of metabolic network behavior. Plant Physiol 2006, 142: 1380–1396. 10.1104/pp.106.088534PubMed CentralView ArticlePubMedGoogle Scholar
- Kost J, McDermott M: Combining dependent P-values. Statistics and Probability Letters 2002, 60: 183–190. 10.1016/S0167-7152(02)00310-3View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.