Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models
- Dawei Liu^{1}Email author,
- Debashis Ghosh^{2} and
- Xihong Lin^{3}
https://doi.org/10.1186/1471-2105-9-292
© Liu et al; licensee BioMed Central Ltd. 2008
Received: 19 February 2008
Accepted: 24 June 2008
Published: 24 June 2008
Abstract
Background
Growing interest on biological pathways has called for new statistical methods for modeling and testing a genetic pathway effect on a health outcome. The fact that genes within a pathway tend to interact with each other and relate to the outcome in a complicated way makes nonparametric methods more desirable. The kernel machine method provides a convenient, powerful and unified method for multi-dimensional parametric and nonparametric modeling of the pathway effect.
Results
In this paper we propose a logistic kernel machine regression model for binary outcomes. This model relates the disease risk to covariates parametrically, and to genes within a genetic pathway parametrically or nonparametrically using kernel machines. The nonparametric genetic pathway effect allows for possible interactions among the genes within the same pathway and a complicated relationship of the genetic pathway and the outcome. We show that kernel machine estimation of the model components can be formulated using a logistic mixed model. Estimation hence can proceed within a mixed model framework using standard statistical software. A score test based on a Gaussian process approximation is developed to test for the genetic pathway effect. The methods are illustrated using a prostate cancer data set and evaluated using simulations. An extension to continuous and discrete outcomes using generalized kernel machine models and its connection with generalized linear mixed models is discussed.
Conclusion
Logistic kernel machine regression and its extension generalized kernel machine regression provide a novel and flexible statistical tool for modeling pathway effects on discrete and continuous outcomes. Their close connection to mixed models and attractive performance make them have promising wide applications in bioinformatics and other biomedical areas.
Background
The rapid progress in gene expression array technology in the past decade has greatly facilitated our understanding of the genetic aspect of various diseases. Knowledge-based approaches, such as gene set or pathway analysis, have become increasingly popular. In such gene sets/pathways, groups of genes act in concert to accomplish tasks related to a cellular process and the resulting genetic pathway effects may manifest themselves through phenotypic changes, such as occurrence of disease. Thus it is potentially more meaningful to study the overall effect of a group of genes rather than a single gene, as single-gene analysis may miss important effects on pathways and difficult to reproduce from studies to studies [1]. Researchers have made significant progress in identifying metabolic or signaling pathways based on expression array data [2, 3]. Meanwhile, new tools for identification of pathways, such as GenMAPP [4], Pathway Processor [5], MAPPFinder [6], have made pathway data more widely available. However, It is a challenging task to model the pathway data and test for a potentially complex pathway effect on a disease outcome.
One way to model pathway data is through the linear model approach, where the pathway effect is represented by a linear combination of individual gene effects. This approach has several limitations. Activities of genes within a pathway are often complicated, thus a linear model is often insufficient to capture the relationship between these genes. Furthermore, genes within a pathway tend to interact with each other. Such interactions are not taken into account by the linear model approach.
In this paper we propose a nonparametric approach, the kernel machine regression, to model a pathway effect. The kernel machine method, with the support vector machine (SVM) as a most popular example, has emerged in the last decade as a powerful machine learning technique in high-dimensional settings [7, 8]. This method provides a flexible way to model linear and nonlinear effects of variables and gene-gene interactions, unifies the model building procedure in both one- and multi-dimensional settings, and shows attractive performance compared to other nonparametric methods such as splines.
Liu et al. [9] proposed a kernel machine-based regression model for continuous outcomes. In this paper, we propose a logistic kernel machine regression model for binary outcomes, where covariate effects are modeled parametrically and the genetic pathway effect is modeled parametrically or nonparametrically using the kernel machine method. A main contribution of this paper is to establish a connection between logistic kernel machine regression and the logistic mixed model. We show that the kernel machine estimator of the genetic pathway effect can be obtained from the estimator of the random effects in the corresponding logistic mixed model. This connection provides a convenient vehicle to connect the powerful kernel machine method with the popular mixed model method in the statistical literature. This mixed model connection also provides an unified framework for statistical inference for model parameters, including the regression coefficients, the nonparametric genetic pathway function, and the regularization and kernel parameters. Based on the proposed logistic kernel machine regression model, we develop a new test for the nonlinear pathway effect on disease risk. An appealing feature of the proposed test is that it performs well without the need to correctly specify the functional form of the effects of each gene or their interactions. This feature has a significant practical implication when analyzing genetic pathway data, as the true relationship between the pathway and the disease outcome is often unknown. We extend the results to generalized kernel machine regression for a class of continuous and discrete outcomes and discuss its connection with generalized linear mixed models [10].
Recently, Wei and Li [11] proposed a nonparametric pathway-based regression (NPR) to model pathway data. NPR is a pathway-based gradient boosting procedure, where the base learner is usually a regression or classification tree. It provides a flexible approach in modeling pathways and interactions among genes within a pathway. Michalowski et al. [12] proposed a Bayesian Belief Network approach for pathway data. Neither method is likelihood-based. Thus parameter estimation and inference cannot be casted within a unified likelihood framework. It is hence difficult to estimate and quantify the overall pathway effect on disease risk and assess its statistical uncertainty. Secondly, a primary interest in this paper is to test for the statistical significance of the overall pathway effect on the risk of a disease. Both NPR and Bayesian belief network do not provide such a statistical test for the pathway effect. For example, NPR uses an importance score to rank the relative importance of each pathway. It lacks formal inferential procedure for assessing the statistical significance of a pathway. Further, when considering a single pathway, the importance score loses its meaning in assessing the importance of a pathway. Our method, on the other hand, is based on penalized likelihood, and estimation and inference can be conducted in a systematic manner within the likelihood framework. We also propose a formal statistical test for the significance of a pathway effect on the risk of a disease.
Goeman et al. [13] proposed a linear mixed model to relate the pathway effect with a continuous outcome. They modeled the pathway effect using a linear function with each gene entering into the model as a regressor. They assumed the regression coefficients of the gene as random from a common distribution with mean 0 and an unknown variance. The pathway effect can then be tested through a variance component test for random effects. Our approach is different from theirs in the following aspects. First, we model the pathway effect by allowing for a nonparametric model rather than a parametric one. As we commented earlier, the highly complicated nature of activities of genes within a pathway makes the linear model assumption untenable. Secondly, the kernel function used in kernel machine regression usually contains unknown tuning parameters. The parameter is present under the alternative hypothesis but disappears under null hypothesis. This makes tests as proposed in [13, 14] not applicable. Our proposed test, on the other hand, works quite well under this scenario. Third, Goeman et al. [14] extended their linear model results to discrete outcomes using basis functions. A key advantage of the kernel machine approach over this basis approach for modeling multi-gene effects is that one does not need to specify bases explicitly, which is often difficult for high-dimensional data especially when interactions are modeled.
Results
Analysis of prostate cancer data
In this section, we apply the proposed logistic kernel machine regression model (3) as described in the Methods section to the analysis of a prostate cancer data set. The data came from the Michigan prostate cancer study [15]. This study involved 81 patients with 22 diagnosed as non-cancerous and 59 diagnosed with local or advanced prostate cancer. Besides the clinical and demographic covariates such as age, cDNA microarray gene expressions were also available for each patient. The early results of Dhanasekaran et al. [15] indicate that certain functional genetic pathways seemed dys-regulated in prostate cancer relative to non-cancerous tissues. We are interested in studying how a genetic pathway is related to the prostate cancer risk, controlling for the covariates. We focus in this analysis on the cell growth pathway, which contains 5 genes. The pathway we describe was annotated by the investigator (A. Chinnaiyan) and is simply used to illustrate the methodology. Of course, one could take the pathways stored in commercial databases such as Ingenuity Pathway Analysis (IPA) and use the proposed methodology based on those gene sets.
The outcome was the binary prostate cancer status and the covariate includes age. Since the functional relationship between the cell growth pathway and the prostate cancer risk is unknown, the kernel machine method provides a convenient and flexible framework for the evaluation of the pathway effect on the prostate cancer risk. Specifically, we consider the following semiparametric logistic modellogit(P(y = 1)) = β_{0} + β_{1}age + h(gene_{1}, ..., gene_{5}),
Analysis of prostate cancer data.
Covariate | Estimate | S.E. | P-value |
---|---|---|---|
Intercept | 0.9893 | 2.7552 | 0.7205 |
Age | -0.0140 | 0.0425 | 0.7430 |
τ | 4.7362 | 3.6190 | |
ρ | 1.9093 | 0.6603 | |
Score test for the genetic pathway effect H_{0}: h(z) = 0: | |||
Test | P-value | ||
KM | < 0.0001 | ||
GT | 0.0661 |
The test for the cell growth pathway effect on the prostate cancer status H_{0}: h(z) = 0 vs H_{1}:h(z) ≠ 0 was conducted using the proposed score test as described in the Methods section. For the purpose of comparison, we also conducted the global test proposed by Goeman et al. [13] that assumed a linear pathway effect. Note that our test allows a nonlinear pathway effect and gene-gene interactions. Table 1 gives the p-values for both tests. The p-value of our test suggests that cell growth pathway has a highly significant effect on the disease status, while the test from Goeman et al. [13] indicates only marginal significance of the growth pathway effect.
Simulation Study for the Parameter Estimates
We conducted a simulation study to evaluate the performance of the parameter estimates of the proposed logistic kernel machine regression by using the logistic mixed model formulation. We considered the following modellogit(P(y_{ i }= 1)) = x_{ i }+ h(z_{i 1}, ⋯, z_{ ip }),
where the true regression coefficient β = 1. We consider p = 5 and set h(z_{1}, ..., z_{5}) = 2{sin(z_{1}) – ${z}_{2}^{2}$ + z_{1} exp(-z_{3}) -sin(z_{2}) cos(z_{3}) + ${z}_{4}^{2}$ + sin(z_{4}) cos(z_{1}) + ${z}_{5}^{2}$ + z_{3}z_{5}}. To allow x_{ i }and (z_{i 1}, ⋯, z_{ ip }) to be correlated, x_{ i }was generated as x_{ i }= sin(z_{i 1}) + 2u_{ i }, where u_{ i }and z_{ ij }(j = 1, ⋯, p) follow independent Uniform(-0.5, 0.5). The Gaussian kernel was used throughout the simulation. All simulations ran 300 times. Settings 1, 2, and 3 correspond to sample size n = 100, 200, and 300, respectively.
Simulation results on estimation.
Model Parameter Estimates | Reg of h on $\widehat{h}$ | |||||||
---|---|---|---|---|---|---|---|---|
setting | true # z | used # z | n | β | ρ | Intercept | Slope | R ^{2} |
1 | 5 | 5 | 100 | 1.10 | 71.50^{ a }(estimated) | -0.06 | 1.06 | 0.82 |
1.14 | 1.00 (fixed) | -0.28 | 1.48 | 0.79 | ||||
1.08 | 20.00 (fixed) | -0.08 | 1.15 | 0.84 | ||||
2 | 5 | 5 | 200 | 0.99 | 90.03 (estimated) | 0.01 | 1.04 | 0.87 |
1.05 | 1.00 (fixed) | -0.01 | 1.13 | 0.84 | ||||
0.96 | 20.00 (fixed) | -0.00 | 1.07 | 0.87 | ||||
3 | 5 | 5 | 300 | 0.98 | 111.76 (estimated) | -0.01 | 1.04 | 0.90 |
1.03 | 1.00 (fixed) | -0.02 | 1.10 | 0.87 | ||||
0.97 | 20.00 (fixed) | -0.01 | 1.06 | 0.90 |
Simulation results on standard errors.
Standard Errors of $\widehat{\beta}$ | ||||||
---|---|---|---|---|---|---|
true | used | Empirical | Model-based | |||
setting | # z | # z | n | SE | SE | ρ |
1 | 5 | 5 | 100 | 0.49 | 0.48 | 71.50 (estimated) |
0.45 | 0.47 | 1.00 (fixed) | ||||
0.48 | 0.47 | 20.00 (fixed) | ||||
2 | 5 | 5 | 200 | 0.32 | 0.32 | 90.03 (estimated) |
0.32 | 0.32 | 1.00 (fixed) | ||||
0.33 | 0.32 | 20.00 (fixed) | ||||
3 | 5 | 5 | 300 | 0.26 | 0.26 | 111.76 (estimated) |
0.25 | 0.26 | 1.00 (fixed) | ||||
0.26 | 0.26 | 20.00 (fixed) |
Simulation Study of the Score Test for the Pathway Effect
We next conducted a simulation study to evaluate the performance of the proposed variance component score test for the pathway effect H_{0}: h(·) = 0 vs H_{1}: h(·) ≠ 0. In order to compare the performance of our test with the linearity-based global test proposed by Goeman et al. [13], both tests were applied to each simulated data set. Nonlinear and linear functions of h(z) were both considered. For the nonlinear pathway effect, the true model is logit(y) = x +ah(z), where h(z) = 2(z_{1} - z_{2})^{2} + z_{2}z_{3} + 3 sin(2z_{3})z_{4} + ${z}_{5}^{2}$ + 2 cos(z_{4})z_{5}. For the linear pathway effect, the true model is logit(y) = x + ah(z), where h(z) = 2z_{1} + 3z_{2}+z_{3} + 2z_{4}+z_{5}. All z's were generated from the standard normal distribution, and a = 0, 0.2, 0.4, 0.6, 0.8. To allow x and (z_{i 1}, ⋯, z_{ ip }) to be correlated, x was generated as x = z_{1}+e/2 with e being independent of z_{1} and following N (0, 1). We studied the size of the test by generating data under a = 0, and studied the power by increasing a. The sample size was 100. For the size calculations, the number of simulations was 2000; whereas for the power calculations, the number of runs was 1000. Based on the discussions in Section "Test for the genetic Pathway Effect", the bound of ρ is set up by interval $[{\mathrm{min}}_{i\ne j}{\displaystyle {\sum}_{l=1}^{5}{({z}_{il}-{z}_{jl})}^{2}/5,10{\mathrm{max}}_{i\ne j}}{\displaystyle {\sum}_{l=1}^{5}{({z}_{il}-{z}_{jl})}^{2}}]$, and the interval is divided by 500 equally spaced grid points. All simulations were conducted using R 2.5.0, and the package "globaltest" v4.6.0 was used for the test proposed by Goeman et al. [13] as a comparison.
Simulation results on score test.
Size | Power | ||||
---|---|---|---|---|---|
h(z) | Method | a = 0 | a = 0.2 | a = 0.4 | a = 0.8 |
Nonlinear | KM | 0.054 | 0.142 | 0.896 | 1.000 |
GT | 0.068 | 0.098 | 0.110 | 0.156 | |
Linear | KM | 0.055 | 0.265 | 0.896 | 1.000 |
GT | 0.065 | 0.302 | 0.900 | 1.000 |
Conclusions and Discussion
In this paper, we developed a logistic kernel machine regression model for binary outcomes, where the covariate effects are modeled parametrically and the genetic pathway effect is modeled nonparametrically using the kernel machine method. This method provides an attractive way to model the pathway effect, without the need to make strong parametric assumptions on individual gene effects or their interactions. Our model also allows for parametric pathway effects if a parametric kernel, such as the first-degree polynomial kernel, is used.
A key result of this paper is that we have established a close connection between the generalized kernel machine regression and generalized linear mixed models, and show that the kernel machine estimators of regression coefficients and the nonparametric multi-dimensional pathway effect can be easily obtained from the corresponding generalized linear mixed models using PQL. The mixed model connection provides a unified framework for estimation and inference and can be easily implemented in existing software, such as SAS PROC GLIMMIX or R GLMMPQL. The mixed model connection also makes it possible to test for the overall pathway effect through the proposed variance component test. A key advantage of the proposed score test for the pathway effect is that it does not require an explicit functional specification of individual gene effects and gene-gene interactions. This feature is of practical significance as the pathway effect is often complex. Our simulation study shows the proposed test performs well for moderate sample size. It has similar power to the linearity-based pathway test of Goeman et al. [13] when the true effect is linear, but much higher power when the true effect is nonlinear.
We have considered in this paper a single pathway. One could generalize the proposed semiparametric model to incorporate multiple pathways by fitting an additive model:logit(P(y = 1)) = x^{ T }β+ h_{1}(z_{1}) + ⋯ + h_{ m }(z_{ m }),
where z_{ j }(j = 1, ⋯, m) denotes a p_{ j }× 1 vector of genes in the j th pathway and hj(·) denotes the nonparametric function associated with the j th genetic pathway.
Machine learning is a powerful tool in advancing bioinformatics research. Our effort helps to build a bridge between kernel machine methods and traditional statistical models. This connection will undoubtedly provide a new and convenient tool for the bioinformatics community and opens a door for future research.
Methods
The Logistic Kernel Machine Model
where μ_{ i }= P (y_{ i }= 1| x_{ i }, z_{ i }), β is a q × 1 vector of regression coefficients, and h(z_{ i }) is an unknown centered smooth function.
In model (3), covariate effects are modeled parametrically, while the multi-dimensional genetic pathway effect is modeled parametrically or nonparametrically. A nonparametric specification for h(·) reflects our limited knowledge of genetic functional forms. Note that h(·) = 0 means that genes in the pathway have no association with the disease risk. If h(z) = γ_{1}z_{1} + ⋯ + γ_{ p }z_{ p }, model (3) becomes the linear model considered by Goeman et al. [13].
In nonparametric modeling, such as smoothing splines, the unknown function is usually assumed to lie in a certain function space. For the kernel machine method, this function space, denoted by ${\mathcal{H}}_{K}$, is generated by a given positive definite kernel function K(·, ·). The mathematical properties of ${\mathcal{H}}_{K}$ imply that any unknown function h(z) in ${\mathcal{H}}_{K}$ can be written as a linear combination of the given kernel function K(·, ·) evaluated at each sample point. Two popular kernel functions are the d th polynomial kernel $K({z}_{1},{z}_{2})={({z}_{1}^{T}{z}_{2}+\rho )}^{d}$ and the Gaussian Kernel K(z_{1}, z_{2}) = exp{-|| z_{1} – z_{2}||^{2}/ρ^{2}}, where $\left|\right|{z}_{1}-{z}_{2}|{|}^{2}={\displaystyle {\sum}_{k=1}^{p}{({z}_{1k}-{z}_{2k})}^{2}}$ and ρ is an unknown parameter. The first and second degree polynomial kernels (d = 1, 2) correspond to assuming h(·) to be linear and quadratic in z's, respectively. The choice of a kernel function determines which function space one would like to use to approximate h(z). The unknown parameter of a kernel function plays a critical role in function approximation. It is a challenging problem to optimally estimate it from data. In the machine learning literature, this parameter is usually pre-fixed at some values based on some ad-hoc methods. In this paper, we show that we can optimally estimate it from data based on a mixed model framework.
The Estimation Procedure
where λ is a regularization parameter that controls the tradeoff between goodness of fit and complexity of the model. When λ = 0, it fits a saturated model, and when λ = ∞, the model reduces to a simple logistic model logit (μ_{ i }) = ${x}_{i}^{T}\beta $. Note that there are two tuning parameters in the above likelihood function, the regularization parameter λ and kernel parameter ρ Intuitively, λ controls the magnitude of the unknown function while ρ mainly governs the smoothness property of the function.
where k_{ i }= {K(z_{ i }, z_{1}), ..., K(z_{ i }, z_{ n })}^{ T }and α= (α_{1}, ⋯, α_{ n })^{ T }, an n × 1 vector of unknown parameters.
where K= K(ρ) is an n × n matrix whose (i, i')th element is K(z_{ i }, z_{ i }') and often depends on an unknown parameter ρ.
where ${\tilde{y}}^{(k)}=X{\beta}^{(k)}+K{\alpha}^{(k)}+{D}^{{(k)}^{-1}}(y-{\mu}^{(k)})$, τ = 1/λ, h^{(k)}= Kα^{(k)}, and ${D}^{(k)}=\text{Diag}\{{\mu}_{i}^{(k)}(1-{\mu}_{i}^{(k)})\}$. The estimators $\widehat{\beta}$ and $\widehat{h}$ at convergence are the kernel machine estimators that maximize (6).
The Connection of Logistic Kernel Machine Regression to Logistic Mixed Models
where β is a q × 1 vector of fixed effects, and h= (h_{1}, ..., h_{ n }) is a n × 1 vector of subject-specific random effects following h ~N{0, τ K(ρ)}, and the covariance matrix K(ρ) is the n × n kernel matrix as defined in previous section.
Setting τ = 1/λ and h= Kα, one can easily see that equations (6) and (9) are identical. It follows that the logistic kernel machine estimators $\widehat{\beta}$ and $\widehat{h}$ can be obtained by fitting the logistic mixed model (8) using PQL. In fact, examination of the kernel machine normal equations (7) shows that they are identical to the normal equations obtained from the PQL (9) [10], where $\tilde{y}$ in (7) is in fact the PQL working vector and D is the PQL working weight matrix.
where $\tilde{y}$ is the working vector defined below equation (7), h is a random effect vector following N{0, τ K(ρ)}, and ε ~ N(0, D). The covariance of $\widehat{\beta}$ is estimated by (X^{ T }V^{-1}X)^{-1}, and the covariance of $\widehat{h}$ is estimated by $\widehat{\tau}K-\widehat{\tau}KPK$, where P= V^{-1} - V^{-1}X(X^{ T }V^{-1}X)^{-1}X^{ T }V^{-1} and V= V($\widehat{\delta}$). The covariance of $\widehat{\delta}$ can be obtained as the inverse of the expected information matrix calculated using the second derivative of (10) with respect to δ. The square roots of the diagonal elements of the estimated covariance matrices give the standard errors of $\widehat{\beta}$, $\widehat{h}$, and δ. The above discussion shows that we can easily fit the logistic kernel machine regression using the existing PQL-based mixed model software, such as SAS GLIMMIX and R GLMMPQL.
Test for the Genetic Pathway Effect
It is of significant practical interest to test the overall genetic pathway effect H_{0}: h(z) = 0. Assuming h(z) ∈ ${\mathcal{H}}_{K}$, one can easily see from the logistic mixed model representation (8) that H_{0}: h(z) = 0 vs H_{1}: h(z) ≠ 0 is equivalent to testing the variance component τ as H_{0}: τ = 0 vs H_{1}: τ > 0. Note that the null hypothesis places τ on the boundary of the parameter space. Since the kernel matrix K is not block diagonal, unlike the standard case considered by Self and Liang [17], the likelihood ratio for H_{0}: τ = 0 does not follow a mixture of ${\chi}_{0}^{2}$ and ${\chi}_{1}^{2}$ distribution. We consider instead a score test in this paper.
When conducting statistical tests for pathways, two types of tests could be formulated. The first is called the competitive test and the second the self-contained test [18]. The competitive test compares an interested gene set to all the other genes on a gene chip. An example of the competitive test is the gene set enrichment analysis (GSEA) [1], where an enrichment score of a gene set is defined and a permutation test is used to test for the significance of the gene set based on the enrichment score. The self-contained test compares the gene set to an internal standard which does not involve any genes outside the gene set considered. In other words, the self-contained test examines the null hypothesis that a pathway has no effect on the outcome versus the alternative hypothesis that the pathway has an effect. The variance component test of [13] for the linear pathway effect is a self-contained test. Goeman and Bühlmann [18] pointed out that the self-contained test has a higher power than a competitive test and that its statistical formulation is also consistent for both single gene tests and gene set tests, and the statistical sampling properties of the competitive test can be difficult to interpret.
Our pathway effect hypothesis H_{0}: h(z) = 0 vs H_{1}: h(z) ≠ 0 is a self-contained hypothesis. We propose in this paper a self-contained test for the pathway effect by developing a kernel machine variance component score test for H_{0}: τ = 0 vs H_{0}:τ > 0. The proposed test allows for both linear and nonlinear pathway effects and includes the tests by Goeman et al. [13, 14] as a special case. A key advantage of our kernel-based test is that we do not need to explicitly specify the basis functions for h(·), which is often difficult for modeling the joint effects of multiple genes, and we all let the data to estimate the best curvature of h(·).
Zhang and Lin [19] proposed a score test for H_{0}: τ = 0 to compare a polynomial model with a smoothing spline. Goeman et al. [14] also proposed a global test against a high dimensional alternative under the empirical Bayesian framework. The variance-covariance matrix used in these tests do not involve any unknown parameters. However, the kernel function K(·, ·) in a kernel machine model usually depends on some unknown parameter ρ. One can easily see from the mixed model representation (8) that under H_{0}: τ = 0, the kernel matrix K disappears. This makes the parameter ρ inestimable under the null hypothesis and therefore renders the above tests inapplicable.
Davies [20, 21] studied the problem of a parameter disappearing under H_{0} and proposed a score test by treating the score statistic as a Gaussian process indexed by the nuisance parameter and then obtaining an upper bound to approximate the p-value of the score test. We adopt this line of approaches for our proposed score test.
where ${\widehat{\beta}}_{0}$ is the MLE of β under H_{0}: τ = 0, ${\widehat{\mu}}_{0}=logi{t}^{-1}(X{\widehat{\beta}}_{0})$, μ_{ Q }= tr{P_{0}K(ρ)}, ${\sigma}_{Q}^{2}$ = 2tr{P_{0}K(ρ)P_{0}K(ρ)}, and P_{0} = D_{0} – D_{0}X(X^{ T }D_{0}X)^{-1} X^{ T }D_{0}, where ${D}_{0}=\text{diag}\{{\widehat{\mu}}_{i0}(1-{\mu}_{i0})\}$. Note that under H_{0}: τ = 0, model (3) reduces to the simple logistic model logit (μ_{ i }) = ${x}_{i}^{T}\beta $. Hence the ${\widehat{\beta}}_{0}$ is the MLE of β under this null logistic model.
If the Gaussian kernel is used, then an arbitrary nonlinear pathway effect is implicitly assumed. Our proposed test, which is derived to test for any nonlinear effect, is therefore more powerful than tests based on a parametric assumption. We show in Appendix A.1 that when ρ becomes large in the Gaussian kernel, our test statistic reduces asymptotically to the one based on linearity assumption of genetic effects. Hence our test includes linear model based test as a special case. From (11) it is also clear that our test is invariant to the relative scaling of the kernel function K(·, ·).
where Φ (·) is the normal cumulative distribution function, M is the maximum of S(ρ) over the range of ρ, W = | S(ρ_{1}) – S(L)| + | S(ρ_{2}) – S(ρ_{1}) | + ... + | S(U) – S(ρ_{ m }) |, L and U are the lower and upper bound of ρ respectively and ρ_{ l }, l = 1, ..., m are the m grid points between L and U. Davies [20] points out that this bound is sharp. For the Gaussian kernel, we suggest to set the bound of ρ as L = 0.1 ${\mathrm{min}}_{i\ne j}{\displaystyle {\sum}_{l=1}^{p}{({z}_{il}-{z}_{jl})}^{2}}$ and U = 100 ${\mathrm{max}}_{i\ne j}{\displaystyle {\sum}_{l=1}^{p}{({z}_{il}-{z}_{jl})}^{2}}$. For justifications, see Appendix A.2.
Extension to generalized kernel machine model
For simplicity, we focus in this paper on logistic regression for binary outcomes. The proposed semiparametric model (3) can be easily extended to other types of continuous and discrete outcomes, such as normal, count, skewed data, whose distributions are in the exponential family [23]. In this section, we briefly discuss how to generalize our estimation and testing procedures for binary outcomes to other data types within the generalized kernel machine framework and discuss its fitting using generalized linear mixed models.
where K is an n × n matrix whose (i, j)th element is K(z_{ i }, z_{ j }).
where τ = 1/λ, and h= (h_{1} ..., h_{ n }) is an n × n random vector with distribution N {0, τ K(ρ)}. The same PQL statistical software, such as SAS PROC GLIMMIX and R GLMMPQL, can be used to fit this model and obtain the kernel machine estimators of β and h(·).
The score test (11) also has a straightforward extension. The only change is that the elements in matrix D in (11) be replaced by appropriate variance function var(y_{ i }) under the assumed parametric distribution of y_{ i }.
Appendix
A.1 Proof of the relationship of the proposed score test and that of Goeman, et al [13] under the linearity assumption
We show in this section when the scale parameter is large, the proposed nonparametric variance component test for the pathway effect using the Gaussian kernel reduces to the linearity-based global test of Goeman et al. [13].
This proves the approximate relation (19).
A.2 Calculations of the lower and upper bounds of ρ
Although in theory ρ could take any positive values up to infinity, for computational purpose we would require ρ to be bounded. For the proposed test statistic (11), its value in fact only depends on a finite range of ρ values. We describe why this is the case and how to find this range. For a given data set, the proof in Appendix A.1 shows that when is sufficiently large, the quantity 0.5ρQ_{ τ }(${\widehat{\beta}}_{0}$, ρ) converges to ${S}_{0}={(\tilde{y}-{\widehat{\mu}}_{0})}^{T}R(\tilde{y}-{\widehat{\mu}}_{0})$, which is free of ρ.
These arguments suggest that for numerical evaluation, it is not necessary to consider all ρ values up to infinity. Instead, a moderately large enough value would suffice. Now the question comes down to how to decide on appropriate upper and lower bounds for ρ. The proof in Appendix A.1 requires ${\mathrm{max}}_{i\ne j}{\displaystyle {\sum}_{l=1}^{p}{({z}_{il}-{z}_{jl})}^{2}}/\rho $ be close to 0. Let C_{1} be some large positive number such that 1/C_{1} ≈ 0. If we take the upper bound of ρ to be C_{1} ${\mathrm{max}}_{i\ne j}{\displaystyle {\sum}_{l=1}^{p}{({z}_{il}-{z}_{jl})}^{2}}$, then ${\mathrm{max}}_{i\ne j}{\displaystyle {\sum}_{l=1}^{p}{({z}_{il}-{z}_{jl})}^{2}}/\rho $ would be close to 0. In practice we suggest taking C_{1} = 100, which would give good approximation. Using a similar idea, we can find a lower bound for ρ. It is clear that when ${\mathrm{min}}_{i\ne j}{\displaystyle {\sum}_{l=1}^{p}{({z}_{il}-{z}_{jl})}^{2}}$/ρ → ∲ any non-diagonal element of K(ρ) will be 0 and the kernel matrix reduces to an identity matrix. Hence, if we pick a small enough number C_{2} such that 1/C_{2} → ∲, we can effectively set the lower bound of ρ to be C_{2} ${\mathrm{min}}_{i\ne j}{\displaystyle {\sum}_{l=1}^{p}{({z}_{il}-{z}_{jl})}^{2}}$. In practice we suggest take C_{2} = 0.1, which yields a good approximation.
A.3 derivation of normal equation (5)
where D= Diag{μ_{ i }(1 – μ_{ i })}. The Newton-Raphson iteration states that the parameter value at the (k + 1)^{th} iteration can be updated by the following relationshipδ^{(k+1)}= δ^{(k)}– (H^{(k)})^{-1} q^{(k)},
where δ= (β^{ T }, α^{ T })^{ T }. Substitute (20) and (21) into (22), we arrive at normal equation (7).
Availability
Our algorithm is available at http://www.biostat.harvard.edu/~xlin.
Declarations
Acknowledgements
Liu and Lin's research was supported by a grant from the National Cancer Institute (R37 CA-76404). Ghosh's research was supported by a grant from the National Institute of Health (R01 GM-72007). We thank A. Chinnaiyan for providing the prostate cancer data. We also thank the two anonymous referees for their comments which helped improve the manuscript.
Authors’ Affiliations
References
- Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, Gillette M, Paulovich A, Pomeroy S, Golub T, Lander E, Mesirov J: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 2005, 102: 15545–15550. 10.1073/pnas.0506580102View ArticleGoogle Scholar
- Eisenberg D, Graeber TG: Bioinformatic identification of potential autorine signaling loops in cancers from gene expression profiles. Nature Genetics 2001, 29: 295–300. 10.1038/ng718View ArticlePubMedGoogle Scholar
- Raponi M, Belly R, Karp J, Lancet J, Atkins D, Wang Y: Microarray analysis reveals genetic pathways modulated by tipifarnib in acute myeloid leukemia. BMC Cancer 2004, 4: 56. 10.1186/1471-2407-4-56PubMed CentralView ArticlePubMedGoogle Scholar
- Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, Conklin BR: GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nature Genetics 2002, 31: 19–20. 10.1038/ng0502-19View ArticlePubMedGoogle Scholar
- Grosu P, Twonsend JP, Hartl DL, Cavalieri D: Pathway Processor: A tool for integrating whole-genome expression results into metabolic networks. Genome Research 2002, 12: 1121–1126. 10.1101/gr.226602PubMed CentralView ArticlePubMedGoogle Scholar
- Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Conklin BR: MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biology 2003, 4: R7. 10.1186/gb-2003-4-1-r7PubMed CentralView ArticlePubMedGoogle Scholar
- Vapnik V: Statistical Learning Theory. New York: Wiley; 1998.Google Scholar
- Schölkopf B, Smola A: Learning with Kernels. Cambridge, Massachusetts: MIT press; 2002.Google Scholar
- Liu D, Lin X, Ghosh D: Semiparametric regression of multi-dimensional genetic pathway data: least squares kernel machines and linear mixed models. Biometrics 2007, 63(4):1079–1088.PubMed CentralView ArticlePubMedGoogle Scholar
- Breslow N, Clayton D: Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 1993, 88: 9–25. 10.2307/2290687Google Scholar
- Wei Z, Li H: Nonparametric pathway-based regression models for analysis of genomic data. Biostatistics 2007, 8(2):265–284. 10.1093/biostatistics/kxl007View ArticlePubMedGoogle Scholar
- Sprague R, Ed: Proceedings of the 39th Annual Hawaii International Conference on System Sciences. Los Alamitos: IEEE; 2006. [CD ROM version]Google Scholar
- Goeman JJ, Geer SA, de Kort F, van Houwelingen HC: A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 2004, 20: 93–99. 10.1093/bioinformatics/btg382View ArticlePubMedGoogle Scholar
- Goeman JJ, Geer SA, van Houwelingen HC: Testing against a high dimensional alternative. Journal of the Royal Statistical Society: Series B 2006, 68: 477–493. 10.1111/j.1467-9868.2006.00551.xView ArticleGoogle Scholar
- Dhanasekaran S, Barrette T, Ghosh D, Shah R, Varambally S, Kurachi K, Pienta K, Rubin M, Chinnaiyan A: Delineation of prognostic biomarkers in prostate cancer. Nature 2001, 412(6849):822–6. 10.1038/35090585View ArticlePubMedGoogle Scholar
- Kimeldorf G, Wahba G: Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications 1970, 33: 82–95. 10.1016/0022-247X(71)90184-3View ArticleGoogle Scholar
- Self SG, Liang KY: Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under non-standard conditions. Journal of the American Statistical Association 1987, 82: 605–610. 10.2307/2289471View ArticleGoogle Scholar
- Goeman JJ, Bühlmann P: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 2007, 23: 980–987. 10.1093/bioinformatics/btm051View ArticlePubMedGoogle Scholar
- Zhang D, Lin X: Hypothesis testing in semiparametric additive mixed models. Biostatistics 2002, 4: 57–74. 10.1093/biostatistics/4.1.57View ArticleGoogle Scholar
- Davies R: Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 1977, 64: 247–254. 10.2307/2335690View ArticleGoogle Scholar
- Davies R: Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 1987, 74: 33–43.Google Scholar
- le Cessie S, van Houwelingen J: Goodness of fit tests for generalized linear models based on random effect models. Biometrics 1995, 51(2):600–614. 10.2307/2532948View ArticlePubMedGoogle Scholar
- McCullagh P, Nelder J: Generalized Linear Models. New York: Chapman & Hall; 1989.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.