Problem formulation
We consider a binary classification problem. Suppose there are n training samples and let y = [y1, y2,...,y
n
]' denote the class labels, where y
i
= 1 indicates the sample i being in the class "I" and y
i
= -1 indicates it being in the other class (i.e. not class "I"), for i = 1,2,...,n. For each sample, there are p genes being investigated and we define the gene expression matrix X as
The data matrix X usually should be normalized for each gene (each column of X). In order to handle the gene selection problem, we further define the gene-selection vector γ as:
X
γ
is defined as the gene expression matrix corresponding to the selected genes in accordance to the gene-selection vector γ. I.e.
where the j th column of X
γ
is the i th column of the matrix X while the index of the j th non-zero element in the vector γ is i. In formula (3), there are q genes being selected out from the total p genes; and q <<p in a typical gene selection problem. Formulating the problem in a regression setting, we introduce n latent variables z1, z2,..., z
n
, such that
z
i
= g(X
γ
i
) + b + e
i
= t
i
+ b + e
i
, and
where x
γi
denotes the i th row of the matrix X
γ
; e
i
presents the independent noise term, which is assumed to be Guassian distributed with zero mean, σ2 variance; b is the intercept term; and the link function g(·) is assumed to be chosen from a class of real-valued functions and the output of which is a Gaussian process. In the vector form, we define z = [z1, z2,..., z
n
]', t = [t1, t2,..., t
n
]' and e = [e1, e2,..., e
n
]'. Note that, if g(·) is restricted to a linear function and σ2 is fixed at 1, model (4) is very similar to a linear probit regression setting.
Kernel-Imbedded Gaussian Processes (KIGPs)
In general, a continuous stochastic process is a collection of random variables, and each of these random variables takes on real values from a probability distribution function. If we consider the outputs of a learning function g(·), where g is chosen according to some distribution D defined over a class of real-valued functions, then the collection of these outputs is also a stochastic process and the distribution D presents the prior belief in the likelihood.
A Gaussian process is a continuous stochastic process such that the marginal distribution for any finite subset of the collection of its outputs is a zero mean Gaussian distribution. In this paper, as defined in formula (4), t
i
= g(x
γi
), where x
γi
= [xγ,i 1, xγ,i 2,..., x
γ,iq
], i = 1,2,..., n; and in the formula, we assume
Pg~D([g(x
γ1
), g(x
γ2
),..., g(x
γn
)] = [t1, t2,..., t
n
]) ∝ , where
K
ij
= K(x
γi
, x
γj
), i, j = 1,2,..., n. (5)
In (5), K(x
γi
, x
γj
) is a function defined in the observation space and it conceptually represents the inner product for sample vectors x
γi
and x
γj
in the feature space, ⟨Ψ(x
γi
), Ψ(x
γj
)⟩ (assuming Ψ(·) is the mapping function from the observation space to the feature space). K is a kernel matrix called the Mercer kernel. Formula (5) formulates our prior belief for the learning model and the kernel function K(·,·) uniquely decides the properties of our learning functions. Some of the most commonly used kernel functions include:
Linear kernel: K(x
γ
i
, x
γ
j
) = ⟨x
γ
i
, x
γ
j
⟩ (6a)
Polynomial kernel: K(x
γ
i
, x
γ
j
) = (⟨x
γ
i
, x
γ
j
⟩ + 1)d, where d = 1,2,.. is the degree parameter. (6b)
Gaussian kernel: K(x
γ
i
, x
γ
j
) = , where r > 0 is the width parameter. (6c)
In (6a) and (6b), the term ⟨x
γ
i
, x
γ
j
⟩ presents the inner product between the vectors x
γi
and x
γj
. When one uses the linear kernel, the feature space is the same as the observation space. In this paper, we refer the linear kernel as the LK, the polynomial kernel with degree "d" as the PK(d) and the Gaussian kernel with width "r" as the GK(r). We primarily focus on the KIGP method with the Gaussian kernel and the polynomial kernel, and discuss them in parallel.
In model (4), we have the latent vector z = t + e + b 1
n
, where e ~ N(0, σ2I
n
), I
n
denotes the n × n identity matrix, and 1
n
presents the n × 1 vector with all the elements being equal to 1; N(·,·) denotes the multivariate normal distribution. Hence,
P(z|t) ∝ exp(-(z - t - b 1
n
)'Ω-1(z - t - b 1
n
)), where Ω = σ2I
n
. (7)
With the Bayes rule, we have
where
γ
is the new predictor associated with the given gene-selection vector γ and is the posterior output (without intercept b) with respect to
γ
, provided the matrix X
γ
and the latent output z. With a kernel such as defined by (6) and assuming an intercept b and a variance of noise σ2 are both given, plugging (5) and (7) into (8) and integrating out t, the marginal distribution of given
γ
, X
γ
and z yields a Gaussian distribution as follow [26]:
(1)
|
γ
, X
γ
, z ~ N(f(
γ
, X
γ
, z), V(
γ
, X
γ
, z)), where
f(
γ
, X
γ
, z) = (z - b l
n
)' (K
γ
+ σ2I)-1 k
γ
, V(
γ
, X
γ
, z) = K(
γ
,
γ
) - k
γ
'(K
γ
+ σ2I)-1k
γ
,
K
γ,ij
= K(x
γ
i
, x
γ
j
), k
γi
= K(
γ
, x
γi
), i, j = 1,2,..., n. (9)
Supervised Microarray Data Analysis using KIGP
Prior specification
(1) γ
j
is assumed to be a priori independent for all j, and
Pr(γ
j
= 1) = π
j
, for j = 1, 2,..., p, (10)
where the prior probability π
j
reflects prior knowledge of the importance of the j th gene.
(2) A non-informative prior is applied for the intercept b:
P(b) ∝ 1. (11a)
This is not a proper probability distribution function (PDF), but it leads to a proper posterior PDF.
(3) The inverted gamma (IG) distribution is applied as the prior for the variance of noise σ2. Specifically, we assume:
P(σ2) ~ IG(1,1) (11b)
(4) For the width of a Gaussian kernel (i.e. a scaling parameter), an inverted gamma distribution is also a reasonable choice as a prior. To preclude too small or too big r (which will make the system to be numerically unstable), we apply IG(1,1) as the prior for r2, that is
P(r2) ~ IG(1,1) (11c)
(5) For the degree of a polynomial kernel, we assume a uniform distribution. In this paper, we only consider the PK(1) and the PK(2) to avoid the issue of overfitting for most practical cases. Therefore, we have P(d = 1) = P(d = 2) = 0.5.
(6) We assume that γ and b are a priori independent from each other, that is P(γ, b) = P(γ)P(b).
Bayesian inferences for model parameters
Based on model (4), label y only depends on z, therefore, all other model parameters are conditionally independent from y if z is given. For convenience, we drop the notation of the training set X in the following derivation and drop y as well when z is given. We also assume the kernel type is given and the associated kernel parameter is termed by θ.
(I) Sampling from γ|z, b, σ2, θ
Here, we drop the notation of the given parameters b, σ2 and θ. With the model described in (2), (5) and (7), we have
Kγ,ij= K(x
γi
+ x
γj
), i, j = 1,2,..., n; Ω = σ2I
n
, I
n
and 1
n
are defined in (7). (12)
The detailed derivation for (12) is provided in Appendix. After inserting the prior given by (10), we have
In practice, rather than sampling γ as a vector, we sample it component-wise from
In both (13) and (14), K
γ
is defined in (12).
(II) Sampling from t|γ, b, z, σ2, θ
As shown by Eq. (A6) in the Appendix, the conditional distribution P(t|z, b) is Gaussian:
t|z, b ~ N((I
n
- Ω(Ω + K
γ
)-1)(z - b l
n
), Ω - Ω(Ω + K
γ
)-1Ω),
where K
γ
and Ω are defined in Eq. (12). (15)
We thus can draw t given z accordingly.
(III) Sampling from z|t, b, σ2, y
Given the class label vector y, the conditional distribution of z given t is a truncated Gaussian distribution, and we have the following formula for i = 1,2,..., n:
z
i
|t
i
, b, σ2, y
i
= 1 ∝ N(t
i
+ b, σ2) truncated at the left by 0,
z
i
|t
i
, b, σ2, y
i
= -1 ∝ N(t
i
+ b, σ2) truncated at the right by 0. (16)
(IV) Sampling from b|z, t, σ2
When z and t are both given, this is a simple ordinary linear regression setting with only an intercept term. Under the non-informative prior assumption given by (11a), it yields
b|z, t, σ2 ~ N(μ, σ2/n), where . (17a)
(V) Sampling from σ2|z, t, b
With IG(α, β), α > 0, β > 0, as the prior, the conditional posterior distribution for σ2 is also an inverted gamma distribution. That is
σ2|z, t, b ~ IG(α + n/2, β + ns2/2), where . (17b)
Kernel parameters tuning
One of the major advantages of kernel-induced learning methods is that one can explore the non-linearity feature of the underlying model for a given learning problem by applying different kernels. It is therefore necessary to discuss the issue of kernel parameter tuning. With the KIGP framework constructed above, this turns out to be rather straightforward.
As in the last section, we denote the kernel parameter(s) as θ, which can be either a scalar (e.g. the width parameter of an GK or the degree parameter of an PK) or a vector. For algorithmic convenience, we work with the logarithm of the conditional likelihood for the parameter θ:
With a proper prior distribution for θ, P(θ), we have:
P(θ|z, γ, b, σ2) ∝ exp (L(θ))* P(θ), (19)
where L(θ) is defined in (18). In this paper, we specifically focus on two kernel types: the polynomial kernel and the Gaussian kernel, as defined in (6b) and (6c) respectively. For an GK, with the prior for the width parameter given in the "Prior specification" subsection, the resulted posterior distribution given by (19) is non-regular. We apply the Metropolis-Hasting algorithm (the details can be found in [31]) to draw the sample. For an PK, we simply calculate the likelihood with respect to each d by (18) and sample d accordingly. Sometimes, one may need to calculate the gradient of L(θ) with respect to θ (assume θ = [θ1,..., θ
J
]') when adopting other plausible algorithms:
where Ω
γ
(θ) = K
γ
(θ) + σ2I
n
, i = 1,..., J. (20)
Theoretically, the proposed KIGP with the linear kernel performs very close to most other classical linear methods. As the width parameter of a Gaussian kernel increases (bigger and bigger than 1), within a reasonable range, the KIGP with such an GK performs fairly close to the KIGP with a linear kernel. On the contrary, when the width decreases (smaller and smaller than 1), the performance of the KIGP in the observation space behaves very non-linear. When the degree of a polynomial kernel of an KIGP increases, the non-linearity of the KIGP also increases. When the degree is equal to 1, the only difference between the PK(1) and the linear kernel is a constant. In short, within a kernel class, different values of the kernel parameter represent different feature spaces. For certain specific kernel parameter values, the performance of the KIGP with an GK or with an PK will be close to the KIGP with a linear kernel or a classical linear model in general. Therefore, the posterior distribution of the kernel parameter will provide some clues on what kind of a feature space is more appropriate to the target problem with the given training samples.
Proposed Gibbs sampler
With the derivation above and a given kernel type, we propose our Gibbs sampling algorithm as follows:
-
1.
Start with proper initial value [γ[0], b[0], t[0], z[0], σ2[0], θ[0]]; then set i = 1.
-
2.
Sample z [i] from z|t[i-1], b[i-1], σ2[i-1]via (16).
-
3.
Sample t[i]from t|γ[i-1], b[i-1], z[i], σ2[i-1], θ[i-1]via (15).
-
4.
Sample b[i]from b|z[i], t[i], σ2[i-1]via (17a).
-
5.
Sample σ2[i]from σ2|z[i], t[i], b[i]via (17b).
-
6.
Sample γ[i]from γ|z[i], b[i], σ2[i], θ[i-1]via (14) component-wise.
-
7.
Sample θ[i]from θ|z[i], b[i], σ2[i], γ[i].
-
8.
Set i = i + 1 and go back to the step 2 until the required number of iterations.
-
9.
Stop. (21)
In the above procedure, the kernel parameter θ denotes the degree parameter "d" of a polynomial kernel or the width parameter "r" of a Gaussian kernel. In step 2, we follow the optimal exponential accept-reject algorithm suggested by Robert [32] to draw from a truncated Gaussian distribution. After a suitable burn-in period, we can obtain the posterior samples of [z[i], t[i], b[i], σ2[i], γ[i], θ[i]] at the i th iteration with the procedure described in (21). The core calculation of the proposed Gibbs sampler involves calculating the inverse of the matrix K
γ
+ σ2I. Since the kernel matrix K
γ
is symmetric and non-negative definite, K
γ
+ σ2I is symmetric and positive definite. Therefore, the algorithm is theoretically robust and the Cholesky decomposition can be applied in the numerical computation. The total computation complexity of the proposed Gibbs sampler within each iteration is O(pn3).
Overall algorithm
In Fig. 1, we epitomize the general framework of the proposed KIGP method. The box bounded by the dotted lines represents the KIGP learning algorithm. A kernel type is supposed to be given a priori. The algorithm basically has a cascading structure and is composed of three consecutive phases: the "kernel parameter fitting phase", the "gene selection phase" and the "prediction phase". Although in the Bayesian sense one can involve all the parameters into the proposed Gibbs sampler for all three phases, we suggest to fix the kernel parameter(s) after the "kernel parameter fitting phase" and fix the gene-selection vector after the "gene selection phase" for practicality. Very often, we are only interested in the area around the peak of the posterior PDF (or probability mass function (PMF)) of a parameter, especially for the kernel parameter(s) and the gene-selection vector. This strategy will lead to a much faster convergence of the proposed Gibbs sampler as long as the posterior PDF or PMF of the kernel parameter(s) is unimodal. For all three phases, we need to discard some proper number of iterations as their burn-in periods. Some dynamic monitoring strategies to track the convergence of a MCMC simulation can be used (e.g. in [31]).
A practical issue needs to be addressed here. It's better to fix the variance parameter σ2 at a proper constant during the "kernel parameter fitting phase" and the "gene selection phase" because this will help the proposed algorithm be more numerically stable and converge faster. For all the simulations of this paper, as in a regular probit regression model, we set σ2 equal to 1 (step 5 in (21)) in the first two phases and only involve it into the Gibbs sampler in the "prediction phase". More details of each phase are described as follows.
Kernel Parameter Fitting Phase
In the kernel parameter fitting phase, our primary interest is to find the appropriate value(s) for the kernel parameter(s) of the given kernel type. In this study, we focus on two kernel types, the polynomial kernel and the Gaussian kernel. With the knowledge of the training set X and y, we firstly involve all model parameters (except σ2), the gene selection vector and the kernel parameter into the simulation of the algorithm given by (21). After convergence, the samples obtained from (21) within each iteration are drawn from the joint posterior distribution of all the parameters. For a PK, since the degree parameter is a discrete number, we simply take the degree value with the highest posterior probability. For a GK, we calculate the histogram of the sample values of the width parameter with some proper number of bins. Then we use a Gaussian smoother to smooth over the histogram bars (similar to a Gaussian kernel density estimation). Finally, we take the center of the bin with the highest histogram counts as the best fitted value of the width parameter.
Gene Selection Phase
After the "kernel parameter fitting phase", we fix the kernel parameter(s) at the fitted value(s) and then continue to run the proposed Gibbs sampler. In this subsection, we present an empirical approach to determining whether a gene is potentially significant based on the posterior samples and a given threshold.
Efron [29] thoroughly discussed an empirical Bayes approach for estimating the null hypothesis based on a large-scale simultaneous t-test. In this paper, we essentially follow the key concept therein to assess whether or not a gene is of significant importance for the given classification problem. We first define a statistic named by "Normalized Log-Frequency" (NLF) to measure the relative potential significance for a gene. By denoting F
j
as the appearing frequency of the j th gene appeared in the posterior samples, the definition of NLF is formulated as:
In practice, if F
j
is 0, we simply set it as 1/2 divided by the total number of iterations. Our use of the NLF as the key statistic is based on the fact that the logarithm of a gamma distribution can be well approximated by a normal distribution, while a gamma distribution is empirically a proper distribution for the appearing frequency of any of the genes from a homogenous group in the posterior samples.
Suppose that the p NLF-values fall into two classes, "insignificant" or "significant", corresponding to whether or not NLF
j
, for j = 1, 2,..., p, is generated according to the null hypothesis, with prior probabilities Pb0 and Pb1 = 1 - Pb0, for the two classes respectively; and that NLF
j
has the conditional prior density either f0 (NLF) or f1(NLF) depending on its class. I.e.
Pb0 = Pr{Insignificant}, Pr(NLF|Insignificant) = f0(NLF)
Pb1 = Pr{Significant}, Pr(NLF|Significant) = f1(NLF) (23)
The marginal distribution for NLF
j
is thus
Pr(NLF) = f0(NLF)*Pb0 + f1(NLF) * Pb1 = f (NLF) (24)
By using the Bayes' formula, the posterior probability for "insignificant" class given the NLF therefore yields
Pr(Insignificant)|NLF) = f0(NLF) * Pb0/f(NLF) (25)
Abiding to [29], we further define a term, the local "false discovery rate (fdr)", by
fdr(NLF) = f0(NLF)/f(NLF) (26)
Since in a typical microarray study, Pb0 generally is very close to 1 (say Pb0 > 0.99), so fdr(NLF) is a fairly precise estimator for the posterior probability of the null hypothesis (insignificant class) given the statistic NLF. With fdr(NLF), we can decide whether or not a target gene is "significant" at some confidence level accordingly. For all the examples, we report all the genes with fdr smaller than 0.05.
To calculate fdr(NLF), one needs to estimate f(NLF) and to choose f0(NLF) properly. For estimating f(NLF), one can resort to the ensemble values of the NLFs, {NLF
j
, j = 1, 2,..., p}. We divide the target range of NLF into M equal length bins with the center of each bin at x
i
for i = 1,2,..., M. A heuristic choice of M is the roundup of the maximum NLF value multiplied by 10. Then we calculate the histogram for the given NLF set with respect to each of these bins followed by fitting a Gaussian smoother. The output divided by the product of the width of the bin and the number of genes (i.e. p) will be a proper estimation for f(NLF) on the center of each bin.
The more critical part is the choice of the density of NLF under null hypothesis, i.e. f0(NLF). The basic assumption we impose here is that the statistic NLF under null hypothesis follows a normal distribution. Since Pb1 is much smaller than Pb0 (say Pb0 > 0.99) in most real microarray analysis problems, it is very safe to choose the standard normal (zero mean, unit variance) as f0(NLF) based on the definition (22). Throughout this paper, we always choose the standard normal as the density of NLF under null hypothesis. (In case Pb0 > 0.99, some more elaborated schemes are needed and an easy approach can be found in [29].) After both f(NLF) and f0(NLF) are obtained, the local fdr for each gene can be calculated by (26) consequently. Based on the local fdr, one can select the "significant" class of genes and fix the gene-selection vector at some given confidence level thereafter.
Prediction Phase
After the "gene selection phase", both the kernel parameter(s) and the gene-selection vector have been fixed. We continue to run the proposed Gibbs sampler (21) and the computational complexity of the Gibbs sampler dramatically decreases to O(n3). After a new proper burn-in period, we can draw samples of z, b and σ2 within each iteration in the "prediction phase". Following (9), the posterior PDF for the output given the testing data in the l-th iteration is Gaussian:
(2)
[l]|
γ
, X
γ
, z[l], b[l], σ2[l]~ N(f(
γ
, X
γ
, z[l], b[l], σ2[l]), V(
γ
, X
γ
, z[l], b[l], σ2[l])) = N(f[l], V[l])
where f[l]= (z[l]- b[l]l
n
)' (K
γ
+ σ2[l]I
n
)-1k
γ
, V[l]= K(
γ
,
γ
) - k
γ
'(K
γ
+ σ2[l]I
n
)-1k
γ
,
K
γ,ij
, = K(xγ,i, xγ,j), k
γ,i
= K(
γ
, x
γ,i
), for i, j = 1,..., n ; l = 1,..., L. (27a)
Then, the predictive probability for the output label given can be estimated by using the Monte Carlo integration:
Kernel type competition
Another important issue needs to be addressed is how to properly select a kernel type. If an independent set of testing samples is available, one approach is to calculate its predictive fit measure such as the misclassification rate (MR) or average predictive probability (APP) of the true class label. If the number of the available testing samples is sufficiently large, this approach is very reliable.
Assuming that there are M testing samples {(1, 1),...,(
M
,
M
)}, where
i
denotes the microarray data and
i
is its class label for i = 1,2,..., M, the MR for the testing set can be estimated by
The smaller the MR a kernel type has, the better general performance it should have. If the number of the available testing samples is small, the APP of the true class label is a more consistent measure. Throughout this paper, we always refer APP to the APP of the true class label and it is defined as:
In both (28a) and (28b), the probability P(
i
|X, y,
i
, K) is evaluated by (27a) and (27b). Obviously, a better model should have a higher APP. The APP usually provides a less biased predictive fit measure when the number of testing samples is limited.
After running the simulations under each candidate kernel type, one can simply choose the kernel type with the least MR or with the largest APP for the testing set. However, the independent testing samples are not always available. To use the predictive fit approach, one may resort to a rigorous cross-validation (CV) procedure. Sometimes, a "leave-one-out" cross-validation (LOOCV) is proper. That is, one treats one of the training samples as the testing sample and applies the proposed KIGP, including all three phases, to the rest n-1 samples and obtains the predictive measure for this sample. One does this procedure for each training sample and the average of the predictive measures should give a consistent evaluation to the target kernel type.
A more unbiased approach is to use a multiple independent 3-fold CVs. For each round of CV, one first randomly partitions the training set into three sets with a balanced ratio of the class labels. Then for each of the three sets, one treats it as the testing set and applies the KIGP (including all three phases) to the remaining two sets as the training set and gets the predictive fit measure for this testing set. After running this procedure for all three sets, one gets the predictive measure of all available samples for this round. One does multiple rounds of independent 3-fold CVs (through different random partitioning) and the average of the predictive measure for the whole set will deliver an unbiased assessment of the given kernel type.
The predictive fit approach through a multiple 3-fold CVs works very well. Throughout this study, we always use it to select the proper kernel type for a given problem if the independent testing set is not available. As the nature of the MCMC-based methods however, the KIGP method is extremely computationally intensive, especially when the number of the candidate kernel types is large. A more integrative implementation for kernel or model selection, such as making use of a reversible jump MCMC approach, would help further streamline the current KIGP.