Kernelimbedded Gaussian processes for disease classification using microarray gene expression data
 Xin Zhao^{1} and
 Leo WangKit Cheung^{2, 3}Email author
DOI: 10.1186/14712105867
© Zhao and Cheung; licensee BioMed Central Ltd. 2007
Received: 16 June 2006
Accepted: 28 February 2007
Published: 28 February 2007
Abstract
Background
Designing appropriate machine learning methods for identifying genes that have a significant discriminating power for disease outcomes has become more and more important for our understanding of diseases at genomic level. Although many machine learning methods have been developed and applied to the area of microarray gene expression data analysis, the majority of them are based on linear models, which however are not necessarily appropriate for the underlying connection between the target disease and its associated explanatory genes. Linear model based methods usually also bring in false positive significant features more easily. Furthermore, linear model based algorithms often involve calculating the inverse of a matrix that is possibly singular when the number of potentially important genes is relatively large. This leads to problems of numerical instability. To overcome these limitations, a few nonlinear methods have recently been introduced to the area. Many of the existing nonlinear methods have a couple of critical problems, the model selection problem and the model parameter tuning problem, that remain unsolved or even untouched. In general, a unified framework that allows model parameters of both linear and nonlinear models to be easily tuned is always preferred in realworld applications. Kernelinduced learning methods form a class of approaches that show promising potentials to achieve this goal.
Results
A hierarchical statistical model named kernelimbedded Gaussian process (KIGP) is developed under a unified Bayesian framework for binary disease classification problems using microarray gene expression data. In particular, based on a probit regression setting, an adaptive algorithm with a cascading structure is designed to find the appropriate kernel, to discover the potentially significant genes, and to make the optimal class prediction accordingly. A Gibbs sampler is built as the core of the algorithm to make Bayesian inferences. Simulation studies showed that, even without any knowledge of the underlying generative model, the KIGP performed very close to the theoretical Bayesian bound not only in the case with a linear Bayesian classifier but also in the case with a very nonlinear Bayesian classifier. This sheds light on its broader usability to microarray data analysis problems, especially to those that linear methods work awkwardly. The KIGP was also applied to four published microarray datasets, and the results showed that the KIGP performed better than or at least as well as any of the referred stateoftheart methods did in all of these cases.
Conclusion
Mathematically built on the kernelinduced feature space concept under a Bayesian framework, the KIGP method presented in this paper provides a unified machine learning approach to explore both the linear and the possibly nonlinear underlying relationship between the target features of a given binary disease classification problem and the related explanatory gene expression data. More importantly, it incorporates the model parameter tuning into the framework. The model selection problem is addressed in the form of selecting a proper kernel type. The KIGP method also gives Bayesian probabilistic predictions for disease classification. These properties and features are beneficial to most realworld applications. The algorithm is naturally robust in numerical computation. The simulation studies and the published data studies demonstrated that the proposed KIGP performs satisfactorily and consistently.
Background
DNA microarray technology provides researchers a highthroughput means to measure expression levels for thousands of genes in an experiment. Careful analyses of microarray gene expression data can help better understand human health and disease and have very important implications in basic sciences as well as pharmaceutical and clinical research. Some existing methodologies for microarray gene expression data analysis, such as introduced in [1–3] and [4], have demonstrated their usefulness for a variety of class discovery or class prediction problems in biomedical applications. In a microarray study, we typically face a problem of analyzing thousands of genes from a relatively small number of available samples. This nature gives rise to a very high likelihood of finding lots of "false positives" with conventional statistical methods. Therefore, properly selecting the group of genes that are significantly related to a target disease has created one of the key challenges in microarray data analysis.
Gene selection problem basically can be viewed as a variable selection problem associated with linear regression models. An incomplete list of those classical variable selection methods/criteria includes the ratio of error sum of squares for the model with p variables to the error mean square of the full model and adjusted with a penalty for the number of variables or the C_{ p }Criterion [5], the Akaike Information Criterion or AIC [6], and the Bayesian Information Criterion or BIC [7]. George and Foster [8] later suggested that these criteria corresponded to a hierarchical Bayesian variable selection procedure under a particular class of priors. Following the similar setting with a slightly different prior specification, Yuan and Lin [9] provided another approach to solve this problem and they showed that their algorithm was significantly faster and could be potentially used even when the predictor dimension is larger than the training sample size. Although both of these algorithms have been shown to favorably enhance the selection performance comparing to the classical methods such as C_{ p }, AIC or BIC, they share a common disadvantage. That is, even after the hyperparameters are estimated, the variable selection criteria need to be evaluated on each candidate variable for optimality. Usually, the number of candidate models grows in an exponential rate with the increase of the number of variables, whereas the typical number of the investigated genes in a microarray data analysis problem is in thousands. This motivates the development of the class of the Markov Chain Monte Carlo (MCMC) algorithms under a Bayesian framework to attack the problem. One of the most widely used MCMC algorithms is the Gibbs sampler. For the microarray analysis problem, Lee et al. [10] suggested a Bayesian model based on a linear probit regression setting and proposed a Gibbs Sampler to solve it. An extension to this method based on a multinomial probit regression setting has also been proposed [11]. Similarly, Zhou et al. ([12, 13]) developed another Bayesian approach built upon a linear logistic regression model to the gene selection problem.
The linear model based methods mentioned above have been shown with various levels of effectiveness in finding the set of significant genes in a wide range of real microarray experiments. However, they all share some common limitations: the first also the most important one is that, a linear model is not necessarily always a good approximation for the underlying physical model; second, linear model based methods are more likely to bring in false positives; third, the computations of these linear model based algorithms usually involve calculating the inverse of a matrix that is possibly singular when the number of potentially important genes is relatively large. To overcome these disadvantages, Zhou et al. [14] introduced a nonlinear term into the basic linear probit regression model and applied a bootstrapping procedure to enlarge the sample size. A technique called sequential Monte Carlo was adopted in the numerical Bayesian computation in their work. Some other models were also developed for tumor classification problems with gene expression profiling. For instance, based on the simple nearest centroid classifier and via a shrinking strategy, Tibshirani et al. [15] offered the socalled "nearest shrunken centroids" (also known as "Prediction Analysis for Microarrays" or PAM) algorithm. By combining two ensemble schemes, i.e. bagging and boosting, Dettling [16] introduced the method "BagBoosting" as an enhanced version of the regular boosting algorithm. Both of these methods were shown being very effective when applied to a few published datasets.
The kernelinduced machine learning is one of the most promising approaches for exploring the potential nonlinearity for a given classification or regression problem through the feature space concept. For example, kernelinduced support vector machines (SVMs) have been successfully applied to a number of learning tasks and are generally accepted as one of the stateoftheart learning methods. Theoretically, Lin et al. ([17, 18]) proved that a SVM with an appropriately chosen kernel and model parameters can approach the Bayesian bound of a given problem when the training sample size is large enough. For the geneselection problem, Guyon et al. [19] proposed the method "Recursive Feature Elimination" (RFE) to rank the genes with respect to a provided SVM, thus the SVM can be utilized for microarray data analysis. RFE was shown to be very effective with a linear kernel. However, when the number of genes is large (in hundreds), RFE doesn't function well with a nonlinear kernel. This limits the applications of SVMs to the analysis of microarray data. Zhu and Hastie ([20, 21]) later proposed a framework called kernel logistic regression and suggested a method called "Import Vector Machine" to solve it. However, they also chose the RFE as the strategy to select the significant genes.
As Bayesian probability theory can help construct a unified framework for modeling data and facilitate tuning of the involved parameter and/or hyperparameter, developing a proper Bayesian probabilistic model is usually beneficial for a machine learning method. MacKay [22] introduced a probabilistic evidence framework as a Bayesian learning paradigm for neural networks. With the close relationship between neural network methods and kernelinduced learning methods, Kwok [23] and Gestel et al. [24] developed a Bayesian framework for SVMs and least square support vector machines (LSSVMs) respectively, with guidance of the principle of the evidence framework. Neal [25] also showed that, as the number of hidden units increases in a Bayesian neural network, the prior over the network output converges to a Gaussian process (GP) if independent Gaussian distributions are used as the priors for network weights and bias. LSSVMs conceptually are close to SVMs, except that they use equality constraints instead of inequality constraints and they use a squared error penalty function. Getting solution of an LSSVM therefore only involves solving a set of linear equations, which though loses the sparseness featured in an SVM, it makes an LSSVM much easier for an online implementation. If we consider the characteristic similarity between the mapping from input nodes/data to hidden units in a neural network and the mapping from input data to a feature space conceptually embedded in an LSSVM, it's not surprising that under the Gaussian noise assumption, the mean of the posterior prediction made by a GP coincides with the optimum decision function made by an LSSVM, whereas a GP offers a more approachable probabilistic model. This fact motivated us to develop a new Bayesian learning method named kernelimbedded Gaussian process (KIGP) for microarray gene expression data analysis based on the Gaussian process theory.
The rest of this paper is organized as follows: we show the results from applying the proposed KIGP method to simulated datasets as well as real published microarray datasets in the "Results" section. The conclusions and the further research discussions are summarized in the "Discussions and Conclusions" section. In the "Methods" section, we provide the mathematical content of the methodology followed by a detailed description of the algorithm.
Results
Some terms and acronyms defined in the "Methods" section are used in this section. They include "geneselection vector (γ)", "linear kernel (LK)", "polynomial kernel (PK)", "Gaussian kernel (GK)", "Normalized LogFrequency (NLF)", "false discovery rate (fdr)", "kernel parameter fitting phase", "gene selection phase", "prediction phase", "misclassification rate (MR)", "average predictive probability (APP)", "leaveoneout crossvalidation (LOOCV)" and "3fold crossvalidation (3fold CV)". One can refer to the "Methods" section for the details.
Simulation studies
Example 1
This example was designed to illustrate all the key concepts, elements and procedures of the KIGP framework introduced in the "Methods" section. It consists of two cases. In the first case, the Bayesian classifier of the underlying generative model is linear; while in the second case, the Bayesian classifier takes a very nonlinear form. We set the number of the significant/explanatory genes as two, so we can better graphically display the Bayesian classifier and the relative performance of the KIGP method. In both of these cases, the number of training samples is twenty. Ten training samples were generated from the class "1" and the other ten samples were generated from the class "1". The number of testing samples is 5000. For each sample, the number of investigated genes is 200; the indices of the two underlying explanatory genes were preset as [23,57]. For each case, we independently generated 10 sets of training samples from the generative model and ran the simulation on each of them.
(a) Case with a Linear Bayesian Classifier
In this linear case, the two preset significant genes were generated from the bivariate Gaussian distribution $N(\left[\begin{array}{c}1\\ 1\end{array}\right],\left[\begin{array}{cc}1& 1\\ 1& 2\end{array}\right])$ for the class "1" and from the bivariate Gaussian distribution $N(\left[\begin{array}{c}1\\ 1\end{array}\right],\left[\begin{array}{cc}1& 1\\ 1& 2\end{array}\right])$ for the class "1". For those insignificant genes, each of them was independently generated from the standard normal distribution N(0,1). The probabilities for the class "1" and the class "1" were equal. With this generative model, the Bayesian classifier for the two classes is a mathematical linear combination of the two prescribed significant genes.
The KIGP method with an PK, or with an GK, or with an LK was applied to each of the 10 training sets respectively. The prior probability for γ_{ j }= 1 for all j in the Gibbs sampling simulations was set at 0.01. For all the Gibbs sampling simulations in this example, we ran 5000 iterations in both the "kernel parameter fitting phase" and the "gene selection phase" and treated the first 1000 iterations as the burnin period. In the "prediction phase", we ran 2000 iterations and treated the first 500 iterations as the burnin period. The threshold for "fdr" in the "gene selection phase" was set at 0.05.
For all of the 10 simulated training sets, when an PK was the kernel type for the KIGP method, the algorithm chose the PK(1) after the "kernel parameter fitting phase" and found both the prescribed significant genes at the end of the "gene selection phase" (i.e. with no "false negative"). However, KIGP with an PK(1) resulted with one "false positive" gene in 2 of the 10 sets. In the prediction phase, the average testing MR for the 8 sets correctly found the 2 preset significant genes with no "false positive" was 0.018. It was very close to the Bayesian bound (i.e. 0.013). However, the average testing MR for the 2 sets with one "false positive" was significantly worse. It was only 0.107. The average testing MR for all 10 sets was 0.036.
The results of the simulation studies with an LK were very similar to that of the simulations with the PK(1). In all the simulations, the KIGP found the 2 preset significant genes (i.e. with no false "negative"), but in 2 of the 10 sets, the algorithm resulted with one "false positive" as well. This result was exactly same as that from the simulations with the PK(1). The average testing MR for the 10 sets with an LK was 0.037, almost the same as to that with an PK.
For the results of the stimulation studies with an GK, the algorithm perfectly found the only 2 prescribed significant genes in 6 of the 10 sets (i.e. no false "negative" and no false "positive"). In other 3 sets, the KIGP identified the 2 prescribed significant genes as well as one "false positive". In one other set, the KIGP resulted only one of the two prescribed genes (i.e. with one "false negative") and one "false positive". The mean and the standard deviation for the fitted width of an GK for these 10 simulations were 1.95 and 0.31 respectively. The average testing MR for the 10 simulations with an GK was only 0.104. Based on the testing MR measure, we should use the KIGP with either a polynomial kernel or a linear kernel to make any further analysis for this problem.
In the majority of the simulations, the KIGPs found the two preset significant genes in this linear case. They all performed very close to the Bayesian bound when the two preset genes were perfectly found. Since the KIGP with the PK(1) gave the best average testing MR, we should use it for any further analysis.
(b) Case with a Nonlinear Bayesian Classifier
For the 10 simulated training sets, when an PK was the kernel type for the KIGP method, the algorithm chose the PK(1) for 5 sets and the PK(2) for the other 5 sets after the "kernel parameter fitting phase". Only in 2 of the 5 sets, the KIGP with the PK(2) perfectly found the two prescribed genes as the only significant genes. The average testing MR for these 10 sets was horrendous. However, for those two sets correctly found the two preset significant genes, the testing MRs were both fairly close to the Bayesian bound.
The results of the simulations with an GK were much better. For all of the 10 sets, the KIGP successfully found the 2 preset significant genes (i.e. with no "false negative"). The KIGP also resulted with one "false positive" for 2 sets as well. The mean and the standard deviation of the fitted width of an GK for these 10 sets were 0.71 and 0.08 respectively. In the "prediction phase", the average testing MR was 0.065 for the 8 sets correctly found the 2 preset significant genes. It was very close to the Bayesian bound (i.e. 0.055). The average testing MR was 0.171 for the 2 sets with one "false positive". The average testing MR for all 10 sets was 0.086.
As an illustration, we depict the results from applying the KIGP to one of the training sets, in which both an PK and an GK worked well. The procedure and all settings of the simulations and the legends of the figures were same as described in the linear case. We first applied an KIGP with an GK to the training set. The mode of the posterior PDF of the width parameter was found at around 0.81 after the "kernel parameter fitting phase" (Fig. 2c). With the GK(0.81), the cutoff value of 3.68 for NLF was obtained at the end of the "gene selection phase". Based on the NLF statistic, the two prescribed genes were successfully retrieved (Fig. 4c) and the KIGP performed well with MR = 0.063 (Fig. 4d). It was very close to the Bayesian bound. For the simulation with an PK, the posterior probability masses of the degree parameter were Prob(d = 1) = 0.229 and Prob(d = 2) = 0.771 respectively. The NLF plot for each gene and the relative cutoff line for the NLF are both displayed in Fig. 4e. The two prescribed genes were discovered. The performance of the KIGP with the PK(2) was very well with MR = 0.060 (Fig. 4f). It was very close to the Bayesian bound too.
We tried to apply the regular SVM with RFE to this example as we did in the linear case, but SVM/RFE failed to work with an LK, nor an GK (with any width), nor an PK. The key problem might be due to the large dimension (i.e. 200) of this example. Comparing the KIGP method to the SVM/RFE in this nonlinear case, besides those beneficial properties of the KIGP that we already observed in the linear case, the KIGP method particularly shows its better adaptability for nonlinear problems. In summary, owing to the nonlinear setting of this case, all linear methods were not applicable. The regular SVM/RFE approach also did not work. On the contrary, in terms of the testing MR measure, the KIGP with an GK provided a performance very close to the Bayesian bound. Comparatively, the KIGP with an PK seems to be less robust and consistent than the KIGP with an GK for a nonlinear problem in general.
As a side note, it's worth pointing out that the posterior PDF of the width parameter seems to disclose some special nature of a dataset for a classification problem when one applies the KIGP with an GK. For instance, we observed that if the underlying Bayesian classifier can be well approximated by a linear function, the mode (peak) of the PDF of the width parameter significantly moves to the right side of the value 1 (Fig. 2a); whereas if the Bayesian classifier is very nonlinear, it moves to the left side of the value 1 (Fig. 2c).
Example 2
We further designed this example to demonstrate the effectiveness of the proposed KIGP method when the number of investigated genes is large, especially for a problem with a very nonlinear Bayesian classifier. A total of 1000 genes in a simulated microarray experiment and 10 of them were preset as the significant genes with indices [64,237,243,449,512,573,783,818,890,961]. These 10 significant genes were generated from the mixture Gaussian distribution with equal probability on N(1_{10}, I_{10} *0.1) and N(1_{10}, I_{10} *0.1) for the class "1" and from the Gaussian distribution N(0_{10}, I_{10} *0.1) for the class "1", where 0_{10} denotes a vector with 10 "0" elements. The probabilities for the two classes were equal. The rest of other insignificant genes were independently generated from the standard normal distribution N(0,1). Similar to the first example, the number of training samples is 20, 10 of which were generated from the class "1" and the other 10 samples were generated from the class "1"; the number of testing samples is 5000; we independently generated 10 sets of training samples from the model and ran the simulation on each of them.
The procedure for this example is same as in the nonlinear case of the first example. The prior probability for γ_{ j }= 1 was set at 0.01. For both the "kernel parameter fitting phase" and the "gene selection phase", we ran 20000 iterations and treated the first 10000 as the burnin period, and for the "prediction phase", we ran 5000 iterations and treated the first 1000 as the burnin period.
For the 10 simulated training sets, when an PK was the kernel type for the KIGP method, the algorithm chose the PK(2) in 7 out of 10 sets. Only in 2 of these 7 sets with the PK(2), the algorithm found all 10 significant genes. However, for the 10 sets with an GK, the 10 prescribed genes were all found in each of the 10 sets. There was one "false positive" being brought into the significant group in one set. There was almost no error for the testing samples and extremely close to the Bayesian bound.
Real data studies
Summary of the real dataset studied in this paper
Dataset  Publication  p  n  M  Response 

Leukemia  Golub et al. (1999) [1]  7129  38  34  ALL/AML 
SRBCT  Khan et al. (2001) [27]  2308  35  12  EWS/NB 
Breast Cancer  Hedenfalk et al. (2001) [28]  3226  22  0  BRCA1/BRCA2 or sporadic 
Colon  Alon et al. (1999) [30]  2000  62  0  Tumor/Normal tissue 
Acute leukemia data
The leukemia dataset was originally published by Golub et al. [1], in which the bone marrow or peripheral blood samples were taken from 72 patients with either acute myeloid leukemia (AML) or acute lymphoblastic leukemia (ALL). The data was divided into two independent sets: a training set and a testing set. The training set consists of 38 samples, of which 27 are ALL and 11 are AML. The testing set consists of 34 samples, of which 20 are ALL and 14 are AML. This dataset contains expression levels for 7129 human genes produced by Affymetrix highdensity oligonucleotide micorarrays. The scores in the dataset represent the intensity of gene expression after being rescaled. By using a weighted voting scheme, Golub et al. made predictions for all the 34 testing samples and 5 of them were reported being misclassified.
The KIGP with an GK, an PK, and an LK was applied to the training dataset (including all investigated genes) respectively. The prior parameter π_{ j }for all j was uniformly set at 0.001. In both the "kernel parameter fitting phase" and the "gene selection phase", we ran 30000 iterations and treated the first 15000 iterations as the burnin period; and in the "prediction phase", we ran 5000 iterations and treated the first 1000 iterations as the burnin period.
Summary of the results from applying the proposed KIGP to the leukemia dataset.
Performance Measure  Test  CV (3fold)  LOOCV (fixed genes)  

PK  GK  LK  PK  GK  LK  PK  GK  LK  
ERR #  2/34  1/34  1/34  2/38  1/38  1/38  0/38  0/38  0/38 
APP  0.858  0.835  0.923  0.844  0.819  0.875  0.995  0.928  0.993 
Summary of the genes found by applying the KIGP with the LK to the leukemia dataset
Index  NLF  Accession #  Gene Description 

4847  11.47  X95735  Zyxin 
3320  10.36  U50136  Leukotriene C4 synthase (LTC4S) gene 
2020  9.79  M55150  FAH Fumarylacetoacetate 
5039  9.63  Y12670  LEPR Leptin receptor 
1834  9.22  M23197  CD33 CD33 antigen (differentiation antigen) 
4499*  6.79  X70297  CHRNA7 Cholinergic receptor, nicotinic, alpha polypeptide 7 
1745  6.46  M16038  LYN Vyes1 Yamaguchi sarcoma viral related oncogene homolog 
3847  5.32  U82759  GB DEF = Homeodomain protein HoxA9 mRNA 
4196  5.21  X17042  PRG1 Proteoglycan 1, secretory granule 
1779*  5.08  M19507  MPO Myeloperoxidase 
6539  4.98  X85116  Epb72 gene exon 1 
6376  4.80  M83652  PFC Properdin P factor, complement 
3258  4.73  U46751  Phosphotyrosine independent ligand p62 for the Lck SH2 domain mRNA 
2111  4.64  M62762  ATP6C Vacuolar H+ ATPase proton channel subunit 
1882  4.64  M27891  CST3 Cystatin C (amyloid angiopathy and cerebral hemorrhage) 
1829*  4.59  M22960  PPGB Protective protein for betagalactosidase (galactosialidosis) 
1249  4.49  L08246  INDUCED MYELOID LEUKEMIA CELL DIFFERENTIATION PROTEIN MCL1 
2121  4.41  M63138  CTSD Cathepsin D (lysosomal aspartyl protease) 
2288  4.28  M84526  DF D component of complement (adipsin) 
1924*  4.28  M31158  PRKAR2B Protein kinase, cAMPdependent, regulatory, type II, beta 
In Table 2, the KIGP with an LK gave the best testing performance: only 1 error was found. We found that many publications (e.g. [10, 12] and [21]) reported the same testing error for this dataset as well. Only Zhou et al. [14] reported 0 testing error. However, based on the results of [14], the testing APP was only 0.83, which is much worse than that of the KIGP with an LK (i.e. the testing APP = 0.923). We suspect that this misclassified testing sample by KIGP/LK may be phenotyped incorrectly.
Small round bluecell tumor (SRBCT) data
The SRBCT data was originally published by Khan et al. [27]. The tumor types include Ewing family of tumors (EWS), rhabdomyosarcoma (RMS), neuroblastoma (NB) and nonHodgkin lymphoma (NHL). The dataset of the four tumor types is composed of 2308 genes and 63 samples, while 25 blinded testing samples are available. In this study, we only focused on two classes, EWS and NB. Thus, there are only 35 training sample (23 EWS and 12 NB) and 12 testing samples (6 EWS and 6 NB).
Summary of the results from applying the proposed KIGP to the SRBCT dataset
Performance Measure  Test  CV (3fold)  LOOCV (fixed genes)  

ANN  PK  GK  LK  PK  GK  LK  PK  GK  LK  
ERR #  0/12  0/12  0/12  0/12  0/35  2/35  0/35  0/35  0/35  0/35 
APP  0.923  0.945  0.781  0.865  0.875  0.794  0.823  0.998  0.909  0.997 
Summary of the genes found by applying the KIGP with PK(1) to the SRBCT dataset
Index  NLF  Image ID  Gene Description 

255  11.36  325182  cadherin 2, Ncadherin (neuronal) 
976*  10.88  786084  chromobox homolog 1 (Drosophila HP1 beta) 
1389  10.19  770394  Fc fragment of IgG, receptor, transporter, alpha 
742  9.28  812105  transmembrane protein 
2144  8.12  308231  Homo sapiens incomplete cDNA for a mutated allele of a myosin class I, myh1c 
823*  7.53  134748  glycine cleavage system protein H (aminomethyl carrier) 
2050  6.61  295985  ESTs 
842*  6.08  810057  cold shock domain protein A 
545  5.27  1435862  antigen identified by monoclonal antibodies 12E7, F21 and O13 
867  5.22  784593  ESTs 
481  5.15  825411  Nacetylglucosamine receptor 1 (thyroid) 
1662  4.82  377048  Homo sapiens incomplete cDNA for a mutated allele of a myosin class I, myh1c 
1601  4.81  629896  microtubuleassociated protein 1B 
437*  4.42  448386  
1700*  4.20  796475  ESTs, Moderately similar to skeletal muscle LIMprotein FHL3 [H. sapiens] 
Similar to the Leukemia data case, the APP of the rigorous 3fold CV is very consistent to that of the independent testing while the "loose" LOOCV is rather biased. We also found that the KIGP with the PK(1) outperformed the Artificial Neural Network (ANN, [27]) method in terms of APP (and both methods gave 0 testing errors).
Breast cancer data
The hereditary breast cancer data used in this example was published by Hedenfalk et al. [28], in which cDNA microarrays were used in conjunction with classification algorithms to show the feasibility of using the differences in global gene expression profiles to separate BRCA1 and BRCA2/sporadic. 22 breast cancer tumors were examined: 7 with BRCA1, 8 with BRCA2 and 7 considered sporadic. 3226 genes were investigated for each sample. We labeled the samples with BRCA1 as the class "1" and others as the class "1".
Summary of the results from applying the KIGP to the breast cancer dataset
Performance Measure  CV (3fold)  CV (3fold, fixed genes)  LOOCV (fixed genes)  

PK  GK  LK  PK  GK  LK  PK  GK  LK  
ERR #  4/22  3/22  5/22  0/22  0/22  0/22  0/22  0/22  0/22 
APP  0.685  0.739  0.662  0.855  0.878  0.929  0.903  0.889  0.995 
Summary of the genes found by applying the KIGP with GK(3.19) to the breast cancer dataset
Index  NLF  Clone ID  Gene Description 

1999  4.44  247818  ESTs 
2734  4.21  46019  minichromosome maintenance deficient (S. cerevisiae) 7 
1851*  4.06  293977  ESTs 
585  3.89  293104  phytanoylCoA hydroxylase (Refsum disease) 
2423  3.85  26082  very low density lipoprotein receptor 
1443  3.85  566887  chromobox homolog 3 (Drosophila HP1 gamma) 
2893*  3.81  32790  mutS (E. coli) homolog 2 (colon cancer, nonpolyposis type 1) 
1068  3.81  840702  SELENOPHOSPHATE SYNTHETASE ; Human selenium donor protein 
1008  3.81  897781  keratin 8 
It's not surprising to find that the general performance of the KIGP with an LK or the PK(1) was not good since we notice that there is an unusual local peak on the left side of the posterior PDF of the width parameter r (Fig. 9c). This local peak usually implies the existence of nonlinearity in the data for the given problem. A fairly logical reason for this phenomena can be found in [29], in which Efron showed that the empirical null of this dataset was significantly different from its theoretical null based on a largescale simultaneous 2sample ttest and he argued that this was probably due to the fact that the experimental methodology used in the original paper had induced substantial correlations among the various microarrays.
This example is also a good case to show the "geneselection bias" problem. In Table 6, with the selected significant genes found by training all the available samples, the performance of the KIGP with an LK from the "loose" 3fold CV was much better than that of the KIGP with a GK. However, from the results of the rigorous 3fold CV, the KIGP with an LK gave very poor predictive performance, while the KIGP with an GK still worked reasonably well.
Colon data
This dataset was originally published by Alon et al. [30] and we noticed that Dettling [16] has reported the performances of many stateoftheart learning methods that had been applied to this dataset. We applied the KIGP method to this dataset so as to have a more sidebyside performance comparison with other methods. [16] did a prefiltering of genes based on the Wilcoxon test statistic and only ran all the simulations within a 200gene pool. However, based on the reported procedure, it should not bring in much geneselection bias. Therefore, it forms a good dataset for comparing different microarray data analysis methods.
Summary of the performance comparison on applying different classifiers to the colon dataset
Classifier  MR 

KIGP(PK)  0.166 
KIGP(GK)  0.129 
KIGP(LK)  0.198 
BagBoost  0.161 
Boosting  0.191 
RF  0.149 
SVM  0.151 
PAM  0.119 
DLDA  0.129 
kNN  0.164 
Summary of the genes found by applying the KIGP with GK(2.38) to the colon dataset
Index  NLF 

377  6.54 
493  6.25 
249  5.48 
267  4.51 
245  4.34 
765  4.29 
513  3.94 
14  3.88 
Another interesting finding of this experiment is that, based on the results of the "loose" CV, the KIGP/LK performed better than the KIGP/GK for this dataset. However, with a multiple rigorous 3fold CV, it turned out that KIGP/GK was the more reliable kernel type for this problem. When we checked the heat map of the significant gene set identified by the KIGP/GK (Table 9), we found that a few samples, particularly including the sample #18, #20, #45, #49 and #56, are significantly different from other samples in their labeled class. However, they are very consistent to those samples in their opposite class. In fact, these samples were also almost always misclassified by the KIGP in the multiple rigorous 3fold CV tests. We therefore suspect that these samples are mistakenly phenotyped. We think that this is probably the reason why all other learning methods referred in Table 8 do not perform well for this colon dataset. This also supports the nature of a KIGP/GK being less sensitive to the mislabeled training samples than a KIGP/LK.
Discussions and Conclusion
This work was motivated by the data analysis challenges posed by microarray gene expression experiments and the mathematical beauty of the kernelimbedding approach in their ability to solve a nonlinear classification problem in the feature space rather than in the observation space. We have presented a unified supervised learning model named kernelimbedded Gaussian process (KIGP) under a hierarchical Bayesian framework. This model was specifically designed for automatic learning and profiling of microarray gene expression patterns. In the simulated examples, without knowing anything of the underlying generative model, the proposed KIGP method has been shown to perform very close to the Bayesian bound not only in the linear case, but also in the nonlinear case.
With a probit regression setting and the introduction of latent variables, the KIGP model was set for a binary disease classification problem. An algorithm with a cascading structure was proposed to solve this problem and a Gibbs sampler was built as the mechanical core to do the Bayesian inferences. Given a kernel type such as a Gaussian kernel or a polynomial kernel, with the training data as input, the fitted parameter(s) of the kernel type and a set of significant genes will be the output of the algorithm. The algorithm also offers a probabilistic class prediction for each sample. The proposed KIGP can explore not only the linear but also the potential nonlinear relationship between the target disease and its associated explanatory genes. Comparing to the regular SVM (a very popular kernelinduced learning method), the proposed KIGP has two advantages. First, the probabilistic class prediction from the KIGP could be insightful for borderline cases in realworld applications. Second, the KIGP method has implemented specific procedure for tuning the kernel parameter(s) (such as the width parameter of a Gaussian kernel or the degree parameter of a polynomial kernel) and the model parameters (such as the variance of the noise term). Tuning parameters has always been one of the key issues for nonlinear parametric learning methods. The results of the simulated examples show that the KIGP significantly outperformed the regular SVM method with RFE as a gene selection strategy in a nonlinear case and it provided more useful information, such as the posterior PDF of the parameters, for further prediction and analysis as well. Computationally, KIGP is also proven to be robust, therefore it's very amenable to be adopted to a Gibbs sampling system. Both the simulated examples and the real data studies have demonstrated the effectiveness of the proposed method.
There are still a few interesting problems left for future research. For example, although the KIGP in this study is developed to only solve a binary classification problem, it can easily be extended to a multiclass classification problem based on a multinomial probit regression setting. On the other hand, some other problems are not only challenging but also critical. First, the kernel type competing problem is still a tough issue. The use of the predictive fit measure method discussed in the "Methods" section is simple to formulate, but it may be problematic when the independent testing set is not available and/or there are many candidate kernel types. We are currently working on addressing this issue by implementing a reversible jump Markov Chain Monte Carlo (RJMCMC) algorithm as a simultaneous integrative approach for kernel type selection within the KIGP framework. Another important problem is the independent prior assumption on elements of the geneselection vector γ and the "componentwise drawing" strategy to sample it. Although this will eventually lead to convergence based on the MCMC theory, it may take a very long time if the true underlying explanatory genes are highly correlated with each other. Therefore, a proper kernelinduced clustering algorithm under some proper generative model will definitely be helpful on this regard. Furthermore, if a more appropriate prior for γ can be found, the dependency between genes can be simply taken into account to the whole framework by sampling γ not in a componentwise fashion but in a blockwise fashion instead. This will then dramatically increase the speed for reaching convergence.
Interestingly, building a kernel based on the feature of the given data and the classification problem is the ideal way to take full advantage of the kernelinduced learning algorithm. For example, if an appropriate generative model is available for the given dataset, a class of kernels named "natural kernels" is applicable in this context. This problem and the preclustering problem mentioned above seemingly share many fundamental elements. However, the further investigation of this is beyond the scope of this paper.
Methods
Problem formulation
We consider a binary classification problem. Suppose there are n training samples and let y = [y_{1}, y_{2},...,y_{ n }]' denote the class labels, where y_{ i }= 1 indicates the sample i being in the class "I" and y_{ i }= 1 indicates it being in the other class (i.e. not class "I"), for i = 1,2,...,n. For each sample, there are p genes being investigated and we define the gene expression matrix X as
The data matrix X usually should be normalized for each gene (each column of X). In order to handle the gene selection problem, we further define the geneselection vector γ as:
X_{ γ }is defined as the gene expression matrix corresponding to the selected genes in accordance to the geneselection vector γ. I.e.
where the j th column of X_{ γ }is the i th column of the matrix X while the index of the j th nonzero element in the vector γ is i. In formula (3), there are q genes being selected out from the total p genes; and q <<p in a typical gene selection problem. Formulating the problem in a regression setting, we introduce n latent variables z_{1}, z_{2},..., z_{ n }, such that
z_{ i }= g(X_{ γ i }) + b + e_{ i }= t_{ i }+ b + e_{ i }, and
where x_{ γi }denotes the i th row of the matrix X_{ γ }; e_{ i }presents the independent noise term, which is assumed to be Guassian distributed with zero mean, σ^{2} variance; b is the intercept term; and the link function g(·) is assumed to be chosen from a class of realvalued functions and the output of which is a Gaussian process. In the vector form, we define z = [z_{1}, z_{2},..., z_{ n }]', t = [t_{1}, t_{2},..., t_{ n }]' and e = [e_{1}, e_{2},..., e_{ n }]'. Note that, if g(·) is restricted to a linear function and σ^{2} is fixed at 1, model (4) is very similar to a linear probit regression setting.
KernelImbedded Gaussian Processes (KIGPs)
In general, a continuous stochastic process is a collection of random variables, and each of these random variables takes on real values from a probability distribution function. If we consider the outputs of a learning function g(·), where g is chosen according to some distribution D defined over a class of realvalued functions, then the collection of these outputs is also a stochastic process and the distribution D presents the prior belief in the likelihood.
A Gaussian process is a continuous stochastic process such that the marginal distribution for any finite subset of the collection of its outputs is a zero mean Gaussian distribution. In this paper, as defined in formula (4), t_{ i }= g(x_{ γi }), where x_{ γi }= [x_{γ,i 1}, x_{γ,i 2},..., x_{ γ,iq }], i = 1,2,..., n; and in the formula, we assume
P_{g~D}([g(x_{ γ1 }), g(x_{ γ2 }),..., g(x_{ γn })] = [t_{1}, t_{2},..., t_{ n }]) ∝ $\mathrm{exp}(\frac{1}{2}t\text{'}{K}^{1}t)$, where
K_{ ij }= K(x_{ γi }, x_{ γj }), i, j = 1,2,..., n. (5)
In (5), K(x_{ γi }, x_{ γj }) is a function defined in the observation space and it conceptually represents the inner product for sample vectors x_{ γi }and x_{ γj }in the feature space, ⟨Ψ(x_{ γi }), Ψ(x_{ γj })⟩ (assuming Ψ(·) is the mapping function from the observation space to the feature space). K is a kernel matrix called the Mercer kernel. Formula (5) formulates our prior belief for the learning model and the kernel function K(·,·) uniquely decides the properties of our learning functions. Some of the most commonly used kernel functions include:
Linear kernel: K(x_{ γ i }, x_{ γ j }) = ⟨x_{ γ i }, x_{ γ j }⟩ (6a)
Polynomial kernel: K(x_{ γ i }, x_{ γ j }) = (⟨x_{ γ i }, x_{ γ j }⟩ + 1)^{ d }, where d = 1,2,.. is the degree parameter. (6b)
Gaussian kernel: K(x_{ γ i }, x_{ γ j }) = $\mathrm{exp}(\frac{{\Vert {x}_{\gamma i}{x}_{\gamma j}\Vert}^{2}}{2{r}^{2}})$, where r > 0 is the width parameter. (6c)
In (6a) and (6b), the term ⟨x_{ γ i }, x_{ γ j }⟩ presents the inner product between the vectors x_{ γi }and x_{ γj }. When one uses the linear kernel, the feature space is the same as the observation space. In this paper, we refer the linear kernel as the LK, the polynomial kernel with degree "d" as the PK(d) and the Gaussian kernel with width "r" as the GK(r). We primarily focus on the KIGP method with the Gaussian kernel and the polynomial kernel, and discuss them in parallel.
In model (4), we have the latent vector z = t + e + b 1_{ n }, where e ~ N(0, σ^{2}I_{ n }), I_{ n }denotes the n × n identity matrix, and 1_{ n } presents the n × 1 vector with all the elements being equal to 1; N(·,·) denotes the multivariate normal distribution. Hence,
P(zt) ∝ exp($\frac{1}{2}$(z  t  b 1_{ n })'Ω^{1}(z  t  b 1_{ n })), where Ω = σ^{2}I_{ n }. (7)
With the Bayes rule, we have
where $\tilde{x}$_{ γ }is the new predictor associated with the given geneselection vector γ and $\tilde{t}$ is the posterior output (without intercept b) with respect to $\tilde{x}$_{ γ }, provided the matrix X_{ γ }and the latent output z. With a kernel such as defined by (6) and assuming an intercept b and a variance of noise σ^{2} are both given, plugging (5) and (7) into (8) and integrating out t, the marginal distribution of $\tilde{t}$ given $\tilde{x}$_{ γ }, X_{ γ }and z yields a Gaussian distribution as follow [26]:
f($\tilde{x}$_{ γ }, X_{ γ }, z) = (z  b l_{ n })' (K_{ γ }+ σ^{2}I)^{1} k_{ γ }, V($\tilde{x}$_{ γ }, X_{ γ }, z) = K($\tilde{x}$_{ γ }, $\tilde{x}$_{ γ })  k_{ γ }'(K_{ γ }+ σ^{2}I)^{1}k_{ γ },
K_{ γ,ij }= K(x_{ γ i }, x_{ γ j }), k_{ γi }= K($\tilde{x}$_{ γ }, x_{ γi }), i, j = 1,2,..., n. (9)
Supervised Microarray Data Analysis using KIGP
Prior specification
(1) γ_{ j }is assumed to be a priori independent for all j, and
Pr(γ_{ j }= 1) = π_{ j }, for j = 1, 2,..., p, (10)
where the prior probability π_{ j }reflects prior knowledge of the importance of the j th gene.
(2) A noninformative prior is applied for the intercept b:
P(b) ∝ 1. (11a)
This is not a proper probability distribution function (PDF), but it leads to a proper posterior PDF.
(3) The inverted gamma (IG) distribution is applied as the prior for the variance of noise σ^{2}. Specifically, we assume:
P(σ^{2}) ~ IG(1,1) (11b)
(4) For the width of a Gaussian kernel (i.e. a scaling parameter), an inverted gamma distribution is also a reasonable choice as a prior. To preclude too small or too big r (which will make the system to be numerically unstable), we apply IG(1,1) as the prior for r^{2}, that is
P(r^{2}) ~ IG(1,1) (11c)
(5) For the degree of a polynomial kernel, we assume a uniform distribution. In this paper, we only consider the PK(1) and the PK(2) to avoid the issue of overfitting for most practical cases. Therefore, we have P(d = 1) = P(d = 2) = 0.5.
(6) We assume that γ and b are a priori independent from each other, that is P(γ, b) = P(γ)P(b).
Bayesian inferences for model parameters
Based on model (4), label y only depends on z, therefore, all other model parameters are conditionally independent from y if z is given. For convenience, we drop the notation of the training set X in the following derivation and drop y as well when z is given. We also assume the kernel type is given and the associated kernel parameter is termed by θ.
(I) Sampling from γz, b, σ^{2}, θ
Here, we drop the notation of the given parameters b, σ^{2} and θ. With the model described in (2), (5) and (7), we have
K_{γ,ij}= K(x_{ γi }+ x_{ γj }), i, j = 1,2,..., n; Ω = σ^{2}I_{ n }, I_{ n }and 1_{ n }are defined in (7). (12)
The detailed derivation for (12) is provided in Appendix. After inserting the prior given by (10), we have
In practice, rather than sampling γ as a vector, we sample it componentwise from
In both (13) and (14), K_{ γ }is defined in (12).
(II) Sampling from tγ, b, z, σ^{2}, θ
As shown by Eq. (A6) in the Appendix, the conditional distribution P(tz, b) is Gaussian:
tz, b ~ N((I_{ n } Ω(Ω + K_{ γ })^{1})(z  b l_{ n }), Ω  Ω(Ω + K_{ γ })^{1}Ω),
where K_{ γ }and Ω are defined in Eq. (12). (15)
We thus can draw t given z accordingly.
(III) Sampling from zt, b, σ^{2}, y
Given the class label vector y, the conditional distribution of z given t is a truncated Gaussian distribution, and we have the following formula for i = 1,2,..., n:
z_{ i }t_{ i }, b, σ^{2}, y_{ i }= 1 ∝ N(t_{ i }+ b, σ^{2}) truncated at the left by 0,
z_{ i }t_{ i }, b, σ^{2}, y_{ i }= 1 ∝ N(t_{ i }+ b, σ^{2}) truncated at the right by 0. (16)
(IV) Sampling from bz, t, σ^{2}
When z and t are both given, this is a simple ordinary linear regression setting with only an intercept term. Under the noninformative prior assumption given by (11a), it yields
bz, t, σ^{2} ~ N(μ, σ^{2}/n), where $\mu =\frac{1}{n}{\displaystyle \sum _{i=1}^{n}({z}_{i}{t}_{i})}$. (17a)
(V) Sampling from σ^{2}z, t, b
With IG(α, β), α > 0, β > 0, as the prior, the conditional posterior distribution for σ^{2} is also an inverted gamma distribution. That is
σ^{2}z, t, b ~ IG(α + n/2, β + ns^{2}/2), where ${s}^{2}=\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{({z}_{i}{t}_{i}b)}^{2}}$. (17b)
Kernel parameters tuning
One of the major advantages of kernelinduced learning methods is that one can explore the nonlinearity feature of the underlying model for a given learning problem by applying different kernels. It is therefore necessary to discuss the issue of kernel parameter tuning. With the KIGP framework constructed above, this turns out to be rather straightforward.
As in the last section, we denote the kernel parameter(s) as θ, which can be either a scalar (e.g. the width parameter of an GK or the degree parameter of an PK) or a vector. For algorithmic convenience, we work with the logarithm of the conditional likelihood for the parameter θ:
With a proper prior distribution for θ, P(θ), we have:
P(θz, γ, b, σ^{2}) ∝ exp (L(θ))* P(θ), (19)
where L(θ) is defined in (18). In this paper, we specifically focus on two kernel types: the polynomial kernel and the Gaussian kernel, as defined in (6b) and (6c) respectively. For an GK, with the prior for the width parameter given in the "Prior specification" subsection, the resulted posterior distribution given by (19) is nonregular. We apply the MetropolisHasting algorithm (the details can be found in [31]) to draw the sample. For an PK, we simply calculate the likelihood with respect to each d by (18) and sample d accordingly. Sometimes, one may need to calculate the gradient of L(θ) with respect to θ (assume θ = [θ_{1},..., θ_{ J }]') when adopting other plausible algorithms:
where Ω_{ γ }(θ) = K_{ γ }(θ) + σ^{2}I_{ n }, i = 1,..., J. (20)
Theoretically, the proposed KIGP with the linear kernel performs very close to most other classical linear methods. As the width parameter of a Gaussian kernel increases (bigger and bigger than 1), within a reasonable range, the KIGP with such an GK performs fairly close to the KIGP with a linear kernel. On the contrary, when the width decreases (smaller and smaller than 1), the performance of the KIGP in the observation space behaves very nonlinear. When the degree of a polynomial kernel of an KIGP increases, the nonlinearity of the KIGP also increases. When the degree is equal to 1, the only difference between the PK(1) and the linear kernel is a constant. In short, within a kernel class, different values of the kernel parameter represent different feature spaces. For certain specific kernel parameter values, the performance of the KIGP with an GK or with an PK will be close to the KIGP with a linear kernel or a classical linear model in general. Therefore, the posterior distribution of the kernel parameter will provide some clues on what kind of a feature space is more appropriate to the target problem with the given training samples.
Proposed Gibbs sampler
 1.
Start with proper initial value [γ^{[0]}, b^{[0]}, t^{[0]}, z^{[0]}, σ^{2[0]}, θ^{[0]}]; then set i = 1.
 2.
Sample z [i] from zt^{[i1]}, b^{[i1]}, σ^{2[i1]}via (16).
 3.
Sample t^{[i]}from tγ^{[i1]}, b^{[i1]}, z^{[i]}, σ^{2[i1]}, θ^{[i1]}via (15).
 4.
Sample b^{[i]}from bz^{[i]}, t^{[i]}, σ^{2[i1]}via (17a).
 5.
Sample σ^{2[i]}from σ^{2}z^{[i]}, t^{[i]}, b^{[i]}via (17b).
 6.
Sample γ^{[i]}from γz^{[i]}, b^{[i]}, σ^{2[i]}, θ^{[i1]}via (14) componentwise.
 7.
Sample θ^{[i]}from θz^{[i]}, b^{[i]}, σ^{2[i]}, γ^{[i]}.
 8.
Set i = i + 1 and go back to the step 2 until the required number of iterations.
 9.
Stop. (21)
In the above procedure, the kernel parameter θ denotes the degree parameter "d" of a polynomial kernel or the width parameter "r" of a Gaussian kernel. In step 2, we follow the optimal exponential acceptreject algorithm suggested by Robert [32] to draw from a truncated Gaussian distribution. After a suitable burnin period, we can obtain the posterior samples of [z^{[i]}, t^{[i]}, b^{[i]}, σ^{2[i]}, γ^{[i]}, θ^{[i]}] at the i th iteration with the procedure described in (21). The core calculation of the proposed Gibbs sampler involves calculating the inverse of the matrix K_{ γ }+ σ^{2}I. Since the kernel matrix K_{ γ }is symmetric and nonnegative definite, K_{ γ }+ σ^{2}I is symmetric and positive definite. Therefore, the algorithm is theoretically robust and the Cholesky decomposition can be applied in the numerical computation. The total computation complexity of the proposed Gibbs sampler within each iteration is O(pn^{3}).
Overall algorithm
In Fig. 1, we epitomize the general framework of the proposed KIGP method. The box bounded by the dotted lines represents the KIGP learning algorithm. A kernel type is supposed to be given a priori. The algorithm basically has a cascading structure and is composed of three consecutive phases: the "kernel parameter fitting phase", the "gene selection phase" and the "prediction phase". Although in the Bayesian sense one can involve all the parameters into the proposed Gibbs sampler for all three phases, we suggest to fix the kernel parameter(s) after the "kernel parameter fitting phase" and fix the geneselection vector after the "gene selection phase" for practicality. Very often, we are only interested in the area around the peak of the posterior PDF (or probability mass function (PMF)) of a parameter, especially for the kernel parameter(s) and the geneselection vector. This strategy will lead to a much faster convergence of the proposed Gibbs sampler as long as the posterior PDF or PMF of the kernel parameter(s) is unimodal. For all three phases, we need to discard some proper number of iterations as their burnin periods. Some dynamic monitoring strategies to track the convergence of a MCMC simulation can be used (e.g. in [31]).
A practical issue needs to be addressed here. It's better to fix the variance parameter σ^{2} at a proper constant during the "kernel parameter fitting phase" and the "gene selection phase" because this will help the proposed algorithm be more numerically stable and converge faster. For all the simulations of this paper, as in a regular probit regression model, we set σ^{2} equal to 1 (step 5 in (21)) in the first two phases and only involve it into the Gibbs sampler in the "prediction phase". More details of each phase are described as follows.
Kernel Parameter Fitting Phase
In the kernel parameter fitting phase, our primary interest is to find the appropriate value(s) for the kernel parameter(s) of the given kernel type. In this study, we focus on two kernel types, the polynomial kernel and the Gaussian kernel. With the knowledge of the training set X and y, we firstly involve all model parameters (except σ^{2}), the gene selection vector and the kernel parameter into the simulation of the algorithm given by (21). After convergence, the samples obtained from (21) within each iteration are drawn from the joint posterior distribution of all the parameters. For a PK, since the degree parameter is a discrete number, we simply take the degree value with the highest posterior probability. For a GK, we calculate the histogram of the sample values of the width parameter with some proper number of bins. Then we use a Gaussian smoother to smooth over the histogram bars (similar to a Gaussian kernel density estimation). Finally, we take the center of the bin with the highest histogram counts as the best fitted value of the width parameter.
Gene Selection Phase
After the "kernel parameter fitting phase", we fix the kernel parameter(s) at the fitted value(s) and then continue to run the proposed Gibbs sampler. In this subsection, we present an empirical approach to determining whether a gene is potentially significant based on the posterior samples and a given threshold.
Efron [29] thoroughly discussed an empirical Bayes approach for estimating the null hypothesis based on a largescale simultaneous ttest. In this paper, we essentially follow the key concept therein to assess whether or not a gene is of significant importance for the given classification problem. We first define a statistic named by "Normalized LogFrequency" (NLF) to measure the relative potential significance for a gene. By denoting F_{ j }as the appearing frequency of the j th gene appeared in the posterior samples, the definition of NLF is formulated as:
In practice, if F_{ j }is 0, we simply set it as 1/2 divided by the total number of iterations. Our use of the NLF as the key statistic is based on the fact that the logarithm of a gamma distribution can be well approximated by a normal distribution, while a gamma distribution is empirically a proper distribution for the appearing frequency of any of the genes from a homogenous group in the posterior samples.
Suppose that the p NLFvalues fall into two classes, "insignificant" or "significant", corresponding to whether or not NLF_{ j }, for j = 1, 2,..., p, is generated according to the null hypothesis, with prior probabilities Pb_{0} and Pb_{1} = 1  Pb_{0}, for the two classes respectively; and that NLF_{ j }has the conditional prior density either f_{0} (NLF) or f_{1}(NLF) depending on its class. I.e.
Pb_{0} = Pr{Insignificant}, Pr(NLFInsignificant) = f_{0}(NLF)
Pb_{1} = Pr{Significant}, Pr(NLFSignificant) = f_{1}(NLF) (23)
The marginal distribution for NLF_{ j }is thus
Pr(NLF) = f_{0}(NLF)*Pb_{0} + f_{1}(NLF) * Pb_{1} = f (NLF) (24)
By using the Bayes' formula, the posterior probability for "insignificant" class given the NLF therefore yields
Pr(Insignificant)NLF) = f_{0}(NLF) * Pb_{0}/f(NLF) (25)
Abiding to [29], we further define a term, the local "false discovery rate (fdr)", by
fdr(NLF) = f_{0}(NLF)/f(NLF) (26)
Since in a typical microarray study, Pb_{0} generally is very close to 1 (say Pb_{0} > 0.99), so fdr(NLF) is a fairly precise estimator for the posterior probability of the null hypothesis (insignificant class) given the statistic NLF. With fdr(NLF), we can decide whether or not a target gene is "significant" at some confidence level accordingly. For all the examples, we report all the genes with fdr smaller than 0.05.
To calculate fdr(NLF), one needs to estimate f(NLF) and to choose f_{0}(NLF) properly. For estimating f(NLF), one can resort to the ensemble values of the NLFs, {NLF_{ j }, j = 1, 2,..., p}. We divide the target range of NLF into M equal length bins with the center of each bin at x_{ i }for i = 1,2,..., M. A heuristic choice of M is the roundup of the maximum NLF value multiplied by 10. Then we calculate the histogram for the given NLF set with respect to each of these bins followed by fitting a Gaussian smoother. The output divided by the product of the width of the bin and the number of genes (i.e. p) will be a proper estimation for f(NLF) on the center of each bin.
The more critical part is the choice of the density of NLF under null hypothesis, i.e. f_{0}(NLF). The basic assumption we impose here is that the statistic NLF under null hypothesis follows a normal distribution. Since Pb_{1} is much smaller than Pb_{0} (say Pb_{0} > 0.99) in most real microarray analysis problems, it is very safe to choose the standard normal (zero mean, unit variance) as f_{0}(NLF) based on the definition (22). Throughout this paper, we always choose the standard normal as the density of NLF under null hypothesis. (In case Pb_{0} > 0.99, some more elaborated schemes are needed and an easy approach can be found in [29].) After both f(NLF) and f_{0}(NLF) are obtained, the local fdr for each gene can be calculated by (26) consequently. Based on the local fdr, one can select the "significant" class of genes and fix the geneselection vector at some given confidence level thereafter.
Prediction Phase
After the "gene selection phase", both the kernel parameter(s) and the geneselection vector have been fixed. We continue to run the proposed Gibbs sampler (21) and the computational complexity of the Gibbs sampler dramatically decreases to O(n^{3}). After a new proper burnin period, we can draw samples of z, b and σ^{2} within each iteration in the "prediction phase". Following (9), the posterior PDF for the output $\tilde{t}$ given the testing data $\tilde{x}$ in the lth iteration is Gaussian:
where f^{[l]}= (z^{[l]} b^{[l]}l_{ n })' (K_{ γ }+ σ^{2[l]}I_{ n })^{1}k_{ γ }, V^{[l]}= K($\tilde{x}$_{ γ }, $\tilde{x}$_{ γ })  k_{ γ }'(K_{ γ }+ σ^{2[l]}I_{ n })^{1}k_{ γ },
K_{ γ,ij }, = K(x_{γ,i}, x_{γ,j}), k_{ γ,i }= K($\tilde{x}$_{ γ }, x_{ γ,i }), for i, j = 1,..., n ; l = 1,..., L. (27a)
Then, the predictive probability for the output label $\tilde{y}$ given $\tilde{x}$ can be estimated by using the Monte Carlo integration:
Kernel type competition
Another important issue needs to be addressed is how to properly select a kernel type. If an independent set of testing samples is available, one approach is to calculate its predictive fit measure such as the misclassification rate (MR) or average predictive probability (APP) of the true class label. If the number of the available testing samples is sufficiently large, this approach is very reliable.
Assuming that there are M testing samples {($\tilde{x}$_{1}, $\stackrel{\u2322}{y}$_{1}),...,($\tilde{x}$_{ M }, $\stackrel{\u2322}{y}$_{ M })}, where $\tilde{x}$_{ i }denotes the microarray data and $\stackrel{\u2322}{y}$_{ i }is its class label for i = 1,2,..., M, the MR for the testing set can be estimated by
The smaller the MR a kernel type has, the better general performance it should have. If the number of the available testing samples is small, the APP of the true class label is a more consistent measure. Throughout this paper, we always refer APP to the APP of the true class label and it is defined as:
In both (28a) and (28b), the probability P($\tilde{y}$_{ i }X, y, $\tilde{x}$_{ i }, K) is evaluated by (27a) and (27b). Obviously, a better model should have a higher APP. The APP usually provides a less biased predictive fit measure when the number of testing samples is limited.
After running the simulations under each candidate kernel type, one can simply choose the kernel type with the least MR or with the largest APP for the testing set. However, the independent testing samples are not always available. To use the predictive fit approach, one may resort to a rigorous crossvalidation (CV) procedure. Sometimes, a "leaveoneout" crossvalidation (LOOCV) is proper. That is, one treats one of the training samples as the testing sample and applies the proposed KIGP, including all three phases, to the rest n1 samples and obtains the predictive measure for this sample. One does this procedure for each training sample and the average of the predictive measures should give a consistent evaluation to the target kernel type.
A more unbiased approach is to use a multiple independent 3fold CVs. For each round of CV, one first randomly partitions the training set into three sets with a balanced ratio of the class labels. Then for each of the three sets, one treats it as the testing set and applies the KIGP (including all three phases) to the remaining two sets as the training set and gets the predictive fit measure for this testing set. After running this procedure for all three sets, one gets the predictive measure of all available samples for this round. One does multiple rounds of independent 3fold CVs (through different random partitioning) and the average of the predictive measure for the whole set will deliver an unbiased assessment of the given kernel type.
The predictive fit approach through a multiple 3fold CVs works very well. Throughout this study, we always use it to select the proper kernel type for a given problem if the independent testing set is not available. As the nature of the MCMCbased methods however, the KIGP method is extremely computationally intensive, especially when the number of the candidate kernel types is large. A more integrative implementation for kernel or model selection, such as making use of a reversible jump MCMC approach, would help further streamline the current KIGP.
Appendix: Inference for P(tz, b, K_{ γ })
First of all, for convenience, we drop the notation of b and K_{ γ }in the following derivations. Under an KIGP model, we have
t ~ N (0, K_{ γ }), zt ~ N(t + b l_{ n }, Ω), where K_{ γ }, l_{ n }and Ω are defined in Eq. (12). (A1)
The joint distribution of z and t is still Gaussian, which can be formulated as:
where μ_{ t }= (Ω^{1} + ${K}_{\gamma}^{1}$)^{1}Ω^{1}(z  b l_{ n }). (A2)
In principle, if z and t form a joint Gaussian distribution, both the marginal distribution of z and the conditional distribution of t given z are also Gaussian. Making use of the following equation from [33]:
(A + C)^{1} = A^{1}  A^{1}(A^{1} + C^{1})^{1}A^{1}, (A3)
it consequently yields
and
where μ_{ t }= (I_{ n } Ω(Ω + K_{ γ })^{1}(z  b l_{ n }). (A5)
z ~ N(b l_{ n }, K_{ γ }+ Ω)
Or strictly,
tz ~ N((I_{ n } Ω(Ω + K_{ γ })^{1})(z  b l_{ n }), Ω  Ω(Ω + K_{ γ })^{1}Ω) (A6)
NOTE: The matrix Ω  Ω(Ω + K_{ γ })^{1}Ω is nonnegative definite.
Abbreviations
 KIGP:

kernelimbedded Gaussian process
 AIC:

Akaike information criterion
 BIC:

Bayesian information criterion
 MCMC:

Markov chain Monte Carlo
 PAM:

prediction analysis for Microarrays
 SVM:

support vector machine
 RFE:

recursive feature elimination
 LSSVM:

least square support vector machine
 GP:

Gaussian process
 LK:

linear kernel
 PK:

polynomial kernel
 PK:

(d) polynomial kernel with degree "d"
 GK:

Gaussian kernel
 GK:

(r) Gaussian kernel with width "r"
 NLF:

normalized logfrequency
 fdr:

false discovery rate
 MR:

misclassification rate
 APP:

average predictive probability
 LOOCV:

leaveoneout crossvalidation
 3fold:

CV 3fold crossvalidation
 PDF:

probability density function
 AML:

acute myeloid leukemia
 ALL:

acute lymphoblastic leukemia
 SRBCT:

small round bluecell tumor
 EWS:

Ewing family of tumors
 RMS:

rhabdomyosarcoma
 NB:

neuroblastoma
 NHL:

nonHodgkin lymphoma
 ANN:

artificial neural network
 RJMCMC:

reversible jump Markov chain Monte Carlo
 IG:

inverted gamma
 PMF:

probability mass function
 RF:

random forests
 DLDA:

diagonal linear discriminant analysis
Declarations
Acknowledgements
This work was partially supported by the Loyola University Medical Center Research Development Funds, and the SUN Microsystems Academic Equipment Grant for Bioinformatics awarded to LWKC. We would like to thank Dr. Will Gersch at the Department of Information and Computer Sciences, University of Hawaii, for his helpful comments and suggestions.
Authors’ Affiliations
References
 Golub TR, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lender E: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531–537. 10.1126/science.286.5439.531View ArticlePubMedGoogle Scholar
 Golub TR: GenomeWide Views of Cancer. N Engl J Med 2001, 344: 601–602. 10.1056/NEJM200102223440809View ArticlePubMedGoogle Scholar
 Ramaswamy S, Golub TR: DNA Microarrays in Clinical Oncology. Journal of Clinical Oncology 2002, 20: 1932–1941.PubMedGoogle Scholar
 Tamayo P, Ramaswamy S: "Cancer Genomics and Molecular Pattern Recognition" in Expressing profiling of human tumors: diagnostic and research applications. Edited by: Marc Ladanyi, William Gerald. Human Press; 2003.Google Scholar
 Mallows CL: Some comments on C_{p}. Technometrics 1973, 15: 661–676. 10.2307/1267380Google Scholar
 Akaike H: Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory Edited by: Petrov BN, Csaki F. 1973, 267–281.Google Scholar
 Schwarz G: Estimation the dimension of a model. Ann Statist 1978, 6: 461–464.View ArticleGoogle Scholar
 George EI, Foster DP: Calibration and emperical Bayes variable selection. Biometrika 2000, 87: 731–747. 10.1093/biomet/87.4.731View ArticleGoogle Scholar
 Yuan M, Lin Y: Efficient Empirical Bayes Variable Selection and Estimation in Linear Models. J Amer Statis l Assoc 2005, 100: 1215–1225. 10.1198/016214505000000367View ArticleGoogle Scholar
 Lee KE, Sha N, Dougherty ER, Vannucci M, Mallick BK: Gene selection: a Bayesian variable selection approach. Bioinformatics 2003, 19(1):90–97. 10.1093/bioinformatics/19.1.90View ArticlePubMedGoogle Scholar
 Zhou X, Wang X, Dougherty ER: Gene Prediction Using Multinomial Probit Regression with Bayesian Gene Selection. EURASIP Journal on Applied Signal Processing 2004, 1: 115–124. 10.1155/S1110865704309157View ArticleGoogle Scholar
 Zhou X, Liu K, Wong STC: Cancer classification and prediction using logistic regression with Bayesian gene selection. Journal of Biomedical Informatics 2004, 37: 249–259. 10.1016/j.jbi.2004.07.009View ArticlePubMedGoogle Scholar
 Zhou X, Wang X, Dougherty ER: Gene Selection using Logistic Regressions based on AIC, BIC and MDL Criteria. New Mathematics and Neural Computation 2005, 1: 129–145. 10.1142/S179300570500007XView ArticleGoogle Scholar
 Zhou X, Wang X, Dougherty ER: A Bayesian approach to nonlinear probit gene selection and classification. Journal of the Franklin Institute 2004, 341: 137–156. 10.1016/j.jfranklin.2003.12.010View ArticleGoogle Scholar
 Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 2002, 99: 6567–6572. 10.1073/pnas.082099299PubMed CentralView ArticlePubMedGoogle Scholar
 Dettling M: BagBoosting for tumor classification with gene expression data. Bioinformatics 2004, 20(18):3583–3593. 10.1093/bioinformatics/bth447View ArticlePubMedGoogle Scholar
 Lin Y, Wahba G, Zhang H, Lee Y: Statistical Properties and Adaptive Tuning of Support Vector Machines. Machine Learning 2002, 48: 115–136. 10.1023/A:1013951620650View ArticleGoogle Scholar
 Lin Y: Support Vector Machines and The Bayes Rule in Classification. Data Mining and Knowledge Discovery 2002, 6: 259–275. 10.1023/A:1015469627679View ArticleGoogle Scholar
 Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines. Machine Learning 2002, 46: 389–422. 10.1023/A:1012487302797View ArticleGoogle Scholar
 Zhu J, Hastie T: Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics 2005, 14(1):185–205. 10.1198/106186005X25619View ArticleGoogle Scholar
 Zhu J, Hastie T: Classification of Gene Microarrays by Penalized Logistic Regression. Biostatistics 2004, 5(3):427–443. 10.1093/biostatistics/kxg046View ArticlePubMedGoogle Scholar
 MacKay DJC: The Evidence Framework Applied to Classification Networks. Neural Computation 1992, 4(5):720–736.View ArticleGoogle Scholar
 Kwok JT: The Evidence Framework Applied to Support Vector Machines. IEEE Trans on Neural Networks 2000, 11: 1162–1173. 10.1109/72.870047View ArticlePubMedGoogle Scholar
 Gestel TV, Suykens JVK, Lanckriet G, Lambrechts A, Moor BD, Vandewalle J: Bayesian framework for leastsquares support vector machine classifiers, Gaussian processes, and kernel fisher discriminant analysis. Neural Computation 2002, 14(5):1115–1147. 10.1162/089976602753633411View ArticlePubMedGoogle Scholar
 Neal RM: Bayesian Learning for Neural Networks. SpringerVerlag, New York; 1996.View ArticleGoogle Scholar
 Cristianini N, ShaweTayer J: An introduction to Support Vector Machines. Cambridge University Press; 2000.Google Scholar
 Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu C: Classification and diagnostic prediction of cancer using gene expression profiling and artificial neural networks. Nature Medicine 2001, 7(6):673–679. 10.1038/89044PubMed CentralView ArticlePubMedGoogle Scholar
 Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Raffeld M: Gene expression profiles in hereditary breast cancer. The New England Journal of Medicine 2001, 344: 539–548. 10.1056/NEJM200102223440801View ArticlePubMedGoogle Scholar
 Efron B: LargeScale simultaneous hypothesis testing: the choice of a null hypothesis. J Amer Statis l Assoc 2004, 99: 96–104. 10.1198/016214504000000089View ArticleGoogle Scholar
 Alon U, Barkai N, Notterman D, Gish K, Mack S, Levine J: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natural Academic Science USA 1999, 96: 6745–6750. 10.1073/pnas.96.12.6745View ArticleGoogle Scholar
 Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian data analysis. 2nd edition. Chapman & Hall/CRC;
 Robert C: Simulation of truncated normal variables. Statistics and Computing 1995, 5: 121–125. 10.1007/BF00143942View ArticleGoogle Scholar
 Barnett S: Matrix Methods for Engineers and Scientist. McGrawHill; 1979.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.