 Research article
 Open Access
 Published:
Kernelbased distance metric learning for microarray data classification
BMC Bioinformatics volume 7, Article number: 299 (2006)
Abstract
Background
The most fundamental task using gene expression data in clinical oncology is to classify tissue samples according to their gene expression levels. Compared with traditional pattern classifications, gene expressionbased data classification is typically characterized by high dimensionality and small sample size, which make the task quite challenging.
Results
In this paper, we present a modified Knearestneighbor (KNN) scheme, which is based on learning an adaptive distance metric in the data space, for cancer classification using microarray data. The distance metric, derived from the procedure of a datadependent kernel optimization, can substantially increase the class separability of the data and, consequently, lead to a significant improvement in the performance of the KNN classifier. Intensive experiments show that the performance of the proposed kernelbased KNN scheme is competitive to those of some sophisticated classifiers such as support vector machines (SVMs) and the uncorrelated linear discriminant analysis (ULDA) in classifying the gene expression data.
Conclusion
A novel distance metric is developed and incorporated into the KNN scheme for cancer classification. This metric can substantially increase the class separability of the data in the feature space and, hence, lead to a significant improvement in the performance of the KNN classifier.
Background
DNA microarray technology is designed to measure the expression levels of tens of thousands of genes simultaneously. As an important application of this novel technology, the gene expression data are used to determine and predict the state of tissue samples, which has shown to be very helpful in clinical oncology. The most fundamental task using gene expression data in clinical oncology is to classify tissue samples according to their gene expression levels. In combination with pattern classification techniques, gene expression data can provide more reliable means to diagnose and predict various types of cancers than the traditional clinical methods.
Compared with traditional pattern classifications, gene expressionbased data classification is typically characterized by high dimensionality and small sample size, which make the task quite challenging. In the literature, a number of methods have been applied or developed to classify microarray data [1–6]. These methods include Knearestneighbor (KNN), boosting, linear discriminant analysis (LDA), and support vector machines (SVM), etc. we herein briefly review some of the approaches.
KNearestNeighbor (KNN)
The KNN method is a simple, yet useful approach to data classification. The error rate of the KNN has been proven to be asymptotically at most twice that of the Bayessian error rate [7]. However, its performance deteriorates dramatically when the input data set has a relatively low local relevance [8]. The most important factor impacting the performance of KNN is the distance metric. It is desirable to adopt an appropriate distance metric for the KNN algorithm. In practice, the Euclidean distance is usually used as the distance metric.
Diagonal Linear Discriminant Analysis (DLDA)
DLDA is the simplest case of the maximum likelihood discriminant rule, in which the class densities are supposed to have the same diagonal covariance matrix. In the special case of binary classification, the DLDA scheme can be viewed as the "weighted voting scheme" proposed by Golub et al. in [3]. The major advantage of the DLDA algorithm lies in its computational efficiency.
Linear Discriminant Analysis (LDA)
The classical LDA method aims to find the most discriminatory projection directions of the input data and classifies the data in the projected space. A major problem in employing the classical LDA algorithm for classifying gene expression data is that the so called scatter matrices are always singular, due to the nature of high dimensionality and relatively small sample size. The singularity makes the classical LDA algorithm inapplicable. In the areas such as face recognition and text classification, the principal component analysis (PCA) technique is introduced as a preprocessing procedure in order to reduce the dimensionality of the input data. However, since the projection criterion of PCA is essentially different from that of LDA, losing discriminatory information in the PCA step becomes inevitable. A recent development in LDA is the generalized discriminant analysis [9, 10], in which a more delicate matrix technique, namely, the generalized singular value decomposition (GSVD), is used to modify the classical LDA into a more general version.
Support Vector Machines (SVM)
SVM has been recognized as the most powerful classifier in various applications of pattern classification. For binary classification, SVM searches for a hyperplane that separates the two classes of data with the maximum margin. It has been shown that support vector machines perform well in many areas of computational biology [11–13]. In the experimental part of this paper, we follow the way in [14] to implement the SVM algorithm.
Generally speaking, due to the high dimensionality and small sample size, linear classifiers such as the linear discrimiant analysis (LDA), and the support vector machines (SVM) with linear kernels are used favorably. However, based on some benchmark tests, researchers have shown that nonlinear classfiers are capable of exploring the nonlinear discriminatory information in the microarray data, and usually produce more precise classification results [15, 16]. This is especially true when more patients' samples are available or the data dimension is substantially reduced, since, in these cases, the linear separability of the microarray data could be considerably degraded.
Among the general algorithms of pattern classification, Knearestneighbor (KNN) is a simple yet useful one. However, in practice, the performance of KNN algorithm is often inferior to those of the sophisticated approaches such as SVM and generalized linear discriminant analysis (GLDA) [9, 10]. Since the distance metric is of great importance for the KNN scheme, an attractive way to improve the performance of KNN is to adopt a more adaptive distance metric to the input data than the Euclidean diatnce. In this paper, we propose to learn the adaptive distance metric via optimizing a datadependent kernel. Experimental results show that, compared with the ordinary Euclidean distancebased KNN scheme, our kernelbased KNN algorithm, denoted KerNN, always achieves significant improvement in the performance of classifying gene expression data. Moreover, the performance of the KerNN classifier is shown to be competitive, if not better, to those of the sophisticated classifiers, e.g., SVM and the uncorrelated linear discriminant analysis (ULDA) [10], in classifying microarray data.
Results
We conducted intensive experiments to compare the performances of our KerNN scheme to the commonlyused classification algorithms, i.e., KNN, DLDA [3], ULDA [10], and SVM. Ten publicly available microarray data sets were chosen to test our algorithms. The basic information about these data sets is summarized below. Each data set is first normalized to a distribution with zero mean and unity variance in every feature direction, and then, randomly partitioned into two disjoint subsets with equal number of samples, one is used as the training data, and the other the test data. We only consider Gaussian kernel function in the proposed and SVM algorithms.

1.
ALLAML Leukemia Data: This data set, taken from the website [17], contains 72 samples of human acute leukemia. 47 samples belong to acute lymphoblastic leukemia (ALL), and the other acute myeloid leukemia (AML). Each sample presents the expression levels of 7129 genes. For the detailed information, one can refer to [3].

2.
ALLMLLAML Leukemia Data: This leukemia microarray data set is available on the website [17]. It includes 72 human leukemia samples, 24 of them belong to acute lymphoblastic leukemia (ALL), 20 of them to mixed lineage leukemia (MLL), a subset of human acute leukemia with a chromosomal translocation, and 28 of the samples are acute myelogenous leukemia (AML). Each sample gives the expression levels of 12582 genes. Further information about this data set can be found in [21].

3.
Embryonal Tumors of the Central Nervous System (CNS): This data set, available at the website [17], contains 60 patient samples, 21 are survivors of a treatment, and 39 are failures. There are 7129 genes in the data set. One can refer to [22] to find more information about this data set.

4.
Breast Cancer Data: The data are available on the website [18]. The expression matrix monitors 7129 genes in 49 breast tumor samples. There are two response variables respectively describing the status of the estrogen receptor (ER) and the lymph nodal (LN) status. For the ER status, 25 samples are ER+, whereas the remaining 24 samples are ER. For the LN variable, there are 25 positive sample and 24 negative samples. The detailed information about this data set can be found in [6].

5.
Colon Tumor Data: This data set is adopted from the website [17]. The data contain 62 samples collected from coloncancer patients. Among them, 40 samples are from tumors, and 22 normal biopsies are from healthy parts of the colons of the same patients. 2000 genes were selected to measure their expression levels. One can refer to [23].

6.
Lung Cancer Data: This data set is taken from the website [17]. It contains 181 tissue samples, which are classified into two classes: malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA). Each sample is described by 12533 genes. More information about this data set can be found in [24].

7.
Lymphoma Data: The data are available on the website [19]. This data set contains 77 tissue samples, 58 are diffuse large Bcell lymphomas (DLBCL) and the remaining 19 samples are follicular lymphomas (FL). Each sample is represented by the expression levels of 7129 genes. The detailed information about this data set can be found in [25].

8.
Ovarian Cancer Data: This data set, available on the website [17], is to distinguish ovarian cancer from noncancer. It contains 253 samples, and each sample has 15154 features. More details can be found in [26].

9.
Prostate Cancer Data: This data set, adopted from the website [19], contains the gene expression levels of 12600 genes for 52 prostate tumor samples and 50 normal prostate samples. One can refer [4] for the details about this data set.

10.
Subtypes of Acute Lymphoblastic Leukemia: This data set, available on the website [20], contains 6 subtypes of pediatric acute lymphoblastic leukemia, corresponding to six diagnostic groups: BCRABL, E2APBX1, MLL, TALL, TELAML1, Hyperdiloid>50. Each sample contains 12625 genes.
Comparisons in terms of the best results
For each data set, we chose the N_{ f }most discriminatory genes, where N_{ f }= 10, 20, 40, 60, 80, 100, 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000, respectively; repeated the experiment 100 times at each value of N_{ f }; and then, calculated the average test error rates and their standard deviations over the 100 experiments. Table 1 lists the best results, i.e., the smallest average test error rate, of different algorithms. It can be seen that the proposed KerNN algorithm reaches the best, which are in bold face, on four data sets. On the other data sets, the performance of the KerNN algorithm is still competitive, if not better, to those of the SVM and ULDA schemes.
In Table 1, if we assign a score 1 to the best result, 2 to the next best result, ..., and so on, then, the global performance of a classifier can be roughly evaluated in terms of the average score. We show the average scores of the five classifiers in Table 1. It can be seen that the proposed KerNN scheme achieves the lowest score among the five classifiers.
Comparisons under different gene numbers
To investigate the stability of the 5 classification algorithms, we compared their performances when different number of genes were selected. The experimental results are shown in Fig. 1, for the ALLAML data, Fig. 2, for the Colon data, and Fig. 3, for the Prostate data, where the horizontal axis is the number of the selected genes and the vertical axis is the average test error rates of the classifiers over 100 experiments. While Fig. 1 (a), Fig. 2 (a), and Fig. 3 (a) illustrate the results in the case of choosing a relatively small number of features (from 10 to 100), Fig. 1 (b), Fig. 2 (b), and Fig. 3 (b) demonstrate the corresponding results when more genes are chosen (from 200 to 2000). It can be seen that the proposed KerNN scheme performs favorably in most cases. Compared with the ULDA scheme, which always performs poorly in the case of small feature size, and the DLDA algorithm, whose performances usually degrade for relatively large feature size, our KerNN algorithm works with more stability with different feature numbers. More importantly, compared with the ordinary KNN classifier, the kernel optimizationbased KNN classifier always gains significant improvements, which implies that the procedure of kernel optimization induces a distance metric that adapts better than the Euclidean metric to the gene expression data in the data space.
Discussion
Parameter tuning
In the experiments, for KNN, ULDA, and the proposed algorithm, the final classification is done via the Knearestneighbor algorithm with K = 3. For KNN, ULDA, and DLDA algorithms, the only parameter is the number of selected genes N_{ f }. For SVM, in addition to the gene number, two parameters, the γ in the Gaussian kernel function and the regulation constant C, need to be set in advance. As for the KerNN algorithm, there are more parameters. To avoid the intensive computation in parameter tuning using the cross validation, we respectively chose the N_{ f }most discriminatory genes, where N_{ f }= 10, 20, 40, 60, 80, 100, 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000. The best performance for each method is reported in Table 1. For our kernel optimization method, the initial learning rate η_{0} and the total iteration number N are always set to 0.01 and 1000 respectively. Furthermore, for the sake of computational simplicity, we empirically set the two Gaussian parameters in the proposed method as ${\gamma}_{0=}\frac{{10}^{5}}{\sqrt{{N}_{f}}}$ and ${\gamma}_{1}=\frac{{10}^{2}}{\sqrt{{N}_{f}}}$, rather than tune them by the cross validation. This may not be the optimal settings for the parameters γ_{0} and γ_{1}. However, high computational complexity can be avoided. It is expected that even better results could be obtained if we were to choose them by the cross validation. Therefore, for the KerNN method, there is only one parameter σ_{ ε }, the standard variance of the disturbance added to the data in Eq. (10), that need to be tuned. As to the SVM, two parameters are tuned by the cross validation.
In the experiments, we employed the leaveoneout technique on the training data to choose these parameters. We followed [14] to implement the SVM algorithm, in which the parameter C is chosen from {l.0e+00, l.0e+01, l.0e+02, l.0e+03, l.0e+04, l.0e+05, l.0e+06, l.0e+07} and γ from {l.0e07, 5.0e07, l.0e06, 5.0e06, l.0e05, 5.0e05, l.0e04, 5.0e04, l.0e03, 5.0e03, l.0e02} using the leaveoneout cross validation. For our KerNN algorithm, the parameter σ_{ ε }is selected from {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}. Note that only the training samples were used for setting parameters. Test samples are independent of this process.
Gene selection
In this paper, we employ the BW ratio used in [2, 10] to select genes. This ratio is essentially a Fisher discriminant measure. Given a gene j, the ratio on gene j is calculated as
where C_{ k }denotes the index set of the kth class (k = l,2,...,p), m_{ k }is the number of samples in C_{ k }(${\sum}_{k=1}^{p}{m}_{k}}=m$), $\overline{x}$_{ k }(j) and $\overline{x}$(j) represent the average expression levels cross the kth class and whole training samples on gene j, respectively.
Gene selection usually has a strong impact on the performances of various classifiers, due to the effect of correlation between genes. Our experiments show that the impact can be considered in two aspects: l)with different numbers of genes, the performance of a classifier could be remarkably different. For example, the ULDA method usually works quite well as a large number of genes is used, but performs poorly in the case of small gene number. Contrarily, the DLDA classifier often reaches its best performance at small number of features. 2) with different numbers of genes, the model parameters, especially for the nonlinear methods, need to be set differently to achieve better result.
The effect of the disturbed resampling
Due to the lack of enough training samples, the scheme of the kernel optimizationbased classification may lead to an overfitting result in classifying gene expression data. To alleviate the possible overfitting, a strategy of disturbed resampling, as shown in Eq. (10), was adopted. In this section, we demonstrate that using this strategy, the overfiting could be effectively reduced.
In the case that there are relatively large number of samples, the kernel optimizationbased KNN classifier without using the strategy of disturbed resampling, denoted by KerNN0, usually works well on both the training and test data. Fig. 4 illustrates the performances of KNN, KerNN0, and KerNN on both the training and test data of the Prostate data set, which includes 102 samples. It can be seen that, compared with the KNN algorithm, both the KerNN0 and KerNN methods gain significant improvements, not only on the training data, but also on the test data. However, when the sample size is relatively small, the KerNNO algorithm may lead to serious overfitting. We choose the BreastER data set, which contains only 49 samples, to demonstrate our argument. Fig. 5 (a) shows the average error rates of KNN, KerNN0, and KerNN algorithms on the training data, and Fig. 5 (b) presents the corresponding results on the test data. It can be seen that, although KerNN0 works quite well on the training data, its performance degrades remarkably on the test data. On the contrary, for the KerNN scheme, no overfitting occurred.
Conclusion
In this paper, a novel distance metric is developed and incorporated into a KNN scheme for cancer classification. This metric, derived from the procedure of a datadependent kernel optimization, can substantially increase the class separability of the data in the feature space, and hence, lead to a significant improvement in the performance of the KNN classifier. Furthermore, in combination with a disturbed resampling strategy, the kernel optimizationbased Knearestneighbor scheme can achieve competitive performance to the fine tuned SVM and the uncorrelated linear discriminant analysis (ULDA) scheme in classifying gene expression data. Experimental results show that the proposed scheme performs with more stability than the ULDA scheme, which works poorly in the case of small feature size, and the DLDA scheme, whose performance usually degrades in the case of a relatively large feature size.
Methods
0.1 Datadependent kernel model
In this paper, we employ a special kernel function model, which is called datedependent kernel model, as the objective kernel to be optimized. Apparently, there is no benefit at all if we simply use the common kernel such as the Gaussian kernel or the polynomial kernel in the KNN scheme, since the distance ranking in the Hilbert space derived from the kernel function is the same as that in the input data space. However, when we adopt the datadependent kernel, especially after the kernel is optimized, the distance metric could be appropriately modified so that the local relevance of the data is significantly improved.
Let {x_{ i }, ζ_{ i }} (i = 1,2, ..., m) be m ddimensional training samples of the given gene expression data, where ζ_{ i }represent the class labels of the samples. We refer the datadependent kernel as,
k(x, y) = q(x)q(y)k_{0}(x, y) (1)
where x, y ∈ R^{d}, k_{0}(x, y), called the basic kernel, is an ordinary kernel such as a Gaussian or a polynomial kernel function, and q(.), the factor function, takes the form as
in which k_{1}(x, a_{ i }) = ${e}^{{\gamma}_{1}\left\rightx{a}_{i}{}^{2}}$, α_{ i }'s are the combination coefficients, and a_{ i }'s denote the local centers of the training data.
Let the kernel matrices corresponding to k(x, y) and k_{0}(x, y) be K and k_{0}. Obviously, K = [q(x_{ i })q(x_{ j })k_{0}(x_{ i }, x_{ j })]_{m × m}= QK_{0}Q, where Q is a diagonal matrix whose diagonal elements are q(x_{1}), q(x_{2}),...,q(x_{ m }). Let us denote the vector (q(x_{1}), q(x_{2}),..., q(x_{ m }))^{T} and (α_{0}, α_{1}, α_{2},...,α_{ n })^{T} by q and α respectively, we have q = K_{1}α, where K_{1} is an m × (l + 1) matrix
0.2 Kernel optimization for binaryclass data
We optimized the datadependent kernel in Eq.(l). This requires optimizing the combination coefficient vector α, aiming to increase the class separability of the data in the feature space. A Fisher scalar measuring the class separability of the training data in the feature space is adopted as a criterion for our kernel optimization
where S_{ b }represents the "betweenclass scatter matrix", and S_{ w }"withinclass scatter matrix".
Suppose that the training data are grouped according to their class labels, i.e., the first m_{1} data belong to one class, and the remaining m_{2} data belong to the other class (m_{1} + m_{2} = m). Then the basic kernel matrix k_{0} can be partitioned as
where the sizes of the submatrices ${K}_{11}^{0},{K}_{12}^{0},{K}_{21}^{0}$, and ${K}_{22}^{0}$ respectively are m_{1} × m_{1}, m_{1} × m_{2}, m_{2} × m_{1}, and m_{2} × m_{2}. A close relation between the class separability measure J and the kernel matrices can be established [27].
where M_{0} = ${K}_{1}^{T}$B_{0}K_{1}, N_{0} = ${K}_{1}^{T}$W_{0}K_{1}, in which
To avoid using the eigenvector solution, an updating algorithm based on the standard gradient approach is developed. This algorithm is summarized below, in which the learning rate η(n) is adopted in a gradually decreasing form as
where η_{0} represents an initial learning rate.

1.
Group the data according to their class labels. Calculate K_{0} and K_{1} first, then B_{0} and W_{0}, and then M_{0}, N_{0}.

2.
Initialize α^{(0)} by a vector (1,0,..., 0)^{T}, and set n = 0.

3.
Calculate q = K_{1}α^{(n)}, and J_{1} = q^{T}B_{0}q, J_{2} = q^{T}W_{0}q, and J.

4.
Update α^{(n)}:
and normalize α^{(n+1)}so that α^{(n+1)} = 1.

5.
If n reaches a prespecified number N, stop. Otherwise, set n = n + 1, go to 3.
0.3 Kernel optimization for multiclass data
In the case of multiclass data, we decompose the problem of kernel optimization into a series of binaryclass kernel optimizations.
Let (x_{ i }, ζ_{ i }) ∈ R^{d} × ζ (i = 1, 2,..., m) be the training data set containing p classes, that is, ζ = {1,2,...,p}. We assume the data to be grouped in order, that is, the first m_{1} data belong to the first class, the next m_{2} data belong to the second class, and so on, where ${\sum}_{i=1}^{p}{m}_{i}}=m$. Then, the kernel matrix can be written as
where the submatrix k_{ ij }is of size m_{ i }× m_{ j }, and K_{ ii }represents the kernel matrix corresponding to the data in the ith class. The class separability of the ith and jth class, denoted by J^{ij} (i,j = 1, 2,...,p, i ≠ j), is calculated as
where the betweenclass and withinclass kernel scatter matrices B^{ij} and W^{ij} are defined as
in which D^{ij} denotes a diagonal matrix whose diagonal elements are composed of the diagonal entries of the matrix K_{ ii }and K_{ jj }. We also denote the betweenclass and withinclass kernel matrices corresponding to the basic kernel by ${B}_{0}^{ij}$ and ${W}_{0}^{ij}$ respectively.
In each iteration of the updating algorithm, we first find the class index (u, v) that corresponds to the minimum J^{ij} in current step, then the value of α is updated in such a way that the class separability of the uth and vth class J^{uv} will be maximized. In other words, the objective of the kernel optimization becomes
It is easy to modify the kernel optimization algorithm from the case of binary class data to the case of multiclass data. The detailed kernel optimization algorithm for multiclass data set is summarized below, where Γ_{ ij }denotes the union of the data index sets of the ith and jth class, and q(Γ_{ ij }) and K_{1}(Γ_{ ij },:) represent the submatrix extraction as in MATLAB.

1.
Group the data according to their class labels. Calculate k_{0} and K_{1}.

2.
Initialize α^{(0)} by a vector (1,0,..., 0)^{T}, and set n = 0.

3.
Calculate q = K_{1}α^{(n)}, ${J}_{1}^{ij}$ = q(Γ_{ ij })^{T}${B}_{0}^{ij}$q(Γ_{ ij }), ${J}_{2}^{ij}$ = q(Γ_{ ij })^{T}${W}_{0}^{ij}$q(Γ_{ ij }), and J^{ij}, where i, j = l,2,...,p, and i ≠ j.

4.
Find $(u,v)=\mathrm{arg}\phantom{\rule{0.1em}{0ex}}\underset{ij}{\mathrm{min}\phantom{\rule{0.1em}{0ex}}}$J^{ij} (α), and calculate ${M}_{0}^{uv}$ = K_{1}(Γ_{ uv },:)^{T}${B}_{0}^{uv}$K_{1}(Γ_{ uv },:), and ${N}_{0}^{uv}$ = K_{1}(Γ_{ uv },:)^{T}${W}_{0}^{uv}$K_{1}(Γ_{ uv },:).

5.
Update α^{(n)}
and normalize α^{(n+1)}so that α^{(n+1)} = 1.

6.
If n reaches a prespecified number N, stop. Otherwise, set n = n + 1, go to step 3.
0.4 KNN classification using the optimized kernel distance metric
Given two samples x,y ∈ R^{d}, the inner product is defined as: x·y = <x, y > = k(x, y); therefore, their derived distance can be calculated
d(x, y) = <x, x > + <y, y > 2 <x, y > = k(x, x) + k(y, y)  2k(x, y).
Using our datadependent kernel model, the distance can be expressed as
d(x, y) = q^{2}(x) + q^{2}(y)  2q(x)q(y)k_{0}(x, y) = [q(x)  q(y)]^{2} + 2q(x)q(y)(1  k_{0}(x, y))
where we assume that the basic kernel function satisfy: k_{0}(x,x) = 1, just like the Gaussian function.
Since the kernel optimization scheme increases the class separability of the data in the feature space, the performances of kernel machines should be improved. However, for the classification of gene expression data, due to the small size of training samples, the kernel optimization, which performs on training data, may cause overfitting, which means the algorithm may work very well on the training data, but worse on the test data. To handle this problem, we adopted a disturbed resampling strategy to increase the sample size of the training data.
Suppose that {x_{ i }, ζ_{ i }} (i = 1,2, ... m) are the training data, we construct a new set of training data {y_{ i },ξ_{ i }}(i = 1,2,...,3m), where
in which x_{ r }is a sample randomly selected form {x_{ i }} with replacement and ε denotes a normal random disturb, that is, ε ~N(0, ${\sigma}_{\epsilon}^{2}$). The class labels are determined as
Due to the very high dimensionality and small number of the patient samples, the training data are sparsely distributed in the high dimensional Euclidean space. It is reasonable to assume that the near points of a training datum have the same class characteristic as that of the training datum. Experimentally, using the technique of disturbed resampling (Eq.(l0)), we can effectively reduce the possible overfitting and computational instability, which are mainly caused by the lack of enough training samples for the gene expression data.
Availability
The core source codes of our algorithms are available at http://www.ittc.ku.edu/~xwchen/BMCbioinformatics/kernel/
Abbreviations
 KNN:

KnearestNeighbor
 SVM:

support vector machine
 DLDA:

diagonal linear discriminant analysis
 ULDA:

uncorrelated linear discriminant analysis
 KerNN:

kernel optimizationbased KNN
 ALL:

acute lymphoblastic leukemia
 AML:

acute myeloid leukemia
 MLL:

mixed lineage leukemia
 CNS:

embryonal tumor of central nervous system
References
 1.
BenDor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z: Tissue classification with gene expression profiles. J Computational Biology 2000, 7: 559–584. 10.1089/106652700750050943
 2.
Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination method for the classification of tumor using gene expression data. J Am Statistical Assoc 2002, 97: 77–87. 10.1198/016214502753479248
 3.
Golub TR, Slonim DK, Tamayo P, Huard C, Gassenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531–537. 10.1126/science.286.5439.531
 4.
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D'Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR: Gene expression correlations of clinical prostate cancer behavior. Cancer Cell 2004, 1: 203–209. 10.1016/S15356108(02)000302
 5.
van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Martoon MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 419: 530–536. 10.1038/415530a
 6.
West B, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA Jr, Marks JR, Nevins JR: Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci USA 2001, 98: 11462–11467. 10.1073/pnas.201162998
 7.
Duda RO, Hart PE, Stork DG: Pattern Classification. 2nd edition. A WileyInterscience Publication; 2000.
 8.
Friedman JH: Flexible metric nearest neighbor classification. In Technical report. Dept. of Statistics, Stanford University; 1994.
 9.
Howland P, Park H: Generalizing discriminant dnalysis using the generalized singular value decomposition. IEEE Trans. on Pattern Analysis and Machine Intelligence 2004, 26: 995–1006. 10.1109/TPAMI.2004.46
 10.
Ye J, Li T, Xiong T, Janardan R: Using uncorrelated discriminant analysis for tissue classification with gene expression data. IEEE/ACM Trans. on Computational Biology and Bioinformatics 2004, 1: 181–190. 10.1109/TCBB.2004.45
 11.
Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16: 906–914. 10.1093/bioinformatics/16.10.906
 12.
Jaakkola T, Diekhans M, Haussler D: Using the Fisher kernel method to detect remote protein homologies. In Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology. Edited by: Menlo Park, CA. AAAI Press;
 13.
Zien A, Rätsch G, Mika S, Schölkopf B, Lemmen C, smola A, Lengauer T, Müller K: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 2000, 16: 799–807. 10.1093/bioinformatics/16.9.799
 14.
Cawley GC: MATLAB support vector machine toolbox.University of East Anglia, School of Information Systems, Norwich, Norfolk, U.K. NR4 7TJ; 2000. [http://theoval.sys.uea.ac.uk/~gcc/svm/toolbox]
 15.
Pochet N, Smet FD, Suykens JAK, Moor BLRD: Systematic benchmarking of microarray data classification: Assessing the role of nonlinearity and dimensionality reduction. Bioinformatics 2004, 20: 3185–3195. 10.1093/bioinformatics/bth383
 16.
Natsoulis G, Ghaoui LE, Lanckriet GRG, Tolley AM, Leroy F, Dunlea S, Eynon BP, Pearson CI, Tugendreich S, Jarnagin K: Classification of a large microarray data set: Algorithm comparison and analysis of drug signatures. Genome Res 2005, 15: 724–736. 10.1101/gr.2807605
 17.
Biomedical Data Analysis[http://sdmc.lit.org.sg/GEDatasets/]
 18.
Center for Applied Genomics and Technology[http://mgm.duke.edu/genome/dna_micro/work/]
 19.
Cancer Program Data Sets[http://www.broad.mit.edu/cgibin/cancer/datasets.cgi]
 20.
St. Jude Research[http://www.stjuderesearch.org/data/]
 21.
Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ: MLL translocations specify a distinct gene expression profile that distigushes a unique leukemia. Nature Genetics 2001, 30: 41–47. 10.1038/ng765
 22.
Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JYH, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR: Prediction of central nervous system embryonal tumor outcome based on gene expression. Letters to Nature Nature 2002, 415: 436–442.
 23.
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissue probed by oligonucleotide arrays. Proc Natl Acad Sci USA 1999, 96: 6745–6750. 10.1073/pnas.96.12.6745
 24.
Gordon GJ, Jenson RV, Hsiao LL, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R: Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelima. Cancer Research 2002, 62: 4936–4967.
 25.
Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RCT, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR: Diffuse large Bcell lymphoma outcome prediction by gene expression profiling and supervised machine learning. Nature Medicine 2002, 8: 68–74. 10.1038/nm010268
 26.
Petricoin EF, Ardekanl AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA: Use of proteomic patterns in serum to identify ovarian cancer. The Lancet 2002, 359: 572–577. 10.1016/S01406736(02)077462
 27.
Xiong H, Swamy MNS, Ahmad MO: Optimizing the datadependent kernel in the empirical feature space. IEEE Trans. on Neural Networks 2005, 16: 460–474. 10.1109/TNN.2004.841784
 28.
Ruiz A, Lopezde Teruel PE: Nonlinear kernelbased statistical pattern analysis. IEEE Trans. on Neural Networks 2001, 12: 16–32. 10.1109/72.896793
 29.
Baudat G, Anouar F: Generalized discriminant analysis using a kernel approach. Neural Computation 2000, 12: 2385–2404. 10.1162/089976600300014980
 30.
Cristianini N, ShaweTaylor J: An Introduction to Support Vector Machines. Cambridge Univ. Press, Cambridge, UK; 2000.
 31.
Amari S, Wu S: Improving support vector machine classifiers by modifying kernel functions. Neural Networks 1999, 12: 783–789. 10.1016/S08936080(99)000325
 32.
Müller KR, Mika S, Rätsch G, Tsuda K, Scholkopf B: An introduction to kernelbased learning algorithms. IEEE Trans. on Neural Networks 2001, 12: 181–201. 10.1109/72.914517
 33.
Pekalska E, Paclik P, Duin RobertPW: A generalized kernel approach to dissimilaritybased classification. Journal of Machine Learning Research 2001, 2: 175–211. 10.1162/15324430260185592
 34.
Roth V, Steinhage V: Nonlinear discriminant analysis using kernel functions. In Advance in Neural Information Processing Systems 12. Edited by: Solla SA, Leen TK, Muller KR. Cambridge, MA:MIT Press; 2000:568–574.
 35.
Scholkopf B, Mika S, Burges CJC, Knirsch P, Muller KR, Ratsch G, Smola AJ: Input space versus feature space in kernelbased methods. IEEE Trans. on Neural Networks 1999, 10: 1000–1017. 10.1109/72.788641
 36.
Graf ABA, Smola AJ, Borer S: Classification in a normalized feature space using support vector machines. IEEE Trans. on Neural Networks 2003, 14: 597–605. 10.1109/TNN.2003.811708
Acknowledgements
This investigation was based upon work supported by the National Science Foundation under Grant No. EPS0236913, by matching support from the State of Kansas through Kansas Technology Enterprise Corporation, and by the University of Kansas General Research Fund allocations #2301770003 and #2301478003.
Author information
Affiliations
Corresponding author
Additional information
Authors' contributions
HX and XWC conceived the study. HX designed and implemented the algorithms, and drafted the manuscript. XWC coordinated the study, participated in the algorithm design, and helped draft the manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Xiong, H., Chen, X. Kernelbased distance metric learning for microarray data classification. BMC Bioinformatics 7, 299 (2006). https://doi.org/10.1186/147121057299
Received:
Accepted:
Published:
Keywords
 Support Vector Machine
 Acute Lymphoblastic Leukemia
 Malignant Pleural Mesothelioma
 Mixed Lineage Leukemia
 Class Separability