Skip to main content

Kernel-based distance metric learning for microarray data classification

Abstract

Background

The most fundamental task using gene expression data in clinical oncology is to classify tissue samples according to their gene expression levels. Compared with traditional pattern classifications, gene expression-based data classification is typically characterized by high dimensionality and small sample size, which make the task quite challenging.

Results

In this paper, we present a modified K-nearest-neighbor (KNN) scheme, which is based on learning an adaptive distance metric in the data space, for cancer classification using microarray data. The distance metric, derived from the procedure of a data-dependent kernel optimization, can substantially increase the class separability of the data and, consequently, lead to a significant improvement in the performance of the KNN classifier. Intensive experiments show that the performance of the proposed kernel-based KNN scheme is competitive to those of some sophisticated classifiers such as support vector machines (SVMs) and the uncorrelated linear discriminant analysis (ULDA) in classifying the gene expression data.

Conclusion

A novel distance metric is developed and incorporated into the KNN scheme for cancer classification. This metric can substantially increase the class separability of the data in the feature space and, hence, lead to a significant improvement in the performance of the KNN classifier.

Background

DNA microarray technology is designed to measure the expression levels of tens of thousands of genes simultaneously. As an important application of this novel technology, the gene expression data are used to determine and predict the state of tissue samples, which has shown to be very helpful in clinical oncology. The most fundamental task using gene expression data in clinical oncology is to classify tissue samples according to their gene expression levels. In combination with pattern classification techniques, gene expression data can provide more reliable means to diagnose and predict various types of cancers than the traditional clinical methods.

Compared with traditional pattern classifications, gene expression-based data classification is typically characterized by high dimensionality and small sample size, which make the task quite challenging. In the literature, a number of methods have been applied or developed to classify microarray data [1–6]. These methods include K-nearest-neighbor (KNN), boosting, linear discriminant analysis (LDA), and support vector machines (SVM), etc. we herein briefly review some of the approaches.

K-Nearest-Neighbor (KNN)

The KNN method is a simple, yet useful approach to data classification. The error rate of the KNN has been proven to be asymptotically at most twice that of the Bayessian error rate [7]. However, its performance deteriorates dramatically when the input data set has a relatively low local relevance [8]. The most important factor impacting the performance of KNN is the distance metric. It is desirable to adopt an appropriate distance metric for the KNN algorithm. In practice, the Euclidean distance is usually used as the distance metric.

Diagonal Linear Discriminant Analysis (DLDA)

DLDA is the simplest case of the maximum likelihood discriminant rule, in which the class densities are supposed to have the same diagonal covariance matrix. In the special case of binary classification, the DLDA scheme can be viewed as the "weighted voting scheme" proposed by Golub et al. in [3]. The major advantage of the DLDA algorithm lies in its computational efficiency.

Linear Discriminant Analysis (LDA)

The classical LDA method aims to find the most discriminatory projection directions of the input data and classifies the data in the projected space. A major problem in employing the classical LDA algorithm for classifying gene expression data is that the so called scatter matrices are always singular, due to the nature of high dimensionality and relatively small sample size. The singularity makes the classical LDA algorithm inapplicable. In the areas such as face recognition and text classification, the principal component analysis (PCA) technique is introduced as a preprocessing procedure in order to reduce the dimensionality of the input data. However, since the projection criterion of PCA is essentially different from that of LDA, losing discriminatory information in the PCA step becomes inevitable. A recent development in LDA is the generalized discriminant analysis [9, 10], in which a more delicate matrix technique, namely, the generalized singular value decomposition (GSVD), is used to modify the classical LDA into a more general version.

Support Vector Machines (SVM)

SVM has been recognized as the most powerful classifier in various applications of pattern classification. For binary classification, SVM searches for a hyperplane that separates the two classes of data with the maximum margin. It has been shown that support vector machines perform well in many areas of computational biology [11–13]. In the experimental part of this paper, we follow the way in [14] to implement the SVM algorithm.

Generally speaking, due to the high dimensionality and small sample size, linear classifiers such as the linear discrimiant analysis (LDA), and the support vector machines (SVM) with linear kernels are used favorably. However, based on some benchmark tests, researchers have shown that nonlinear classfiers are capable of exploring the nonlinear discriminatory information in the microarray data, and usually produce more precise classification results [15, 16]. This is especially true when more patients' samples are available or the data dimension is substantially reduced, since, in these cases, the linear separability of the microarray data could be considerably degraded.

Among the general algorithms of pattern classification, K-nearest-neighbor (KNN) is a simple yet useful one. However, in practice, the performance of KNN algorithm is often inferior to those of the sophisticated approaches such as SVM and generalized linear discriminant analysis (GLDA) [9, 10]. Since the distance metric is of great importance for the KNN scheme, an attractive way to improve the performance of KNN is to adopt a more adaptive distance metric to the input data than the Euclidean diatnce. In this paper, we propose to learn the adaptive distance metric via optimizing a data-dependent kernel. Experimental results show that, compared with the ordinary Euclidean distance-based KNN scheme, our kernel-based KNN algorithm, denoted KerNN, always achieves significant improvement in the performance of classifying gene expression data. Moreover, the performance of the KerNN classifier is shown to be competitive, if not better, to those of the sophisticated classifiers, e.g., SVM and the uncorrelated linear discriminant analysis (ULDA) [10], in classifying microarray data.

Results

We conducted intensive experiments to compare the performances of our KerNN scheme to the commonly-used classification algorithms, i.e., KNN, DLDA [3], ULDA [10], and SVM. Ten publicly available microarray data sets were chosen to test our algorithms. The basic information about these data sets is summarized below. Each data set is first normalized to a distribution with zero mean and unity variance in every feature direction, and then, randomly partitioned into two disjoint subsets with equal number of samples, one is used as the training data, and the other the test data. We only consider Gaussian kernel function in the proposed and SVM algorithms.

  1. 1.

    ALL-AML Leukemia Data: This data set, taken from the website [17], contains 72 samples of human acute leukemia. 47 samples belong to acute lymphoblastic leukemia (ALL), and the other acute myeloid leukemia (AML). Each sample presents the expression levels of 7129 genes. For the detailed information, one can refer to [3].

  2. 2.

    ALL-MLL-AML Leukemia Data: This leukemia microarray data set is available on the website [17]. It includes 72 human leukemia samples, 24 of them belong to acute lymphoblastic leukemia (ALL), 20 of them to mixed lineage leukemia (MLL), a subset of human acute leukemia with a chromosomal translocation, and 28 of the samples are acute myelogenous leukemia (AML). Each sample gives the expression levels of 12582 genes. Further information about this data set can be found in [21].

  3. 3.

    Embryonal Tumors of the Central Nervous System (CNS): This data set, available at the website [17], contains 60 patient samples, 21 are survivors of a treatment, and 39 are failures. There are 7129 genes in the data set. One can refer to [22] to find more information about this data set.

  4. 4.

    Breast Cancer Data: The data are available on the website [18]. The expression matrix monitors 7129 genes in 49 breast tumor samples. There are two response variables respectively describing the status of the estrogen receptor (ER) and the lymph nodal (LN) status. For the ER status, 25 samples are ER+, whereas the remaining 24 samples are ER-. For the LN variable, there are 25 positive sample and 24 negative samples. The detailed information about this data set can be found in [6].

  5. 5.

    Colon Tumor Data: This data set is adopted from the website [17]. The data contain 62 samples collected from colon-cancer patients. Among them, 40 samples are from tumors, and 22 normal biopsies are from healthy parts of the colons of the same patients. 2000 genes were selected to measure their expression levels. One can refer to [23].

  6. 6.

    Lung Cancer Data: This data set is taken from the website [17]. It contains 181 tissue samples, which are classified into two classes: malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA). Each sample is described by 12533 genes. More information about this data set can be found in [24].

  7. 7.

    Lymphoma Data: The data are available on the website [19]. This data set contains 77 tissue samples, 58 are diffuse large B-cell lymphomas (DLBCL) and the remaining 19 samples are follicular lymphomas (FL). Each sample is represented by the expression levels of 7129 genes. The detailed information about this data set can be found in [25].

  8. 8.

    Ovarian Cancer Data: This data set, available on the website [17], is to distinguish ovarian cancer from non-cancer. It contains 253 samples, and each sample has 15154 features. More details can be found in [26].

  9. 9.

    Prostate Cancer Data: This data set, adopted from the website [19], contains the gene expression levels of 12600 genes for 52 prostate tumor samples and 50 normal prostate samples. One can refer [4] for the details about this data set.

  10. 10.

    Subtypes of Acute Lymphoblastic Leukemia: This data set, available on the website [20], contains 6 subtypes of pediatric acute lymphoblastic leukemia, corresponding to six diagnostic groups: BCR-ABL, E2A-PBX1, MLL, T-ALL, TEL-AML1, Hyperdiloid>50. Each sample contains 12625 genes.

Comparisons in terms of the best results

For each data set, we chose the N f most discriminatory genes, where N f = 10, 20, 40, 60, 80, 100, 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000, respectively; repeated the experiment 100 times at each value of N f ; and then, calculated the average test error rates and their standard deviations over the 100 experiments. Table 1 lists the best results, i.e., the smallest average test error rate, of different algorithms. It can be seen that the proposed KerNN algorithm reaches the best, which are in bold face, on four data sets. On the other data sets, the performance of the KerNN algorithm is still competitive, if not better, to those of the SVM and ULDA schemes.

Table 1 Comparison of the classifiers in terms of the best results. Thecomparison of all the classifiers in terms of the best results of the average test error rates (%). For each data set, we chose the N f most discriminatory genes, where N f = 10, 20, 40, 60, 80, 100, 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000, respectively; repeated the experiment 100 times at each value of N f ; and then, calculated the average test error rates and their standard deviations over the 100 experiments. In comparison, we assign a classifier a score 1 as it achieves the best result on one data set, and 2 if it achieves the next best result, and so on. The average score roughly evaluates the global performance of a classifier on these twelve data sets.

In Table 1, if we assign a score 1 to the best result, 2 to the next best result, ..., and so on, then, the global performance of a classifier can be roughly evaluated in terms of the average score. We show the average scores of the five classifiers in Table 1. It can be seen that the proposed KerNN scheme achieves the lowest score among the five classifiers.

Comparisons under different gene numbers

To investigate the stability of the 5 classification algorithms, we compared their performances when different number of genes were selected. The experimental results are shown in Fig. 1, for the ALL-AML data, Fig. 2, for the Colon data, and Fig. 3, for the Prostate data, where the horizontal axis is the number of the selected genes and the vertical axis is the average test error rates of the classifiers over 100 experiments. While Fig. 1 (a), Fig. 2 (a), and Fig. 3 (a) illustrate the results in the case of choosing a relatively small number of features (from 10 to 100), Fig. 1 (b), Fig. 2 (b), and Fig. 3 (b) demonstrate the corresponding results when more genes are chosen (from 200 to 2000). It can be seen that the proposed KerNN scheme performs favorably in most cases. Compared with the ULDA scheme, which always performs poorly in the case of small feature size, and the DLDA algorithm, whose performances usually degrade for relatively large feature size, our KerNN algorithm works with more stability with different feature numbers. More importantly, compared with the ordinary KNN classifier, the kernel optimization-based KNN classifier always gains significant improvements, which implies that the procedure of kernel optimization induces a distance metric that adapts better than the Euclidean metric to the gene expression data in the data space.

Figure 1
figure 1

Test error rates for the ALL-AML data set. Stability comparison on the ALL-AML data. The average test error rate (%) as a function of the selected gene number.

Figure 2
figure 2

Test error rates for the Colon data set. Stability comparison on the Colon data. The average test error rate (%) as a function of the selected gene number.

Figure 3
figure 3

Test error rates for the Prostate data set. Stability comparison on the Prostate data. The average test error rate (%) as a function of the selected gene number.

Discussion

Parameter tuning

In the experiments, for KNN, ULDA, and the proposed algorithm, the final classification is done via the K-nearest-neighbor algorithm with K = 3. For KNN, ULDA, and DLDA algorithms, the only parameter is the number of selected genes N f . For SVM, in addition to the gene number, two parameters, the γ in the Gaussian kernel function and the regulation constant C, need to be set in advance. As for the KerNN algorithm, there are more parameters. To avoid the intensive computation in parameter tuning using the cross validation, we respectively chose the N f most discriminatory genes, where N f = 10, 20, 40, 60, 80, 100, 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000. The best performance for each method is reported in Table 1. For our kernel optimization method, the initial learning rate η0 and the total iteration number N are always set to 0.01 and 1000 respectively. Furthermore, for the sake of computational simplicity, we empirically set the two Gaussian parameters in the proposed method as γ 0 = 10 − 5 N f MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFZoWzdaWgaaWcbaGaeGimaaJaeyypa0dabeaakmaalaaabaGaeGymaeJaeGimaaZaaWbaaSqabeaacqGHsislcqaI1aqnaaaakeaadaGcaaqaaiabd6eaonaaBaaaleaacqWGMbGzaeqaaaqabaaaaaaa@3744@ and γ 1 = 10 − 2 N f MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFZoWzdaWgaaWcbaGaeGymaedabeaakiabg2da9maalaaabaGaeGymaeJaeGimaaZaaWbaaSqabeaacqGHsislcqaIYaGmaaaakeaadaGcaaqaaiabd6eaonaaBaaaleaacqWGMbGzaeqaaaqabaaaaaaa@3740@ , rather than tune them by the cross validation. This may not be the optimal settings for the parameters γ0 and γ1. However, high computational complexity can be avoided. It is expected that even better results could be obtained if we were to choose them by the cross validation. Therefore, for the KerNN method, there is only one parameter σ ε , the standard variance of the disturbance added to the data in Eq. (10), that need to be tuned. As to the SVM, two parameters are tuned by the cross validation.

In the experiments, we employed the leave-one-out technique on the training data to choose these parameters. We followed [14] to implement the SVM algorithm, in which the parameter C is chosen from {l.0e+00, l.0e+01, l.0e+02, l.0e+03, l.0e+04, l.0e+05, l.0e+06, l.0e+07} and γ from {l.0e-07, 5.0e-07, l.0e-06, 5.0e-06, l.0e-05, 5.0e-05, l.0e-04, 5.0e-04, l.0e-03, 5.0e-03, l.0e-02} using the leave-one-out cross validation. For our KerNN algorithm, the parameter σ ε is selected from {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}. Note that only the training samples were used for setting parameters. Test samples are independent of this process.

Gene selection

In this paper, we employ the BW ratio used in [2, 10] to select genes. This ratio is essentially a Fisher discriminant measure. Given a gene j, the ratio on gene j is calculated as

g ( j ) = ∑ k = 1 p m k ( x ¯ k ( j ) − x ¯ ( j ) ) 2 ∑ k = 1 p ∑ i ∈ C k ( x i ( j ) − x ¯ k ( j ) ) 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGNbWzcqGGOaakcqWGQbGAcqGGPaqkcqGH9aqpdaWcaaqaamaaqadabaGaemyBa02aaSbaaSqaaiabdUgaRbqabaGccqGGOaakcuWG4baEgaqeamaaBaaaleaacqWGRbWAaeqaaOGaeiikaGIaemOAaOMaeiykaKIaeyOeI0IafmiEaGNbaebacqGGOaakcqWGQbGAcqGGPaqkcqGGPaqkdaahaaWcbeqaaiabikdaYaaaaeaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGWbaCa0GaeyyeIuoaaOqaamaaqadabaWaaabeaeaacqGGOaakcqWG4baEdaWgaaWcbaGaemyAaKgabeaakiabcIcaOiabdQgaQjabcMcaPiabgkHiTiqbdIha4zaaraWaaSbaaSqaaiabdUgaRbqabaGccqGGOaakcqWGQbGAcqGGPaqkcqGGPaqkdaahaaWcbeqaaiabikdaYaaaaeaacqWGPbqAcqGHiiIZcqWGdbWqdaWgaaadbaGaem4AaSgabeaaaSqab0GaeyyeIuoaaSqaaiabdUgaRjabg2da9iabigdaXaqaaiabdchaWbqdcqGHris5aaaaaaa@689B@

where C k denotes the index set of the k-th class (k = l,2,...,p), m k is the number of samples in C k ( ∑ k = 1 p m k = m MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaaeWaqaaiabd2gaTnaaBaaaleaacqWGRbWAaeqaaaqaaiabdUgaRjabg2da9iabigdaXaqaaiabdchaWbqdcqGHris5aOGaeyypa0JaemyBa0gaaa@38C2@ ), x ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWG4baEgaqeaaaa@2E3D@ k (j) and x ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWG4baEgaqeaaaa@2E3D@ (j) represent the average expression levels cross the k-th class and whole training samples on gene j, respectively.

Gene selection usually has a strong impact on the performances of various classifiers, due to the effect of correlation between genes. Our experiments show that the impact can be considered in two aspects: l)with different numbers of genes, the performance of a classifier could be remarkably different. For example, the ULDA method usually works quite well as a large number of genes is used, but performs poorly in the case of small gene number. Contrarily, the DLDA classifier often reaches its best performance at small number of features. 2) with different numbers of genes, the model parameters, especially for the nonlinear methods, need to be set differently to achieve better result.

The effect of the disturbed resampling

Due to the lack of enough training samples, the scheme of the kernel optimization-based classification may lead to an overfitting result in classifying gene expression data. To alleviate the possible overfitting, a strategy of disturbed resampling, as shown in Eq. (10), was adopted. In this section, we demonstrate that using this strategy, the overfiting could be effectively reduced.

In the case that there are relatively large number of samples, the kernel optimization-based KNN classifier without using the strategy of disturbed resampling, denoted by KerNN0, usually works well on both the training and test data. Fig. 4 illustrates the performances of KNN, KerNN0, and KerNN on both the training and test data of the Prostate data set, which includes 102 samples. It can be seen that, compared with the KNN algorithm, both the KerNN0 and KerNN methods gain significant improvements, not only on the training data, but also on the test data. However, when the sample size is relatively small, the KerNNO algorithm may lead to serious overfitting. We choose the Breast-ER data set, which contains only 49 samples, to demonstrate our argument. Fig. 5 (a) shows the average error rates of KNN, KerNN0, and KerNN algorithms on the training data, and Fig. 5 (b) presents the corresponding results on the test data. It can be seen that, although KerNN0 works quite well on the training data, its performance degrades remarkably on the test data. On the contrary, for the KerNN scheme, no overfitting occurred.

Figure 4
figure 4

The effect of the disturbed resampling on Prostate. The effect of adopting the technique of disturbed resampling on a relatively large data set, Prostate, which contains 102 samples. (a) Results on the training data. (b) Results on the test data.

Figure 5
figure 5

The effect of the disturbed resampling on Breast-ER. The effect of adopting the technique of disturbed resampling on the a relatively small data set, Breast-ER, which contains only 49 samples. (a) Results on the training data. (b) Results on the test data.

Conclusion

In this paper, a novel distance metric is developed and incorporated into a KNN scheme for cancer classification. This metric, derived from the procedure of a data-dependent kernel optimization, can substantially increase the class separability of the data in the feature space, and hence, lead to a significant improvement in the performance of the KNN classifier. Furthermore, in combination with a disturbed resampling strategy, the kernel optimization-based K-nearest-neighbor scheme can achieve competitive performance to the fine tuned SVM and the uncorrelated linear discriminant analysis (ULDA) scheme in classifying gene expression data. Experimental results show that the proposed scheme performs with more stability than the ULDA scheme, which works poorly in the case of small feature size, and the DLDA scheme, whose performance usually degrades in the case of a relatively large feature size.

Methods

0.1 Data-dependent kernel model

In this paper, we employ a special kernel function model, which is called date-dependent kernel model, as the objective kernel to be optimized. Apparently, there is no benefit at all if we simply use the common kernel such as the Gaussian kernel or the polynomial kernel in the KNN scheme, since the distance ranking in the Hilbert space derived from the kernel function is the same as that in the input data space. However, when we adopt the data-dependent kernel, especially after the kernel is optimized, the distance metric could be appropriately modified so that the local relevance of the data is significantly improved.

Let {x i , ζ i } (i = 1,2, ..., m) be m d-dimensional training samples of the given gene expression data, where ζ i represent the class labels of the samples. We refer the data-dependent kernel as,

k(x, y) = q(x)q(y)k0(x, y)       (1)

where x, y ∈ Rd, k0(x, y), called the basic kernel, is an ordinary kernel such as a Gaussian or a polynomial kernel function, and q(.), the factor function, takes the form as

q ( x ) = α 0 + ∑ i = 1 l α i k 1 ( x , a i ) ( 2 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGXbqCcqGGOaakcqWG4baEcqGGPaqkcqGH9aqpiiGacqWFXoqydaWgaaWcbaGaeGimaadabeaakiabgUcaRmaaqahabaGae8xSde2aaSbaaSqaaiabdMgaPbqabaGccqWGRbWAdaWgaaWcbaGaeGymaedabeaakiabcIcaOiabdIha4jabcYcaSiabdggaHnaaBaaaleaacqWGPbqAaeqaaOGaeiykaKcaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGSbaBa0GaeyyeIuoakiaaxMaacaWLjaWaaeWaaeaacqaIYaGmaiaawIcacaGLPaaaaaa@4D48@

in which k1(x, a i ) = e − γ 1 | | x − a i | | 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGLbqzdaahaaWcbeqaaiabgkHiTGGaciab=n7aNnaaBaaameaacqaIXaqmaeqaaSGaeiiFaWNaeiiFaWNaemiEaGNaeyOeI0Iaemyyae2aaSbaaWqaaiabdMgaPbqabaWccqGG8baFcqGG8baFdaahaaadbeqaaiabikdaYaaaaaaaaa@3E53@ , α i 's are the combination coefficients, and a i 's denote the local centers of the training data.

Let the kernel matrices corresponding to k(x, y) and k0(x, y) be K and k0. Obviously, K = [q(x i )q(x j )k0(x i , x j )]m × m= QK0Q, where Q is a diagonal matrix whose diagonal elements are q(x1), q(x2),...,q(x m ). Let us denote the vector (q(x1), q(x2),..., q(x m ))T and (α0, α1, α2,...,α n )T by q and α respectively, we have q = K1α, where K1 is an m × (l + 1) matrix

K 1 = ( 1 k 1 ( x 1 , a 1 ) ⋯ k 1 ( x 1 , a l ) 1 k 1 ( x 2 , a 1 ) ⋯ k 1 ( x 2 , a l ) ⋮ ⋮ ⋱ ⋮ 1 k 1 ( x m , a 1 ) ⋯ k 1 ( x m , a l ) ) ( 3 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGlbWsdaWgaaWcbaGaeGymaedabeaakiabg2da9maabmaabaqbaeqabqabaaaaaeaacqaIXaqmaeaacqWGRbWAdaWgaaWcbaGaeGymaedabeaakiabcIcaOiabdIha4naaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaemyyae2aaSbaaSqaaiabigdaXaqabaGccqGGPaqkaeaacqWIVlctaeaacqWGRbWAdaWgaaWcbaGaeGymaedabeaakiabcIcaOiabdIha4naaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaemyyae2aaSbaaSqaaiabdYgaSbqabaGccqGGPaqkaeaacqaIXaqmaeaacqWGRbWAdaWgaaWcbaGaeGymaedabeaakiabcIcaOiabdIha4naaBaaaleaacqaIYaGmaeqaaOGaeiilaWIaemyyae2aaSbaaSqaaiabigdaXaqabaGccqGGPaqkaeaacqWIVlctaeaacqWGRbWAdaWgaaWcbaGaeGymaedabeaakiabcIcaOiabdIha4naaBaaaleaacqaIYaGmaeqaaOGaeiilaWIaemyyae2aaSbaaSqaaiabdYgaSbqabaGccqGGPaqkaeaacqWIUlstaeaacqWIUlstaeaacqWIXlYtaeaacqWIUlstaeaacqaIXaqmaeaacqWGRbWAdaWgaaWcbaGaeGymaedabeaakiabcIcaOiabdIha4naaBaaaleaacqWGTbqBaeqaaOGaeiilaWIaemyyae2aaSbaaSqaaiabigdaXaqabaGccqGGPaqkaeaacqWIVlctaeaacqWGRbWAdaWgaaWcbaGaeGymaedabeaakiabcIcaOiabdIha4naaBaaaleaacqWGTbqBaeqaaOGaeiilaWIaemyyae2aaSbaaSqaaiabdYgaSbqabaGccqGGPaqkaaaacaGLOaGaayzkaaGaaCzcaiaaxMaadaqadaqaaiabiodaZaGaayjkaiaawMcaaaaa@84DF@

0.2 Kernel optimization for binary-class data

We optimized the data-dependent kernel in Eq.(l). This requires optimizing the combination coefficient vector α, aiming to increase the class separability of the data in the feature space. A Fisher scalar measuring the class separability of the training data in the feature space is adopted as a criterion for our kernel optimization

J = tr ( S b ) tr ( S w ) ( 4 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGkbGscqGH9aqpdaWcaaqaaiabbsha0jabbkhaYjabcIcaOiabdofatnaaBaaaleaacqWGIbGyaeqaaOGaeiykaKcabaGaeeiDaqNaeeOCaiNaeiikaGIaem4uam1aaSbaaSqaaiabdEha3bqabaGccqGGPaqkaaGaaCzcaiaaxMaadaqadaqaaiabisda0aGaayjkaiaawMcaaaaa@4148@

where S b represents the "between-class scatter matrix", and S w "within-class scatter matrix".

Suppose that the training data are grouped according to their class labels, i.e., the first m1 data belong to one class, and the remaining m2 data belong to the other class (m1 + m2 = m). Then the basic kernel matrix k0 can be partitioned as

K 0 = ( K 11 0 K 12 0 K 21 0 K 22 0 ) ( 5 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGlbWsdaWgaaWcbaGaeGimaadabeaakiabg2da9maabmaabaqbaeqabiGaaaqaaiabdUealnaaDaaaleaacqaIXaqmcqaIXaqmaeaacqaIWaamaaaakeaacqWGlbWsdaqhaaWcbaGaeGymaeJaeGOmaidabaGaeGimaadaaaGcbaGaem4saS0aa0baaSqaaiabikdaYiabigdaXaqaaiabicdaWaaaaOqaaiabdUealnaaDaaaleaacqaIYaGmcqaIYaGmaeaacqaIWaamaaaaaaGccaGLOaGaayzkaaGaaCzcaiaaxMaadaqadaqaaiabiwda1aGaayjkaiaawMcaaaaa@45EB@

where the sizes of the submatrices K 11 0 , K 12 0 , K 21 0 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGlbWsdaqhaaWcbaGaeGymaeJaeGymaedabaGaeGimaadaaOGaeiilaWIaem4saS0aa0baaSqaaiabigdaXiabikdaYaqaaiabicdaWaaakiabcYcaSiabdUealnaaDaaaleaacqaIYaGmcqaIXaqmaeaacqaIWaamaaaaaa@3AD2@ , and K 22 0 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGlbWsdaqhaaWcbaGaeGOmaiJaeGOmaidabaGaeGimaadaaaaa@30CA@ respectively are m1 × m1, m1 × m2, m2 × m1, and m2 × m2. A close relation between the class separability measure J and the kernel matrices can be established [27].

J ( α ) = α T M 0 α α T N 0 α ( 6 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGkbGscqGGOaakiiGacqWFXoqycqGGPaqkcqGH9aqpdaWcaaqaaiab=f7aHnaaCaaaleqabaGaemivaqfaaOGaemyta00aaSbaaSqaaiabicdaWaqabaGccqWFXoqyaeaacqWFXoqydaahaaWcbeqaaiabdsfaubaakiabd6eaonaaBaaaleaacqaIWaamaeqaaOGae8xSdegaaiaaxMaacaWLjaWaaeWaaeaacqaI2aGnaiaawIcacaGLPaaaaaa@43C6@

where M0 = K 1 T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGlbWsdaqhaaWcbaGaeGymaedabaGaemivaqfaaaaa@3019@ B0K1, N0 = K 1 T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGlbWsdaqhaaWcbaGaeGymaedabaGaemivaqfaaaaa@3019@ W0K1, in which

B 0 = ( 1 m 1 K 11 0 0 0 1 m 2 K 22 0 ) − 1 m K 0 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGcbGqdaWgaaWcbaGaeGimaadabeaakiabg2da9maabmaabaqbaeqabiGaaaqaamaalaaabaGaeGymaedabaGaemyBa02aaSbaaSqaaiabigdaXaqabaaaaOGaem4saS0aa0baaSqaaiabigdaXiabigdaXaqaaiabicdaWaaaaOqaaiabicdaWaqaaiabicdaWaqaamaalaaabaGaeGymaedabaGaemyBa02aaSbaaSqaaiabikdaYaqabaaaaOGaem4saS0aa0baaSqaaiabikdaYiabikdaYaqaaiabicdaWaaaaaaakiaawIcacaGLPaaacqGHsisldaWcaaqaaiabigdaXaqaaiabd2gaTbaacqWGlbWsdaWgaaWcbaGaeGimaadabeaaaaa@4841@
W 0 = diag ( k 11 0 , k 22 0 , … , k m m 0 ) − ( 1 m 1 K 11 0 0 0 1 m 2 K 22 0 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGxbWvdaWgaaWcbaGaeGimaadabeaakiabg2da9iabbsgaKjabbMgaPjabbggaHjabbEgaNnaabmaabaGaem4AaS2aa0baaSqaaiabigdaXiabigdaXaqaaiabicdaWaaakiabcYcaSiabdUgaRnaaDaaaleaacqaIYaGmcqaIYaGmaeaacqaIWaamaaGccqGGSaalcqWIMaYscqGGSaalcqWGRbWAdaqhaaWcbaGaemyBa0MaemyBa0gabaGaeGimaadaaaGccaGLOaGaayzkaaGaeyOeI0YaaeWaaeaafaqabeGacaaabaWaaSaaaeaacqaIXaqmaeaacqWGTbqBdaWgaaWcbaGaeGymaedabeaaaaGccqWGlbWsdaqhaaWcbaGaeGymaeJaeGymaedabaGaeGimaadaaaGcbaGaeGimaadabaGaeGimaadabaWaaSaaaeaacqaIXaqmaeaacqWGTbqBdaWgaaWcbaGaeGOmaidabeaaaaGccqWGlbWsdaqhaaWcbaGaeGOmaiJaeGOmaidabaGaeGimaadaaaaaaOGaayjkaiaawMcaaaaa@5C76@

To avoid using the eigenvector solution, an updating algorithm based on the standard gradient approach is developed. This algorithm is summarized below, in which the learning rate η(n) is adopted in a gradually decreasing form as

η ( n ) = η 0 ( 1 − n N ) ( 7 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF3oaAcqGGOaakcqWGUbGBcqGGPaqkcqGH9aqpcqWF3oaAdaWgaaWcbaGaeGimaadabeaakiabcIcaOiabigdaXiabgkHiTmaalaaabaGaemOBa4gabaGaemOta4eaaiabcMcaPiaaxMaacaWLjaWaaeWaaeaacqaI3aWnaiaawIcacaGLPaaaaaa@3F39@

where η0 represents an initial learning rate.

  1. 1.

    Group the data according to their class labels. Calculate K0 and K1 first, then B0 and W0, and then M0, N0.

  2. 2.

    Initialize α(0) by a vector (1,0,..., 0)T, and set n = 0.

  3. 3.

    Calculate q = K1α(n), and J1 = qTB0q, J2 = qTW0q, and J.

  4. 4.

    Update α(n):

α ( n + 1 ) = α ( n ) + η ( n ) ( 1 J 2 M 0 − 1 J 2 N 0 ) α ( n ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFXoqydaahaaWcbeqaaiabcIcaOiabd6gaUjabgUcaRiabigdaXiabcMcaPaaakiabg2da9iab=f7aHnaaCaaaleqabaGaeiikaGIaemOBa4MaeiykaKcaaOGaey4kaSIae83TdGMaeiikaGIaemOBa4MaeiykaKIaeiikaGYaaSaaaeaacqaIXaqmaeaacqWGkbGsdaWgaaWcbaGaeGOmaidabeaaaaGccqWGnbqtdaWgaaWcbaGaeGimaadabeaakiabgkHiTmaalaaabaGaeGymaedabaGaemOsaO0aaSbaaSqaaiabikdaYaqabaaaaOGaemOta40aaSbaaSqaaiabicdaWaqabaGccqGGPaqkcqWFXoqydaahaaWcbeqaaiabcIcaOiabd6gaUjabcMcaPaaaaaa@5197@

and normalize α(n+1)so that ||α(n+1)|| = 1.

  1. 5.

    If n reaches a pre-specified number N, stop. Otherwise, set n = n + 1, go to 3.

0.3 Kernel optimization for multi-class data

In the case of multi-class data, we decompose the problem of kernel optimization into a series of binary-class kernel optimizations.

Let (x i , ζ i ) ∈ Rd × ζ (i = 1, 2,..., m) be the training data set containing p classes, that is, ζ = {1,2,...,p}. We assume the data to be grouped in order, that is, the first m1 data belong to the first class, the next m2 data belong to the second class, and so on, where ∑ i = 1 p m i = m MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaaeWaqaaiabd2gaTnaaBaaaleaacqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabdchaWbqdcqGHris5aOGaeyypa0JaemyBa0gaaa@38BA@ . Then, the kernel matrix can be written as

K = ( K 11 K 12 ⋯ K 1 p K 21 K 22 ⋯ K 2 p ⋮ ⋮ ⋱ ⋮ K p 1 K p 2 ⋯ K p p ) ( 8 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGlbWscqGH9aqpdaqadaqaauaabeqaeqaaaaaabaGaem4saS0aaSbaaSqaaiabigdaXiabigdaXaqabaaakeaacqWGlbWsdaWgaaWcbaGaeGymaeJaeGOmaidabeaaaOqaaiabl+UimbqaaiabdUealnaaBaaaleaacqaIXaqmcqWGWbaCaeqaaaGcbaGaem4saS0aaSbaaSqaaiabikdaYiabigdaXaqabaaakeaacqWGlbWsdaWgaaWcbaGaeGOmaiJaeGOmaidabeaaaOqaaiabl+UimbqaaiabdUealnaaBaaaleaacqaIYaGmcqWGWbaCaeqaaaGcbaGaeSO7I0eabaGaeSO7I0eabaGaeSy8I8eabaGaeSO7I0eabaGaem4saS0aaSbaaSqaaiabdchaWjabigdaXaqabaaakeaacqWGlbWsdaWgaaWcbaGaemiCaaNaeGOmaidabeaaaOqaaiabl+UimbqaaiabdUealnaaBaaaleaacqWGWbaCcqWGWbaCaeqaaaaaaOGaayjkaiaawMcaaiaaxMaacaWLjaWaaeWaaeaacqaI4aaoaiaawIcacaGLPaaaaaa@618B@

where the submatrix k ij is of size m i × m j , and K ii represents the kernel matrix corresponding to the data in the i-th class. The class separability of the i-th and j-th class, denoted by Jij (i,j = 1, 2,...,p, i ≠ j), is calculated as

J i j ( α ) = J i i j J 2 i j = 1 m i + m j T B i j 1 m i + m j 1 m i + m j T W i j 1 m i + m j ( 9 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGkbGsdaahaaWcbeqaaiabdMgaPjabdQgaQbaakiabcIcaOGGaciab=f7aHjabcMcaPiabg2da9maalaaabaGaemOsaO0aa0baaSqaaiabdMgaPbqaaiabdMgaPjabdQgaQbaaaOqaaiabdQeaknaaDaaaleaacqaIYaGmaeaacqWGPbqAcqWGQbGAaaaaaOGaeyypa0ZaaSaaaeaacqaIXaqmdaqhaaWcbaGaemyBa02aaSbaaWqaaiabdMgaPbqabaWccqGHRaWkcqWGTbqBdaWgaaadbaGaemOAaOgabeaaaSqaaiabdsfaubaakiabdkeacnaaCaaaleqabaGaemyAaKMaemOAaOgaaOGaeGymaeZaaSbaaSqaaiabd2gaTnaaBaaameaacqWGPbqAaeqaaSGaey4kaSIaemyBa02aaSbaaWqaaiabdQgaQbqabaaaleqaaaGcbaGaeGymaeZaa0baaSqaaiabd2gaTnaaBaaameaacqWGPbqAaeqaaSGaey4kaSIaemyBa02aaSbaaWqaaiabdQgaQbqabaaaleaacqWGubavaaGccqWGxbWvdaahaaWcbeqaaiabdMgaPjabdQgaQbaakiabigdaXmaaBaaaleaacqWGTbqBdaWgaaadbaGaemyAaKgabeaaliabgUcaRiabd2gaTnaaBaaameaacqWGQbGAaeqaaaWcbeaaaaGccaWLjaGaaCzcamaabmaabaGaeGyoaKdacaGLOaGaayzkaaaaaa@6ECC@

where the between-class and within-class kernel scatter matrices Bij and Wij are defined as

B i j = ( 1 m i K i i 0 0 1 m j K j j ) − 1 m i + m j ( K i i K i j K j i K j j ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGcbGqdaahaaWcbeqaaiabdMgaPjabdQgaQbaakiabg2da9maabmaabaqbaeqabiGaaaqaamaalaaabaGaeGymaedabaGaemyBa02aaSbaaSqaaiabdMgaPbqabaaaaOGaem4saS0aaSbaaSqaaiabdMgaPjabdMgaPbqabaaakeaacqaIWaamaeaacqaIWaamaeaadaWcaaqaaiabigdaXaqaaiabd2gaTnaaBaaaleaacqWGQbGAaeqaaaaakiabdUealnaaBaaaleaacqWGQbGAcqWGQbGAaeqaaaaaaOGaayjkaiaawMcaaiabgkHiTmaalaaabaGaeGymaedabaGaemyBa02aaSbaaSqaaiabdMgaPbqabaGccqGHRaWkcqWGTbqBdaWgaaWcbaGaemOAaOgabeaaaaGcdaqadaqaauaabeqaciaaaeaacqWGlbWsdaWgaaWcbaGaemyAaKMaemyAaKgabeaaaOqaaiabdUealnaaBaaaleaacqWGPbqAcqWGQbGAaeqaaaGcbaGaem4saS0aaSbaaSqaaiabdQgaQjabdMgaPbqabaaakeaacqWGlbWsdaWgaaWcbaGaemOAaOMaemOAaOgabeaaaaaakiaawIcacaGLPaaaaaa@5FAD@
W i j = D i j − ( 1 m i K i i 0 0 1 m j K j j ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGxbWvdaahaaWcbeqaaiabdMgaPjabdQgaQbaakiabg2da9iabdseaenaaCaaaleqabaGaemyAaKMaemOAaOgaaOGaeyOeI0YaaeWaaeaafaqabeGacaaabaWaaSaaaeaacqaIXaqmaeaacqWGTbqBdaWgaaWcbaGaemyAaKgabeaaaaGccqWGlbWsdaWgaaWcbaGaemyAaKMaemyAaKgabeaaaOqaaiabicdaWaqaaiabicdaWaqaamaalaaabaGaeGymaedabaGaemyBa02aaSbaaSqaaiabdQgaQbqabaaaaOGaem4saS0aaSbaaSqaaiabdQgaQjabdQgaQbqabaaaaaGccaGLOaGaayzkaaaaaa@4A3E@

in which Dij denotes a diagonal matrix whose diagonal elements are composed of the diagonal entries of the matrix K ii and K jj . We also denote the between-class and within-class kernel matrices corresponding to the basic kernel by B 0 i j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGcbGqdaqhaaWcbaGaeGimaadabaGaemyAaKMaemOAaOgaaaaa@318C@ and W 0 i j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGxbWvdaqhaaWcbaGaeGimaadabaGaemyAaKMaemOAaOgaaaaa@31B6@ respectively.

In each iteration of the updating algorithm, we first find the class index (u, v) that corresponds to the minimum Jij in current step, then the value of α is updated in such a way that the class separability of the u-th and v-th class Juv will be maximized. In other words, the objective of the kernel optimization becomes

It is easy to modify the kernel optimization algorithm from the case of binary class data to the case of multi-class data. The detailed kernel optimization algorithm for multi-class data set is summarized below, where Γ ij denotes the union of the data index sets of the i-th and j-th class, and q(Γ ij ) and K1(Γ ij ,:) represent the submatrix extraction as in MATLAB.

  1. 1.

    Group the data according to their class labels. Calculate k0 and K1.

  2. 2.

    Initialize α(0) by a vector (1,0,..., 0)T, and set n = 0.

  3. 3.

    Calculate q = K1α(n), J 1 i j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGkbGsdaqhaaWcbaGaeGymaedabaGaemyAaKMaemOAaOgaaaaa@319E@ = q(Γ ij )T B 0 i j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGcbGqdaqhaaWcbaGaeGimaadabaGaemyAaKMaemOAaOgaaaaa@318C@ q(Γ ij ), J 2 i j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGkbGsdaqhaaWcbaGaeGOmaidabaGaemyAaKMaemOAaOgaaaaa@31A0@ = q(Γ ij )T W 0 i j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGxbWvdaqhaaWcbaGaeGimaadabaGaemyAaKMaemOAaOgaaaaa@31B6@ q(Γ ij ), and Jij, where i, j = l,2,...,p, and i ≠ j.

  4. 4.

    Find ( u , v ) = arg min i j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqGGOaakcqWG1bqDcqGGSaalcqWG2bGDcqGGPaqkcqGH9aqpcyGGHbqycqGGYbGCcqGGNbWzdaWfqaqaaiGbc2gaTjabcMgaPjabc6gaUbWcbaGaemyAaKMaemOAaOgabeaaaaa@3E4D@ Jij (α), and calculate M 0 u v MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGnbqtdaqhaaWcbaGaeGimaadabaGaemyDauNaemODayhaaaaa@31D2@ = K1(Γ uv ,:)T B 0 u v MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGcbGqdaqhaaWcbaGaeGimaadabaGaemyDauNaemODayhaaaaa@31BC@ K1(Γ uv ,:), and N 0 u v MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGobGtdaqhaaWcbaGaeGimaadabaGaemyDauNaemODayhaaaaa@31D4@ = K1(Γ uv ,:)T W 0 u v MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGxbWvdaqhaaWcbaGaeGimaadabaGaemyDauNaemODayhaaaaa@31E6@ K1(Γ uv ,:).

  5. 5.

    Update α(n)

α ( n + 1 ) = α ( n ) + η ( n ) ( 1 J 2 u v M 0 u v − J u v J 2 u v N 0 u v ) α ( n ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFXoqydaahaaWcbeqaaiabcIcaOiabd6gaUjabgUcaRiabigdaXiabcMcaPaaakiabg2da9iab=f7aHnaaCaaaleqabaGaeiikaGIaemOBa4MaeiykaKcaaOGaey4kaSIae83TdGMaeiikaGIaemOBa4MaeiykaKIaeiikaGYaaSaaaeaacqaIXaqmaeaacqWGkbGsdaqhaaWcbaGaeGOmaidabaGaemyDauNaemODayhaaaaakiabd2eannaaDaaaleaacqaIWaamaeaacqWG1bqDcqWG2bGDaaGccqGHsisldaWcaaqaaiabdQeaknaaCaaaleqabaGaemyDauNaemODayhaaaGcbaGaemOsaO0aa0baaSqaaiabikdaYaqaaiabdwha1jabdAha2baaaaGccqWGobGtdaqhaaWcbaGaeGimaadabaGaemyDauNaemODayhaaOGaeiykaKIae8xSde2aaWbaaSqabeaacqGGOaakcqWGUbGBcqGGPaqkaaaaaa@6087@

and normalize α(n+1)so that ||α(n+1)|| = 1.

  1. 6.

    If n reaches a prespecified number N, stop. Otherwise, set n = n + 1, go to step 3.

0.4 KNN classification using the optimized kernel distance metric

Given two samples x,y ∈ Rd, the inner product is defined as: x·y = <x, y > = k(x, y); therefore, their derived distance can be calculated

d(x, y) = <x, x > + <y, y > -2 <x, y > = k(x, x) + k(y, y) - 2k(x, y).

Using our data-dependent kernel model, the distance can be expressed as

d(x, y) = q2(x) + q2(y) - 2q(x)q(y)k0(x, y) = [q(x) - q(y)]2 + 2q(x)q(y)(1 - k0(x, y))

where we assume that the basic kernel function satisfy: k0(x,x) = 1, just like the Gaussian function.

Since the kernel optimization scheme increases the class separability of the data in the feature space, the performances of kernel machines should be improved. However, for the classification of gene expression data, due to the small size of training samples, the kernel optimization, which performs on training data, may cause overfitting, which means the algorithm may work very well on the training data, but worse on the test data. To handle this problem, we adopted a disturbed resampling strategy to increase the sample size of the training data.

Suppose that {x i , ζ i } (i = 1,2, ... m) are the training data, we construct a new set of training data {y i ,ξ i }(i = 1,2,...,3m), where

y i = { x i if  1 ≤ i ≤ m x r + ε if  i > m ( 10 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWG5bqEdaWgaaWcbaGaemyAaKgabeaakiabg2da9maaceaabaqbaeaabiGaaaqaaiabdIha4naaBaaaleaacqWGPbqAaeqaaaGcbaGaeeyAaKMaeeOzayMaeeiiaaIaeGymaeJaeyizImQaemyAaKMaeyizImQaemyBa0gabaGaemiEaG3aaSbaaSqaaiabdkhaYbqabaGccqGHRaWkiiGacqWF1oqzaeaacqqGPbqAcqqGMbGzcqqGGaaicqWGPbqAcqGH+aGpcqWGTbqBaaaacaGL7baacaWLjaGaaCzcamaabmaabaGaeGymaeJaeGimaadacaGLOaGaayzkaaaaaa@510C@

in which x r is a sample randomly selected form {x i } with replacement and ε denotes a normal random disturb, that is, ε ~N(0, σ ε 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFdpWCdaqhaaWcbaGae8xTdugabaGaeGOmaidaaaaa@3137@ ). The class labels are determined as

ξ i = { ζ i if  1 ≤ i ≤ m ζ r if  i > m MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF+oaEdaWgaaWcbaGaemyAaKgabeaakiabg2da9maaceaabaqbaeaabiGaaaqaaiab=z7a6naaBaaaleaacqWGPbqAaeqaaaGcbaGaeeyAaKMaeeOzayMaeeiiaaIaeGymaeJaeyizImQaemyAaKMaeyizImQaemyBa0gabaGae8NTdO3aaSbaaSqaaiabdkhaYbqabaaakeaacqqGPbqAcqqGMbGzcqqGGaaicqWGPbqAcqGH+aGpcqWGTbqBaaaacaGL7baaaaa@4A9E@

Due to the very high dimensionality and small number of the patient samples, the training data are sparsely distributed in the high dimensional Euclidean space. It is reasonable to assume that the near points of a training datum have the same class characteristic as that of the training datum. Experimentally, using the technique of disturbed resampling (Eq.(l0)), we can effectively reduce the possible overfitting and computational instability, which are mainly caused by the lack of enough training samples for the gene expression data.

Availability

The core source codes of our algorithms are available at http://www.ittc.ku.edu/~xwchen/BMCbioinformatics/kernel/

Abbreviations

KNN:

K-nearest-Neighbor

SVM:

support vector machine

DLDA:

diagonal linear discriminant analysis

ULDA:

uncorrelated linear discriminant analysis

KerNN:

kernel optimization-based KNN

ALL:

acute lymphoblastic leukemia

AML:

acute myeloid leukemia

MLL:

mixed lineage leukemia

CNS:

embryonal tumor of central nervous system

References

  1. Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z: Tissue classification with gene expression profiles. J Computational Biology 2000, 7: 559–584. 10.1089/106652700750050943

    Article  CAS  Google Scholar 

  2. Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination method for the classification of tumor using gene expression data. J Am Statistical Assoc 2002, 97: 77–87. 10.1198/016214502753479248

    Article  CAS  Google Scholar 

  3. Golub TR, Slonim DK, Tamayo P, Huard C, Gassenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531–537. 10.1126/science.286.5439.531

    Article  CAS  PubMed  Google Scholar 

  4. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D'Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR: Gene expression correlations of clinical prostate cancer behavior. Cancer Cell 2004, 1: 203–209. 10.1016/S1535-6108(02)00030-2

    Article  Google Scholar 

  5. van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Martoon MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 419: 530–536. 10.1038/415530a

    Article  Google Scholar 

  6. West B, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA Jr, Marks JR, Nevins JR: Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci USA 2001, 98: 11462–11467. 10.1073/pnas.201162998

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  7. Duda RO, Hart PE, Stork DG: Pattern Classification. 2nd edition. A Wiley-Interscience Publication; 2000.

    Google Scholar 

  8. Friedman JH: Flexible metric nearest neighbor classification. In Technical report. Dept. of Statistics, Stanford University; 1994.

    Google Scholar 

  9. Howland P, Park H: Generalizing discriminant dnalysis using the generalized singular value decomposition. IEEE Trans. on Pattern Analysis and Machine Intelligence 2004, 26: 995–1006. 10.1109/TPAMI.2004.46

    Article  Google Scholar 

  10. Ye J, Li T, Xiong T, Janardan R: Using uncorrelated discriminant analysis for tissue classification with gene expression data. IEEE/ACM Trans. on Computational Biology and Bioinformatics 2004, 1: 181–190. 10.1109/TCBB.2004.45

    Article  CAS  Google Scholar 

  11. Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16: 906–914. 10.1093/bioinformatics/16.10.906

    Article  CAS  PubMed  Google Scholar 

  12. Jaakkola T, Diekhans M, Haussler D: Using the Fisher kernel method to detect remote protein homologies. In Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology. Edited by: Menlo Park, CA. AAAI Press;

  13. Zien A, Rätsch G, Mika S, Schölkopf B, Lemmen C, smola A, Lengauer T, Müller K: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 2000, 16: 799–807. 10.1093/bioinformatics/16.9.799

    Article  CAS  PubMed  Google Scholar 

  14. Cawley GC: MATLAB support vector machine toolbox.University of East Anglia, School of Information Systems, Norwich, Norfolk, U.K. NR4 7TJ; 2000. [http://theoval.sys.uea.ac.uk/~gcc/svm/toolbox]

    Google Scholar 

  15. Pochet N, Smet FD, Suykens JAK, Moor BLRD: Systematic benchmarking of microarray data classification: Assessing the role of non-linearity and dimensionality reduction. Bioinformatics 2004, 20: 3185–3195. 10.1093/bioinformatics/bth383

    Article  CAS  PubMed  Google Scholar 

  16. Natsoulis G, Ghaoui LE, Lanckriet GRG, Tolley AM, Leroy F, Dunlea S, Eynon BP, Pearson CI, Tugendreich S, Jarnagin K: Classification of a large microarray data set: Algorithm comparison and analysis of drug signatures. Genome Res 2005, 15: 724–736. 10.1101/gr.2807605

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  17. Bio-medical Data Analysis[http://sdmc.lit.org.sg/GEDatasets/]

  18. Center for Applied Genomics and Technology[http://mgm.duke.edu/genome/dna_micro/work/]

  19. Cancer Program Data Sets[http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi]

  20. St. Jude Research[http://www.stjuderesearch.org/data/]

  21. Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ: MLL translocations specify a distinct gene expression profile that distigushes a unique leukemia. Nature Genetics 2001, 30: 41–47. 10.1038/ng765

    Article  PubMed  Google Scholar 

  22. Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JYH, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR: Prediction of central nervous system embryonal tumor outcome based on gene expression. Letters to Nature Nature 2002, 415: 436–442.

    Article  CAS  Google Scholar 

  23. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissue probed by oligonucleotide arrays. Proc Natl Acad Sci USA 1999, 96: 6745–6750. 10.1073/pnas.96.12.6745

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  24. Gordon GJ, Jenson RV, Hsiao L-L, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R: Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelima. Cancer Research 2002, 62: 4936–4967.

    Google Scholar 

  25. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RCT, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR: Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning. Nature Medicine 2002, 8: 68–74. 10.1038/nm0102-68

    Article  CAS  PubMed  Google Scholar 

  26. Petricoin EF, Ardekanl AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA: Use of proteomic patterns in serum to identify ovarian cancer. The Lancet 2002, 359: 572–577. 10.1016/S0140-6736(02)07746-2

    Article  CAS  Google Scholar 

  27. Xiong H, Swamy MNS, Ahmad MO: Optimizing the data-dependent kernel in the empirical feature space. IEEE Trans. on Neural Networks 2005, 16: 460–474. 10.1109/TNN.2004.841784

    Article  PubMed  Google Scholar 

  28. Ruiz A, Lopez-de Teruel PE: Nonlinear kernel-based statistical pattern analysis. IEEE Trans. on Neural Networks 2001, 12: 16–32. 10.1109/72.896793

    Article  CAS  PubMed  Google Scholar 

  29. Baudat G, Anouar F: Generalized discriminant analysis using a kernel approach. Neural Computation 2000, 12: 2385–2404. 10.1162/089976600300014980

    Article  CAS  PubMed  Google Scholar 

  30. Cristianini N, Shawe-Taylor J: An Introduction to Support Vector Machines. Cambridge Univ. Press, Cambridge, UK; 2000.

    Google Scholar 

  31. Amari S, Wu S: Improving support vector machine classifiers by modifying kernel functions. Neural Networks 1999, 12: 783–789. 10.1016/S0893-6080(99)00032-5

    Article  PubMed  Google Scholar 

  32. Müller K-R, Mika S, Rätsch G, Tsuda K, Scholkopf B: An introduction to kernel-based learning algorithms. IEEE Trans. on Neural Networks 2001, 12: 181–201. 10.1109/72.914517

    Article  PubMed  Google Scholar 

  33. Pekalska E, Paclik P, Duin RobertPW: A generalized kernel approach to dissimilarity-based classification. Journal of Machine Learning Research 2001, 2: 175–211. 10.1162/15324430260185592

    Google Scholar 

  34. Roth V, Steinhage V: Nonlinear discriminant analysis using kernel functions. In Advance in Neural Information Processing Systems 12. Edited by: Solla SA, Leen TK, Muller K-R. Cambridge, MA:MIT Press; 2000:568–574.

    Google Scholar 

  35. Scholkopf B, Mika S, Burges CJC, Knirsch P, Muller K-R, Ratsch G, Smola AJ: Input space versus feature space in kernel-based methods. IEEE Trans. on Neural Networks 1999, 10: 1000–1017. 10.1109/72.788641

    Article  CAS  PubMed  Google Scholar 

  36. Graf ABA, Smola AJ, Borer S: Classification in a normalized feature space using support vector machines. IEEE Trans. on Neural Networks 2003, 14: 597–605. 10.1109/TNN.2003.811708

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This investigation was based upon work supported by the National Science Foundation under Grant No. EPS-0236913, by matching support from the State of Kansas through Kansas Technology Enterprise Corporation, and by the University of Kansas General Research Fund allocations #2301770-003 and #2301478-003.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xue-wen Chen.

Additional information

Authors' contributions

HX and XWC conceived the study. HX designed and implemented the algorithms, and drafted the manuscript. XWC coordinated the study, participated in the algorithm design, and helped draft the manuscript.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Xiong, H., Chen, Xw. Kernel-based distance metric learning for microarray data classification. BMC Bioinformatics 7, 299 (2006). https://doi.org/10.1186/1471-2105-7-299

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-7-299

Keywords