 Research article
 Open access
 Published:
Gene selection algorithms for microarray data based on least squares support vector machine
BMC Bioinformatics volumeÂ 7, ArticleÂ number:Â 95 (2006)
Abstract
Background
In discriminant analysis of microarray data, usually a small number of samples are expressed by a large number of genes. It is not only difficult but also unnecessary to conduct the discriminant analysis with all the genes. Hence, gene selection is usually performed to select important genes.
Results
A gene selection method searches for an optimal or near optimal subset of genes with respect to a given evaluation criterion. In this paper, we propose a new evaluation criterion, named the leaveoneout calculation (LOOC, A list of abbreviations appears just above the list of references) measure. A gene selection method, named leaveoneout calculation sequential forward selection (LOOCSFS) algorithm, is then presented by combining the LOOC measure with the sequential forward selection scheme. Further, a novel gene selection algorithm, the gradientbased leaveoneout gene selection (GLGS) algorithm, is also proposed. Both of the gene selection algorithms originate from an efficient and exact calculation of the leaveoneout crossvalidation error of the least squares support vector machine (LSSVM). The proposed approaches are applied to two microarray datasets and compared to other wellknown gene selection methods using codes available from the second author.
Conclusion
The proposed gene selection approaches can provide gene subsets leading to more accurate classification results, while their computational complexity is comparable to the existing methods. The GLGS algorithm can also better scale to datasets with a very large number of genes.
Background
Recently, discriminant analysis of microarray data has been widely used to assist diagnosis [1â€“3]. Given some microarray data characterized by a large number of genes' expressions, a typical discriminant analysis constructs a classifier based on the given data to distinguish between different disease types. In practice, a gene selection procedure to select the most informative genes from the whole gene set is usually employed. There are several reasons for performing gene selection. First, the cost of clinical diagnosis can be reduced with gene selection since it is much cheaper to focus on only the expressions of a few genes for diagnosis instead of the whole gene set. Second, many of the genes in the whole gene set are redundant. Although the training error of a classifier on the given data will decrease as more and more genes are included, the generalization error when classifying new data eventually will increase. A preceding gene selection procedure can remove the redundant genes, reduce storage requirement and computational complexity of the following discriminant analysis, and possibly reduce the generalization error. Finally, gene selection can provide a more compact gene set, which can help understand the functions of particular genes and plan the diagnosis process.
From the perspective of pattern recognition, the gene selection problem is a special case of the feature selection problem. Given a set of training data represented by a set of features, a typical feature selection method aims to select a feature subset leading to a low generalization error (i.e. low error in future classification). It searches for an optimal or near optimal subset of features with respect to a given criterion, and thus consists of two basic components: an evaluation criterion and a search scheme. Feature selection methods can generally be categorized into three major groups: marginal filters, wrappers and embedded methods [4]. Marginal filter approaches are usually mentioned as individual feature ranking methods. They evaluate a feature based on its marginal contribution to the class discrimination without considering its interactions with other features. The selection procedure is independent of the classification procedure because a classifier is not built when evaluating a feature. Some comparative studies on the criteria employed in a marginal filter method can be found in [5]. In a wrapper method, usually a classifier is built and employed as the evaluation criterion. One such example is to use the training or crossvalidation error of the classifier on the training data as the evaluation criterion. Because the finally selected feature subset has the highest value on the criterion, the feature selection procedure is closely related to the decision mechanism of the classifier and therefore the wrapper methods are expected to generate better feature subsets for classification than the marginal filter methods. If the criterion is derived from the intrinsic properties of a classifier, the corresponding feature selection method will be categorized as an embedded approach [6]. For example, in the SVM Recursive Feature Elimination (SVMRFE) algorithm, a support vector machine is trained first. then the features corresponding to the smallest weights in the vector normal to the optimal hyperplane is sequentially eliminated [7]. Nevertheless, wrapper and embedded methods are often closely related to each other.
Because the marginal filter methods evaluate features separately and there is no other search scheme for them, we mainly discuss the search schemes for the wrapper and embedded methods. In a wrapper or embedded feature selection algorithm, if the whole feature subset space is explicitly (such as using an exhaustive search) or implicitly searched (such as using the branchandbound scheme [8]), it is guaranteed to discover the optimal feature subset with respect to the evaluation criterion. However, an exhaustive search or the branchandbound scheme is computationally prohibitive except for small problems. Therefore, one usually searches a part of the whole feature subset space, which is more practicable but provide no optimality guarantees [9]. The sequential forward selection, sequential floating forward selection, sequential backward elimination, sequential floating backward elimination and so on [10] belong to this class. The sequential forward scheme starts from an empty set, and sequentially includes a new feature into the feature subset so that the largest improvement on the evaluation criterion can be achieved. Once a feature is selected, it will not be removed from the subset. Differently, the sequential floating forward selection scheme contains two steps. First, the feature leading to the largest improvement is included, then the scheme backtracks the search path and removes some previously selected features if improvement can be achieved by doing so. Sequential backward elimination scheme sequentially remove features from the whole feature set until an optimal feature subset is remained. And the sequential floating backward elimination allows including previously removed features to the current feature subset. Hence, the floating schemes cover a larger portion of all the possible feature subsets, while it is more time consuming [10]. Recently, genetic algorithms (GA) have also been employed as the search schemes [11â€“13]. Compared with the traditional search schemes, GAs provide a more flexible search procedure, the feature subset space is searched in parallel and multiple feature subsets instead of a single subset are evaluated simultaneously to avoid being trapped in a local optimum. GAs are generally even more time consuming than the floating schemes, although it can cover more feature subsets.
In the context of microarray data analysis, many of the methodologies discussed above have been used. In addition to those marginal filter methods using tstatistics, Fisher's ratio and information gain, different evaluation criteria were proposed for wrapper and embedded methods, such as the SVMbased criteria [14] and the LS bound measure, which is based on a lower bound of the leaveoneout crossvalidation error of the least squares support vector machine (LSSVM) [15]. They can be combined with any kinds of search scheme. Several GAbased algorithms are also available in the literature of gene selection [12, 13, 16]. Two issues should be considered when assessing these gene selection methods: the generalization error that can be achieved on the selected gene subset and the time requirement of the selection procedure. Specifically, a good feature selection method should contain following characteristics: The evaluation criterion can guarantee low generalization error, computational cost for a single evaluation is low, the search scheme requires a small number of evaluations while can still search a large portion of the whole feature subset space in order to include the optimal ones. Therefore, among all the methods discussed above, the marginal filter methods are the most efficient, but the selected feature subsets are usually suboptimal. The wrapper/embedded methods using an exhaustive search are the most time consuming, but optimality can be guaranteed. All the other methods lie between these two cases, providing a tradeoff between optimality and computation cost. The wrapper methods using sequential selection schemes are more efficient than the wrapper methods using GA based algorithms, but are more likely to select a suboptimal gene subset. Furthermore, in addition to finding an optimal gene subset for classification, identifying important genes is another goal of gene selection. Identifying important genes is essentially different from finding a single optimal gene subset. For the microarray data, a classifier may be able to achieve the lowest generalization error on many different gene subsets, and all of them consist of important genes. Knowing these different gene subsets can help gain more insight into the functions of genes.
In the present study, we first propose an evaluation criterion called leaveoneout calculation (LOOC) measure for gene selection. The LOOC measure is derived from an exact and efficient calculation of the leaveoneout crossvalidation error (LOOE) of LSSVM. By combining the LOOC measure with the sequential forward selection scheme, we proposed the leaveoneout calculation sequential forward selection (LOOCSFS) gene selection algorithm. Moreover, we also present a novel gene selection algorithm, named gradientbased leaveoneout gene selection (GLGS) algorithm. Employing none of the traditional search schemes, it combines a variant of the LOOC measure with the gradient descent optimization and the principal component analysis (PCA). Performance of the proposed methods is evaluated experimentally on two microarray datasets.
Results
Datasets
In this section, we present performance of the proposed gene selection algorithms, i.e. the LOOCSFS and the GLGS algorithms on two public domain datasets.
Hepatocellular carcinoma dataset
This dataset comprises information of 60 patients with hepatocellular, with oligonucleotide microarrays representing 7129 gene expression levels [2].
Glioma dataset
All the 50 samples of the Glioma dataset [3] are expressed by 12625 genes. Twenty eight of the samples are glioblastomas and the other 22 are anaplastic oligodendrogliomas.
Experimental setup
We acquire the two datasets directly from [17]. We further standardize the data to zero mean and unit standard deviation for each gene. The experiments and results are based on the preprocessed data and are implemented in the Matlab environment on a computer with 3 GHz P4 CPU and 1024 MB RAM.
There are two objectives for the experiments. One is to evaluate the performance of LOOCSFS and GLGS algorithms, and compare them with other gene selection algorithms. The other goal of the experiments is to identify important genes of the two datasets.
For the first objective, we compare our leaveoneout calculation sequential forward selection (LOOCSFS) and gradientbased leaveoneout gene selection (GLGS) algorithms with other five gene selection algorithms. First, although it usually selects a suboptimal gene subset for classification, a marginal filter method using Fisher's ratio is employed to provide a baseline for the comparison on generalization errors. Fisher's ratio is a criterion that evaluates how well a single gene is correlated with the separation between classes. For every gene the Fisher's ratio is defined as f\frac{{({\mathrm{\xce\xbc}}_{1}\xe2\u02c6\u2019{\mathrm{\xce\xbc}}_{2})}^{2}}{{\mathrm{\xcf\u0192}}_{1}^{2}+{\mathrm{\xcf\u0192}}_{2}^{2}}, where Î¼_{1}, Î¼_{2}, Ïƒ_{1}, Ïƒ_{2} denote the means and standard deviations of two classes. Further, our methods are compared with the Mahalanobis class separability measure (MAHSFS) [8] and the LS bound measure combined with sequential forward selection scheme (LSSFS) [15], SVMRFE [7], the LS bound measure with sequential floating forward selection scheme (LSSFFS) [15]. The comparisons are conducted based on the generalization error achieved on the selected gene subset and the time requirement of the selection procedure. In two previous works, Ambroise and McLachlan and Simon et al. demonstrated that crossvalidation or bootstrap samples should be kept external to a gene selection algorithm [18, 19]. Ambroise and McLachlan [18] also assessed several techniques for estimating the generalization error. They showed that the external 10fold crossvalidation error and the external B.632+ error are the two most unbiased estimators of the generalization error. Since crossvalidation is claimed to have a relatively higher variance for small sample size problems [20], the external B.632+ error appears to be the best choice. Hence we use the external B.632+ technique [21] to compare the generalization error achieved on the selected gene subsets. The B.632+ technique employs the bagging [22] procedure to generate different training and testing sets (which are called bootstrap samples [21]) from the original data. The algorithms are applied to the bootstrap samples as well as the original data. Specifically, we employ 200 replicates of balanced bootstrap samples to reduce variance of the B.632+ error, i.e. each sample in the original dataset is restricted to appear exactly 200 times in total in all the 200 balanced bootstrap samples. A standard SVM is employed as the final classifier for all seven gene selection methods. All the compared algorithms will terminate if a predefined number of genes are selected. We set this number as 100. Furthermore, we conduct another experiment to study the computational complexity and scalability of the seven gene selection algorithms. The required computational time of the algorithms are studied with respect to the number of genes to be selected (t) and the size of the whole gene set (d). This experiment is conducted on the Hepatocellular Carcinoma dataset, but similar scenario can be easily shown on the Glioma dataset.
The second objective of identifying important genes is essentially different from selecting a single gene subset for classification. Some researchers claimed that the gene selection procedure should not be applied only once on the original data, but should be run repeatedly on different subsets of the training data [13, 23â€“25]. Since we apply our algorithms to 200 bootstrap samples for the first objective, we can actually achieve 200 different gene subsets for these bootstrap samples. By looking at the frequency of genes appearing in all the 200 gene subsets, we can find some insights on the genes that are important for classification. Therefore, for both LOOCSFS and GLGS algorithms, we identify the top 20 genes that are most frequently selected based on the 200 bootstrap samples.
Results
Figures 1 and 2 present the external B.632+ errors achieved on the genes selected by the seven gene selection algorithms. It can be observed that the GLGS algorithm generally achieves the lowest external B.632+ error among the compared methods on both datasets. The LOOCSFS algorithm does not perform as well as the GLGS algorithm. As shown in the figures, LOOCSFS is consistently superior to the marginal filter method, LSSFS. It also outperforms the MAHSFS and SVMRFE on the Hepatocellular Carcinoma datasets, and the results are mixed on the Glioma dataset. Furthermore, although a gene selection algorithm employing the sequential forward selection scheme is expected to be inferior to the methods employing the sequential floating forward selection scheme (because the sequential forward selection scheme searches a smaller portion of the feature subset space than the sequential floating forward selection scheme), it also outperforms the LSSFFS on both datasets.
From the perspective of computational complexity, the scalability of a gene selection algorithm should also be considered when evaluating it. The required computational time of the algorithms are plotted with respect to the number of genes to be selected (t) and the size of the whole gene set (d) in Figures 3 and 4. As shown in the two figures, the marginal filter method is always the most efficient one among all the approaches. The computational costs of LSSFS, MAHSFS, LOOCSFS and LSSFFS all increase significantly when d or t increases. In Fig. 3, LSSFFS is the most time consuming since the sequentially floating forward selection scheme is employed. MAHSFS also requires expensive computation. The LSSFS and the LOOCSFS can be calculated more efficiently than LSSFFS and MAHSFS. The LOOCSFS requires slightly more time than the LSSFS. The computational time of SVMRFE and GLGS do not change significantly with t, and SVMRFE is more time consuming than GLGS. In Figure 4, computational costs of all methods except GLGS increase significantly with d, with the LSSFFS being the most time consuming and the other four are comparable. Hence the GLGS algorithm can better scale to microarray data with large number of genes as well as the problems that require selecting a large number of genes from the original gene set.
Hepatocellular carcinoma dataset
The 20 genes most frequently selected by LOOCSFS and GLGS are listed in Table 1 and 2 respectively. Some comments about the selected genes are worthy of mention. Genes M59465, X75042, Y10032, L08895, AB000409, L11695, X15341 and L76927 are frequently selected by both algorithms. Among them, M59465, X75042, Y10032 and L08895 are also used to construct an SVM classifier in the original work [2]. The Y10032 and M59465 are claimed as greatly downregulated in hepatocellular carcinoma with venous invasion and the levels of Y10032 transcript are altered in hepatoma cells in response to osmotic changes or cell volume changes [2]. In addition, LOOCSFS also frequently select gene X00274, which is downregulated in hepatocellular carcinomas with early intrahepatic recurrence and this downregulation might permit tumour cells to escape from host immune surveillance [26].
Glioma dataset
In their original work, Nutt et al. identified 77 important genes that were used to construct a 20gene kNN classifier in a leaveoneout crossvalidation procedure [3]. Tables 3 and 4 present the 20 most frequently selected genes by LOOCSFS and GLGS. We found from Tables 3 and 4 that the genes L39874 (Please note that according to the original work, L39874630_at and L39784631_g_at are two different features obtained by applying the same probe set to different regions of a same gene), AB007960 and D29643 may be very important since they are both frequently selected in the original work as well as by our two algorithms. In addition, the gene Y00815 is also identified by both LOOCSFS and GLGS algorithm. Hence, it may also be important.
Another interesting observation on the two datasets is that a gene is usually less frequently selected by GLGS than by LOOCSFS, which indicates that the 200 gene subsets selected by GLGS are more different from one another (more diverse) than the gene subsets selected by LOOCSFS. Since the training sets for the 200 bootstrap samples are different, the optimal gene subsets for classification are also expected to be different. Hence, this observation is consistent with the fact that GLGS can select gene subsets leading to lower generalization performance as seen in figures 1 and 2.
Discussion
In practice, choosing a gene selection algorithm for discriminant analysis usually depends on the problem involved. According to the presented experimental results, if only one gene subset that leads to the lowest generalization error is needed, the GLGS is more appealing. However, in some cases we have to trade some accuracy for efficiency. Among the seven discussed gene selection methods, the marginal filter approach is the most efficient, but it generates the highest generalization error. If d<1000 and t<100, the GLGS algorithm is more time consuming than the approaches employing the sequential forward selection schemes. In this case, if one wants to obtain the solution faster and also achieve higher accuracy than a marginal filter method, the LOOCSFS and SVMRFE may be more suitable than the GLGS. If the number of samples is relatively larger than the number of genes, which is quite unlikely for microarray data, the MAHSFS may be a better choice. Finally, the GLGS algorithm appears to be a good choice for small number of samples with large d and t, which is true for most microarraybased gene selection scenarios.
If the LOOCSFS or GLGS is chosen to carry out gene selection, one needs to define the number of the genes to be selected. In our experiments, we set this number as 100 for both datasets, of course 100 may not be the optimal number for achieving lowest generalization error on all microarray datasets and in practice one may need to estimate the optimal one for different datasets. Although our aim in this paper is not to investigate this point thoroughly, some suggestions can be found in previous studies [14]. One approach is to terminate the selection procedure when a given criterion does not improve significantly when more genes are incorporated. This strategy can be easily included in the program so that the algorithm can terminate automatically, and the criterion can be the LOOC measure, or simply the crossvalidation error. Another approach is selecting a sufficiently large number of genes. Then by looking at the curve of generalization error, a human expert can determine the optimal number of genes to be selected so that the generalization error does not decrease significantly when more genes are selected. Based on the last approach and the experimental results presented in figures 1 and 2, we are further able to recommend some choices of the number of selected genes for the discussed two datasets. Our recommendation is based on the results of GLGS algorithm since it achieves the lowest external b.632+ error on the two datasets. For the Hepatocellular Carcinoma dataset, after selecting 40 features, including more genes can only result in no more than 1% reduction in the external b.632+ error. If 40 is a relatively large number for a gene subset, we can select 20 genes. It can be observed from figure 1 that the error increases significantly when the number of selected genes is less than 20. Therefore, we recommend 20 and 40 as two choices for the Hepatocellular Carcinoma dataset. Similarly, we can find from figure 2 that 15 and 30 can be good choices for the glioma dataset.
To identify important genes and study the possible interactions between them, one may need to select a number of different gene subsets that can all solve the classification problem with similar high accuracy. For this purpose, we feel that not only different training data (as in [23]), but also different selection algorithms should also be employed to render a comprehensive exploration of the useful genes. In this case, our methods can be used together with many other approaches. In the context of machine learning, this approach is referred as an ensembling, which has also been used to the gene selection problems. It should be noted that generally any algorithm that selects a single gene subset could be used as a component of such an ensemble system. Therefore, our work can be viewed as providing new choices to build an ensemble system.
Conclusion
In this study, we have proposed two gene selection algorithms, the LOOCSFS and the GLGS algorithms based on an efficient and exact calculation of the leaveoneout crossvalidation error of LSSVM. The GLGS algorithm is different from traditional gene selection algorithms for that it solves the involved optimization problem in a much lower dimensional space, thus significantly reduces the computational cost of the selection procedure, while still GLGS selects genes from the original gene set. As both LOOCSFS and GLGS algorithms are derived from the exact calculation of leaveoneout crossvalidation error, they are promising to select gene subsets leading to low generalization error. Experimental results show that the GLGS is also more efficient than traditional algorithms when the microarray data are represented by a large number of genes or a large number of genes to be selected from the whole gene set. Furthermore, our algorithms can be easily incorporated into more sophisticated ensemble systems to enhance overall gene selection performance.
Methods
Least square support vector machines
Belonging to the large family of socalled kernel methods [27], the least squares support vector machine (LSSVM) [28, 29] is a modification of the standard support vector machine. Suppose we are given n training sample pairs {x_{ i }, y_{ i }} where x_{ i }is a ddimensional column vector representing the i th sample, and y_{ i }is the class label of x_{ i }, which is either +1 or 1. The LSSVM employs a set of mapping functions Ã– to map the data into a reproducing kernel Hilbert space (RKHS), and performs classification in it. Using the kernel function k(x_{ i }, x_{ j }) = x_{ i }^{T}x_{ j }, the linear decision boundary of the LSSVM can be formulated as
w^{T}x+b = 0 (1)
where w = [w1, w2, ..., wn]^{T} and b is a scalar. w and b can be obtained by solving the optimization problem:
s.t.y_{ i }(w^{T}x_{ i }+b) = 1e_{ i } (3)
where e_{ i }denotes regression error for sample x_{ i }, e = [e_{1}, e_{2}, ..., e_{ n }] and Î³ is a given positive constant introduced to adjust the compromise between generalization and training errors. After introducing Lagrangian multipliers, the optimization problem can be converted to the following linear system:
where Y = [y_{1}, y_{2}, ..., y_{ n }] ^{T}, Ã™ = {y_{ i }y_{ j }x_{ i }^{T}x j}, \stackrel{\xe2\u2020\u2019}{1} = [1, 1, ..., 1]^{T}, a = [Î±_{1}, Î±_{2}, ..., Î±_{ n }]^{T} and I is the identity matrix. Similar to the standard SVM, given a testing sample x, the discriminant function of the LSSVM takes the form:
and the sign of f (x) is taken as the class label of x.
The main difference between the standard SVM and the LSSVM is that for the standard SVM the equality constraints in Eq. (3) are replaced by inequality constraints, thus SVM involves solving a quadratic programming (QP) problem, which requires more expensive computation than solving a linear system. On the other hand, according to an empirical study [29], the LSSVM is capable of achieving comparable performance as the standard SVM on many realworld problems. It has also achieved satisfactory classification accuracy on microarray data [17].
The LOOC gene selection criterion
As mentioned before, a good evaluation criterion for feature selection should guarantee low generalization error and can be computed efficiently. Although the B.632+ technique has been proven to be the best estimator of generalization error [18], computing B.632+ error for every candidate gene subset is computationally too costly. Therefore, the B.632+ error is seldom employed as the evaluation criterion during the gene selection process. On the other hand, as it is proven that the leaveoneout crossvalidation error (LOOE) is an almost unbiased estimator of the generalization error [30], and it can be easily computed as we will demonstrate in Eq. (9), it is acceptable to use LOOE as the evaluation criterion for feature selection.
Basically,, the direct calculation of the LOOE requires repeating the whole training procedure for n times, where n is the number of training samples. This is still time consuming. To simplify the calculation of LOOE for the standard SVM, several approaches have been discussed [31, 32]. These approaches generally require training the SVM only once with the whole training set. Many of them are later extended as evaluation criteria for feature selection [14]. But, determined by the nature of SVM, all these approaches involve solving the QP problem, which still requires expensive computation. In the context of LSSVM, Cawley and Talbot in [33] and Van Gestel et al. in [34] showed that the leaveoneout error of an LSSVM can be efficiently and exactly evaluated. This approach is then successfully implemented in the LSSVMlab toolbox [35]. Since the LSSVM can be implemented more efficiently, we focus on the LSSVM in this paper. First of all, an alternative efficient calculation of the LOOE of the LSSVM is presented as our starting point based on the Lemma below:
Lemma 1
Given n training samples, let w_{ i }and b_{ i }denote the w and b achieved by training the LSSVM after sample x_{ i }is removed, and denote the testing result of sample x_{ i }in the leaveoneout procedure as:
Then Eq. (8) holds:
y_{ i }f^{i} (x) = 1Î±_{ i }/(H^{1})_{ ii } (8)
where H=\left[\begin{array}{cc}K+{\mathrm{\xce\xb3}}^{\xe2\u02c6\u20191}I& \stackrel{\xe2\u2020\u2019}{1}\\ {\stackrel{\xe2\u2020\u2019}{1}}^{T}& 0\end{array}\right], K = {x_{ i }^{T}x_{ j }} is the kernel matrix, and (H^{1})_{ ii }denotes the i th diagonal element of the matrix H^{1} (The proof of Lemma 1 is available in Additional file 1). An exact calculation of the LOOE of LSSVM can be derived from Eq. (8) as:
It should be noted that although they take different forms, Eq. (9) and the works presented in [33] and [34] generally share equivalent performance and property. Eq. (9) itself can be directly employed as the evaluation criterion for gene selection. But for a microarray dataset, which usually contains only a small number of samples, it is very likely that many candidate feature subsets may provide the same LOOE. Hence, based on Eq. (8), we propose the C bound as the supplementary criterion of Eq. (9):
where (x)_{} = min(0, x). Eq. (10) is motivated by the following consideration: A sample x_{ i }is misclassified in the leaveoneout procedure if yif^{i} (x_{ i }) is negative, and absolute value of yif^{i} (x_{ i }) indicates how close this sample is to the decision boundary. Therefore, for those samples misclassified in the LOO procedure (i.e. yif^{i} (x_{ i }) is negative), a small absolute value of the term yif^{i} (x_{ i }) is more preferable. Because a small value means that the sample is close to the decision boundary and might be classified correctly with a few more training data, while a large absolute value indicates that it is difficult to classify even if we are given more training data. By combining Eq. (9) with Eq. (10), we can now obtain the LOOC measure for the gene selection problem. The optimal gene subset for dicriminant analysis is the one leading to the smallest LOOE, if the same LOOE can be achieved on several gene subsets, the one with the largest value of the C bound is preferred (note that the C bound only requires negligible additional computation since the term 1Î±_{ i }/(H^{1})_{ ii }has been computed when calculating the LOOE). The advantage of the LOOC measure is that it is derived from the leaveoneout procedure and therefore is expected to be an accurate estimator of the generalization error. Further, LOOC measure can be calculated by training the LSSVM with the whole training set only once, which requires solving a linear system and is much easier than solving a QP problem. Hence, the LOOC measure can be calculated even more efficiently than those SVMbased criteria [3, 7]. By combining the LOOC measure with the sequential forward selection scheme, we propose the LOOCSFS gene selection algorithm, which is described in Figure 5.
The gradientbased leaveoneout gene selection algorithm
In addition to the sequential forward selection, sequential floating forward selection, sequential backward elimination and sequential floating backward elimination search schemes, a possible alternative search scheme is the gradient descent method. Using gradient descent is not a totally new idea in the literature of the standard SVM. Chapelle et al. [32] employed the gradient descent approach to choose parameters for the standard SVM, they also suggested using the same framework to address feature selection problem. But the resultant algorithm requires using gradient descent to repeatedly solve an optimization problem, whose dimensionality is the same as the total number of genes. As the number of genes is usually huge in microarray data, this framework will be very time consuming for gene selection problems. Considering the specific properties of microarray data, we propose in this subsection a novel gene selection algorithm, named gradientbased leaveoneout gene selection (GLGS) algorithm. It is also based on the exact calculation of the LOOE of LSSVM, and employs a gradient descent approach to optimize the evaluation criterion.
As we would like to use a gradient approach to optimize the evaluation criterion, the criterion must be differentiable, whereas both Eq. (9) and Eq. (10) are not. Hence, to obtain a differentiable criterion, the logistic LOOC (LLOOC) measure is proposed by modifying Eq. (9) as:
The logistic function 1/(1+exp(x)) is commonly used to transfer output of an SVMtype classifier into a specific region [36]. Different from the C bound, the LLOOC measure ranges between (0,1). Hence it can be viewed as a probability that represents the generalization error of a classifier and can be useful for possible postprocessing procedure. More precisely, large positive value of the term 1Î±_{ i }/(H^{1})_{ ii }generally corresponds to a small LOOE and LLOOC. Therefore, the genes can be evaluated by minimizing the LLOOC measure.
In the present study, to design an LSSVM based gene selection algorithm, we introduce a diagonal matrix V, whose diagonal elements are scaling factors {v_{1}, v_{2}, ..., v_{ d }}, into the kernel matrix K to modify k(x_{ i }, x_{ j }) = x_{ i }^{T}x_{ j }as k(x_{ i }, x_{ j }) = x_{ i }^{T}Vx_{ j }. Consequently, the LLOOC can be viewed as a function of these scaling factors, and we can optimize the scaling factors by solving the ddimensional minimization problem below:
Problem 1
Given a dbyd diagonal matrix V, whose diagonal elements are the scaling factors {v_{1}, v_{2}, ..., v_{ d }}, minimize the LLOOC measure with respect to V.
For the optimized scaling factors, a smaller absolute value indicates that the corresponding gene is less important for achieving the minimal LLOOC measure and thereby low generalization error. Hence, genes can be selected according to the absolute value of the scaling factors.
Given the abovedescribed problem, we can observe that the LLOOC measure is differentiable, and the partial derivative of it with respect to a scaling factor v_{ k }can be calculated by (Detailed derivations are available in Additional file 1):
where M=\left[\begin{array}{cc}\mathrm{\xce\copyright}+{\mathrm{\xce\xb3}}^{\xe2\u02c6\u20191}I& Y\\ {Y}^{T}& 0\end{array}\right]. Let x_{ i }= [x_{ i1 }, x_{ i2 }, ..., x_{ id }]^{T}, then:
Therefore, we can solve the minimization problem by using a gradient descent approach. However, d is usually very large for microarray data, which means the Problem 1 is a high dimensional optimization problem in our case. As we have mentioned, the gradient descent approach takes a long time to converge for high dimensional optimization problem. To overcome this situation, in the GLGS algorithm, the scaling factors are not introduced into the original data directly. Instead, we first apply a principal component analysis (PCA) procedure to the microarray data to resolve the high dimensionality problem, and scaling factors are then introduced into the transformed data and optimized. In pattern recognition field, PCA is a commonly used approach for dimensionality reduction. Denoting the original high dimensional data by a dbyn matrix X, PCA first computes a transformation matrix T, and then transforms X to a low dimensional space by X_{ low }= TX, where X_{ low }denotes the transformed data and is a d_{ low }byn matrix. Each feature of the transformed data is actually a linear combination of the features of original data (we refer the features of transformed data and original data as features and genes respectively), and most information of the original data can be preserved by setting the value of d_{ low }no larger than min(d, n) (specifically, we recommend to use d = n for the presented GLGS algorithm). In case of the microarray data analysis, because the number of samples is usually very small while the number of genes is huge, PCA can reduce dimensionality of the data significantly, which typically equals the number of samples. By this means, we only need to solve an optimization problem whose dimensionality is the number of samples, thereby reducing the computational cost.
After optimizing a d_{ low }dimensional vector v_{ low }of scaling factors in the transformed space, the scaling factors of the original genes, which are called pseudo scaling factors as they are not truly optimized, can be estimated based on three considerations: First of all, absolute values of the scaling factors of the transformed features indicate the importance of the transformed features for achieving the minimal LLOOC measure. Second, absolute values of the elements of T reveal how important the corresponding genes are for constructing the transformed data. Finally, correlation between genes plays an important part in gene selection problems. Hence it is usually expected that a set of uncorrelated genes are likely to be more informative. As a result, the pseudo scaling factors for the original genes can be estimated as:
v = RâŒŠabs(T^{T})âŒ‹ [abs(v_{ low })] (14)
where R denotes the dbyd correlation coefficient matrix of the original gene set, v = [v1, v2, ..., vd] ^{T} and abs(T^{T}) is the matrix whose elements take the absolute value of the elements of T^{T}. The term âŒŠabs(T^{T})âŒ‹[abs(v_{ low })] evaluates the genes' contribution for constructing the lower dimensional space. However, if two or more important genes are very similar to each other, they will have correspondingly large pseudo scaling factors while including all of them may not reduce the generalization error. In this case, we only need to select one of them as a representation and avoid selecting similar genes subsequently. Since the more a specific gene correlated to other genes, the better it can be viewed as a representation of many other genes, we combine the correlation matrix R with the term âŒŠabs(T^{T})âŒ‹[abs(v_{ low })] in the final selection procedure. Given the scaling factors estimated for the original genes, we select genes sequentially from the original gene set based on the pseudo scaling factors and the evc criterion:
evc = (1Î²)v_{ k } (15)
where Î² is the largest correlation coefficient between k^{th} gene and one of the already selected genes. Although large value of v_{ k }means the gene is possibly informative, large value of Î² indicates that the k th gene is highly correlated with at least one already selected gene. Hence, the term 1Î² is introduced to control similarity between the selected genes. In each stage of the selection procedure, the gene with respect to the largest evc is the most desirable one and will be selected. The whole GLGS algorithm is described in Figure 6.
According to the definitions in [4], the GLGS algorithm can be categorized as an embedded method. Hence, it is more time consuming than a marginal filter method. The GLGS differs from previous wrapper and embedded approaches because it optimizes the evaluation criterion derived in a supervised manner in a transformed space with significantly reduced dimensions instead of the original space, while it selects genes from the original gene set based on results of the optimization. One main advantage of the GLGS over the other gene selection algorithms is scaling well to high dimensional data. In a gene selection algorithm, the evaluation criterion is computed repeatedly to assess candidate gene subsets. Hence, the computational cost of a gene selection algorithm is determined not only by computational complexity of the evaluation criterion, but also by the number of required evaluations. Although we have experimentally shown that GLGS can better scale to high dimensions and large number of selected genes, it is worth analyzing the computational complexity issue quantitatively. If the microarray data contain d genes and t of them are to be selected, the sequential forward selection and the sequential backward elimination schemes require (2dt+1)t/2 and (2dt1)(dt)/2 evaluations respectively. The sequential floating selection/elimination scheme requires more evaluations than the former two schemes and GAs generally requires even more evaluations than the floating schemes. As d and t increase, the number of evaluations required by these schemes will increase significantly. Since microarray data usually contain thousands of genes, all the traditional schemes are highly time consuming for the gene selection problem even if we employ an evaluation criterion that is easy to compute. In contrast to these traditional methods, because the computational complexity of Eq. (11) and Eq. (12) is mainly determined by the number of samples rather than the size of the whole gene set or the number of genes to be selected, time requirement of the GLGS algorithm will not increase much when d or t increases, as can be observed in Figure 3 and Figure 4. By performing minimization of Eq. (11) in the lower dimensional PC space, it requires much less evaluations for high dimensional data than the sequential forward selection, sequential floating forward selection, sequential backward elimination and sequential floating backward elimination schemes. For example, if 50 genes are to be selected from 5000, then the LOOCSFS algorithm requires 123775 evaluations, and SVMRFE requires solving the QP problem for 4950 times because of its specific mechanism. For GLGS algorithm, the computational complexity is dominated by the PCA and gradient descent procedure. Generally, the gradient descent procedure can converge within 300 iterations. As the computational complexity of Eq. (11) and Eq. (12) is approximately two times of Eq. (9) and Eq. (10), the time requirement of the gradient descent procedure in this case is comparable to 600 evaluations of the LOOCSFS method, and less than 600 evaluations of SVMRFE. The computational cost of the PCA procedure is a bit difficult to estimate, but our experimental results show that it can almost be neglected when d and t are large. Finally, as Eq. (11) is derived from the exact calculation of the LOOE of LSSVM, GLGS also can select gene subsets leading to a low generalization error.
All the related proofs/derivations, the data used in the experiments and the programs of the LOOCSFS and GLGS algorithms are provided in the additional files.
Abbreviations
 GA:

Genetic Algorithm
 GLGS:

Gradientbased Leaveoneout Gene Selection
 LLOOC:

Logistic LOOC
 LOOC:

LeaveOneOut Calculation (measure)
 LOOCSFS:

Gene selection method using LOOC measure and SFS scheme
 LOOE:

LeaveOneOut crossvalidation ErrorLSSFS Gene selection method using LS bound measure and SFS scheme
 LSSFFS:

Gene selection method using LS bound measure and SFFS scheme
 LSSVM:

Least Squares Support Vector MachineMAHSFS Gene selection method using Mahalanobis measure and SFS scheme
 PCA:

Principal Component Analysis
 RFE:

Recursive Feature Elimination (sequential backward elimination)
 SFS:

Sequential Forward Selection scheme
 SFFS:

Sequential Floating Forward Selection scheme
 SVMRFE:

SVMbased Recursive Feature Elimination algorithm
References
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531â€“537.
Iizuka N, Oka M, YamadaOkabe H, Nishida M, Maeda Y, Mori N, Takao T, Tamesa T, Tangoku A, Tabuchi H, Hamada K, Nakayama H, Ishitsuka H, Miyamoto T, Hirabayashi A, Uchimura S, Hamamoto Y: Oligonucleotide microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection. The Lancet 2003, 361: 923â€“929.
Nutt CL, Mani DR, Bentensky RA, Tamayo P, Cairncross JG, Ladd C, Pohl U, Hartmann C, McLaughlin ME, Batchelor TT, Black PM, Von Deimling A, Pomeroy SL, Golub TR, Louis DN: Gene expressionbased classification of malignant gliomas correlates better with survival than histological classification. Cancer Research 2003, 63: 1602â€“1607.
Kohavi R, John GH: Wrappers for feature subset selection. Artificial Intelligence 1997, 97: 273â€“324.
Cho SB: Exploring features and classifiers to classify gene expression profiles of acute leukaemia. International Journal of Pattern Recognition and Artificial Intelligence 2002, 16: 831â€“844.
Blum AL, Langley P: Selection of relevant features and examples in machine learning. Artificial Intelligence 1997, 97: 245â€“271.
Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines. Machine Learning 2002, 46: 389â€“422.
Devijver P, Kittler J: Pattern Recognition: A Statistical Approach. London: Prentice Hall; 1982.
Tsamardinos I, Aliferis CF: Towards principled feature selection: relevance, filters and wrappers. In Ninth International Workshop on Artificial Intelligence and Statistics. Key West, Florida, USA; 2003.
Webb AR: Statistical Pattern Recognition. London: Wiley; 2002.
Raymer ML, Punch WF, Goodman ED, Kuhn LA, Jain AK: Dimensionality reduction using genetic algorithms. IEEE Transactions Evolutionary Computation 2000, 4: 164â€“171.
Li L, Jiang W, Li X, Moser KL, Guo Z, Du L, Wang Q, Topol EJ, Wang Q, Rao S: A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset. Genomics 2005, 85: 16â€“23.
JirapechUmpai T, Aitken S: Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes. BMC Bioinformatics 2005, 6: 148.
Rakotomamonjy A: Variable selection using SVMbased criteria. Journal of Machine Learning Research 2003, 3: 1357â€“1370.
Zhou X, Mao KZ: LS bound based gene selection for DNA microarray data. Bioinformatics 2005, 21: 1559â€“1564.
Li L, Darden TA, Weinberg CR, Levine AJ, Pedersen LG: Gene assessment and sample classification for gene expression data using a genetic algorithm/Knearest neighbor method. Computational Chemistry High Throughput Screen 2001, 4: 727â€“739.
Pochet N, De Smet F, Suykens JAK, De Moor BLR: Systematic benchmarking of microarray data classification: assessing the role of nonlinearity and dimensionality reduction. Bioinformatics 2004, 20: 3185â€“3195.
Ambroise C, McLachlan GJ: Selection bias in gene extraction on the basis of microarray gene expression data. Proc Natl Acad Sci USA 2002, 99: 6562â€“6566.
Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification. Journal of National Cancer Institute 2003, 95: 14â€“18.
BragaNeto UM, Dougherty ER: Is crossvalidation valid for smallsample microarray classification? Bioinformatics 2004, 20: 374â€“380.
Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer; 2001.
Brieman L: Bagging predictors. Machine Learning 1996, 24: 123â€“140.
Li X, Rao S, Wang Y, Gong B: Gene mining: a novel and powerful ensemble decision approach to hunting for disease genes using microarray expression profiling. Nucleic Acids Research 2004, 32: 2685â€“2694.
Liu XX, Krishnan A, Mondry A: An entropybased gene selection method for cancer classification using microarray data. BMC Bioinformatics 2005, 6: 76.
Li MF, FuLiu CS: Evaluation of gene importance in microarray data based upon probability of selection. BMC Bioinformatics 2005, 6: 67.
Cabrera T, RuizCabello F, Garrido F: Biological implication of HLADR expression in tumours. Scandinavian Journal of Immunology 1995, 41: 398â€“406.
SchÃ¶lkopf B, Smola AJ: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and beyond. Cambridge, MA: MIT Press; 2001.
Suykens JAK, Vandewalle J: Least squares support vector machine classifiers. Neural Processing Letters 1999, 9(3):293â€“300.
Suykens JAK, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J: Least Squares Support Vector Machines. Singapore: World Scientific; 2002.
Luntz A, Brailovsky V: On estimation of characters obtained in statistical procedure of recognition (in Russian). Technicheskaya Kibernatica 1969., 3:
Vapnik V, Chapelle O: Bounds on error expectation for support vector machines. Neural Computation 2000, 12: 2013â€“2036.
Chapelle O, Vapnik V, Bousquet O, Mukherjee S: Choosing multiple parameters for support vector machines. Machine Learning 2002, 46: 131â€“159.
Cawley GC, Talbot NLC: Fast exact leaveoneout crossvalidation of sparse least squares support vector machines. Neural Networks 2004, 17: 1467â€“1475.
Van Gestel T, Baesens B, Suykens J, Espinoza M, Baestaens D, Vanthienen J, De Moor B: Bankruptcy Prediction with Least Squares Support Vector Machine Classifiers. In Proc of the International Conference on Computational Intelligence for Financial Engineering (CIFER'03). Hong Kong, China; 2003:1â€“8.
Pelckmans K, Suykens J: LSSVMlab toolbox.[http://www.esat.kuleuven.ac.be/sista/lssvmlab/]
Platt J: Probabilities for support vector machines. In Advances of Large Margin Classifiers. Edited by: Smola A, Bartlett P, SchÃ¶lkopf B, Schuurmans D. Cambridge, MA: MIT Press; 2000.
Acknowledgements
The authors acknowledge the financial support offered by the A*Star (Agency for Science, Technology and Research) under the grant # 052 101 0020 to conduct this research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
KT developed the LOOC criteria, coded all procedures and conducted experiments. PNS proposed the usage of correlation in the GLGS procedure. XY suggested the usage of gradient descent. All three authors participated in the preparation of the manuscript.
Electronic supplementary material
12859_2005_834_MOESM1_ESM.pdf
Additional File 1: The proof of Lemma 1 and derivations of Eq. (12). Additional file descriptions text (including details of how to view the file, if it is in a nonstandard format). (PDF 100 KB)
Authorsâ€™ original submitted files for images
Below are the links to the authorsâ€™ original submitted files for images.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Tang, E.K., Suganthan, P. & Yao, X. Gene selection algorithms for microarray data based on least squares support vector machine. BMC Bioinformatics 7, 95 (2006). https://doi.org/10.1186/14712105795
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/14712105795