Dimension reduction with redundant gene elimination for tumor classification

Background Analysis of gene expression data for tumor classification is an important application of bioinformatics methods. But it is hard to analyse gene expression data from DNA microarray experiments by commonly used classifiers, because there are only a few observations but with thousands of measured genes in the data set. Dimension reduction is often used to handle such a high dimensional problem, but it is obscured by the existence of amounts of redundant features in the microarray data set. Results Dimension reduction is performed by combing feature extraction with redundant gene elimination for tumor classification. A novel metric of redundancy based on DIScriminative Contribution (DISC) is proposed which estimates the feature similarity by explicitly building a linear classifier on each gene. Compared with the standard linear correlation metric, DISC takes the label information into account and directly estimates the redundancy of the discriminative ability of two given features. Based on the DISC metric, a novel algorithm named REDISC (Redundancy Elimination based on Discriminative Contribution) is proposed, which eliminates redundant genes before feature extraction and promotes performance of dimension reduction. Experimental results on two microarray data sets show that the REDISC algorithm is effective and reliable to improve generalization performance of dimension reduction and hence the used classifier. Conclusion Dimension reduction by performing redundant gene elimination before feature extraction is better than that with only feature extraction for tumor classification, and redundant gene elimination in a supervised way is superior to the commonly used unsupervised method like linear correlation coefficients.


Background
DNA microarray experiments are used to collect information from tissue and cell samples regarding gene expression differences for tumor diagnosis [1,2]. The output of microarray experiment is summarized as an n × p data matrix, where n is the number of tissue or cell samples, p is the number of genes (features). Here, p is always much larger than n, which hurts generalization performance of most classification methods. To overcome this problem, we either select a small subset of interesting genes (gene selection, feature selection) or construct K new components summarizing the original data as well as possible, with K <p (feature extraction).
Gene selection has been studied extensively in the last few years. The most commonly used procedures of gene selection are based on a score which is calculated for all genes individually and genes with the best scores are selected. Gene selection procedures output a list of relevant genes which can be experimentally analyzed by biologists. The method is often denoted as univariate gene selection, whose advantages are its simplicity and interpretability. However, interactions and correlations between genes are omitted during gene selection, although they are of great interest in system biology. Furthermore, gene selection often fails to pick relevant genes, because the score they assign to correlated genes is too similar, and none of the genes is strongly preferred over another.
Feature extraction is an alternative to gene selection to overcome curse of dimensionality. Unlike gene selection, feature extraction projects the whole data into a low dimensional space and constructs new dimensions (components) by analyzing the statistical relationship hidden in the data set. Although feature extraction is often criticized for the lack of interpretability, the new components often give good information or hints about the data's intrinsic structure. Researchers have developed different feature extraction methods in applications of bioinformatics and computational biology [3][4][5], which are generally divided into two groups, unsupervised and supervised. Among various methods, Principle Component Analysis (PCA), an unsupervised method, and Partial Least Squares (PLS), a supervised method, are widely used [5].
Considering of the fact that gene selection and feature extraction algorithms have complementary advantages and disadvantages. Feature extraction algorithms thrive on correlation among features but fail to remove irrelevant and redundant features from a set of complex features. Feature selection algorithms fail when all the features are correlated but do well with informative features. It would be an interesting work to combine gene selection and feature extraction into a general model. In practical, the simplest way is to apply a preliminarily gene selection procedure before feature extraction.
As to analysis of microarray data whose speciality is the huge amount of genes with few examples, it is believed that there exist many redundant genes among the full gene set [6]. Preserving the most discriminative genes and reducing other irrelevant and redundant genes still remain as an open issue. In this paper, we propose a novel metric of redundancy which can effectively eliminate redundant genes before feature extraction. By measuring the discriminative ability of each gene and the pair-wise complementarity, the new method reduce the redundant genes with little contribution of discriminative ability. We also compare our method with commonly used redundant gene reduction methods based on linear correlation. Experiments on several real microarray data sets demonstrate the outstanding performance of our method.
Some notions used in this work are clarified here. Expression levels of p genes in n microarray samples are collected in an n × p data matrix X = (x ij ), 1 ≤ i ≤ n, 1 ≤ j ≤ p; of which an entry x ij is the expression level of the jth gene in the ith microarray sample. As we only consider binary classification problems, the labels of the n microarray samples are collected in the vector y. When the ith sample belongs to class one, the element y i is 1; otherwise it is -1. The matrix S X denotes the p × p covariance matrix of the gene expressions.
Besides, || • || denotes the length of a vector. X T represents the transpose of X, X -1 represents the inverse matrix of X. The matrices X and y used in the following are assumed to be centered to zero mean by each column.

Results
According to the framework proposed in this paper, dimension reduction is performed by combining redundant gene elimination with feature extraction, then the classifier is used to perform classification on the extracted feature subsets. The novel proposed algorithm REDISC (Redundancy Elimination based on DIScriminative Contribution) is compared with the commonly used algorithm RELIC (Redundancy Elimination based on LInear Correlation) to perform redundant gene elimination on two microarray data sets, i.e. Colon and Leukemia, where the threshold of δ in REDISC and RELIC is varied from 0.1 to 0.9. Feature extraction is performed by principle component analysis (PCA) and partial least squares (PLS). The classifier is a linear support vector machine (SVM) with C = 1.
Statistical results of the number of remained genes after performing REDISC and RELIC are showed in Figure 1, the detailed results are also listed in Tables 1-4. Comparative results of BACC obtained by SVM on the new feature sets by using PCA or PLS after performing REDISC and RELIC are illustrated in Figure 2 and Figure  3. Detailed results of Sensitivity, Specificity, BACC, Precision, PPV, NPV and correction on Colon and Leukemia are showed in Tables 1-4, where the results are averaged on ten times of run.
The results in Figures 1-3  5. REDISC and RELIC with different threshhold values produces different results, no one is optimal for all the data sets.

Discussion
The experimental results prove our assumption that redundant features hurt performance of feature extraction and classification, other considerations on the above results are listed as below: 1. The results confirm that there exist many redundant genes in the microarray data and it is necessary to perform redundant gene elimination. Usually, there are four types of features in one data set, I is strong relevant features, II is weak relevant but non redundant features, III is weak relevant and redundant features and IV is irrelevant features. I and II are the essential features in the data sets, and III and IV should be removed [7]. The previous works show III and IV should be removed for classifiers, and in this paper, we show they should also be removed for feature extraction like PCA and PLS.
2. REDISC obtains better results with less features than RELIC, which shows that REDISC has the higher ability to select relevant features and eliminate the redundant features than RELIC. Proper redundant feature elimination help improve performance of feature extraction and classification. Simply reducing redundant genes by linear correlation is not always positive, because without considering the label information in the data set, linear correlation does not give properly redundancy estimation. REDISC takes label information into account for redundant gene elimination, which may be viewed as a supervised way. Since the final step is classification, so a supervised redundant gene elimination is better than an unsupervised one like RELIC.
3. It shows the performances of dimension reduction is improved when redundant genes are properly eliminated. The improvement for PLS is much more dramatic than that of PCA. A possible reason is redundant genes obstruct the performance of supervised methods more obviously, since supervised methods often build more precisely model than unsupervised ones.

Conclusion
Dimension Reduction is widely used in bioinformatics and related fields to overcome the curse of dimensionality. But the existence of amounts of redundant genes in the microarray data often obscure the application of dimension reduction. Preliminarily redundant gene elimination before feature extraction for dimension reduction is an interesting issue, which was often neglected.
In this paper, a novel metric of redundancy based on Discriminative Contribution (DISC) is proposed, which directly estimates the similarity between two features by explicitly building linear classifiers on each genes. The REDISC algorithm (Redundancy Elimination based on Discriminative Contribution) is also proposed. REDISC is compared with a commonly used algorithm RELIC (Redundancy Elimination based on Linear Correlation) on two real microarray data sets. Experimental results The number of selected genes by performing REDISC and RELIC with different parameters Figure 1 The number of selected genes by performing REDISC and RELIC with different parameters.
demonstrate the necessariness of preliminarily redundant gene elimination before feature extraction for tumor classification and the superiority of REDISC to RELIC, a commonly used method. This work is an attempt to propose a general framework performing dimension reduction for tumor classification by combing redundant gene elimination and feature extraction. More investigation need to be done on the efficiency of fusion of feature selection with feature extraction in the future.

A framework of dimension reduction
In this paper, we propose a novel framework for dimension reduction by combining redundant feature elimination with feature extraction to improve performance of classification. The framework is illustrated as in Figure 4, where the microarray data is performed dimension reduction before classification, and dimension reduction consists of redundant gene elimination and feature extraction. The algorithms of redundant gene elimination before feature extraction in this paper actually remove Comparative results of BACC scores by using different algorithms on the Colon data set Figure 2 Comparative results of BACC scores by using different algorithms on the Colon data set.
Comparative results of BACC scores by using different algorithms on the Leukemia data set Figure 3 Comparative results of BACC scores by using different algorithms on the Leukemia data set.
irrelevant features and redundant features at the same time. We omit irrelevant gene elimination because irrelevant genes are few in the gene data set and are not the focus in this paper.
Redundant gene elimination is the critical part in the framework, we propose a novel algorithm based on discriminative ability to improve performance of commonly used linear correlation, which is described in detail in the following subsections. Feature extraction is performed by using two methods, one is supervised, i.e. partial least squares, another is unsupervised, i.e. principle component analysis, which are briefly introduced in the following subsection. As for the classifier, support vector machine is used.

Redundant gene elimination
As redundant features have no contribution for classification, we consider eliminating them preliminarily before feature extraction, which has the following benefits: 1. Eliminating redundant features improves classification accuracy. In general, original microarray data sets have many irrelevant and redundant genes, which hurts performance of feature extraction. In practical, biologists often expect noises are reduced, at least in some extent, during the stage of feature extraction. But, if some redundant genes are reduced beforehand, performance of feature extraction may be improved.
2. Preliminarily feature selection facilitates the application of feature extraction. Compared with modeling on the original data directly, the computational and RAM consumptions of feature extraction on preliminarily gene selected data are much less. Especially for the RAM consumption, most feature extraction methods are often not practical for high dimensional data, since the requirement of loading all data into RAM at one time. However, any additional gene selection procedure may bring some extra computation, so the computational complexity of preliminarily feature selection must not be too high.
3. Preliminarily feature selection improves the interpretability of the components. The meanings of the components are always difficult to be interpreted in feature extraction. Biologists often analyze the relation between extracted components and original features by the coefficients, but it is obscured by the large amount of genes. Reducing a number of original features is obviously helpful when the components are needed to be related with original genes manually.

The previous metrics
Discriminative ability (predictive ability) is a general notion which can be measured in various ways and be used to select significant features for classification. Many effective metrics had been proposed such as t-statistic, information gain, χ 2 statistic, odds ratio etc. [8,9]. Filter feature selection methods sort features by the discriminative ability scores, and some top rank features are retained to be essential for classification.
However, t-statistic and most of other discriminative ability measures are based on individual features, which do not consider the redundancy between two features. Because given two features with the same rank scores, they may be redundant to each other when they are completely correlated, otherwise, they may also be complementary to each other when they are nearly independent.
For the task of feature selection, we want to eliminate the redundant features and only retain the interactive ones. But there exist many redundant features in the top rank feature set produced by using the filter methods. The redundant features increase the dimensionality and contribute little for the final classification. In order to eliminate redundant features, metrics need to estimate the redundancy directly.
The novel framework of dimension reduction Figure 4 The novel framework of dimension reduction.
Notions of feature redundancy are normally in terms of feature correlation. It is widely accepted that two features are redundant to each other if their values are completely correlated. But in fact, it may not be so straightforward to determine feature redundancy when a feature is correlated with a set of features. The widely used way is to approximate the redundancy of feature set by considering the pair-wise feature redundancy.

The linear correlation metric
For linear cases, the most well known pair-wise redundancy metric is the linear correlation coefficient. Given a pair of features (x, y), the definition of the linear correlation coefficient Cor(x, y) is: where and are the mean of x and y respectively. The value of Cor(x, y) lies between -1 and 1. If x and y are completely correlated, Cor(x, y) takes the value of 1 or -1; if x and y are independent, Cor(x, y) is zero. It is a symmetrical metric.
The linear correlation coefficient has the advantage of its efficiency and simplicity, but it is not suitable for redundant feature elimination when classification is the final target, since it does not use any label information. For example, two highly correlated features, whose differences are minor in values but happen to causing different critical discriminative ability, may be considered as a pair of redundancy features. Reducing any one of them will decrease classification accuracy. Guyon et al. has also pointed out that high correlation (or anti-correlation) of variables does not mean absence of variable complementarity [8]. The problem of the linear correlation coefficient is that it measures the similarity of the numerical values between two features, but not the similarity of discriminative ability between two features.
The ideal feature set should have both great discriminative ability and little feature redundancy, where redundancy could not be obtained by estimating their properties separately. A more elaborate measure of redundancy is required to estimate the differences of the discriminative ability between two features.

The proposed novel metric
In order to measure the similarity of discriminative ability of two features, the discriminative ability need be defined more precisely. That is to say, we want to know which example can be rightly classified by the given feature and which can not. Upon the new metric, it is possible to com-pare the discriminative ability of two features by the corresponding correctly classified examples.
In the field of text classification, Training Accuracy on Single Feature (TASF) has been proved to be an effective metric of discriminative ability [9], which builds a classifier for each feature, and the corresponding training accuracy is used as the discriminative score.
Various classifiers can be used to calculate TASF, in simplification, we consider a linear learner here. Given a feature z, the classification function is given as: where and n 1 are the feature mean and the sample size of class one, and n 2 are the feature mean and the sample size of class two. This is a weighted centroid based classifier, which predicts examples as the class label whose weighted distance to its centroid is smaller. The computational complexity of this classifier is O(n).
Putting the whole training set back, we can estimate training accuracy of each classifier by different features, which is used to represent discriminative ability of the corresponding feature. The higher training accuracy, the greater discriminative ability. Since only one feature is used to build the classifier, a part of training examples can be correctly separated in most cases. So the value of TASF ranges from 0 to 1. One feature is considered as an irrelevant one if its TASF value is no greater than 0.5.
Based on TASF, we propose a novel metric of feature redundancy. Given two features of z 1 and z 2 , two classifiers C 1 and C 2 can be constructed. Feeding the whole training set to the classifiers, both C 1 and C 2 can correctly classify a sample subset. The differences of the correctly classified examples are used to estimate the similarity of discriminative abilities. We record the concrete classification results as in table 5, where a + b + c + d equals to the size of the training set n. The values of (a + b)/n and (a + c)/n are training accuracy of C 1 and C 2 respectively. The score of a + d measures the similarity of the features, and the score of b + c measures the dissimilarity. When b + c = 0, the two features z 1 and z 2 have exactly the same discriminative ability.
Our feature elimination problem is becoming whether the contribution of the additional feature to the given feature is significant. The additional feature is considered as redundant if its contribution is tiny. Then, we propose a novel metric of Redundancy based on DIScriminative (1) Contribution (DISC). DISC of z 1 and z 2 , which estimates z 2 's redundancy to z 1 , is defined as follows, The pair-wise DISC metric is asymmetrical, and the computation complexity is O(n).
It is clear that c + d is the number of examples which could not be discriminated by C 1 , c is that which could be correctly classified by the collaboration of C 1 and C 2 . So the proportion of c/(c + d) is the discriminative contribution of C 2 to C 1 , and the value of d/(c + d) is the DISC metric of redundancy, which varies from 0 to 1. When the DISC score takes 1, C 2 's discriminative ability is covered by C 1 's and then z 2 is completely redundant to z 1 . When the DISC value is 0, all training examples could be correctly classified by the union of C 2 and C 1 and we consider z 2 is complementary to z 1 .
DISC is proposed in a linear way, which shows in two respects, one is the linear classifier, another is the linear way of counting the cross discriminative abilities. The microarray problems meet the assumption, since most microarray data sets are binary classification problems, where each gene has equal position to perform classification.

The proposed redundant gene elimination algorithms The REDISC algorithm
Based on the DISC redundancy metric, we propose the REDISC algorithm (Redundancy Elimination based on Discriminative Contribution), which eliminates redundant features by the pair-wise DISC scores. REDISC is illustrated in Figure 5, its basic idea is that, firstly, REDISC filters out trivial features, which do not have discriminative ability on itself, by the TASF score threshold of 0.5. The REDISC algorithm Figure 5 The REDISC algorithm.
Then the features are ordered by their TASF scores. As we usually want to retain the more discriminative one between two redundant features, REDISC tries to preserve the top TASF score ranked features. REDISC uses two nested iterations to eliminate redundant features whose discriminative ability are covered by any higher ranked features. The computational complexity of REDISC is O(np 2 ).

The RELIC algorithm
In order to compare our method with commonly used redundant feature elimination methods, we present the algorithm of RELIC (Redundancy Elimination based on Linear Correlation) [10], which filters out redundant features by the pair-wise linear correlation. A threshold is needed to control how many features should be eliminated. RELIC is given in Figure 6, whose computational complexity is also O(np 2 ).

Feature extraction techniques
Principle component analysis Principle component analysis (PCA) is a well-known method of feature extraction [11]. The basic idea of PCA is to reduce the dimensionality of a data set, while retaining as much as possible the variation in the original predictor variables. This is achieved by transforming the p original variables X = [x 1 , x 2 , ..., The maximum number of components K is determined by the number of nonzero eigenvalues, which is the rank of S X , and K ≤ min(n, p). But in practical, the maximum value of K is not necessary. Some tail components, which have tiny eigenvalues and represent few variances of original data, are often needed to be reduced. The threshold of K often determined by cross-validation or the proportion of explained variances [11]. The computational cost of PCA, determined by the number of original predictor variables p and the number of samples n, is in the order of min(np 2 + p 3 , pn 2 + n 3 ). In other words, the cost is O(pn 2 + n 3 ) when p > n. The RELIC algorithm Figure 6 The RELIC algorithm.

Partial Least Squares
Partial Least Squares (PLS) was firstly developed as an algorithm performing matrix decompositions, and then was introduced as a multivariate regression tool in the context of chemometrics [12,13]. In recent years, PLS has also been found to be an effective feature extraction technique for tumor discrimination [14,15].
The underlying assumption of PLS is that the observed data is generated by a system or process which is driven by a small number of latent (not directly observed or measured) features. Therefore, PLS aims at finding uncorrelated linear transformations (latent components) of the original predictor features which have high covariance with the response features. Based on these latent components, PLS predicts response features y, the task of regression, and reconstruct original matrix X, the task of data modeling, at the same time.
The objective of constructing components in PLS is to maximize the covariance between the response variable y and the original predictor variables X, subject to the constraint , ∀ 1 ≤ i <j. The central task of PLS is to obtain the vectors of optimal weights w i (i = 1, ..., K) to form a small number of components, while PCA is an "unsupervised" method that utilizes the X data only.
Like PCA, PLS reduces the complexity of microarray data analysis by constructing a small number of gene components, which can be used to replace the large number of original gene expression measures. Moreover, obtained by maximizing the covariance between the components and the response variable, the PLS components are generally more predictive of the response variable than the principal components.
PLS computes efficiently with a cost only at O(npK), i.e. the number of calculations required by PLS is a linear function in terms of n or p. Thus it is much faster than the method of PCA for K is always less than n.
Feature extraction methods extract components to represent original data, which are linear or non-linear transformations of original genes. Although the new subspace is effective for data analysis, no original gene is excluded during the process, which often obstructs the explanations of PCs. In order to solve this problem, eliminating redundant genes before dimension reduction is an alternative way.

Classifier -Support Vector Machines
Support vector machines (SVM) proposed by Vapnik and his co-workers in 1990s, have been developed quickly during the last decade [16], and successfully applied to biological data mining [17], drug discovery [18,19] etc. Denoting the training sample as S = {(x, y)} ⊆ ‫ޒ{‬ n × {-1, 1}}ᐍ, SVM discriminant hyperplane can be written as where w is a weight vector, b is a bias. According to the generalization bound in statistical learning theory [20], we need to minimize the following objective function for a 2-norm soft margin version of SVM: in which, slack variable ξ i is introduced when the problem is infeasible. The constant C > 0 is a penalty parameter, a larger C corresponds to assigning a larger penalty to errors.

Data Sets
Two microarray data sets used in our study are listed in Table 6. They are briefly described as below, and the corresponding C4.5 format versions are available at [21]. We do not use the original split by their authors, we merge the data set before using it.
Colon used Affymetrix oligonucleotide arrays to monitor expressions of over 6,500 human genes with samples of 40 tumor and 22 normal colon tissues. Expression of the 2,000 genes with the highest minimal intensity across the 62 tissues were used in the analysis [2].
Leukemia The acute leukemia data set was published by [1], which consists of 72 bone marrow samples with 47 ALL and 25 AML. The gene expression intensities are obtained from Affymetrix high-density oligonucleotide microarrays containing probes for 7,129 genes.

Experimental settings
We use the stratified 10-fold cross-validation procedure, where each data set is firstly merged and then split into ten subsets of equal size. Each subset is used as a test set once, and the corresponding left subsets are combined together and used as the training set. Within each cross-validation  fold, the gene expression data is standardized. The expressions of the training set are transformed to zero mean and unit standard deviation across samples, and the test set are transformed according to the means and standard deviations of the corresponding training set. We use 10 fold cross validation because the 10 × 10 cross-validation measurement is more reliable than the randomized resampling test strategy and the leave-one-out cross-validation due to the correlations between the test and training sets, some detail discussions can be found at [22].
The linear Support Vector Machine (SVM) with C = 1 is used as the classifier, which is trained on the training set to predict the label of test samples. Figure 7 contains pseudo-code to describe the complete 10 × 10 cross-validation measurement procedure.
In order to precisely characterize the performance of different learning methods, we define several performance measures below (see [23]). Here TP, TN, FP, and FN, stand for the number of true positive, true negative, false positive, and false negative samples, respectively.
Sensitivity is defined as and is also known as Recall.
Specificity is defined as . Accuracy) is defined as , which defines the average of sensitivity and specificity.

BACC (Balanced
Precision is defined as . PPV (Positive Predictive Value) is defined as .

NPV (Negative Predictive Value) is defined as .
Correction is defined as and measures the overall percentage of samples correctly classified.

Competing interests
The authors declare that they have no competing interests. Experimental procedure for comparing different algorithms Figure 7 Experimental procedure for comparing different algorithms.