Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data

Background Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy. Results We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets. Conclusion For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures.


Background
The aim of feature selection is to form, from all available features in a dataset, a relatively small subset of features capable of producing the optimal classification accuracy. This subset is called the predictor set. The following are past and current stances on the use of feature selection for multiclass tissue classification: • Feature selection does not aid in improving classification accuracy [1,2], at least not as much as the type of classifier used.
• Feature selection is often rank-based, and is implemented mainly with the intention of merely reducing cost/complexity of subsequent computations (since the transformed dataset is smaller), rather than also finding the feature subset which best explains the dataset [1,3].
• Studies proposing feature selection techniques with sophistication above that of rank-based techniques resort to an evaluation procedure which is prone to giving overly optimistic estimate of accuracy, but has the advantage of costing less computationally than procedures which yield a more realistic estimate of accuracy [4][5][6][7].
In short, there are three ways in which feature selection has been, and still is regarded for multiclass microarray datasets: 1) should not be considered at all, 2) as simple rank-based methods for dataset truncation, and finally, 3) as more complicated methods with sound theoretical foundation, but with doubtful empirical results.
A feature selection technique is made of two components: the predictor set scoring method (which evaluates the goodness of a candidate predictor set); and the search method (which searches the gene subset space for the predictor set based on the scoring method). The technique becomes wrapper-based when classifiers are invoked in the predictor set scoring method. Otherwise, the technique is filter-based. Filter-based techniques, which are the focus of this study, have several advantages over wrapper-based techniques: 1) Filter-based techniques cost less computationally than wrapper-based techniques. 2) Filter-based techniques are not classifier-specific. 3) More importantly, unlike the typical 'black-box' trait of wrapper-based techniques, filter-based techniques provide a clear picture of why a certain feature subset is chosen as the predictor set through the use of scoring methods in which inherent characteristic(s) of the predictor set (and not its prediction ability) is optimized. The last advantage is particularly crucial since the predictor set scoring method in a filter-based technique can explain the prediction ability of the predictor set, whereas in a wrapperbased technique, the score of goodness of the predictor set is its prediction ability itself, and hence the term 'blackbox'.
An important principle behind most filter-based feature selection techniques can be summarized by the following statement: A good predictor set should contain features highly correlated with the target class distinction, and yet uncorrelated with each other [8]. The predictor set attribute referred to in the first part of this statement, 'relevance', is the backbone of simple rank-based feature selection techniques. The aspect alluded to in the second part, 'redundancy', refers to pairwise relationships between all pairs of genes in the predictor set.
Previous studies [4,8] have based their feature selection techniques on the concept of relevance and redundancy having equal importance in the formation of a good predictor set. We call the predictor set scoring methods used in such correlation-based feature selection techniques equal-priorities scoring methods. On the other hand, Guyon and Elisseeff (2003) [9] demonstrated using a 2-class problem that seemingly redundant features may improve the discriminant power of the predictor set instead, although it remains to be seen how this scales up to multiclass domains with thousands of features. A study was implemented on the effect of varying the importance of redundancy in predictor set evaluation in [10]. However, due to its use of a relevance score that is inapplicable to multiclass problems, the study was limited to only binary classification.
From here, we can state the three levels of filter-based feature selection for multiclass tumor classification as follows: 1) no selection, 2) select based on relevance alone, and finally, 3) select based on relevance and redundancy. Thus, currently, relevance and redundancy are the two existing criteria which have ever been used in predictor set scoring methods for multiclass tumor classification.

Contributions of this study
We propose to go one step further, by introducing a third criterion: the relative importance placed between relevance and redundancy. We call this criterion the degree of differential prioritization (DDP). DDP compels the search method to prioritize the optimization of one of the two criteria (of relevance or redundancy) at the cost of the optimization of the other. Unlike other existing correlation-based techniques, our proposed feature selection techniques do not take for granted that the optimizations of both elements of relevance and redundancy are to have equal priorities in the search for the predictor set [11].
Having introduced the element of differential prioritization, we stress the importance of applying a more appropriate evaluation procedure which gives more realistic estimates of accuracy than the internal cross validation (ICV) procedure used in several feature selection studies for gene expression data [3][4][5]. This is done by evaluating our feature selection techniques using the F-splits evaluation procedure.
In this paper, we investigate the efficacy of two DDPbased predictor set scoring methods on nine multiclass microarray datasets. Each of the two methods is differentiated from the other method by the measure of correlations between genes used in the method. The first method is termed the antiredundancy-based W A,S scoring method. The measure of antiredundancy, U S , is used as the measure of correlations between genes in the W A,S scoring method. In the second method, called the redundancybased W R,S scoring method, the measure of redundancy, R S , is used as the measure of correlations between genes. The DDP parameters for the W A,S and the W R,S scoring methods are denoted as α and ρ respectively. Larger DDP means more emphasis on optimizing relevance, V S , and less emphasis on minimizing correlations between genes in the predictor set. Conversely, smaller DDP indicates more emphasis on minimizing correlations between members of the predictor set and less emphasis on optimizing its relevance.
The main contribution of this study is to show that a degree of freedom in adjusting the priorities between maximizing relevance and minimizing redundancy is necessary to produce the best classification performance (i.e. equal-priorities techniques might not yield the optimal predictor set). A secondary contribution is to determine which one of the two measures investigated in this study is the better measure of correlations between genes in the predictor set ('predictor genes').

Results
Nine multiclass microarray datasets are used as benchmark datasets ( Table 1). The Brown (BRN) dataset, first analyzed by Munagala et al. (2004) [12,13], includes 15 broad cancer types. We also analyzed another version of this dataset, denoted as BRN14, where one class (skin tissue) is excluded due to its small sample size.
The GCM dataset [2,14] contains 14 tumor classes. For the NCI60 dataset [15,16], only 8 tumor classes are analyzed; the 2 samples of the prostate class are excluded due to small class size.
The PDL dataset [17,18] consists of 6 classes, each class representing a diagnostic group of childhood leukemia. The SRBC dataset [19,20] consists of 4 subtypes of small, round, blue cell tumors (SRBCTs). In the 5-class lung dataset [21,22], 4 classes are subtypes of lung cancer; the fifth class consists of normal samples.
Except for the BRN, BRN14 and SRBC datasets (which are only available as preprocessed in their originating studies), datasets are preprocessed and normalized based on the recommended procedures in [27] for Affymetrix and cDNA microarray data. Except for the GCM dataset, for which the original ratio of training set size to test set size used in [2] is maintained to enable comparison with previous studies, for all datasets we employ the standard 2:1 split ratio.
Different values of α and ρ ranging from 0.1 up to 1 with equal intervals of 0.1 are tested. Predictor sets ranging from size P = 2 to P = P max are formed in each split. The number of splits, F is set to 10 in this study. The choice of the value of P max is based on previous studies on feature selection in tumor classification such as [1], where it is observed that there is no significant change in accuracy at values of P beyond 150. Therefore, for datasets with larger number of classes (K>6), we set P max to 150, while for datasets with 6 or less classes, the value of P max is decreased accordingly in proportion to K ( Table 1). The rationale for this is that datasets with smaller number of classes need less predictor genes to differentiate samples from different classes than datasets with larger number of classes. It is easier to distinguish among say, 3 classes than telling apart samples from 15 different classes (and hence, a smaller predictor set would be sufficient for the former whereas a larger number of predictor genes would be necessary to accomplish the latter).
Two feature selection experiments were run on each dataset: one using the W A,S scoring method and the other, the W R,S scoring method. The DAGSVM classifier is used in evaluating the performance of all resulting predictor sets from both experiments. The DAGSVM is an all-pairs SVMbased multi-classifier which uses substantially less training time compared to either the standard algorithm or Max Wins, and has been shown to produce accuracy comparable to both of these algorithms [28].
Two parameters will be used to evaluate the performance of the W A,S and the W R,S scoring methods. The first is the best estimate of accuracy. This is simply taken as the largest among the accuracies obtained from Figure 1 at all tested values of α or ρ at P = P max . By taking the accuracy at fixed value of P (at P max ), we further exclude the possibility of leaking information from the test set into the training process of forming the predictor set in each split. If a draw occurs in terms of the estimate of accuracy, we take the average of the values of α or ρ giving the largest accuracies as the optimal α or ρ.
For multiclass classification problems, merely attaining a good estimate of accuracy does not represent excellent classification performance. There is also the need to ensure that samples from all classes are predicted with equally good rate of accuracy. This is especially true when class sizes are greatly unequal among the classes, which is often the case for multiclass microarray datasets. A predictor set may achieve high estimate of overall accuracy by simply predicting test samples as belonging to one of the classes with large class size at a high frequency. The end results will be that samples belonging to certain classes will be correctly predicted most of the time, while samples from other classes will be wrongly classified at a high rate.
This calls for the second parameter, the range of class accuracies, in evaluating the performance of the predictor set scoring methods. For each class, class accuracy denotes the ratio of correctly classified samples of that class to the class size in the test set. Each class accuracy is computed from the DDP value which produces the best estimate of accuracy at P = P max in the first place. The range of class accuracies is the difference between the best class accuracy and the worst class accuracy among the K class accuracies in a K-class dataset. In an ideal situation, overall accuracy being exactly 1, each class accuracy is 1, so the perfect range of class accuracies is 0. Hence, the lower the range of class accuracies, the better the classification performance.

Best estimate of accuracy
Overall the W A,S scoring method outperforms the W R,S scoring method by giving better accuracy in six out of nine datasets ( Table 2). Only in three datasets, GCM, NCI60 and SRBC datasets, does the W R,S scoring method give the same accuracy as the W A,S scoring method. Figure 2 shows how the estimate of accuracy at P = P max varies against corresponding value of α and ρ for the W A,S and the W R,S scoring methods respectively.

Range of class accuracies
The W A,S scoring method gives better performance than the W R,S scoring method by yielding smaller range of class accuracies for five datasets: GCM, NCI60, PDL, MLL and AML/ALL datasets ( Table 3). The W R,S scoring method turns out lower range of class accuracies for only two datasets: the lung and BRN14 datasets. For the remaining two datasets (BRN and SRBC), both methods yield the same performance. Figure 3 shows how the range of class accuracies at P = P max varies against corresponding value of α and ρ for the W A,S and the W R,S scoring methods respectively.
Comparing the W A,S and the W R,S scoring methods By taking the rightmost columns of Tables 2 and 3, we assign the overall superior method for each of the nine datasets in Table 4. At P = P max , the W A,S scoring method is superior to the W R,S scoring method for six out of nine datasets (15-class BRN, 14-class GCM, 8-class NCI60, 6class PDL, 3-class MLL and 3-class AML/ALL). Four of these six datasets contain large number of classes (more than 5 classes). The overall superior method is undecided for two datasets (BRN14 and lung datasets), while for the SRBC dataset, both methods produce equal performance in terms of both best estimate of accuracy and range of class accuracies.
Both methods have been briefly compared in a previous work [29] using only one dataset (the GCM dataset). Here we add the binomial test recommended in [30] for comparing classifiers in order to compare both methods for all values of P ranging from 2 to P max . For each predictor set size P = 2,3,...,P max , we identify the DDP (α or ρ) value which gives the best accuracy (averaged across F splits) for each scoring method. In each split, a classifier is constructed using the P-gene predictor set obtained at this optimal DDP value from each scoring method. We then compare the two resulting classifiers across splits using the test sets of all F splits.
Out of these P max -1 comparisons, for each dataset we record the number of times, A, we reject the null hypothesis that both scoring methods are equal, in favor of the hypothesis that the W A,S scoring method is better than the W R,S scoring method at the 0.05 significance level ( Table  5). The outcome of the comparisons does not seem impressive until we take into account the fact that the number of times, B, we reject the null hypothesis that both scoring methods are equal, in favor of the hypothesis that the W A,S scoring method is worse than the W R,S scoring method, is 0 for all datasets. Moreover, we observe a strong correlation between the training set size (Table 1) and A -the larger the training set, the higher the frequency at which the null hypothesis can be rejected in favor of the W A,S scoring method as the superior method ( Figure 4). Therefore, we believe that with sufficiently large training set, it can be irrefutably proven that the W A,S scoring method is the superior method.
To reinforce the results from the binomial test, we further conduct the Wilcoxon signed rank test [31] on accuracies from both methods obtained at the best DDP for each predictor set size P = 2,3,...,P max . The rightmost column of Table 5 contains the number of times, C, the right-sided p-value is below the significance level of 0.05. The rightsided p-value in this case is the probability that T 0 is less than or equal to the Wilcoxon signed rank test statistic, T.
where θ P is the number of test samples the W A,S scoring method predicts correctly at predictor set size P for all F splits, and η P is the number of test samples the W R,S scoring method predicts correctly at predictor set size P for all F splits. Rank(P) is the rank of (θ P -η P ). T 0 is computed in a similar way to T, except that the expression sign (θ P -η P ) is replaced with independent random sign. This means that the right-sided p-value represents the probability that the W A,S scoring method is only superior to the W R,S scoring method by random chance, given available observations of the classification performance of the two scoring methods. In other words, a right-sided p-value that is near zero (less than the 0.05 significance level) indicates a high likelihood that the W A,S scoring method is indeed significantly superior to the W R,S scoring method.
The left-sided p-value represents the probability that T 0 is greater than or equal to T. Supporting the results from the binomial test, the number of times, D, the left-sided pvalue is below the significance level of 0.05, is also 0 for all nine datasets. Moreover, as in the case of A, C is also proportionate to the training set size, M t , of the corresponding datasets ( Figure 4). Indeed, as shown in Figure  4, the ratio of C to P max -1 has stronger correlation to M t than the ratio of A to P max -1.

Comparisons to other studies
Detailed comparisons to previously reported results will only be made for the four datasets with the largest K. Two of them, the GCM and NCI60 datasets, have been extensively analyzed in previous studies and have been known to consistently produce low realistic estimates of accuracy (<90%) [1,27,32]. Since the W A,S scoring method has been shown to outperform the W R,S scoring method, we shall compare results from the W A,S scoring method against results from other studies.
For the GCM dataset, with a 150-gene predictor set, an accuracy of 80.2% is achievable with our W A,S scoring method when the value of α is set to 0.4. This is a significant improvement compared to the 78% accuracy obtained, using all available 16000 genes, in the original analysis of the same dataset [2]. However, strict comparison cannot be made against this 78% accuracy of [2] and the 81.5% accuracy (using 84 genes) achieved in [32] since the evaluation procedure in both studies is based on a single (the original) split of the dataset. We can make a more appropriate comparison, however, against a comprehensive study on various rank-based feature selection techniques [1]. The study uses external 4-fold cross validation to evaluate classification performance. In [1], the best accuracy for the GCM dataset is 63.3%, when no feature selection is applied prior to classification! For the NCI60 dataset, using the W A,S scoring method, the best 10-splits accuracy of 68% occurs at α = 0.1. This is only marginally better than the best accuracy obtained from the two studies employing evaluation procedures similar to ours [1,27]. In [27], the best averaged accuracy is around 67% (using 200 genes selected from the BSS/ WSS rank-based feature selection technique), whereas the study by Li et al. (2004) [1] gives similar performance with an accuracy of 66.7% achieved using the sum minority rank-based feature selection technique with the same number of genes as our predictor set, 150. A more recent  study on a wrapper-based feature selection technique [7] found a LOOCV (leave-one-out cross validation) accuracy of 76.2% for the NCI60 dataset. Unfortunately no separate test set has been formed to validate the efficacy of the 30-gene predictor set which has achieved this accuracy.
The classification of samples from the NCI60 dataset is well-known for being a difficult problem. Major culprits include the small class sizes and the heterogeneity of some of the classes (breast and non-small-cell lung cancer) [27].
Using the current interval size between the DDP values (0.1), our improvement of accuracy for this dataset is small. However, we hope that further refinement to our feature selection technique (through the development of a method to predict the suitable range of the DDP for the dataset of interest, which will enable us to test for both smaller range and interval size for the DDP) will bring about a significantly better accuracy.
The discriminative margin clustering method used on the BRN dataset in [12] is geared towards discovering subclasses in broad histological types -but manages to yield good class accuracy (70 to 90%) for four tumor classes (kidney, lung, ovary and soft tissue). Not surprisingly, the class accuracy produced by our W A,S scoring method for each of these classes ranges from 90 to 100%. For this dataset, we obtain a 94.3% accuracy using a 150-gene predictor set found at α = 0.6. This is better than the results reported in [6] where an 81.23% LOOCV accuracy on the BRN dataset is achieved using a wrapper-based GA/SVM feature selection technique. However, if the LOOCV accuracy itself is used as the GA fitness function, as is the case in [6], an external test set should have been set aside to evaluate the performance of the technique. It is the accuracy from this test set that provides a realistic estimate of accuracy for the feature selection technique (more details in The F-splits evaluation procedure sub-section under the Methods section). Again, similar to the situation in an aforementioned study on the NCI60 dataset [7], no such evaluation procedure has been implemented in [6].
In [33] the authors have eliminated the skin tissue samples from the 15-class BRN dataset (and thus forming the 14-class BRN14 dataset), also possibly, as in our case, due to small class size (3 samples). In that study, the nearest shrunken centroid classifier yields a 10-splits accuracy of 93.5% using 4327 genes for the BRN14 dataset. This is slightly lower than the 94.7% accuracy from the W A,S scoring method (at α = 0.33). More importantly, we use a much smaller predictor set (150 genes) to achieve a better accuracy. Range of class accuracies from the W A,S and the W R,S scoring methods, obtained using the values of DDP shown in Table 2 at P = P max Estimate of accuracy at P = P max vs. α or ρ Estimate of accuracy at P=P max vs. α or ρ. Solid line: Therefore, compared to previous studies which used realistic evaluation procedures similar to the F-splits evaluation procedure, rather than the potentially overly optimistic ICV procedure, our W A,S scoring method has produced better classification performance on highly multiclass datasets such as the BRN, BRN14, GCM and NCI60 datasets.

The importance of the DDP
In both scoring methods, it is worth noting from Figures  2 and 3, and Table 2 that the best classification performance (i.e. highest accuracy and lowest range of class accuracies) is not always achieved at values of the DDP where the technique becomes equal-priorities scoring method (α or ρ equals 0.5) or rank-based (α or ρ equals 1). There-fore, without varying the DDP so that it takes any other values aside from 0.5 or 1, the optimal classification performance would not have been achievable for most of the datasets. For datasets where the optimal value of the DDP happens to be exactly 0.5 (for instance, the PDL dataset where the value of α giving the best estimate of accuracy is 0.5), it is due to the fact that some characteristic(s) of the dataset dictate that the optimal value of the DDP for the dataset should be 0.5.
In [11], we have hypothesized that for a given scoring method, the value of the DDP leading to the best estimate of accuracy is dataset-specific. Successfully predicting such optimal value of the DDP for a dataset gives us savings in terms of computational cost and time (we will not have to run feature selection for the full domain of the DDP from 0 to 1). Linking the optimal value of the DDP to dataset characteristic(s) is the first step towards successful prediction of the optimal value of the DDP for any future untested datasets.
Since our feature selection technique does not explicitly predict the best P from the range of [2, P max ], in order to determine the value of the DDP most likely to produce the optimal accuracy, we use a parameter called size-averaged accuracy, which is computed as follows. For all predictor sets found using a particular value of the DDP, we plot the estimate of accuracy obtained from the procedure outlined in Figure 1 against the value of P of the corresponding predictor set ( Figure 5). The size-averaged accuracy for that value of the DDP is the area under the curve in Figure  5 divided by the number of predictor sets, (P max -1). The value of α or ρ associated with the highest size-averaged accuracy is deemed the empirical estimate of α* or ρ* (the empirical optimal value of the DDP). If there is a tie in terms of the highest size-averaged accuracy between different values of α or ρ the empirical estimate of α* or ρ* is taken as the average of those values of α or ρ [34] The overall trend in Figure 6 implies that as K increases, in order to achieve the optimal classification performance, the emphasis on • maximizing antiredundancy (for the W A,S scoring method) or • minimizing redundancy (for the W R,S scoring method) needs to be increased at the cost of the emphasis on maximizing relevance. Conversely, maximizing antiredundancy (or minimizing redundancy) becomes less important as K decreases -thereby supporting the assertion in [9] that redundancy does not hinder the discriminant power of the predictor set when K is 2. The α* -K plot follows this trend more closely than the ρ* -K plot.
Range of class accuracies at P = P max vs.α or ρ Since the measure of antiredundancy, U S , and the measure of redundancy, R S , play increasingly important roles compared to relevance, V S , as K increases, the better performance of the W A,S scoring method compared to the W R,S scoring method for majority of datasets with larger K (> 5) must be due to the superiority of the measure of antiredundancy, U S , over the measure of redundancy, R S , in measuring correlations between predictor genes.
The statement above can be substantiated by comparing the corresponding value of α* to ρ* for each of these datasets in Figure 6. The value of α* is always less than the value of ρ* for all datasets. The role of α or ρ is such that the smaller the value of α or ρ, the more the emphasis on maximizing U S or minimizing R S , respectively (equations 7 and 8). This implies that U S is more useful than R S as a criterion in finding the optimal predictor set for datasets with large K. Moreover, we observe from Figure 2 that the estimate of accuracy from the W R,S scoring method at small ρ is much lower than accuracy from the W A,S scoring method at small α, again underscoring the reliability of U S over R S in finding the optimal predictor set.

Most frequently selected genes
Since the W A,S scoring method has been shown to outperform the W R,S scoring method, we perform the analysis on the most frequently selected genes using the optimal predictor sets found from the W A,S scoring method. The optimal predictor sets consist of the P max -gene predictor set obtained in each split of training and test sets using the value of α which gives the best estimate of accuracy ( Table   2). The most frequently selected genes are identified by surveying the optimal predictor sets for genes that are selected 10 times out of 10 splits of training and test sets. We then rank these genes based on their split-averaged position in the predictor set. Higher rank is assigned to genes with consistently high position in the predictor set in each split. The biological significance of the top 25 Correlations of A and C to training set size, M t Figure 4 Correlations of A and C to training set size, M t . A (left) and C (right) plotted against training set size, M t , for all benchmark datasets. A and C are normalized by dividing against their maximum value, P max -1. For the BRN and BRN14 datasets, we first identify which of the 25 genes have been found to be markers for specific tumor types against normal tissues in the originating study by Munagala et al. [12]. We discover that out of the 25 genes, 10 and 8 genes are included in lists of genes which differentiate specific tumor types from normal tissues in case of the BRN (Table 6) and BRN14 (Table 7) datasets respectively. These lists are available at the website [13] of the authors of [12].
The second group comprises genes which are known to either promote or inhibit tumor in general (for example, genes #9 and #19 in Table 6; and genes #12 and #15 in Table 7). Our results suggest that these genes are expressed variably among different tumor types as well as between tumor tissues and normal tissues.
The third group contains genes which are tissue-specific (highly expressed in certain tissue relative to other parts of the body). Examples are genes #22, #24 and #25 in Table  6; and genes #22, #23 and #25 in Table 7. This is expected, as the classification problem involves distinguishing among different broad tumor types, each of which originates from a distinct tissue type.
The fourth group is made of unknown sequences and genes with either still-unidentified function (genes #3 and #23 in Table 6; genes #4 and #13 in Table 7) or general housekeeping roles such as production of normal proteins and gene regulation (genes #14 and #18 in Table 6; genes #7 and #21 in Table 7). In other words, these are genes that ostensibly play no role in influencing the development of tumor in general or specific tumor types. However, the identification of these genes as predictor genes for multiclass tumor classification points to the possible cascade effect of these genes in development of specific tumor types, especially in case of gene #18 in Table 6 (also gene #21 in Table 7), which is involved in regulating the expression of other genes.
For the GCM dataset, we also first compare our top 25 genes to the marker genes identified in the originating study by Ramaswamy et al. [2]. The authors of [2] have provided a list of OVA (one-vs.-all) marker genes for each tumor type at the paper website [49]. Each list contains the top 1000 genes which distinguish a specific tumor type against all other 13 tumor types in the GCM dataset. These genes are ranked based on their significance, which is computed using the permutation test elucidated in [25].  (Table 8).
Of the remaining 7 genes, only 1 gene is not listed in any of the lists of the top 1000 OVA marker genes of [2]. This is gene #14 in Table 8, which belongs to the second of the four groups of genes defined previously. The other 6 genes Area under the accuracy-predictor set size curve Figure 5 Area under the accuracy-predictor set size curve. belong to the first of the four groups of genes (markers or repressors of specific types of tumor) [50][51][52][53][54][55][56].

Conclusion
For majority of the benchmark datasets, using the optimal value of the degree of differential prioritization gives an accuracy higher than accuracies obtainable using equalpriorities scoring method (α or ρ fixed at 0.5) or rankbased technique (α or ρ fixed at 1). Therefore, instead of limiting ourselves to a fixed universal set of priorities for relevance and antiredundancy/redundancy (α or ρ fixed at 0.5 or 1) for all datasets, a suitable range of α or ρ should be chosen based on the characteristics of the dataset of interest in order to achieve the optimal accuracy. Furthermore, the study demonstrates the advantages of using the measure of antiredundancy over the measure of redundancy for measuring gene correlations, especially for datasets with large number of classes. Based on the criteria of best estimate of accuracy and range of class accuracies, the antiredundancy-based predictor set scoring method performs better than the redundancy-based predictor set scoring method for majority of the benchmark datasets. Furthermore, the antiredundancy-based predictor set scoring method is the superior method of the two in four of the datasets with the largest number of classes. These are the BRN, PDL, GCM and NCI60 datasets, the last two of which remain the most difficult datasets to work on in the area of tissue classification.
Finally, a large portion of the genes most frequently selected into the optimal predictor sets has been identified by the originating studies on the corresponding datasets as marker genes of specific tumor types. Majority of the most frequently selected genes have also been discovered to be involved in either development or suppression of specific tumor types by other studies. These findings con-firm the practical value of our feature selection technique for the analysis of gene expression data.

Terminology and objective
For gene expression datasets, the terms gene and feature may be used interchangeably. The training set upon which feature selection is to be implemented, T, consists of N genes and M t training samples. Sample j is represented by a vector, x j , containing the expression of the N genes [x 1,j ,..., x N,j ] T and a scalar, y j , representing the class the sample belongs to. The target class vector y is defined as [y 1 , ..., y Mt ], y j ∈[1, K] in a K-class dataset. From the total of N genes, the objective is to form the subset of genes, called the predictor set S, which gives the optimal classification accuracy.

Two predictor set scoring methods
A score of goodness incorporating both the elements of maximum relevance and minimum redundancy ensures that the predictor set should possess maximal power in discriminating among different classes (maximum relevance), while at the same time containing features with minimal correlation to each other (minimum redundancy).
The relevance of a predictor set S is the average of score of relevance, F(i) of all members of S, as recommended in [4]: The BSS/WSS ratio, first used in [27] for multiclass tumor classification, is a modification of the F-ratio statistics for Relationship between optimal value of the DDP and number of classes in the dataset Figure 6 Relationship between optimal value of the DDP and number of classes in the dataset. Optimal value of the DDP plotted against number of classes in the dataset for the W A,S scoring method (left) and the W R,S scoring method (right) for all benchmark datasets.
one-way ANOVA (Analysis of Variance). It indicates the gene's ability in discriminating among samples belonging to the K different classes.
To measure the correlation between genes i and j, the absolute value of the Pearson product moment correla-tion coefficient between the two, |R(i,j)|, is used. Absolute value is used because both extreme correlation and anticorrelation indicates strong similarity between a pair of genes. There are two possible schemes for measuring the correlations among the members of S, each leading to a different predictor set scoring method. The first scheme LGALS4 lectin, galactoside-binding, soluble, 4 (galectin 4) Verified as marker for pancreatic cancer in [12], also identified as marker for stomach, liver, kidney, and breast cancer classes in [12] M 22 HELO1 homolog of yeast long chain polyunsaturated fatty acid elongation enzyme 2 Highly expressed in adrenal gland and testis (tissue-specific), probably involved in encoding the major histocompatibility complex, essential to human immune response [ Top 25 genes ranked from the most frequently selected genes for the BRN dataset. Group M: identified as a marker or repressor of a specific tumor type in originating study. Group 1: identified as a marker or repressor of a specific tumor type in other studies. Group 2: known to either promote or inhibit tumor in general. Group 3: tissue-specific genes. Group 4: unknown sequences and genes with either still-unidentified function or general housekeeping roles.
uses direct redundancy. Redundancy is defined as the sum of all possible pairwise correlations in S and has been introduced in [4] for use with gene expression datasets.
Theoretically the largest possible value of R S is which occurs when every pair of genes in S has perfect correlation or anti-correlation (where |R(i,j)| equals 1). Conversely, the smallest possible value of R S is zero, when every pair of genes in S has zero correlation (where |R(i,j)| equals 0). As |S| approaches infinity, the limit of R S,max is 0.5. Hence R S has theoretical bounds of [0,0.5).
The second scheme uses a measure called antiredundancy which we proposed in [57]. It quantifies the lack of redundancy in S.
With R S , we have a redundancy-based scoring method in which the measure of goodness for predictor set S is given as follows.

Rank
Annotation Remarks Group 1 FLJ20624 **hypothetical protein FLJ20624 Gene #2 in the BRN dataset (Table 6)  1  2 M6PR mannose-6-phosphate receptor (cation dependent) Gene #6 in the BRN dataset (Table 6)  1  3 PAX8 paired box gene 8 Gene #13 in the BRN dataset (Table 6)  M  4 DMXL1 Dmx-like 1 Gene #3 in the BRN dataset (Table 6)  4  5 PLG plasminogen Gene #7 in the BRN dataset (Table 6)  M  6 LGALS4 lectin, galactoside-binding, soluble, 4 (galectin 4) Gene #21 in the BRN dataset (Table 6)  M  7 APCS amyloid P component, serum Gene #14 in the BRN dataset (Table 6)  4  8 GATA3 GATA-binding protein 3 Verified as marker for breast cancer in [12], also identified as marker for bladder cancer class in [12] M 9 TACSTD1 tumor-associated calcium signal transducer 1 Gene #5 in the BRN dataset (Table 6)  M  10 FGA fibrinogen, A alpha polypeptide Gene #10 in the BRN dataset ( GRHPR glyoxylate reductase/hydroxypyruvate reductase One of the partners of the BCL6 (see Gene #11 in Table 6) translocation in follicular lymphoma, which leads to higher risk of transformation into aggressive lymphoma [48] 1 25 HELO1 homolog of yeast long chain polyunsaturated fatty acid elongation enzyme 2 Gene #22 in the BRN dataset (Table 6) 3 Top 25 genes ranked from the most frequently selected genes for the BRN14 dataset. Group M: identified as a marker or repressor of a specific tumor type in originating study. Group 1: identified as a marker or repressor of a specific tumor type in other studies. Group 2: known to either promote or inhibit tumor in general. Group 3: tissue-specific genes. Group 4: unknown sequences and genes with either still-unidentified function or general housekeeping roles.
With U S , we have an antiredundancy-based scoring method in which the score of goodness for predictor set S is given as follows.
where the power factor α ∈ (0, 1] denotes the DDP between maximizing relevance and maximizing antire-dundancy [11]. Codes for both scoring methods are available as MATLAB M-files [see Additional file 1].

Significance of the DDP
In the previous section it has been stated that a predictor set is to be found based on two criteria: maximum relevance and either minimum redundancy or maximum antiredundancy. However, the quantification of the prior- ity to be assigned to each of these two criteria remains an unexplored area.
In the antiredundancy-based scoring method, decreasing the value of α forces the search method to put more priority on maximizing antiredundancy at the cost of maximizing relevance. Raising the value of α increases the emphasis on maximizing relevance (and at the same time decreases the emphasis on maximizing antiredundancy) during the search for the predictor set.
A predictor set found using larger value of α has more features with strong relevance to the target class vector, but also more redundancy among these features. Conversely, a predictor set obtained using smaller value of α contains less redundancy among its member features, but at the same time also has fewer features with strong relevance to the target class vector. At α = 0.5, we get an equal-priorities scoring method.
There is also the trivial case of α = 0, where only antiredundancy is considered in forming the predictor set.
The role and significance of ρ in the redundancy-based scoring method are similar to those of α in the antiredundancy-based scoring method, which has been elucidated in the paragraphs above. At ρ = 0.5, the redundancy-based scoring method is effectively the same as one of the scoring methods presented in [4].
The case of ρ = 1 is similar to α = 1.
In the trivial case of ρ = 0, only redundancy is considered during feature selection.
We posit that different datasets will require different values of DDP between maximizing relevance and minimiz-ing redundancy/maximizing antiredundancy in order to come up with the most efficacious predictor set. Therefore the optimal range of α or ρ (optimal as in leading to the predictor set giving the best estimate of accuracy) is dataset-specific.

Search method
For predictor set search, the linear incremental search method [4,5] is used, where the first member of S is chosen by selecting the gene with the highest F(i) score. To find the second and the subsequent members of the predictor set, the remaining genes are screened one by one for the gene that would give the maximum W A,S or W R,S . The procedure is terminated when P has reached P max . P max is the size of the largest predictor set we wish to look for. This search method has a computational complexity of O(NP max ), which is much lower than that of exhaustive search, O( ).

Differences of the roles played by antiredundancy, U S , and redundancy, R S
Substituting equation (4) into equation (6), we express the relationship between antiredundancy, U S , and redundancy, R S , as follows. (15) where , which, as indicated in equation (5), is also the largest possible value of R S . As |S| approaches infinity, the limit of χ is 0.5. Therefore, similar to R S , the bounds of U S is also [0,0.5).
The mathematical difference between the W A,S scoring method and the W R,S scoring method contrasts the manner the sum of the absolute value of the Pearson product moment correlation coefficient between all possible pairwise combinations in the predictor set S, , is minimized between the two scoring methods.
In the W R,S scoring method we directly minimize by making it a part of the denominator in equation (7). In the W A,S scoring method the minimization is done indirectly, where we maximize - , , ∈ ≠ ∑ (note the minus sign) by making W A,S the product of V S and U S in equation (8).
This difference might be a key to the reason the W A,S scoring method works better than the W R,S scoring method. Since the minimum limit of R S is 0, it is possible in practical situations to encounter cases where R S approaches zero. When this occurs, the influence of the denominator in equation (7) greatly overwhelms the contribution of the numerator. This is true even for a large part of the domain of ρ (i.e. at least from 0 to 0.6). Therefore, rather than the bounds of R S itself, the concern is on the bounds of the expression in equation (7) which is (2,∞ ] for all values of ρ ∈ (0,1]. The upper limit leads to instability when redundancy is minimal (near zero), causing the search process to over-emphasize R S at the cost of V S even at larger values of ρ. On the other hand, the expression (U S ) 1-α in equation (8)    The value of for a large part of the domain of ρ (0 to 0.6) is at least in the order of 10, which is two orders of magnitude greater than the typical values of (U S ) 1-α and (V S ) α in the same situation (between 0 and 1). Due to this chasm in magnitude, the contribution of (V S ) α (regardless of its value) will be subjugated by that of for a large range of ρ. Conversely, due to the stable limits of (U S ) 1-α ∈ (0,1], the contributions of (U S ) 1-α and (V S ) α are almost counter-balanced by each other, with the exception of the extreme cases of α = 0 and α = 1 (Figure 7(c)).

The F-splits evaluation procedure
In several previous studies on feature selection for microarray datasets [3][4][5], feature selection techniques have been applied once on the full dataset (with no splitting) before cross validation procedure (be it LOOCV or n-fold cross validation) is employed to evaluate the classification performance of the resulting predictor sets. We call this evaluation procedure the internal cross validation (ICV) procedure. ICV is known to produce selection bias, which leads to overly optimistic estimates of accuracy [58]. In [59], datasets are explicitly split into training and test sets, but information from the test set is incorporated into the feature selection process -which results in over-fitting and again, overly optimistic estimates of accuracy. To avoid this pitfall, different splits of the dataset into training and test sets should be used where feature selection is repeated for each of the splits. During each split, our feature selection techniques will be applied only on the training set of that particular split. No information from the test set should be 'leaked' into the process of forming the predictor set (which is precisely what hap-pens during the ICV procedure). Classifier trained on the predictor set and the training samples will then be used to predict the class of the test samples of the current split. The test set accuracies obtained from each split will be averaged to give an estimate of the classification accuracy. We call this procedure of accuracy estimation the F-splits evaluation procedure (F being the number of splits used).
Limits of (U S ) 1-α and in extreme situations of near-maximum and near-minimum redundancy   Figure 1 shows how we evaluate the performance of each predictor set scoring method run at a certain value of the DDP at each value of P.
The selection bias resulting from the ICV procedure makes it possible for any feature selection technique evaluated using ICV to produce accuracies which are much higher than those from the F-splits evaluation procedure. The overly optimistic estimates of accuracy from the ICV mean that predictor sets chosen based on the ICV are likely to give dismal classification performance in the face of fresh, unknown test samples. The effect of selection bias on multiclass microarray datasets has been elucidated in [11].

Summary of previous works
The concept of the DDP was first introduced in [57], where we proposed the antiredundancy-based W A,S scoring method. However, in that study, only one dataset (the GCM dataset [2]) was analyzed, and no F-splits evaluation procedure was implemented since only the original (single) split of training and test sets used in [2] was investigated. In a following study [29], we added the redundancy-based W R,S scoring method and made some simple comparisons between the two scoring methods (using overall accuracy and range of class accuracies). Again, only the single split of the GCM dataset was analyzed.
In our third study on the DDP feature selection techniques [11], five datasets were used as benchmark datasets: GCM [2], NCI60 [15], lung [21], MLL [23] and AML/ ALL [25]. All five are included in the set of nine microarray datasets analyzed in this study. However, the study in [11] was limited to the W A,S scoring method. It was in the same study [11] that we first implemented the F-splits evaluation procedure and demonstrated the importance of avoiding evaluation procedures prone to producing overly optimistic estimates of accuracy. All of the details concerning the W A,S and the W R,S scoring methods, and the F-splits evaluation procedure are described in the Methods section of this paper.
In our most recent study [34], we presented a procedure for computing the value of the DDP most likely to produce the optimal accuracy based on available classification results. This procedure is elaborated in The importance of the DDP sub-section under the Discussion section of this paper.