 Research article
 Open Access
 Published:
Outcome prediction based on microarray analysis: a critical perspective on methods
BMC Bioinformatics volume 10, Article number: 53 (2009)
Abstract
Background
Information extraction from microarrays has not yet been widely used in diagnostic or prognostic decisionsupport systems, due to the diversity of results produced by the available techniques, their instability on different data sets and the inability to relate statistical significance with biological relevance. Thus, there is an urgent need to address the statistical framework of microarray analysis and identify its drawbacks and limitations, which will enable us to thoroughly compare methodologies under the same experimental setup and associate results with confidence intervals meaningful to clinicians. In this study we consider geneselection algorithms with the aim to reveal inefficiencies in performance evaluation and address aspects that can reduce uncertainty in algorithmic validation.
Results
A computational study is performed related to the performance of several gene selection methodologies on publicly available microarray data. Three basic types of experimental scenarios are evaluated, i.e. the independent testset and the 10fold crossvalidation (CV) using maximum and average performance measures. Feature selection methods behave differently under different validation strategies. The performance results from CV do not mach well those from the independent testset, except for the support vector machines (SVM) and the least squares SVM methods. However, these wrapper methods achieve variable (often low) performance, whereas the hybrid methods attain consistently higher accuracies. The use of an independent testset within CV is important for the evaluation of the predictive power of algorithms. The optimal size of the selected geneset also appears to be dependent on the evaluation scheme. The consistency of selected genes over variation of the trainingset is another aspect important in reducing uncertainty in the evaluation of the derived gene signature. In all cases the presence of outlier samples can seriously affect algorithmic performance.
Conclusion
Multiple parameters can influence the selection of a genesignature and its predictive power, thus possible biases in validation methods must always be accounted for. This paper illustrates that independent testset evaluation reduces the bias of CV, and casespecific measures reveal stability characteristics of the genesignature over changes of the training set. Moreover, frequency measures on gene selection address the algorithmic consistency in selecting the same gene signature under different training conditions. These issues contribute to the development of an objective evaluation framework and aid the derivation of statistically consistent gene signatures that could eventually be correlated with biological relevance. The benefits of the proposed framework are supported by the evaluation results and methodological comparisons performed for several geneselection algorithms on three publicly available datasets.
Background
Modern biological and biomedical research has been challenged by the relatively new highthroughput methods of genomic, proteomic and metabolomic analysis [1], such as DNA microarrays that allow the simultaneous measurement of the expression of every gene in a cellular genome. Two of the fundamental tasks in this area are the identification of differentially expressed genes between two or more conditions and the selection of a subset of features (genes) with the best predictive accuracy for a certain classifier [2]. Various statistical methods for the analysis of microarray data exist, however data derived from these methods are complex, hard to reproduce and require expedient statistical analysis to minimize errors and avoid bias.
Despite the plethora of methods that have been developed for information extraction from microarrays, such information has not yet been widely used in diagnostic or prognostic decisionsupport systems [3]. This is partly due to the inconsistency of the derived results [4] and the different properties of various data sets [5–8]. For example, to correctly distinguish the two types of leukemia, Golub et al. [9] used a filter method and succeeded to derive a 50gene signature, whereas Guyon et al. [10] applied a wrapper method in combination with SVM on the same data set and succeeded to derive an 8gene signature.
Many studies have addressed the different issues involved in data analysis from different microarray studies [11–14]. Here we briefly mention some of those, namely the experimental platform used, the design of the study, the normalization techniques employed or even the different properties of the data distribution [5, 10, 15]. Also, from the point of the statistical analysis of the data, different study outcomes can be due to the different algorithms used, the improper use of validation techniques and the optimization techniques for the prediction model.
Additionally, various statistical issues can potentially affect the results of a study. For example, the bootstrapping strategy for the generation of random folds for training and testing is an issue of particular importance for performance comparison of gene selection approaches. Moreover, crossvalidation (CV) can induce a certain bias due to mixing of training and testing samples [16]. Leaveoneout CV derives overoptimistic estimates, while threefold CV split may lead to a small number of training samples and hence the possibility of overtraining [17]. The low performance of hardmargin SVM in [5] as compared to [10] on the independent testsets may be partially attributed to such overtraining. Stratified resampling of data may be used to maximize the power of comparison among methods [18], but it can still modify the prior data distributions leading to changes in performance estimates. Similarly, random data splits often induce bias in casespecific considerations because they randomly exclude samples from the training or the testing process [19, 20]. Thus, there is a need to consider issues of influence in algorithmic performance and associate results with confidence intervals meaningful to clinicians, in order to thoroughly compare methodologies under the same experimental and methodological setup [16, 20–22].
Objectives of the study
Through our study we aim to reveal shortcomings in the evaluation of gene selection approaches and address methodological issues that may lead to more objective evaluation schemes. Furthermore, we consider samplespecific effects on the performance and stability of prediction algorithms. For that purpose, we have selected three well known publicly available data sets, namely the Van't Veer et al. [23], the Golub et al. [9] and the Alon et al. [15] data sets. Further details about the data sets used are given in the following section.
Our ultimate objective, besides presenting and highlighting the advantages and disadvantages of the tested methodologies, is to explore fundamental characteristics in the analysis of microarray data sets and identify methodological aspects that may influence the evaluation of algorithms. As a performance validation scheme we consider tenfold CV repeated ten times, i.e. 100 iterations in total. It is worth recalling that CV with multiple genesets or different splits of data sets always presents some bias threats [16, 24]. In order to relieve such biases we consider testing on an independent testset at each step of the recursive validation process. Finally, in order to avoid bias due to the classification model, we estimate its parameters through optimization on the independent testset.
Results
Data sets
Three well known and publicly available data sets first published at, [9, 15] and [23] were considered. For the first data set genes were hybridized using twocolor arrays, whereas for the other two onecolor arrays were used. The lists of genes, after platformspecific handling of the data, were used as suggested by the corresponding authors. Thus, platform specific effects were not considered.
The breast cancer (BC) data set [23] contains 24,481 genes and 78 samples on the training set, 34 of which are characterized positive and 44 negative accordingly to the presence or not of a relapse within a period of five years. 293 genes expressing missing information for all 78 patients were removed and the remaining 13,604 missing values were substituted using Expectation Maximization (EM) imputation [25]. This set is used either as a training set or for the design of the CV trials and the specification of the bootstrap training and testing subsets. The independent testset consists of 19 samples, 7 negative and 12 positive.
The leukemia data set in [9] consists of 38 bone marrow samples obtained from acute leukemia patients at the time of diagnosis. The data set is divided into 27 samples of acute lymphoblastic leukemia (ALL) and 11 samples of acute myeloid leukemia (AML). Data was analyzed using an Affymetrix arrays which contained 7,129 human genes. The independent test set consists of 34 samples (20 ALL and 14 AML).
The colon cancer (CC) data set in [15] consists of 40 tumour and 22 normal colon tissue samples analyzed with Affymetrix oligonucleotide array complementary to more than 6,500 human genes. 2,000 genes were considered as suggested by [15] based on their minimal intensity across samples. To create the training set, 28 tumour and 16 normal samples were randomly selected, while the remaining samples constituted the independent set.
Data set analysis
To highlight differences between the three data sets, a threedimensional principal component analysis (PCA) depicted in Figure 1 was performed, which could assist us in estimating a decision boundary. Such an analysis revealed that the BC data set perhaps demonstrates the highest degree of overlap between the two classes. CC data set appears to be more separable than BC, while the leukemia data set appears to be most easily distinguishable.
Design of experimental scenarios
Three basic types of experimental scenarios are conducted in order to estimate the performance of tested methodologies:

In the independent testset scenario the algorithms are trained using the complete training set together with backward feature elimination process. Thus, a varying number of genes are recursively eliminated until 100 surviving genes are left and subsequently one gene is eliminated from this point onwards [26]. At each stage of the elimination process, the classifier is trained on the trainingset using the selected set of genes, while its performance with respect to accuracy is measured on the independent testset using the same set of genes. Finally, the minimal set of genes which achieves the maximum classification accuracy and classifies perfectly the trainingset is the final set of marker genes with the corresponding maximum accuracy performance on the independent testset. Here, we also measure the Qstatistic score [27] to investigate the level of gene differentiation among the classes of interest and gene correlation with the disease outcome. This scenario is essentially used for the estimation of algorithmic parameters and of the size of gene signature with high accuracy and good generalization on the independent testset.

The maximum performance on a 10fold CV scenario is used to access optimal algorithmic performance under CV, similar to the GEMS approach [28, 29]. The initial set of training samples is randomly partitioned into ten groups. Each group, denoted by fold for the rest of this paper, is iteratively used for testing, whereas the remaining folds are used for the training of the algorithms. Thus, 90% of the set is used for training purposes, while the remaining 10% is used as the testset. This CV process is repeated ten times resulting in total to 100 test iterations, i.e. 10(folds) × 10(splits per fold). Each iteration proceeds independently with gene selection according to the previous cutoff strategy. The maximum performance for each run on the testing subset is reported along with the corresponding number of surviving genes. Finally, we report the grand average of maximum accuracies along with the corresponding number of genesets and confidence intervals (CIs) on classification accuracy.

The average performance on a 10fold CV scenario operates with fixed parameters of the tested algorithms and terminates at a fixed number of selected genes determined by the independent testset scenario. Feature selection is performed within each run of the CV process in order to avoid overestimation of the prediction accuracy [30, 31]. Testing is performed on the test folds of CV, but also on the common independent testset. In this scenario, average accuracy measures are estimated in an attempt to reduce random correlation effects and increase the confidence that the test measurements reflect the true predictive ability of each method. CIs on classification accuracy are used, where appropriate, as a stability measure revealing the variation in the performance of the tested methodologies. Furthermore, we report the number of commonly selected genes as an indicator of the stability of the tested methodologies over different trainingsets. The frequency of gene appearance in classification rules has also been used by [19, 20] in order to control the rate of genes selected at random.
Except from measures on the overall population used in CV, we also resort to casespecific considerations to increase the clinical relevance of algorithmic accuracy estimates. The accuracy, usually measured over many iterations, cannot evaluate the efficiency of the algorithm to correctly classify individual cases. Thus, high accuracy may be due to correct classification of different cases in each run, providing an overall high score for this measure even though single cases may not be correctly classified in most runs of the algorithm. However, when considering clinical evaluation of a new case, it is important to have high confidence in the correct categorization of the individual subject based on the training examples. For that reason it is essential to consider per subject accuracy measures, which assist in increasing the efficiency of the algorithm to correctly identify new cases with similar attributes, as those involved in the training set. Through such measures, we can also consider the stability and generalization ability of each algorithm on the basis of per subject success or failure, as well as account for the influence of entiresample measurement errors on the estimation of the prediction power of each method. All those measures are presented along with appropriate CIs derived from the CV process.
For the effective evaluation of measures a performance profile of each algorithm is computed, i.e. a table with columns reflecting all subjects and rows indicating the CV runs. For each run (row), the table captures a binary value for each case (column) if that subject is in the testset of the run. This value indicates the prediction success for that specific subject on the specific run. In this form, the average of per subject accuracies over all runs reflects the per subject accuracy of the algorithm, whereas the average of per run accuracies over all testing cases reflects its per run accuracy. The notation of such accuracy measures is specified in subsequent sections.
Three classes of methods have been chosen for selecting the number of genes with best performance in sample classification, namely filter, wrapper and hybrid methods. Filter and wrapper methods have been extensively used for gene selection, whereas hybrid methods have also shown considerable success by combining the advantages of the previous classes [26]. Filter methods reveal the discriminatory ability of each gene by employing various criteria such as the Fisher's criterion which is also used in the present study. Wrapper methods traditionally employ SVM classifiers in a recursive feature elimination (RFE) mode to reveal the prediction ability of groups of genes, but other (rather simple) predictors have also been tested. In our study wrapper methods were employed via five algorithms, and in particular the RFE based on Linear Neuron Weights using Gradient Descent (RFELNWGD), the RFE based on Support Vector Machines (RFESVM), the RFE based on Least Square Support Vector Machines (RFELSSVM), the RFE based on Ridge Regression (RFERR) and the RFE based on Fisher's Linear Discriminant Analysis (RFEFLDA). Hybrid methods are examined via RFE based on Linear Neuron Weights (RFELNW1, RFELNW2), and RFE based on Fisher's ratio of Support Vectors using a 7 Degree polynomial Kernel (RFEFSVs7DK). More details can be found in Methods section.
Maximum Performance Results
Independent Testset
The maximum performance results of filter, wrapper and integrated schemes on the independent testset are presented in Tables 1, 2 and 3. Accuracy measures, as defined in Methods section, are presented along with sensitivity and specificity measures, reflecting the ability to correctly classify good and bad prognosis, respectively. The Qstatistic measure and the number of selected genes for achieving maximum performance on the independent testset are also assessed.
We first notice that for the BC data set (Table 1) RFEFSVs7DK is the best accuracy performer, achieving the highest success rate of 95% (only one sample is misclassified) with 73 genes selected. It also achieves the highest specificity measure with adequately high sensitivity. Noticeable is the fact that the same method achieves a relatively high statistical significance score, close to that of the filter method, which is the highest performer of the Qstatistic. RFELNW1 also demonstrates noticeable performance both in terms of accuracy and statistical significance. Alternatively, the RFELNW2 and filter method have similar results indicating that by using a relatively low learning rate RFELNW approach converges to the result of the pure filter method. From the wrapper methods, RR and FLDA achieve high accuracy with better sensitivity than specificity values.
For the leukemia dataset (Table 2) most algorithms (except the wrapper RR and FLDA methods) attain higher accuracy than in breast cancer (Table 1), which is justified due to its well defined classes. The specificity of all methods is remarkably high. Nevertheless, this is not true for sensitivity with only LNW, SVM and hybrid methods achieving scores over 86%. Moreover, the hybrid methods preserve good values of the Qstatistic, slightly better than the filter method, revealing high discrimination over the selected gene signatures. The wrapper methods in general reflect lower values of the Qstatistic metric, implying weaker discrimination of genes amongst the classes of interest, and low correlation between gene expression level and prediction outcome. The wrapper LNW and SVM methods reflect better performance than the filter method.
For the CC data set (Table 3) all methods (except the wrapper RR) achieve more than 88% accuracy, with wrapper LNWGD, LSSVM and hybrid LNW1 and RFEFSVs being the best performers, achieving also the highest sensitivity rate. We point out the inferior performance of the wrapper RR method compared to previous results indicating that the specific algorithm is highly affected by differences in the experimental settings, as it is also pointed out in the leukemia data set. The sensitivity achieved by all methods (except RR) is remarkably high while at the same time, the hybrid methods preserve higher values of the QStatistic compared to the wrapper schemes and close or better (RFELNW2) than the filter method. An exception is the wrapper LNWGD which achieves a high score close to the filter method as well. Overall, the hybrid schemes preserve top performance amongst the tested methods for all data sets. We may also infer that the wrapper RR and FLDA methods are most affected by the differences in the experimental settings.
Crossvalidation
The results of this experimental scenario concerning BC, leukemia and CC data sets are presented in Tables 4, 5, 6, respectively. We can observe that filter method perform well in all cases, resulting in higher accuracy and smaller CIs. There is a similar tendency for the hybrid methods for all data sets. For the wrapper methods, however, the performance varies depending on the data set. Furthermore, the wrapper methods derive wider CIs in the more complex set of breast cancer, indicating larger variation of results throughout the CV iterations.
For the CV testing of BC, the wrapper schemes exhibit less accuracy and stability, yielding larger CIs. The inferior accuracy of wrapper methods is primarily attributed to their lower sensitivity. Alternatively, for the leukemia data set (Table 5) most wrapper schemes (except RR) increase their maximum accuracy measures in CV compared to independent testset scenario, whilst attaining good sensitivity and specificity scores. In fact, the FLDA scheme reaches the maximum accuracy followed by LSSVM and LNWGD, but in this case the accuracy performance of all methods is comparably high.
For the CC data set (Table 6), even though all methods decrease their performance in comparison to leukemia data set, they still exhibit a high accuracy level ranging from 83% to 91%. We point out the increase in the performance of the wrapper RR method, compared to the previous evaluation procedure (independent testset scenario), indicating once more that the method is strongly influenced by the different experimental scenarios. As in the case of the leukemia data set, a decrease in algorithmic performance is observed when compared to the independent testset scenario. The wrapper LSSVM and hybrid FSVs and LNW1 are the top performers, with LSSVM achieving a higher sensitivity rate among the three. Generally speaking, the wrapper methods (with the exception of LSSVM and FLDA methods) have higher CIs than the hybrid and the filter methods, indicating higher variation of results throughout the CV iterations, as was also expressed in BC data. Moreover, hybrid methods tend to select less number of genes than both the filter and wrapper approaches.
Considering these results, we may conclude that the performance of pure wrapper methods is more affected by the experimental conditions with performance measures varying over crossexperimental evaluation. Furthermore, wrapper methods appear to benefit from integrating appropriately adapted filtering criteria (hybrid approaches) into their learning procedure. Such integration would lead to a more stable performance by preserving high levels of statistical significance under both tested scenarios and experimental frameworks.
Overall, the evaluation (or ranking) of algorithms changes under a cross experiment evaluation or when considering different evaluation schemes such as CV or singlestep evaluation with testing on an independent testset. The CV scheme appears to give a certain (positive or negative) bias to all algorithms. Furthermore, the optimal size of the selected geneset appears to be dependent on the evaluation scheme. The CV scheme driven by its various training sets can lead to a quite skewed distribution of performance estimates for various sizes of selected gene signatures. Thus, the final size of signature attaining maximum performance within the multiple iterations of CV can be easily affected by several random effects. In essence, factors of bias in maximum CV performance include the small size correlation effect and the intermixing of samples within the training and testset. Similar criticism on maximum or ranked performance schemes has been reported by other studies [24].
From this section it is obvious that the ranking of methods is highly dependent on the evaluation strategy. For a more objective comparison of algorithms, we consider in the next section the average performance of algorithms at a certain cutoff point on the number of surviving genes. Thus, the size of the gene signature is specifically chosen for each algorithm based on its best performance on the independent testset (Tables 1, 2, 3), in order to test how well algorithms will perform with new cases.
Average Performance Results on Crossvalidation
The average per run and per subject accuracies (acc_{ R }and acc_{ P }) along with their standard deviations (stds) are tabulated in Tables 7, 8, 9. The measures of sensitivity and specificity are also reported on the basis of per run tests for the test population in CV and in the independent sets. The algorithms achieving the highest accuracy as well as the lowest stds are highlighted with bold figures. The last row of those tables presents the overall measures on all samples tested (CV test set plus the independent test set) and reflects the overall performance ability of each algorithm. When considering the same population for testing, e.g. the independent testset, the two measures, acc_{ R }^{t}(per run) and acc_{ P }^{t}(per subject), obtain identical values. Notice, however that we expect a difference in stds due to the different reference cohorts, i.e. the different runs and subjects used.
The overall accuracy measures along with their CIs are graphically depicted in Figures 2, 3, 4. Furthermore, Figures 5, 6, 7 show the per subject accuracy and the CIs of classification accuracy for the independent testsets. Note the large variability among sample accuracies in the case of BC (Figure 5) and the relative consistency of estimation throughout the tested subjects in the case of leukemia (Figure 6). Concerning the consistency of algorithms in terms of selected gene signatures over the CV iterations, the consistency (or gene overlap) index is tabulated in Table 10 for all tested algorithms. With the exception of the LSSVM and RFELNWGD methods, wrapper methods appear to select different genes per iteration, resulting in quite small indices. Filter, as well as hybrid, methods yield good consistency based on their high frequencies of selecting the same genes throughout CV iterations. Nevertheless, we should stress our belief that in a further development stage we need to also associate the statistical results with the biological meaning of selected gene signatures.
Comparing the performance results, we first notice that the average accuracy measures (Tables 7, 8, 9) are significantly smaller than their maximum counterparts in the previous section (Tables 1, 2, 3). Recall that in the present scenario each algorithm is applied with fixed parameters for all CV iterations and terminates at a fixed number of surviving genes, which is specified by its point of maximum performance on the independent testset (Tables 1, 2, 3). This scenario is much closer to a clinical testing and validation setup, so that these results, even though inferior, reflect more closely the actual potential of algorithms in clinical prediction. It can also be observed that, similarly to the maximum performance measures, the average accuracy of the CV samples differs from that of the independent test samples, either on a per run or a per subject basis (Tables 7, 8, 9). This deviation is consistent for all three datasets. Furthermore, all methods demonstrate consistently lower stds when tested on the independent testset than on the testing fold of the CV set for the BC and CC data sets (Tables 7 and 9). This is a direct consequence of the fact that in the former case the testing set is kept constant across all iterations (fixed independent testset). Nevertheless, this performance is reversed for the leukemia data set (Table 8), which is possibly due to statistical differences in the distributions of the training (CV) and testing samples and implies insufficient training of the algorithms.
Considering the results over all testing samples (Tables 7, 8 and 9), which we consider as good estimates of the true algorithmic performances (golden standard for comparison), we notice that they are more closely approximated by the results of the independent testset than by those of CV test folds. More specific comparisons on algorithmic performance are summarized in the following sections.
Filter and Wrapper methods
All methods reflect a positive or negative bias on the accuracy estimates of the CV set over those of the independent testset and the overall testing set, depending on the data set. The baseline filter method for all data sets derives good accuracy on the CV tests (Tables 7, 8, 9), which is in accordance to the results of [6]. In fact, its accuracy on CV tests is higher than all wrapper methods tested. Nevertheless, such a good performance is not sustained on the independent testset (Tables 7, 8, 9). Furthermore, the std indices for the leukemia data set achieve their lowest values for the filter method, but such algorithmic stability is dependent on the data set.
The wrapper scheme based on the linear neuron (LNWGD) results in better accuracy on the independent testset, either on a per run or per subject basis (Tables 7 and 9) for the more difficult BC and CC data sets. This is reversed for the leukemia data set (Table 8), where the compact nature of samples in each class forces its performance on CV testing towards overestimated measures. Nevertheless, it maintains reasonable performance on the overall testset, with balanced sensitivity and specificity.
Considering the BC data set and the other wrapper RFE approaches on their average per run performance, we observe that on the independent testset (Table 7) RR and FLDA perform better than SVM, which is in accordance to the results of [8]. However, the SVM and its least squares variant yield better specificity than the RR and FLDA methods. This performance is reversed when considering CV (Table 7). In this case, the performance of SVM is slightly better than FLDA, which is in accordance to the findings of [5]. For the leukemia data set (Table 8) the SVMbased approaches perform better for both the independent testset and the CV approaches, with increased specificity, but reduced sensitivity over the RR and FLDA methods. This improvement is attributed to the characteristic distribution of support vectors in the leukemia data set, which expresses better class separability compared to the BC data set [32]. Similarly for CC (Table 9), the SVMbased approaches perform better than the other wrapper methods, but now due to increased sensitivity. With the exception of the excellent performance of the LSSVM on the independent and the overall testsets, the performance of wrapper methods in colon cancer is in accordance to the performance ranking for the other two data sets.
Despite its high performance in CC, the LSSVM method expresses a large variation on the accuracy measures for CV testing and the independent testset, which may be due to the particular selection of the independent testset. Notice that in this case the independent set was randomly selected from the original data in [15], unlike the other two data sets considered which supplied different independent data sets. Furthermore, its increased performance on the independent testset is mainly due to its increased sensitivity over all other algorithms (Table 9). The sporadic nature of its results is also verified by its rather low performance on the other two data sets. Overall, for all data sets the variation of results in wrapper methods is quite large between the CV and the independent testset data sets, indicating a significant influence of the training and/or testing sets on the performance of these algorithms. In particular, the LSSVM achieves its best performance on colon cancer. On similar grounds, the RR method performs relatively similar to the other wrapper methods on breast cancer, but presents the worst performance on the other two data sets. These results further highlight the need for evaluating the ranking of wrapper algorithms on a series of tests over different data sets. An additional drawback for wrapper methods is the weak consistency of the derived gene signatures between iterations, as reflected by their low consistency index presented in Table 10. Only the LSSVM approach attains exceptionally high consistency index, indicating a consistent selection of genes, but the selected gene signatures in most cases do not reflect high prediction power. Hybrid and filter methods combine relatively high consistency index with increased success rate and more stable performance over the different datasets.
Hybrid methods
Concerning the hybrid approaches, the FSV scheme is one of the best two algorithms in BC and CC, with its performance being slightly inferior of other hybrid methods in the leukemia data set. More specifically, for the BC and CC case FSV succeeds high accuracy on either a per run or a per subject base when testing the independent and the overall testset (Table 7 and 9). This performance slightly drops relative to other hybrid methods in the case of leukemia, where the CV scheme reflects highly overestimated measures (Table 8). A possible reason for that is that the optimal representation kernel (7 degree polynomial) of the BC data set is also used for the leukemia and CC data sets. A different kernel with better fit to the distribution of the leukemia or CC data sets would further improve the performance of this method.
The neural networkbased algorithms (LNW1 and LNW2) are amongst the best performers in BC and leukemia data sets. The LNW2 algorithm exhibits highly optimistic estimates in CV (Tables 7, 8), while its overall performance is amongst the highest for leukemia. The LNW1 approach yields good per run accuracies for these data sets, when tested on either the independent testset (Tables 7 and 8) or the entire testset (last row in Tables 7 and 8). It also derives consistently small variances per run, reflecting good stability properties as it appears to be less affected by variation of the training and/or testing sets. The performance rank of LNW1 and LNW2 on CC drops slightly, mainly due to the exceptional performance of the wrapper LSSVM method. Nevertheless, all hybrid methods perform better than RFELSSVM on cross validation (Table 9), while they all achieve above 80% accuracy on the independent testset (Table 9) with the best performer being the RFEFSVs (85% average success rate).
Testing on all available samples (both sampling and independent testsets) through CV is regarded to be a more unbiased test on algorithmic performance and is used as a reference for comparison (Tables 7, 8, 9). Regarding such overall performance results, the hybrid FSV approach takes the lead in overall performance, followed by LNW1 and LNW2, with slight variations.
The performance of hybrid methods on the overall set is closely matched by the results of testing on the independent testset throughout the iterations, either on per run or per subject basis, further emphasizing the necessity of an independent testset in evaluation procedures and the utility of both performance and stability measures. These hybrid methods also achieve relatively high consistency index compared to wrapper methods (except the LSSVM) as illustrated in Table 10. Thus, considering several aspects of their performance, we may claim that the hybrid approaches provide consistently good performance and stability in the independent testset, both on per run and per subject basis.
Outlier Samples
Another important aspect of this work is the consideration of the influence of noisy samples. Considering the per subject stability (std measure) of algorithms in all cases, we observe relatively increased values compared to the per run measures (Tables 7, 8 and 9), which indicate a quite large variability in the algorithmic estimates for individual case depending on the training set. This is an indication of problems caused by the limited trainingset used in the design of prediction models. Since the model is designed with limited information from the feature space, specific test samples may not fit well to its design specifications. The variability for the independent testset is graphically presented in Figures 5, 6, and 7 for the baseline filter, wrapper SVM and hybrid LNW1 approaches and the three cases, respectively. An accuracy deviation on specific samples is more severe in breast cancer and less severe in leukemia, following the difficulty in class separability illustrated by the PCA analysis in Figure 1.
Further analyzing the BC case (Figure 5), we notice that the performance of algorithms on test samples is variable, with the exception of some samples on which the average performance of all algorithms is quite low. These specific samples are ID 37, 38, 54, 60 and 76 from the Van't Veer's trainingset, as well as sample ID 117 from the independent testset. Generally, these samples cannot be well classified and can be characterized as outliers. When excluding only these samples, the average performance measures are significantly affected, as presented in Table 11. Comparing the corresponding measures in Tables 7 and 11, it becomes obvious that only a few outliers due to measurement errors can drastically deteriorate the performance of any prediction algorithm.
In a similar consideration for the leukemia data set (Figure 6), we may conclude that sample 31 is probably an outlier since all methods examined fail to classify it. Furthermore, for CC (Figure 7) we may suspect samples 3, 4 and 6 as outliers, since most methods fail to effectively classify them. The identification and removal of such samples is of primary importance in algorithmic evaluation, especially in the area of gene selection with sparsely covered data spaces. The identification scheme proposed here is an alternative to data projection for exploratory purposes [33] for selecting outlier samples based on machine learning rather than on projective mappings of data distributions.
Discussion
In this study datadriven models, which highly depend on the distribution of data within and across classes, were considered. The cutoff point on the size of gene signature selected was determined independently for each model, based on its maximum performance on the independent testset. Even though the ranking of algorithms tested varied between data sets, the main focus was on methodological aspects of evaluation so that the points addressed would remain valid for other data sets and algorithms.
More specifically, the examined approaches were applied in three different data sets of gradual difficulty as a first attempt to understand the various CV approaches and form the basis of as an objective as possible algorithmic evaluation scheme. The BC data set, which had the less well defined classes, and its independent testset was used as a pilot in the evaluation process. Each of the methods examined was pushed to perform its best level on the independent testset of the pilot data set, by appropriately fine tuning its parameters. Then, using exactly the same set of parameters, method performance was assessed, using various metrics in two additional data sets, along of course with the pilot one. Using such an approach we aimed in assessing the sensitivity of each method in various data sets or different biomedical experimental scenarios. This approach revealed that RFERR and RFELSSVM for instance, highly depend on the data set by producing diverse and contradictory results along the different computational scenarios presented.
Three different evaluation schemes were used. The maximum accuracy scheme, tested on both CV and independent testsets, presented optimistic results for most algorithms and all data sets considered. Such results can be misleading and induce severe bias on the accuracy estimates for predictors, which are far away from the actual potential of each algorithm on correctly classifying new unseen cases. The use of such schemes should be avoided and the average accuracy scheme should be used instead.
Following these guidelines, the performance of all methods was assessed with a 10fold evaluation process on the specific number of genes pointed by the independent testset evaluation. This strategy aimed in revealing the consistency of each method in deriving a gene signature with high prediction accuracy, either on per run or per patient basis. The former assessed method sensitivity on perturbations of the testset, while the later addressed sensitivity on the training set. This evaluation scenario revealed that the use of a stable independent testset along with a 10fold evaluation process (that uses a different training set per fold) resulted in more stable and less variant method performance, than a standard 10fold CV process. This result was verified in all three data sets on the perrun basis, while on the perpatient basis it was verified for two data sets (BC and CC). We found that complementing a stable testset with a varying one along a 10fold CV process is a less unbiased estimator of method performance on the initial independent testset than a standard 10fold CV process.
Using the perpatient approach we identified outlier samples, which usually resulted in an underestimation of method performance, while stability index was used to access method consistency along the iterative 10fold CVprocess. Such an index however, should be used with caution and always in association with the accuracy performance.
Overall, the wrapper methods expressed large variations in performance depending on the data set. The filter method derived good results for all data sets on maximum performance test, but its performance ranking dropped when average accuracy testing was considered, especially for the independent testset. The hybrid methods preserved consistently good results for either maximum or average performance consideration and for all data sets tested. Focusing on the average performance results in CV testing, the filter method was among the highest performers along with hybrid approaches. On the independent testset and the overall testing scheme, which forms the golden standard for our comparisons, the best performers always included FSV and other hybrid methods, except in the last data set where LSSVM yielded relatively high accuracy.
It becomes clear that comparison and ranking of algorithms, even on the same data set, should be based on several characteristic measures. Furthermore, the relatively small overlap index raises concerns regarding the potential influence of the small sample random correlation effect on the derived signatures and the performance estimates. A reason for such effects is the independent selection process throughout the iterations for the different data folds. Under CV, each iteration begins with the maximum number of genes and recursively eliminates them up to the minimum specified size, with its own ranking scheme driven by the data in the training set. Thus, for a specific size gene signature, the process for each iteration may select completely different genes. Such drawbacks of CV have been highlighted before, suggesting the need for double CV for metaparameter selection [34]. For the purpose of gene signature selection, we propose the use of a nested CV scheme, where all iterations operate in parallel. At every cutoff stage on the size of gene signature, all genes could be ranked based on the average of their weights from all CV iterations, so that the same genes survive for all iterations. In this way the effect of random correlation between genesets and data samples is expected to be reduced. This scheme, however, remains to be tested in practice.
Conclusion
The prediction accuracy reported by many studies in the field of microarray analysis reflect a certain bias due to either the study design, the analysis method (model design) or the validation process [35, 16]. Most published comparative studies consider that issue as well as the ranking of algorithms on various data sets. This study has focused in the evaluation platform of algorithmic performance. The main aim was not to strictly compare algorithms, even though a ranking was attempted as a byproduct of this work, but rather to address inefficiencies of evaluation and introduce aspects that may reduce uncertainty in algorithmic evaluation.
Overall, it is concluded that the use of an independent testset is beneficial for estimating a baseline on accuracy results for all algorithms. The CV scheme by itself induces certain positive or negative bias depending on the data set and should be complemented with independent tests. Nevertheless, most often data sets do not come with a compatible independent testset obtained under the same study criteria. Hence, the isolation of a subset as an independent testset must be considered with caution, since there is a danger of inducing estimation bias due to alterations in the design of the experiment. Furthermore, to reduce bias in the estimation of measures, the design of crossvalidation splits should be carefully considered as it should not alter the entire design of the experiment. In particular, a stratified scheme for sample selection should be preferred, so that individual samples can be tested a sufficient number of times throughout the iterations.
Another concept that has been introduced is the "performance profile", which is a matrix recording accuracy results over the various iterations (for different CV folds) of the RFE process and enables the computation of subject and iterationspecific performance measures for each algorithm. Based on the performance profile, we consider casespecific measures to reveal stability of the estimate over different training sets, as well as frequency measures on gene selection to address the algorithm's consistency in selecting the same gene signature under different training conditions. These issues reveal different aspects of algorithmic performance which could influence their ranking under a crossvalidation strategy.
Besides all these algorithmic considerations and comparisons, gene selection should always be interpreted from the biologists' perspectives. Statistical significance is not always accompanied by biological relevance, thus knowledge of gene function and biological pathways should always be taken into account. It seems that a more integrated scheme of statistical analysis, combined with statistical as well as biological validation is needed in order to eliminate any misclassifications and thus could safely being used in the clinical practice and decisionmaking.
Methods
Considerations of the Application Field
Even though gene selection initially appears as a standard paradigm of feature selection, the application domain entails several aspects that add certain constraints to the problem [17, 36]. In such an application we have to deal with many problems including the following:
i) Noise of the data: DNA microarrays provide a vast amount of data which might contain noise along with redundant information that needs to be processed in such a way so that the real valuable and useful information is finally distilled. This resulting gene signature could then be used by an expert to search, discover and understand the hidden biological mechanisms involved in the development of cancer.
ii) Limited number of samples: Most of the microarrays experiments have few samples because of the cost of the method and the limited number of cases in a short study period that adhere to the study protocol. As the number of variables studied is too large relative to the number of cases, overfitting can easily occur. A related problem refers to the small sample random correlation effect, which essentially allows a small number of features to be randomly correlated with the data, with an increased risk of getting an irrelevant or random solution,
iii) Random measurement effects on samples: each subject may be measured with different (unknown) CIs on each gene expression. Thus, average classification measures and CIs may not be applicable to individual samples; they might apply on gene space distributions but not on subjectspecific distributions.
iv) Bias on the design of the study: altering the initial number of genes or the number of cases, may change the design conditions of the study. Several studies begin with a smaller set of genes obtained from oversimplified criteria and proceed with the proposed gene elimination approach. However, by coupling different selection methods at the various stages of recursive elimination, one may bias algorithmic performance by means of affecting the initial conditions of recursions.
Taking into consideration the above issues and using the three experimental scenarios presented before, we address several concerns regarding algorithmic evaluation and discuss potential measures for the objective comparison of geneselection approaches.
Tested Methods
Gene Selection Approaches
In order to select a specific number of features that reflects best performance in sample classification, two general methods exist, namely filter and wrapper methods. Filter methods directly rank genes according to their significance using various statistical measures such as Fisher's ratio, tstatistics, χ^{2}statistic, information gain, Pearson's correlation and many others. The top ranked genes that yield the highest classification accuracy are then selected as the final set of markers. Wrapper and/or hybrid (or integrated) methods employ a classifier in order to assess the importance of genes in decisionmaking and assign weights to genes by means of the weights of a classifier trained on the data set. Subsequently, the lowest weighted genes are eliminated on the basis of RFE and the process continues in a recursive manner. In the RFE procedure any classifier could be potentially used as a weight vector estimator, highlighting the intriguing advantage of an open and adaptive scheme. The vast majority of methods employ linear predictors for the specific problem of marker selection, due to the sparse nature of the feature space [10]. A fundamental attribute of the "philosophy" of wrapper methods is that gene weights are reevaluated and adjusted dynamically from iteration to iteration, while in filter approaches gene weights remain fixed. Another qualitative difference between the two philosophies is that filter methods focus on intrinsic data characteristics neglecting gene interactions, while wrapper methods focus on gene interactions neglecting intrinsic data characteristics [19]. Recent studies [26, 37] also addressed the advantages of integrating these quite different "philosophies" into a single approach, leading to the socalled hybrid (or integrated) approaches.
Let v be a particular vector sample of gene values with class label y. A linear classifier can be seen as a predictor of the target result y that bases its decision on a weight vector w in the form of $\widehat{y}={w}^{\prime}v$. In essence, this weight vector identifies the boundary hyperplane between two classes of interest. Based on this formulation, we can use primary characteristics of classifiers used in the various tested methods to assign feature (gene) weights. SVM [38] searches for the boundary hyperplane that provides the best separation distance between classes. It minimizes the regularized power of the weight vector subject to the condition of correct classification of the training set. Least squares support vector machines (LSSVM) [39] modify the above problem into a least squares linear problem. The ridge regression classifier (RR) uses the classification error as a regularizing factor on the minimization of the power of the weight vector [8], whereas Fisher's linear discriminant analysis (FLDA) defines the weight vector as the direction of the projection line of the mdimensional data that best separates the classes of interest according to the Fisher criterion. The gradient descent linear neuron (LNWGD) optimizes the l_{2} norm of the classification error proceeding in an iterative way imposed by its gradient.
The hybrid/integrated methods considered proceed in a similar iterative manner, but appropriately enrich their learning process within the wrapper scheme with a filter criterion. The linear neuron weight (LNW) approach updates the weight of each individual gene towards the signed direction of the gradient weighted by a factor that induces the effect of the Fisher's metric [26, 37]. The learning rate (or update parameter) may be used to control the importance of the Fisher's metric on the derived gene signature. Thus, we test two learning rates along with this method. More specifically, LNW2 uses smaller learning rate than LNW1, favoring the derivation of more statistically significant genes and the creation of more compact classes in the space of surviving features. In a similar framework, the Fisher's support vectors (FSV) approach appropriately integrates a variation of the Fisher's ratio within the weight update scheme of SVMs [26]. By exploiting the kernel formulation of SVM we can derive several forms of the boundary hypersurface. These recursive methods are compared with the baseline filter method of [9], which uses a variation of Fisher's ratio.
Classification Models
The evaluation of feature selection schemes is often performed in association with a subsequent classification model [5, 34]. In our study, the classifier models used for testing are obtained from the same pool of methods used for feature selection. This pool covers a wide variety of popular models, varying from linear discriminant analysis and the related ridge regression scheme to neural networks and support vector machines, all trainable on the data distribution. In order to disassociate feature selection from subsequent classification, we require the maximum performance of classification on the independent testset. Thus, each geneselection scheme is combined with all classifiers, optimized and tested through the first scenario on the independent testset. The algorithmic parameters are optimized as to maximize performance on the independent testset scenario, under the additional condition of correct classification of the training set. The classifier that achieves the best performance under this test is selected and used with its fixed parameters in all other evaluation tests. The algorithmic parameters are fixed based on the more difficult data set of breast cancer and are similarly used on all three data sets. This scenario is close to the operation of a decision support system for the categorization of new cases, where the classifier is used with fixed, already optimized parameters. Furthermore, when comparing algorithmic performance on different data sets, it is important to preserve the same algorithmic parameters in order to evaluate the stability of algorithms under a crossexperiment evaluation scheme. Table 12 summarizes the parameter values used for each approach in the conducted experiments for either gene selection or classification.
Definition and Justification of Measures
Let ν_{ i }denote the measurement vector of the i^{th} individual, or the i^{th} sample of the data set, with y_{ i }denoting its associated class label. A classifier C trained on a given data set D maps an unlabelled instance ν ∈ V to a label y ∈ Y. The notation C(D, ν) indicates the label assigned to an unlabelled instance ν ∈ V by a classifier built on data set D. In kfold CV, the data set D is randomly divided into k mutually exclusive subsets (folds) D_{1},..., D_{ k }of approximately equal size. The classifier is trained and tested k times; each time t (t = 1,..., k) is trained on D\D_{ t }= {DD_{ t }} and tested on D_{ t }. The "performance profile" is the matrix S of m rows and n columns, where m is the number of runs (folds) and n the number of subjects. We define each element S_{ ij }as follows:
where K corresponds to the total number of runs induced by the CV scenario and δ(.,.) indicates the Dirac delta function.
Along each row of matrix S we define the cardinality C_{ Ri }of row i and the cardinality C_{ Pj }of column j as the number of active entries (one or zero) within each row and column, respectively. Along each row of S we define the mean accuracy per CV run:
which assess the model's generalization on the testset, while keeping the training set fixed. We can also derive the confidence interval $C{I}_{{R}_{i}}$ for the classification of the testing set D_{ i }in this run based on a completely separate training set. Each run R_{ i }is a splitsample process [16], where the prediction outcomes for the samples in the testing set are truly independent Bernoulli trials, so that the derivation of a binomial confidence interval $C{I}_{{R}_{i}}$ is fully justified. Based on multiple splitsample runs, Michiels et al. [18] proposed a strategy for the estimation of CIs on the true prediction power of a method, by means of a percentile on the empirical distribution of multiple run estimates. In this study, we follow a similar approach for estimating measure of overall accuracy, considering the percentile distribution of accuracies either per run or persubject. We employ the sample mean and standard deviation of per run accuracies in (3) to model multiple run estimates and derive measures for the mean prediction accuracy and its CI over all runs, denoted by the pair (acc_{ R }, std_{ R }). Notice that the accuracy measure acc_{ Ri }for the i^{th} run indicates how well the specific trainingtest can represent other (new) cases or samples (used in the testingset). Consistently high accuracy for all CV runs (with different trainingsets) indicates good overall prediction power and generalization ability of the method. Thus, the pair (acc_{ R }, std_{ R }) is employed as an index of the algorithmic performance and stability in its learning and generalization process. The 95% confidence interval (CI_{ R }) is set to 1.96 times the standard deviation std_{ R }.
We now turn our attention to measures along the columns of the performance profile S that reveal different issues of algorithmic performance. In this form, the mean persubject accuracy is given by:
We can also derive the confidence interval CI_{ pj }for the classification over all runs of subject P_{ j }. Nevertheless, the assumption of truly independent Bernoulli trials in the computation of CI breaks down, due to some overlap in the training sets across the runs [18]. Having denoted the potential of some bias in the computation of CI_{ pj }, we could use it with the due caution [16]. We also emphasize that such accuracy measures for individuals are highly dependent on the testing strategy, e.g. the design of the CV splits, which affects the number of iterations that this individual is tested for. Individuals that are randomly selected in the testing subset many times may yield more trustworthy measures on the persubject performance of the algorithm, as well as tighter bounds on the CIs of accuracy. Individuals that are tested only a few times result in wider CIs, implying that they may influence the prediction of the overall algorithmic performance either in a positive or negative trend due to the small set random correlation effect. Such samples should be removed from the process of algorithmic evaluation. In our study we exclude subjects (samples) that have been tested less than ten times within the CV procedure. Similarly, persubject measures may also be used to identify and clean up the data from irrelevant samples or samples highly affected by measurement noise.
The accuracy measure for each subject indicates how well this tested sample fits the model developed by the training process. In essence, high accuracy achieved for all CV runs indicates that the specific sample fits well to the decision space defined by all other samples. Consistently high accuracy of the algorithm for all tested samples indicates good and stable prediction characteristics. Thus, the mean accuracy acc_{ p }over all tested samples/subjects may be used as an index of the prediction power of an algorithm, similar to the overall accuracy per runs. The standard deviation std_{ p }of the persubject accuracies over all tested samples is a measure of algorithmic stability on perturbations of the training and/or testing sets. For that reason, small standard deviation of persubject accuracies of an algorithm indicates robustness to changes of the sample distributions. As in the previous case, the 95% confidence interval (CI_{ P }) is set to 1.96 times the standard deviation std_{ P }.
In order to exploit and compare performance statistics on various testing sets, we compute separately the average performances per run involving (i) the CV testing subjects, (ii) the samples in the independent testset and (iii) the total set of tested samples.
Consistency index
We use consistency index as a measure to estimate the stability of algorithms on the geneselection process. This measure is based on gene frequencies over the CV iterations, as defined below:
where [F] defines a descentordered list of gene frequencies and F_{ i }is the frequency rate of gene i defined as its number of occurrences (O_{ i }) over the total number of runs (R). Furthermore, S is the total number of genes in the signature of each algorithm, defined on the basis of best performance on the independent testset (Tables 1, 2, 3).
Abbreviations
 BC:

Breast Cancer
 CC:

Colon Cancer
 CI:

Confidence Interval
 CV:

Cross Validation
 RFEFLDA:

Recursive Feature Elimination based on Fisher's Linear Discriminant Analysis
 RFEFSVs7DK:

Recursive Feature Elimination based on Fisher's ratio of Support Vectors using a 7 Degree polynomial Kernel
 RFELNW1:

RFELNW2: Recursive Feature Elimination based on Linear Neuron Weights
 RFELNWGD:

Recursive Feature Elimination based on Linear Neuron Weights using Gradient Descent
 RFELSSVM:

Recursive Feature Elimination based on Least Square Support Vector Machines
 RFERR:

Recursive Feature Elimination based on Ridge Regression
 RFESVM:

Recursive Feature Elimination based on Support Vector Machines.
References
 1.
Seliger H: Introduction: array technology – an overview. Methods Mol Biol 2007, 381: 1–36.
 2.
Simon R: Diagnostic and Prognostic Prediction Using Gene Expression Profiles in HighDimensional Microarray Data. British Journal of Cancer 2003, 89: 1599–1604.
 3.
Khan J, Wei JS, Ringnér M, Saal LH, Ladanyi M, Westermann F, Bardhold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 2001, 7: 673–679.
 4.
Dalton WS, Friend SH: Cancer Biomarkers An Invitation to the Table. Science 2006, 312: 1165–1168.
 5.
Niijima S, Kuhara S: Recursive gene selection based on maximum margin criterion: a comparison with SVMRFE. BMC Bioinformatics 2006, 7: 543.
 6.
Inza I, Larranaga P, Blanco R, Cerrolaza AJ: Filter versus wrapper gene selection approaches in DNA microarray domains. Artificial Intelligence in Medicine 2004, 31: 91–103.
 7.
Pirooznia M, Yang JY, Yang MQ, Deng Y: A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics 2008, 9. doi:10.1186/1471–2164–9S1S13. doi:10.1186/147121649S1S13.
 8.
Li F, Yang Y: Analysis of recursive gene selection approaches from microarray data. Bioinformatics 2005, 21: 3741–3747.
 9.
Golub TR, Slonim K, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lande ES: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 1999, 286: 531–536.
 10.
Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using Support vector machines. machine learning 2002, 36: 389–422.
 11.
Allison DB, Cui X, Page GP, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews 2006, 7: 55–65.
 12.
Quackenbush J: Computational Analysis of Microarray data. Nature Reviews 2001, 2: 418–427.
 13.
Smyth GK, Yang YH, Speed T: Statistical Issues in cDNA Microarray Data Analysis. Methods in Molecular Biology 2003, 224: 111–136.
 14.
Yang YH, Speed T: Design Issues for cDNA Microarray Experiments. Nature Reviews 2002, 3: 579–588.
 15.
Alon U, Barkai N, Notterman D, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal cancer tissues proposed by oligonucleotide arrays. PNAS 1999, 96: 6745–6750.
 16.
Jiang W, Varma S, Simon R: Calculating Confidence Intervals for Prediction Error in Microarray Classification Using Resampling. Stat Appl Genet Mol Biol 2008, 7(1):Article8.
 17.
Gormley M, Dampier W, Ertel A, Karacali B, Tozeren A: Prediction Potential of Candidate Biomarker Sets Identified and Validated on Gene Expression Data from Multiple Data sets. BMC Bioinformatics 2007, 8: 415.
 18.
Michiels S, Koscielny S, Hill C: Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 2005, 365: 488–492.
 19.
Baker SG, Kramer BS: Identifying genes that contribute more to good classification in microarrays. BMC Bioinformatics 2006, 7: 407.
 20.
EinDor L, Domany E: Thousands of Samples are Needed to Generate a Robust Gene List for Predicting Outcome in Cancer. PNAS 2006, 103(15):5923–5928.
 21.
Dupuy A, Simon R: Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting. J Natl Cancer Inst 2007, 99: 147–157.
 22.
EinDor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 2005, 21: 171–178.
 23.
Van't Veer LJ, Dai H, Vijver MJ, He YD, Augustinus AM, Mao Mao, Peterse HL, Kooy Karin, Marton MJ, Witteven AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530–536.
 24.
Ioannidis JP: Is Molecular Profiling Ready for Use in Clinical Decisionmaking? The Oncologist 2007, 12: 301–311.
 25.
Little A, Rubin D: Statistical Analysis with Missing Data. Wiley Series in Probability and Mathematical Statistics; 1987.
 26.
Blazadonakis M, Zervakis M: Wrapper Filtering Criteria Via a Linear Neuron and Kernel Approaches. Comput Biol Med 2008, 38(8):894–912.
 27.
Goeman J, Geer S, de Koort F, Van Houwelingen H: A global test for groups of genes: testing association with clinical outcome. Bioinformtics 2003, 20(1):93–99.
 28.
Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005, 21: 631–643.
 29.
Statnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF: GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data. International Journal of Medical Informatics 2005, 74: 491–503.
 30.
Ambroise C, McLachlan GL: Selection bias in gene extraction on the basis of microarray geneexpression data. Proc Natl Acad Sci USA 2002, 99(10):6562–6566.
 31.
Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J. Natl Cancer Institute 2003, 95: 14–18.
 32.
Tan Y, Shi L, Tong W, Wang C: Multiclass cancer classification by total principal component regression (TPCR) using microarray gene expression data. Nucleic Acid Res 2005, 33(1):56–65.
 33.
Misra J, Schmitt W, Hwang D, Hsiao L, Gullans S, Stephanopoulos G, Stephanopoulos Gr: Interactive Exploration of Microarray Gene Expression Patterns in a Reduced Dimensional Space. Genome Research 2002, 12: 1112–1120.
 34.
Smit S, Hoefsloot H, Smilde A: Statistical Data Processing in Clinical Proteomics. Journal of Chromatography B 2008, 866: 77–88.
 35.
Ioannidis J: Microarrays and molecular research: noise discovery? Lancent 2005, 365(9458):354–355.
 36.
Varma S, Simon R: Bias in Error Estimation when using CrossValidation for Model Selection. BMC Bioinformatics 2006, 7: 91.
 37.
Blazadonakis M, Zervakis M: The Linear Neuron as Marker Selector and Clinical Predictor. Comput Methods Programs Biomed 2008, 91(1):22–35.
 38.
Vapnik NV: The Nature of Statistical Learning Theory. SpringerVerlag New York; 1999.
 39.
Suykens JA, Gestel TV, De Brabanter J, De Moor B, Vandewalle J: Least Square Support Vector Machines. World Scientific Publishing; 2002.
Acknowledgements
Present work was supported by Biopattern, IST EU funded project, Proposal/Contract no.: 508803, and "Gonotypos" projects funded by Greek Secretariat for Research and Technology as well as the Hellenic Ministry of Education.
Author information
Additional information
Authors' contributions
MZ and MEB contributed equally to the work. MZ drafted most of the manuscript, designed the evaluation on a per run and a persubject basis and performed data analysis on this part. MEB run the experiments, wrote all MatLab codes and contributed to data analysis for the independent testset and 10fold cross validation scenarios, while he assisted in drafting the manuscript. DK, VD, GT and MT assisted with the study design, data analysis and drafting of the manuscript. GT assisted with the statistical analysis of results. All authors read and approved the final manuscript.
Michalis Zervakis, Michalis E Blazadonakis contributed equally to this work.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Zervakis, M., Blazadonakis, M.E., Tsiliki, G. et al. Outcome prediction based on microarray analysis: a critical perspective on methods. BMC Bioinformatics 10, 53 (2009). https://doi.org/10.1186/147121051053
Received:
Accepted:
Published:
Keywords
 Little Square Support Vector Machine
 Algorithmic Performance
 Filter Method
 Recursive Feature Elimination
 Subject Accuracy