Instance-based concept learning from multiclass DNA microarray data

Background Various statistical and machine learning methods have been successfully applied to the classification of DNA microarray data. Simple instance-based classifiers such as nearest neighbor (NN) approaches perform remarkably well in comparison to more complex models, and are currently experiencing a renaissance in the analysis of data sets from biology and biotechnology. While binary classification of microarray data has been extensively investigated, studies involving multiclass data are rare. The question remains open whether there exists a significant difference in performance between NN approaches and more complex multiclass methods. Comparative studies in this field commonly assess different models based on their classification accuracy only; however, this approach lacks the rigor needed to draw reliable conclusions and is inadequate for testing the null hypothesis of equal performance. Comparing novel classification models to existing approaches requires focusing on the significance of differences in performance. Results We investigated the performance of instance-based classifiers, including a NN classifier able to assign a degree of class membership to each sample. This model alleviates a major problem of conventional instance-based learners, namely the lack of confidence values for predictions. The model translates the distances to the nearest neighbors into 'confidence scores'; the higher the confidence score, the closer is the considered instance to a pre-defined class. We applied the models to three real gene expression data sets and compared them with state-of-the-art methods for classifying microarray data of multiple classes, assessing performance using a statistical significance test that took into account the data resampling strategy. Simple NN classifiers performed as well as, or significantly better than, their more intricate competitors. Conclusion Given its highly intuitive underlying principles – simplicity, ease-of-use, and robustness – the k-NN classifier complemented by a suitable distance-weighting regime constitutes an excellent alternative to more complex models for multiclass microarray data sets. Instance-based classifiers using weighted distances are not limited to microarray data sets, but are likely to perform competitively in classifications of high-dimensional biological data sets such as those generated by high-throughput mass spectrometry.


Motivation
Being crucial to diagnostic and prognostic applications, a plethora of methods have been brought to bear on microarray data classification in the field of cancer research [1][2][3]. Microarray data analysis is beset by the 'curse of dimensionality' (a.k.a. small-n-large-p problem) [4]. This problem relates to the high dimensionality, p, i.e., the number of gene expression values measured for a single sample, and the relatively small number of biological samples, n.
There is a growing number of publications on comparative studies trying to elucidate the performance of various classifiers for microarray data sets. However, the conclusions that can be drawn from these studies are often limited because of one or more of the following reasons.
(2) The study does not involve a complete re-calibration of all model parameters in each learning phase [6].
(3) The study does not incorporate an external cross-validation to avoid gene selection bias [7].
(4) The study makes inappropriate use of clustering techniques for classification tasks [8]. (5) The study assesses the differences in performance based on 'orphaned' accuracy measures (e.g., observed cross-validation error rates).
Many comparative studies include data sets involving binary problems only. One of the first studies in this field compared a nearest neighbor model, support vector machines, and boosted decision stumps on three binary microarray data sets related to cancer [9]. The recent study by Krishnapuram et al. benchmarked their model against a variety of statistical and machine learning methods using two cancer microarray data sets involving a binary classification task [10]. Tasks involving multiple classes, however, are considered substantially more challenging. Li et al. [11] and Yeang et al. [12] highlighted the importance of multiclass methodologies in this context. It is common practice to assess microarray classifiers using data resampling strategies such as bootstrapping and cross-validation strategies. Dudoit et al. have highlighted the importance of model re-calibration in each cross-validation fold [6]; however, comparative studies do not always include a complete parameter recalibration [8].
It is crucial that feature selection or weighting is performed only on the learning set and not on the test set. Otherwise, the estimation of the model's generalization ability will be overly optimistic [7]. Whereas this caveat may not have received due attention in early microarray studies, most recent comparative studies include an external cross-validation phase intended to avoid the selection bias.
One of the most common pitfalls in the analysis of microarray data analysis is the use of clustering methods for classification tasks [8]. Clustering methods are unsupervised methods that do not take into account the class labels. The number of class-discriminating genes is usually small compared with the number of non-discriminating genes. The pair-wise distances that clustering methods compute do not necessarily reflect the influence of the discriminating genes. Hence, the resulting clusters may not be related to the phenotypes at hand. Different clustering methods can reveal different insights in the data by providing different clusters, all of which may be of interestthere is generally no 'right' or 'wrong' clustering result.
Finally, a critical problem in the aforementioned comparative studies is that these models are commonly assessed based on monolithic accuracy measures, frequently devoid of suitable confidence intervals for the true error rates (or alternatively, the true prediction accuracy). Comparing classification error rates or confidence intervals is limited in terms of the conclusions that can be drawn when comparing differences in performance. It is crucial that a comparative study assesses these differences based on suitable significance tests that also take into account the adopted resampling strategy. In an ideal world with unlimited training and test data, the comparison of classifiers would be straightforward. However, in practical settings, the number of available cases is limited, and particularly small in the context of microarray data. Therefore, the classifiers are usually compared based on their performance on resampled training and test sets. The sampling procedure introduces a random variation in the sampled data sets, which must be controlled by the statistical test [13]. For example, the classification performance of the same method can be different, depending on whether leave-one-out cross-validation, ten-fold crossvalidation, or bootstrapping is adopted for data set sampling. The statistical test should conclude that two models perform significantly differently if and only if their error rate would be different, on average, when trained on a training set of a given fixed size and tested on all cases of the population of interest [13]. This is essentially the aim of comparative studies: Do the observed differences in performance provide sufficient evidence to conclude that the models perform significantly differently, or can we not exclude the possibility (with reasonably confidence) that this difference may be due to chance alone or to the random variation introduced by the sampling strategy? This question should guide the formulation of the null hypothesis. In general, this implies that for a randomly drawn learning set of fixed size and according to a fixed probability distribution, two models will have the same error rate on a test set that is also randomly drawn from the population under investigation, and all random draws are made according to the same probability distribution [13]. Note, that a 95%-confidence interval for an estimate (e.g., the true prediction accuracy) is completely different from a 95%-confidence level for the difference of two estimates (e.g., the difference between the prediction accuracy of model A and B). Therefore, it should be noted explicitly that it is logically inadequate to use the derived confidence intervals for assessing whether there is a significant difference in performance of the classifiers. This fact is well-established in the statistical literature, but may not have received sufficient attention in many comparative studies.
Somorjai et al. [4] identified the following key features of classifiers for microarray data: Robustness (i.e., high generalization ability and insensitivity with respect to outliers) and the simplicity of a model. A model (i) should be easy to implement and use, and (ii) its outputs should be easy to interpret. In particular in biomedical applications, we claim that such classifiers should also be able to provide a suitable measure of confidence for the predictions they make. One way of representing such a confidence measure could be a degree of class membership with respect to the predicted class. In such a framework, a sample may belong to any class with a certain degree. This is often represented by the unit interval: A value of 0 indicating complete non-membership and a value of 1 indicating complete compliance with the predefined class in questions. Any value within the interval indicates a partial class membership. Providing such a value of 'confidence' for classifications can serve two purposes, (i) optimizing the model's calibration in the learning phase, and (ii) the rejection of low-confidence classifications in the test phase.

Overview of nearest neighbor classifiers
Comparative studies involving various classifiers and microarray data sets have revealed that instance-based learning (a basic form of memory-based or case-based reasoning) approaches such as nearest neighbor methods perform remarkably well compared with more intricate models [14,15]. A k-nearest neighbor (k-NN) classifier is based on an instance-based learning concept, which is also referred to as lazy learning. In contrast to eager methods, which apply rule-like abstractions obtained from the learning instances, lazy methods access learning instances at application time, i.e., the time when a new case is to be classified. A nearest neighbor classifier determines the classification of a new sample on the basis of a set of k similar samples found in a database containing samples with known classification. Challenges of the k-NN approach include (a) the relative weighting of features, (b) the choice of a suitable similarity method, (c) the estimation of the optimal number of nearest neighbors, and (d) a scheme for combining the information represented by the k nearest neighbors.
In its simplest implementation, k-NN computes a measure of similarity between the test case and all pre-classified learning cases. The test case is then classified as a member of the same class as the most similar case [11]. In this simple scenario only one, the most similar case, is finally selected for calling the class, the parameter k is set to 1. A more elaborate variant of k-NN involves cross-validation procedures that determine an optimal number, k opt , of nearest neighbors; usually, k opt > 1. The test case is classified based on a majority vote among the k opt nearest neighbors [16]. For example, in leave-one-out cross-validation, each hold-out case is classified based on k ∈ {1, 2, ..., k max } neighbors. That integer k that minimizes the cumulative error is k opt . For more details and extensions to the k-NN classifier, see for instance [5,[16][17][18], and references therein.

Paper outline
Motivated by the recent success stories of nearest neighbor methods [14,15,19,20], we investigated a model of a knearest neighbor classifier based on a weighted-voting of normed distances [5,16]. This classifier outputs a degree of class membership for each case x, 0 ≤ (C | x) ≤ 1.
Wang et al. used fuzzy c-means clustering for deriving fuzzy membership values, which they used as a confidence measure for microarray data classification [21]. Recently, Asyali and Alci applied fuzzy c-means clustering for classifying microarray data of two classes [22]. In contrast to the models of Wang et al. [21] and Asyali and Alci [22], the k-NN model in the present study does not rely on unsupervised clustering approaches for deriving fuzzy class membership values.
This paper focuses on a simple and intuitive model, the knearest neighbor based on distance weighting, for the classification of multiclass microarray data and aims at addressing the aforementioned key limitations of previous comparative studies in this field. We apply the distance-weighted k-NN to three well-studied, publicly available microarray data sets, one based on cDNA chips and two on Affymetrix oligonucleotide arrays, and compare the classification performance with support vector machines (SVMs), decision tree C5.0 (DT), artificial neural networks (multiplayer perceptrons, MLPs), and 'classic' nearest neighbor classifiers (1-NN, 3-NN, and 5-NN) that are based on majority voting. The 5-NN is not applied to the NCI60 data set because of the small number of cases per class. Using a ten-fold repeated random subsampling strategy, we assess the models' classification performance based on a 0-1 loss function, i.e., a loss of 0 for each correct classification and a loss of 1 for each misclas-p sification. To allow for a 'crisp' classification using k-NN, a case x is classified as member of class C for which (C | x) is maximal. We do not consider the rejection of lowconfidence classifications. The statistical significance of the differences in performance is assessed using a parametric test, the variance-corrected resampled paired t-test [23].

Classification results
Let f denote the observed fraction of correctly classified test cases and let p denote the true prediction accuracy of the model. Let the total number of test cases be M. For deriving a (1 -α)100%-confidence interval for the true prediction accuracy p, we obtain Equation (1) by the de Moivre-Laplace limit theorem (assuming that the binomial distribution of the correctly classified cases can be approximated by the standard normal): with Φ(•) being the standard normal cumulative distribution function and z = Φ -1 (1 -1/2α), e.g., z = 1.96 for 95% confidence. Solving Equation (2) for p gives Equation (2): Table 1 shows the 95%-confidence intervals for the true prediction accuracy of the models, averaged over the ten test sets. where k = 10 is the number of folds, ε Ai is the observed error of model A in the i th fold, t 9, 0.025 = 2.26 for 95% confidence, and SE is the standard error as shown in the denominator in Equation 4. Table 2 shows the 95%-CI for the differences in prediction errors.
The apparent 'best' performers in the present study are the support vector machines with a classification accuracy of 78.60 ± 6.44% on the NCI60 data set and an accuracy of 75.83 ± 3.81% on the GCM data set. However, as we will show later, this result does not necessarily imply that the differences in performance between nearest neighbor models and the support vector machines are statistically significant.
On the ALL data set, the k-NN achieved the highest classification accuracy of 77.85 ± 2.43%. The results of the present study do not match up with the results that Yeoh et al. reported [3], i.e., a best average test set accuracy of 98.67%. How can this discrepancy be explained? First, the present study assessed the models' performance in a 10fold random subsampling procedure that entailed ten splits of learning and test sets. The study of Yeoh et al., on the other hand, comprised one split only (i.e., single holdout approach) [3], so that the achieved classification accuracies may not reflect the true performance of their models. Second, the classification task in the present study includes all ten classes, whereas Yeoh et al. focused on the classification results for the six molecularly distinct classes [3].   variance-corrected resampled paired t-test is then given as shown in Equation (4).

Analysis of differences in performance
This statistic obeys approximately Student's t distribution with k -1 degrees of freedom. The only difference to the standard t statistic is that the factor 1/k in the denominator has been replaced by 1/k + M/N. In cross-validation and repeated random subsampling, the learning sets L i necessarily overlap; in repeated random subsampling, the test sets may overlap as well. Hence, the individual differences p i are not independent from each other. Due to these violations of the basic independence assumptions, the standard paired t-test cannot be applied here. Empirical results show that the corrected statistic improves on the standard resampled t-test; the Type I error is drastically reduced [23,24]. For k = 10 folds, the null hypothesis of equal performance between two classifiers can be rejected We applied the following six classifiers to the NCI60 data set: k-NN, 1-NN, 3-NN, SVMs, DT, and MLP. The 5-NN is applied to the ALL and GCM data set but not to the NCI60 data set because of the small number of cases per class.
Based on the variance-corrected resampled paired t-test, we cannot reject the null hypothesis of equal performance between k-NN and the SVMs on the NCI60 data set (P = 0.38). Hence, the support vector machines did not per-form significantly better than k-NN on this data set. The smallest p-value is P = 0.06 for the comparison between SVMs and 3-NN, which does not allow for the rejection of the null hypothesis of equal performance.
On the ALL data set, we observe no statistically significant difference in performance between k-NN and the support    vector machines (P = 0.92), but between k-NN and the decision tree (P = 0.007). The support vector machines performed significantly better than the decision tree (P = 1.67 × 10 -6 ), but not significantly better than the multilayer perceptron (P = 0.11). The support vector machines did not perform significantly better than 1-NN (P = 0.63), 3-NN (P = 0.95), or 5-NN (P = 0.95). It might seem surprising that the p-value is smaller for the comparison support vector machines vs. decision tree (P = 1.67 × 10 -6 ) than k-NN vs. decision tree (P = 0.007) despite the fact that the confidence intervals for the true prediction accuracy of the support vector machines and decision tree are 'closer to each other'. However, we note that a 95%-confidence interval for an estimate (here, the true prediction accuracy of a model) is completely different from a 95%confidence level for the difference of two estimates (here, the difference between the accuracies of two models).
On the GCM data set, the difference in performance between k-NN and the decision tree is significant (P = 0.003) as well as between k-NN and the multilayer perceptron (P = 0.001). There is no significant difference between k-NN and the support vector machines (P = 0.70).
In summary, on all three data sets, there was no statistically significant difference in performance between the decision tree and the multilayer perceptron. On all data sets, there was no statistically significant difference between k-NN and the support vector machines. The k-NN outperformed the decision tree on both the ALL and the GCM data set, and the k-NN outperformed the MLP on the GCM data set.
When a comparative study comprises n classifiers, a total of κ = 1\2 n(n -1) pairwise comparisons are possible. The α of each individual test is the comparison-wise error rate, while the family-wise error rate (a.k.a. overall Type I error rate), α κ , is made up of the κ individual comparisons. To control the family-wise error rate, different approaches are possible, for example Bonferroni's correction for multiple testing, which sets α/κ as comparison-wise error rate. The corrected comparison-wise error rates are then α = 0.05/ 15 = 0.0033 for the NCI60 data set and α = 0.05/21 = 0.0024 for the ALL and GCM data set. Taking this correction into account, the p-value for the difference in performance between k-NN and DT on the ALL data set, P = 0.007, is to be compared with α = 0.0024, and hence the null hypothesis of equal performance cannot be rejected anymore. However, Bonferroni's method is known to be conservative. We are currently investigating various approaches for addressing this problem in the context of multiclass microarray data.

Discussion
The design of this investigation takes into account the caveats of comparative studies by including a complete model re-calibration in each learning phase, an external cross-validation strategy, and by assessing the models' performance based on significance tests rather than relying on accuracy measures. The presented k-NN classifier alleviates a major problem of the 'classic' nearest neighbor models, i.e., the lack of confidence values for the predictions. We derived a degree of class membership without the need for clustering methods. The model is simple, intuitive, and both its implementation and application are straightforward. Despite its simple underlying principles, k-NN performed as well as or even better than established more intricate machine learning methods. In the present study, the classification results with confidence values had to be converted into crisp classifications based on the maximal , because we assessed and compared the models using a 0-1 loss function. The degrees of class memberships have been used as guidance for model calibration in the learning phase, but these degrees could also be used for the rejection of low-confidence classifications in the test phase. This potential of the k-NN has not been exploited in the present study. Different quantitative criteria are possible for comparing classifiers, for example, the quadratic loss function or the informational loss function that both take into account the classifiers' confidence in the predictions, or the costs that are involved for false positive and false negative predictions. This is of particular interest for applications in the biomedical context. In an ongoing study, we compare and assess various models that are able to generate confidence values for the classification. Here, we are interested in thê p Sampling of learning and test set and selection of marker genes Figure 4 Sampling of learning and test set and selection of marker genes. Depicted is one fold in the ten-fold resampling procedure. From the original data set comprising n cases and p genes, ~70% of the cases are randomly selected for the learning set L i and ~30% cases for the test set T i . On the learning set L i with unpermuted class labels, the signal-to-noise weight for each gene and each class is computed as illustrated for class B. The class labels are then randomly permuted 1,000 times and the signal-to-noise weights (for each gene and each class) are recomputed for each permutation to assess the significance of the weights for the unpermuted learning set. Both the learning and the test set are filtered to contain only those genes that are significantly differently expressed in the learning set.  i i critical assessment of classifiers that take into account the confidences, which can also entail the rejection of classification decisions. Also, the problem of adjusting the error rate for multiple testing needs further work.

Conclusion
Instance-based learning approaches are currently experiencing a renaissance for classification tasks involving high-dimensional data sets from biology and biotechnology. The k-NN performed remarkably well compared to its more intricate competitors. A significant difference in performance between k-NN and support vector machines could not be observed. Viewed from an Occam's razor perspective, we doubt that more intricate classifiers should necessarily be preferred over simple nearest neighbor approaches. This is particularly relevant in practical biomedical scenarios where life scientists have a need to understand the concepts of the methods used in order to fully accept them.

Data
The NCI60 data set comprises gene expression profiles of 60 human cancer cell lines of various origins (both derived from solid and non-solid tumors) [1]. Scherf et al. [29] used Incyte cDNA microarrays that included 3,700 named genes, 1,900 human genes homologous to those of other organisms, and 4,104 ESTs of unknown function but defined chromosome map location. The data set includes nine different cancer classes: Central nervous system (6 cases), breast (8 cases), renal (8 cases), non-small cell lung cancer (9 cases), melanoma (8 cases), prostate (2 cases), ovarian (6 cases), colorectal (7 cases), and leukemia (6 cases). The background-corrected intensity values of the remaining genes are log 2 -transformed prior to analysis.
The ALL data set comprises the expression profiles of 327 pediatric acute lymphoblastic leukemia samples [3]. The diagnosis of ALL was based on the morphological evaluation of bone marrow and on an antibody test. Based on immunophenotyping and cytogenetic approaches, six genetically distinct leukemia subtypes have been identified: B lineage leukemias BCR-ABL (15 cases), E2A-PBX (27 cases), TEL-AML (79 cases), rearrangements in the MLL gene on chromosome 11q23 (20 cases); hyperdiploid karyotype (> 50 chromosomes, 64 cases); and T lineage leukemias (43 cases). In total, 79 cases could not be assigned to any of the aforementioned groups; these samples were assigned to the group Others. This group comprises four subgroups: Hyperdiploid 47-50 (23 cases), Hypodiploid (9 cases), Pseudodiploid (29 cases), and Normaldiploid (18 cases). The present study follows the data pre-processing as described in [3], supplementary online material.

Study design Dimension reduction and feature selection
We decided to focus on two widely used methods to address the high-dimensionality problem: Principal component analysis (PCA) based on singular value decomposition [26] and the signal-to-noise (S2N) metric [27]. PCA reduces dimensionality and redundancy by mapping the existing genes onto a smaller set of 'combined' genes or 'eigengenes' [28]. The S2N metric (a.k.a. Slonim's P-metric) is a simple, yet powerful approach for assigning weights to genes, thus permitting analysis to focus on a subset of important genes [2,15,27]. For the i th gene and the j th class, the signal-to-noise weight w ij is determined as shown in Equation (5) The distance-weighted k-NN classifier for a binary classifica-tion task Figure 5 The distance-weighted k-NN classifier for a binary classification task. The arrows indicate the three nearest neighbors of the test case. Here it is assumed that k opt = 3. , where n represents the number of cases in a class, m is the mean and s 2 is the variance.)

Data sampling strategies
The NCI60 data [1] set is pre-processed using PCA, and the 23 first 'eigengenes' (explaining > 75% of the total variance), are selected. The dimensions of the data set are thus n = 60 cases, p = 23 features. The data set comprises nine classes. The data set is analyzed in ten-fold repeated random subsampling (a.k.a. repeated hold-out method). The ten data set pairs (L i , T i ), i = 1..10, are generated by randomly sampling 45 (75%) cases for L i and 15 (25%) cases for T i .
For both the acute lymphoblastic leukemia (ALL) data set [3] (n = 327 cases, p = 12,600 genes, ten classes) and the Global Cancer Map (GCM) data set [2] (n = 198 cases, p = 16,063 genes, 14 classes), we apply the S2N metric for feature selection. For the weight of each gene, a p-value is derived, corresponding to the probability that this weight is obtained by chance alone. The Monte Carlo method to compute this p-value involves 1,000 random permutations of the class labels and a recomputation of the weight for each gene [29]. Feature weighting is performed only on the learning set and not on the test set.
In contrast to the original study by Yeoh et al. [3], the present study investigates whether the less distinct classes (Hyperdiploid, Hypodiploid, Pseudodiploid, and Normaldiploid) in the group Others show an expression signature that could be used for classification. This implies that instead of merging these subgroups into one single group, these four subgroups are treated as distinct groups. From the pre-processed, normalized data set, we randomly select 215 cases (65.75%) for the learning and 112 cases (34.25%) for the test set. Then, based on the learning set only, we determine the signal-to-noise weight for each gene with respect to each class. We randomly permute the class labels and perform a random permutation test to assess the importance of the signal-to-noise weights [29]. We rank the genes according to their weight and the associated p-value; the smaller the p-value and the larger the weight, the more important is the gene. We repeat this procedure ten times to generate ten pairs, each consisting of a learning set L i and a test set T i . The models are then built on the learning set L i and tested on the corresponding test set, T i .
The sampled learning and test sets from the GCM data set are generated as described for the ALL data set. The GCM learning sets include 150 (75.8%) randomly selected cases and the test sets include 48 (24.2%) cases. For each learning set, potential marker genes are identified using signal-to-noise metric in combination with a random permutation test. Figure 4 illustrates the feature selection process that applies to both the ALL and the GCM data set; depicted is only one fold in the tenfold sampling procedure.
In addition to the statistical evaluation, we carried out an epistemological validation to verify whether the identified marker genes are known or hypothesized to be associated with the phenotype under investigation. For example, the majority of the top-ranking genes in the GCM data set could be confirmed to be either known or hypothesized marker genes. In L 1 , for instance, the top gene (S2N of 2.84, P < 0.01) for the class colon cancer is Galectin-4, which is known to be involved in colorectal carcinogenesis [30].
In contrast, the biological interpretation of the 'eigengenes' resulting from PCA is not trivial. We decided not to apply S2N to the NCI60 data set due to the small number of cases (60) and the relatively large number of classes (9). Since feature selection must be performed in each crossvalidation fold, it would be necessary to compute the S2N weight for each gene and each class based on each L i comprising only 45 cases, and the computed values for the mean and standard deviation can be highly affected by those cases that are left out for the test set.
All models are trained in leave-one-out cross-validation (LOOCV) on the learning set L i to determine those parameters that lead to the smallest cumulative error. The models then use these parameters to classify the test cases in T i . Each learning phase encompasses a complete re-calibration of the models' parameters.

Classifiers
Distance-weighted k-nearest neighbor classifier The similarity between two cases, x i and x j , is commonly defined as  The k-NN in this study operates as follows. Let n k denote the k th nearest neighbor of a test case x j and the optimal number of nearest neighbors be k opt . Further, let the similarity, sim, between cases x i and x j be given by 1 -d(x i , x j ), where d represents a distance. In the present study, we investigate various distance metrics, including Euclidean, Canberra, Manhattan, and the fractional distance [31].
The normed similarity between x j and its nearest neighbor n k , sim normed (x j , n k ), is then defined as The degree of class membership is then defined as follows: where the Kronecker symbol δ k = 1 if n k ∈ C and δ k = 0 otherwise. If a crisp classification is required, then a case x j may be classified as member of class C for which (C | x j ) is maximal. Figure 5 illustrates the k-NN on a simplified example involving only two classes. In this example, the triangle marks the test case.

Support vector machines
The support vector machines [32] in the present study implement three different kernel functions: Linear kernel K(x i , x j ) = (x i ·x j ), radial kernel K(x i , x j ) = exp(-||x i -x j || 2 / 2σ 2 ), and the polynomial kernel For the present study we used the implementation from [33].
SVMs are inherently binary classifiers, and it is not obvious how they can solve problems that comprise more than two classes. There exist two commonly adopted approaches for breaking down multiclass problems into a sequence of binary problems: (i) the one-versus-all (OVA) approach, and (ii) the all-pairs (AP) approach. For the present study, we combined the SVMs in the AP approach, which constructs 1\2 k(k -1) classifiers, with each classifier trained to discriminate between a class pair i and j. The outputs of the binary classifiers are then combined in a decision directed acyclic graph (DDAG), which is a graph whose edges have an orientation and no cycles [34]. Mukherjee pointed out that the decision boundaries resulting from the all-pairs approach are, in general, more natural and intuitive, and should be more accurate in theory [35]. For the present study, we combined the SVMs in the AP approach. The SVMs are trained in LOOCV on the learning set to determine the optimal parameters, i.e., the optimal kernel function, the optimal kernel parameters (bandwidth for the Gaussian kernel and the degree of the polynomial kernel), and the optimal error penalty.
Decision tree C5.0 The term 'decision tree' is derived from the presentation of the resulting model as a tree-like structure. Decision tree learning follows a top-down, divide-and-conquer strategy. The basic algorithm for 'decision tree learning' can be described as follows [36]: (1) Select (based on some measure of 'purity' or 'order' such as entropy, information gain, or diversity) an attribute to place at the root of the tree and branch for each possible value of the tree. This splits up the underlying case set into subsets, one for every value of the considered attribute.
(2) Tree growing: Recursively repeat this process for each branch, using only those cases that actually reach that branch. If at any time most instances at a node have the same classification or if a further splitting does not lead to a significant improvement, then stop developing that part of the tree.
(3) Tree pruning: Merge some nodes to improve the model's performance, i.e., balance the bias and variance of the tree based on statistical measures regarding the node purity or based on performance assessment (e.g., cross-validation performance). Following the top-down and divide-and-conquer strategy, learning in C5.0 involves a tree growing phase and a tree pruning phase. In the pruning phase some nodes are merged to improve the generalization ability of the overall model. C5.0 builds a multi-leaf classification tree based on information gain ranking of the attributes.
The initial pruning severity of the decision tree is 90%. Then, in 10-fold cross-validation on the learning set, the average correct classification rate is determined. The pruning severity is iteratively reduced in steps of 10% (i.e., 90%, 80%, 70% etc.), and the tree is rebuilt in 10-fold cross-validation. Using this strategy, the optimal pruning severity is determined for the learning set. The DT is then built on the entire learning set L i and pruned with the optimal pruning severity. The resulting model is used to classify the corresponding test cases in T i .

Multilayer perceptrons
For both the decision tree and the multilayer perceptrons, SPSS Clementine's ® implementation is used. Various network topologies are investigated in the present study; the optimal architecture (number of layers and hidden neurons) is determined in the learning phase. The training algorithm for the multilayer perceptrons is backpropagation with momentum α = 0.9 and adaptive learning rate of initial λ = 0.3. The network is initialized with one hidden layer comprising five neurons. The number of hidden neurons is empirically adapted on the learning set L i , i.e., the network topology is chosen to provide for the lowest cross-validated error rate on the learning set L i . The resulting optimal network architecture is chosen for predicting the test cases in T i .

Authors' contributions
DB implemented the NN models, selected and pre-processed the data sets, and carried out the comparative study. IB helped in the statistical design and interpretation. WD interpreted the results and helped in the preparation of the manuscript.