The unique characteristics of microarray data have stimulated the development of a multitude of analysis methods. Microarray data is distinguished by very small numbers of samples compared to the number of features measured. Most previous machine learning methods have been developed on data where the opposite holds true; the number of samples is much larger than the number of features. As a result, such analysis methods have to be modified for microarray datasets.
An example is the common paradigm of splitting the data-set into training and test data. The training data is used for selecting features and training a classifier. Once a final classifier has been specified, it can be used to predict the classes of the test samples. The mean error on a sufficiently large (ideally infinite) test dataset gives the true error of the classifier.
When the number of samples n is small, it is important to ensure that the data used to test the classifier is not part of the data used to train it. Testing the classifier on the same samples that were used to train it gives the re-substitution estimate of the true error, which is known to give falsely low (usually zero) error estimates for small n.
With microarray data, splitting the sample into large training and test sets is usually not feasible since the number of samples is so small. Cross-validation (CV) is one solution to the lack of sufficiently large training and testing sets [1], where, instead of testing a fixed classifier (as we had in the split sample case) we have a fixed classifier training algorithm. A classifier training algorithm takes a set of samples and does feature selection and classifier training and returns a single, well defined classifier. In CV, part of the data is left out and the rest is used by the classifier training algorithm to develop a classifier. The classifier thus obtained is used to predict the classes of the left out samples. This loop is repeated for different left out portions. The average error thus obtained on the entire dataset (the CV error estimate) can be interpreted as an estimate of the true error for the classifier we would obtain if we used the classifier training algorithm on the entire dataset. In the case where the left out data consists of one sample only (Leave-One-Out-CV), it can be shown that the CV error estimate is an almost unbiased estimate of the true error expected on an independent test set for the classifier one would obtain if the classifier training algorithm was used on the entire dataset (Theorem 10.8, 8).
However, CV methods are proven to be unbiased only if all the various aspects of classifier training takes place inside the CV loop. This means that all aspects of training a classifier e.g. feature selection, classifier type selection and classifier parameter tuning takes place on the data not left out during each CV loop. It has been shown that violating this principle in some ways can result in very biased estimates of the true error. One way is to use all of the training data to choose the genes that discriminate between the two classes and only change the classifier parameters inside the CV loop. This violates the principle that feature selection must be done for each loop separately, on the data that is not left out. As pointed out by Simon et al. [2], Ambroise and McLachlan [3] and Reunanen [4], this gives a very biased estimate of the true error; not much better than the resubstitution estimate. Over-optimistic estimates of error close to zero are obtained, even for data where there is no real difference between the two classes.
Another violation of the principle is to do any kind of classifier parameter selection outside the CV loop. Examples of these classifier parameters are the numbers of neighbors for a Nearest Neighbor classifier or kernel parameters for the Support Vector Machine (SVM) classifier. To find the best values of these parameters for a given dataset, we can compute the CV error estimate for the dataset using different values of the parameters. Then the classifier parameter with the minimum CV error estimate is chosen to create the final classifier. The final classifier is trained on the entire dataset using the chosen optimal classifier parameters.
This comes under the general term of wrapper methods, where a CV algorithm is "wrapped" inside a search algorithm that tries to minimize the CV error. Such wrapper methods have proven very useful for data-driven adaptation of classifier parameters.
However, this involves a kind of additional training of the classifier (in the form of selecting the classifier parameter) that is done outside the CV loop. This violates the assumption that all training is done within the CV loop on the data not left out. Thus the guarantee of unbiased estimation of true error is not valid and there is a possibility of bias. In other words, the CV error estimate for the classifier parameters that minimize the CV error estimate could be a biased estimate of the true error of the final classifier trained on all the data using the optimal classifier parameters.
We investigate this possibility using two wrapper algorithms. The first is the Shrunken Centroids method of Tibshirani et al. [5] where an optimum value of a classifier parameter Δ that controls the degree of shrinkage is obtained as the one that minimizes the 10-fold CV error. The second is a variant of the Support Vector Machine proposed by Peng et al. [6] which selects SVM kernel parameters that minimize the Leave-One-Out-CV (LOOCV) error. The first article uses both the CV error estimate on the training set and the error on the test set for determining the optimal value of the parameters, thus making the test set part of the training process. The second article presents only the minimum CV error estimate obtained on the training set. The true error obtained on an independent test set is not given for either.
Since selection of classifier parameters that minimize CV error estimates is a kind of training, it should be included as part of the classifier training algorithm. Thus the classifier training algorithm in this case is the complete wrapper algorithm where, given a dataset, the classifier is trained the following way. First the CV error estimate is computed for different values of the classifier tuning parameters. Then, the parameters with the smallest CV error estimate are used to create a classifier using all the data. This satisfies the definition of a classifier training algorithm, i.e. an algorithm that takes a dataset and returns a single, well defined classifier.
Now that we have a wrapper algorithm that is a well defined classifier training algorithm, we can use CV to get an estimate of the true error for the classifier it returns. We can embed the complete wrapper algorithm (with its own CV loop for finding the best classifier parameters) inside another CV loop that computes the error estimate. Note that this is no different from the usual CV method. Here, instead of using CV to find an error estimate for a particular classifier (e.g. Nearest Neighbors) we use CV to find an error estimate for an optimized classifier (e.g. Nearest Neighbors with the optimal number of neighboring samples determined by minimizing the CV error estimate). Thus there are two CV loops; the inner loop is part of the wrapper algorithm and the outer loop computes an estimate of the true error. A similar method was used by Izuka et al. [7]. In this article, we investigate the effect on the bias when using this nested CV approach.
Shrunken centroids
This classifier, originally proposed by Tibshirani et al [5], is an extension of the nearest centroid classifier. In the nearest centroid classifier the training set is used to calculate the centroids and (mean expressions of the genes) for the two classes. A new sample is compared to the two centroids and classified according to the class of the nearest centroid. In the shrunken centroids method, a parameter Δ is used to shrink the class centroids towards the overall centroid after standardizing by the within class standard deviation. The centroid belonging to class k is brought closer to the overall centroid (mean of samples of all classes pooled together) by
where s is the vector of pooled within class standard deviation for all genes, s0 is the median of the elements of s, d
k
is given by
and (...)+ denotes the positive part of the quantity in the parenthesis, i.e. equal to the quantity if it is greater than zero, and zero otherwise. Thus genes which are not very differentially expressed will contribute less to the classification than genes that are more discriminating. The parameter Δ can be varied to vary the number of genes used.
In the original paper, 10-fold CV is used to obtain the CV error estimate for a particular choice of Δ. There is no objective guideline given for selecting Δ based only on the training data. The authors vary Δ and use the value that minimizes the CV error estimate on the training set and the error on the testing data simultaneously. Thus the test set is used in the selection of the classifier parameters, which is problematic.
Support Vector Machines
Peng et al. present a method for feature selection using genetic algorithms (GA) and recursive feature elimination (RFE) in combination with a support vector machine (SVM) classifier [6]. SVMs were introduced by Vapnik [8] as linear hyperplanes that separate data belonging to different classes while maximizing the margin, or the distance of the training samples to the linear separating hyperplane.
Denote the two classes by 1 and -1. For a sample x consisting of p measurements (e.g. gene expressions) the linear hyperplane classifier c(x) predicts the class according to
for a weight vector w = ⌊w0 w1 ... w
p
⌋ and the augmented sample vector obtained by appending sample x with a constant 1, i.e. = ⌊1, x1, ..., x
p
⌋.
The margin of a sample x of class y ∈ {-1,1} is defined as yw' and it plays a very important role in SVM. The margin of correctly classified samples is positive and that for misclassified samples is negative. The margin of the classifier is defined as the smallest margin of all the training samples. The SVM tries to find a w such that the margin is maximized while the norm of the weight vector, ⟨w, w⟩ is minimized. This is equivalent to minimizing the cost function
where
ξ
i
(w) = (1 - y
i
wt)+ (5)
and (...)+ denotes the positive part of the quantity in the parenthesis, as above, i.e. equal to the quantity if it is greater than zero, and zero otherwise.
Since the capacity of the classifier increases with increasing norm of the weight vector, the parameter C also controls the tradeoff between the size of the margin and the capacity of the classifier.
Since the samples need to be represented only in the form of scalar products, this formulation can be extended to non-linear classifiers by the introduction of kernels. Kernels are the functional representation of scalar products in transformed space. It can be shown that such a transformation leaves the optimization problem unchanged except that the inner product ⟨x1, x2⟩ is replaced by the kernel K(x1, x2). In the case of very high (or infinite) dimensional transformed space, the kernel is usually easier to compute than doing the transformation followed by scalar product.
For Gaussian kernel SVM (also called a radial basis function kernel), the kernel is given by
K(x1, x2) = exp(-γ ||x1 - x2||2) (6)
The spread of the kernel function is given by γ, which can be varied to adapt the kernel to the data. The larger the value of γ, the more peaked the corresponding transformations of the feature vectors are, and the higher the capacity of the classifier.
Peng et al use the Leave-One-Out-CV (LOOCV) error to tune the kernel parameters. LOOCV is a CV scheme where one sample is left out during each iteration. The average classification error obtained is an almost unbiased estimate of the true error. In (6), the training data is used to select an appropriate kernel (from linear, Gaussian and polynomial) and set of parameters that minimize the LOOCV error estimate. However no independent test set is used and only the final LOOCV error estimate on the training set is reported.