### Binary classifiers: PLS, PPLS and LASSO

We first briefly review the three binary classifiers, which was first designed for regression and can be directly applied to two-class classification, even when the number of covariates (i.e. genes here) is much larger than the sample size.

We code the response variable (i.e. class label) as *Y* = 1 for class 1 and *Y* = -1 for class 2. Suppose that *x*_{
i
}is the expression level of gene *i*, *i* = 1, ..., *p* with *p* as the total number of the genes, and that we have *n* samples in the training data. A challenge is that we have *n* <<*p*.

The main idea of partial least squares (PLS) [21] is to seek a few linear combinations of
for *j* = 1, ..., *m*, then apply ordinary least squares (OLS) to regress *Y* on *z*_{
j
}'s to obtain

with β's as OLS estimates. The key of course is how to form linear components *z*_{
j
}'s. It turns out that

α_{
j
}= argmax_{α}Corr^{2}(*y*, *Xα*) Var*(Xα)*

with the constraints ||α|| = 1,

for

*l* = 1,...,

*j* - 1, where

*y* is the vector of observed

*Y*'s (in the training data),

*X* is the design matrix (i.e. matrix of observed

*x*'s), and

*S* is the sample covariance matrix of

*x*'s [

31]. In practice, the number of linear components

*m* has to be chosen, typically by a form of cross-validation, such as leave-one-out cross-validation (LOOCV), to minimize misclassification errors.

PPLS is a penalized regression method in the framework of PLS [19, 32]. Suppose that we have built a PLS linear model, which can be rewritten as:

Then we shrink the PLS coefficients by soft-thresholding [33, 34]

where *sign*(*a*) = 1 if *a* ≤ 0 and *sign*(*-a*) = -1 if *a* < 0, λ is a shrinkage parameter to be determined, and *f*_{+} = *max*(*f*, 0). It is common that the shrinkage leads to many
, effectively eliminating gene *i* from the model, thus gene selection is automatically accomplished. Next we construct a linear component
. Finally a PPLS model is built by regressing *Y* on *z* using OLS

which can be re-expressed as

. The parameters involved in building a PPLS model, such as the shrinkage parameter λ and the number of PLS components, are estimated by LOOCV. The goal is to choose the largest shrinkage parameter and the smallest number of PLS components for which the LOOCV misclassification error estimate is minimized.

The LASSO estimates [23] in a linear model

are obtained by

subject to

, where

*Y*_{
i
}is the observed response for sample

*i* and

is its LASSO estimate,

*i* = 1, ...,

*n*, and

*t* can be chosen by LOOCV. The constraint can often force many

, leading to gene selection.

Note that the class label (1/-1) for the response *Y* is binary, but in any of the above binary classifiers, the response *Y* is treated as a continuous variable and the estimate
could be any real number. To predict the class of a new sample, we use *sign*(
): if the estimated response
is greater than or equal to 0, then we classify it into class 1; otherwise, class 2. In particular, this direct use of PLS for binary classification (as in [35]) is different from other approaches [36–40]; a distinct advantage of our approach is its simplicity, e.g., avoiding convergence problems when two classes are perfectly separable, which is common in microarray data with a small sample size and a large number of genes.

### Multiclass classifiers: nearest shrunken centroids and random forest

Nearest shrunken centroids (SC) is built on a diagonalized linear discriminant analysis (DLDA) [26, 41]. Suppose that we have *K* classes,
is the mean expression level of gene *i* in class *k* of the training samples,
is the pooled sample variance of gene *i* of the training samples, and π_{
k
}is the prior probability of a new sample being in class *k*. The DLDA rule for a new sample
is

SC is motivated from the observation that many of the genes will not be predictive of the class membership and should be eliminated from the above DLDA rule. Formally, define

where *n*_{
k
}is the number of training samples in class *k*, and
is the overall mean expression level of gene *i* in all the training samples. Note that by the definition, we have
. Let

for all *i* and *k*, where Δ is the shrinkage parameter to be chosen by LOOCV. Then substituting
in the DLDA rule by
, we obtain a SC rule

The new sample *x** is assigned to class *k*_{0} such that
.

Note that, if

, then

and thus gene

*i* plays no role in classifying for class

*k*. Hence SC effectively accomplishes gene selection by shrinkage.

Random forest (RF) [24] is an ensemble of classification trees [42, 43], which have been shown to be useful in tumor classification with microarray data [44]. It is designed to improve over a single classification tree. There are two random aspects that help generate multiple classification trees in RF. First, a bootstrap sample is repeatedly drawn from the original training data and then used to build a classification tree. Second, in building a classification tree, rather than using the best splitting variable (i.e. gene here) from all the available variables at each node, it chooses the best from a small random subset of all the variables. Each tree is grown to the maximum and no pruning is pursued. To predict the class for a new sample, the sample is applied to each tree and each tree votes by giving its prediction, then the majority vote is taken as the final prediction for the sample.

### Extending a binary classifier to multiclass classification

Here we describe how a multi-class (*K* > 2) classification problem can be handled by a binary classifier, such as PLS, PPLS and LASSO. It is achieved by formulating a multi-class classification problem as multiple two-class classification problems. We consider two most popular approaches: one is to compare each class against all the others, and the other is to compare all possible pairs of classes. Applications of these two approaches can be found, among others, in [45–50]. In particular, some have considered the first approach for PLS [50].

The *one-against-others* approach is to reduce a *K*-class classification task to *K* two-class classification problems. Formally, a new response is defined in the *k*_{
th
}binary problem as:

for *k* = 1, ..., *K*. Then we build *K* binary classifiers. To predict a new sample with gene expression profile *x**, we apply *x** to each binary classifier and yield
. Finally, the class of the new sample *C*(*x**) is predicted as

That is, the new sample is classified into the class maximizing

.

The *pair-wise* approach reduces a *K*-class classification to *K*(*K* - 1)/2 two-class classification problems [45]. Specifically, for each of all possible pairs of classes, solve each of the two-class problems and then, for a new sample, combine all the pairwise decisions to form a K-class decision. Suppose that the new binary response in a pairwise comparison with classes *k*_{1} and *k*_{2} (with 1 ≤ *k*_{1} <*k*_{2} ≤ *K*) is defined as

We build a binary classifier with response

using only samples belonging to class

*k*_{1} or

*k*_{2}, and denote the fitted response value for a new sample with expression profile

*x** as

. As described earlier, we classify the new sample into class

*k*_{1} or

*k*_{2} according to the sign of

. After this is done for any 1 ≤

*k*_{1} <

*k*_{2} ≤

*K*, the final decision is to assign the new sample to the class that wins the most pairwise comparisons. In the case when there are multiple winning classes, we randomly pick one of the winning classes to be the final winning class. Comparing to the

*one-against-others* approach, the

*pair-wise* approach is computationally more expensive if

*K* ≥ 4.

### Gene ranking

To explore the effect of the number of genes a model starts with on the classification performance, we have a preliminary gene ranking using a usual F-statistic. This univariate ranking is used throughout, and obviously is by no means to be optimal. For the purpose of the presentation in this section, we only need to consider a given gene. Suppose *x*_{
ik
}is the gene expression level of the gene in sample *i* that is in class *k*, *i* = 1, ..., *n*_{
k
}, and *k* = 1, ..., *K*, where *n*_{
k
}is the total number of samples in class *k* and *K* is the total number of classes. Let
be the mean expression level of class-*k* samples,
be the overall mean (across all the samples) and
be the total number of samples. We can construct an F-statistic as the ratio of the mean sums of squares for between-class and within-class variations:

We can rank all the genes based on their corresponding *F*-statistics: a gene with a larger F-statistic indicates a stronger relationship between its expression levels and the class membership in the samples, and therefore has a higher rank as a potential predictor of the class. We started with various models by including different numbers of top ranked genes. We considered models starting from the top 50, 100, 200, 400, 800, 1600, 3200, 6400, 9200, 12800, 16000, 19200, and all (22283 and 22277 for the two datasets respectively) genes respectively.

It is an incorrect practice in microarray experiments to first select genes using all the samples and then perform cross-validation using the selected genes, which gives downward biased prediction error estimates [51, 52]. Hence, it is essential to perform cross-validation on the entire model building process, including gene selection. In our study, we did honest cross-validation. In particular, we cross-validated gene selection (and other aspects of model building, such as parameter selection and estimation). Specifically, in LOOCV, we remove each sample from the data in turn (which is then treated as the test sample), carry out gene selection using F-statistic based on the remaining samples, build a classifier with the selected genes using the remaining samples, and then test the classifier on the left-out sample.

### Data preprocessing

To facilitate the application of penalized regression (i.e. PPLS and LASSO) so that their regression coefficients are in the same unit and thus can be penalized using a global penalty parameter, the expression levels of each gene were scaled to have sample variance 1.

### Evaluations

In addition to PLS/PPLS, we will consider the shrunken centroids (SC) method, the LASSO, and the random forest (RF). SC, LASSO and RF have been implemented in R [53], and are easy to use; we applied their R functions using default parameter settings. SC and RF are directly applicable to multiclass classification while LASSO, as PLS/PPLS, is itself a binary classifier. For multi-class classification with PLS and LASSO, we used the same approaches as described for PPLS.

We use the leave-one-out cross-validation (LOOCV) to estimate the prediction error for each of the methods. Within this first-level LOOCV, a second-level LOOCV is used to select tuning parameters for each method to minimize cross validation errors. Specifically, in PLS, the smallest number of PLS components is selected among the PLS models that give the minimum LOOCV error. In PPLS, among the models with minimum LOOCV error, we first pick the ones with the smallest number of PLS components, then pick the one with the largest shrinkage parameter. In the SC method, the largest shrinkage is selected among the models that minimize the LOOCV error. The number of candidate threshold values and the number of cross validation folds are both set to be default (i.e. 30 and the smallest class size respectively). In LASSO, the maximum fraction parameter of the models that minimize LOOCV error is selected while the number of the candidate fraction values is set to be 51 (equally spaced from 0 to 1) and the number of cross validation folds is set to be the total sample size. In RF, every parameter is set to be default. For example, the number of trees to grow is set to 500, and the number of candidate splitting variables considered at each split is set to

by default, where

*p* is the total number of variables (i.e. genes).

Due to the small sample size (about 10 in each class) in each dataset, it is quite challenging to estimate the prediction error well. Although it is straightforward to apply LOOCV or other cross-validation methods, their performance may not be optimal. After submitting this work, we became aware of the recent work by Fu et al [54], where a better method than LOOCV was proposed specifically for microarray data. This new method aims to reduce the variability of LOOCV. We reason that with the use of this new method, the main conclusions drawn in this work would not change.