Simple and flexible classification of gene expression microarrays via Swirls and Ripples

Background A simple classification rule with few genes and parameters is desirable when applying a classification rule to new data. One popular simple classification rule, diagonal discriminant analysis, yields linear or curved classification boundaries, called Ripples, that are optimal when gene expression levels are normally distributed with the appropriate variance, but may yield poor classification in other situations. Results A simple modification of diagonal discriminant analysis yields smooth highly nonlinear classification boundaries, called Swirls, that sometimes outperforms Ripples. In particular, if the data are normally distributed with different variances in each class, Swirls substantially outperforms Ripples when using a pooled variance to reduce the number of parameters. The proposed classification rule for two classes selects either Swirls or Ripples after parsimoniously selecting the number of genes and distance measures. Applications to five cancer microarray data sets identified predictive genes related to the tissue organization theory of carcinogenesis. Conclusion The parsimonious selection of classifiers coupled with the selection of either Swirls or Ripples provides a good basis for formulating a simple, yet flexible, classification rule. Open source software is available for download.


Simplicity and flexibility
Simple classification rules with few variables and parameters are preferable to complicated classification rules for the following two reasons [1]. First, classification performance is primarily a function of the first few variables selected, with only slight improvements when additional variables are included. Second, only those variables that strongly predict class in the study data, namely the first few selected, are also likely to moderately or strongly predict class in new data. A popular simple classification rule for analyzing gene expression microarrays is diagonal discriminant analysis, which is discriminant analysis with a diagonal variance-covariance matrix [2]. Diagonal discriminant analysis yields linear or curved classification boundaries, given the name Ripples. Although Ripples are optimal boundaries for normally distributed expression levels in each class with the appropriate variance [3], they can perform poorly with other distributions of gene expression levels [4]. A simple modification of diagonal discriminant analysis with two classes yields a smooth highly nonlinear classification boundary, given the name Swirls. Swirls can outperform Ripples under certain scenarios. The proposed simple, yet flexible, classification rule for two classes selects either Swirls or Ripples after parsimoniously selecting the number of genes and the distance measure.
Classification rules generally have two objectives: prediction and understanding [5], which correspond to Goals 1 and 2, respectively, which are described below.

Goal 1: Rule discovery and testing
Rule discovery and testing involves splitting the data once into training and test samples, selecting the classification rule in the training sample, and evaluating the performance of this classification rule in the test sample. A univariate measure of performance when the training sample is "fixed", as in this case, is called the conditional error [6]. Two measures of performance that are more informative than the conditional error are the receiver operating characteristic (ROC) curve and the relative utility (RU) curve. The ROC curve plots true versus false positive rates. The RU curve plots the maximum expected utility of prediction as a fraction of the utility of perfect prediction versus the risk threshold, which is the risk corresponding to indifference between harms and benefits [7,8].

Goal 2. Gene discovery
Gene discovery involves identifying those genes that contribute most to good classification by repeatedly randomly splitting the data into training and test samples, computing a distribution of ROC curves in the test samples to ascertain classification performance, and tabulating the most frequently selected genes in the training sample [9,10].

Classification Rule
Let j index gene, and k = 0 and 1 index class. The following quantities specify the classification rule: Let z hj denote the set of expression level of gene j in new specimen h, and let Z h = {z hj } denote the set of expression levels for specimen h. The distance from specimen h to the centroid of class k, based on gene set G, is where v jP = {(n 0 -1) v j0 + (n 1 -1) v j1 }/(n 0 + n 1 -2) is the pooled variance over the two classes. Thus D is the distance measure with D = 1 = Euclidean distance divided by the pooled variance, D = 2 = Euclidean distance divided by the class-specific variance.
Division by the variance ensures that the distance measure is not inappropriately weighted by genes with high average levels of expression. The score for combining distance measures is Thus S indicates the score formula, either a difference of squared distances or a fraction of the total of the distances. The classification rule for specimen h, which is based on the cutpoint u of the score, is Rule assign to class if Score assign to cl ( , , ) , Diagonal linear and quadratic discriminant analysis correspond to S = 1 with D = 1 and D = 2, respectively [2]. A classification rule with S = 2 and Euclidean distance was previously used to analyze microarrays [10] but without a discussion of its implications.

Swirls and Ripples
By setting equation (3) equal to various constants and plotting the solution, one can see that the score formula S = 2 yields a classification boundary that encircles a centroid and the score formula S = 2 yields a boundary of lines and curves ( Figure 1). These boundary shapes motivate the following terminology for the score formula, If the data in each class are normally distributed, Ripples is the optimal classification boundary if the correct variance (pooled or class-specific) is specified [3]. However if the data in each class are normally distributed with class-specific variances, and one specifies a pooled variance to avoid adding parameters, then Swirls can dramatically outperform Ripples ( Figure 2). The proposed classification rule selects either Swirls or Ripples, which increases flexibility without adding parameters.

Implementation
The proposed classification rule involves randomly splitting 70% of the data into a training sample and 30% into a test sample. Classification performance in the training sample is measured via the area under the ROC curve, denoted AUC, computed assuming a normal distribution of scores in each class. Details are provided in the Methods Section.
For each score formula and distance measure D, the classification rule selects a gene set G by first identifying the 50 highest ranking genes in terms of AUC of the score and, after starting with the highest ranking gene, successively including genes from these 50 highest ranking genes until there is little improvement classification performance. A greedy algorithm, which is sometimes called forward stepwise selection, successively adds the gene that most improves classification performance given the previously selected genes in the classification rule. For the greedy algorithm, the classification rule adds a gene only if the increase in AUC of the score is at least 0.02. A wrapper algorithm selects features by "wrapping around" (invoking) the full method of selecting a classification rule that uses both training and test samples, when these samples are nested within the training sample [16]. The wrapper algorithm randomly splits half the training sample data into training-training and training-test samples, which is repeated five times. On each random split, a greedy algorithm within the wrapper algorithm formulates a classification rule based on centroid set in the training-training sample and the gene expression levels in the training-test sample, successively adding a gene to the classification rule only if the increase in AUC of the score is at least 0.01. The wrapper algorithm selects the best performing classification rule among classification rules obtained in the five random splits. (Although the wrapper algorithm makes use of a greedy algorithm, reference to a greedy algorithm, unless otherwise noted, means the greedy algorithm not embedded in the wrapper algorithm). After the classification rule computes the gene set G as described above, for each score formula S the classification rule selects D = 1 if the increase in AUC with D = 2 is less than 0.02, and selects D = 2 otherwise. Finally the classification rule selects the score formula, S = 1 =Ripples or S = 2 = Swirls, that has the highest AUC.
Computations for Goal 1 are based on a distribution of ROC curves in the test sample computed from 20 bootstrap iterations. The RU curve is derived from the concave envelope of the mean ROC curve over the bootstrap iterations. Computations for Goal 2 are based on 100 random splits into training and test samples.

Simulation
Simulations are useful for investigating some aspects of classification rules, but one should not overly rely on their results because little is known about the true distributions of gene expression levels [17]. Here a simulation was used to investigate the ability to identify informative genes in a simple setting.  (Tables 1 and 2 and Figure 3). For Goal 2, the two informative genes were selected much more frequently than the non-informative genes (Table 3), as anticipated.

Data analysis
For Goal 1, the classification rules under both greedy and wrapper algorithms performed well in all data sets except for medulloblastoma (Tables 4 and 5 and Figure   Figure 2 Swirls and Ripples applied to data generated with D = 2. Table 1 Classification rules selected from simulated data using a greedy algorithm 4). For Goal 2, there was good classification in test samples obtained by random splits in all data sets except for medulloblastoma (not shown). The most frequently occurring genes among random splits of the training sample associated with good classification were desmin, zyxin, hepsin, and HLA class II. See Table 6.

Discussion
The reason for using AUC to measure classification performance in the training sample is that it can be computed quickly under a binormal assumption and is familiar to many researchers. The threshold increase in AUC to add a gene to the classification rule of 0.02 for the greedy algorithm and 0.01 for the wrapper algorithm represents a small improvement in performance relative to the range of AUC from 0.5 to 1.00. The specified threshold increase in AUC is smaller with the wrapper than with the greedy algorithm because splitting of the training sample into a training-training sample and training-test sample with the wrapper avoids overfitting, unlike the case with the greedy algorithm in which the entire training sample is used both gene selection and evaluation. Investigating various values for the threshold increase in AUC to determine an optimal threshold increase in AUC is not desirable in this setting because it would require the use of the test sample for both rule selection and evaluation, which could bias the results. Some centroid-based classifications of microarray data shrink centroids to the mean of the centroids and select genes based on soft thresholding [18]. However this procedure is not desirable for our goals because it "makes the class centroids look more similar to each other" [19] and typically selects many more genes than with our approach. Some classification rules are based on the connectivity of each gene in a network [20]. However this approach is not desirable for our goal of identifying the few genes most directly predictive of class as some highly connected genes may be selected due to multiple associations with many moderately predictive genes and not because they are highly predictive themselves.   With Goal 1 of rule discovery and testing for purposes of prediction, one should consider baseline clinical variables as well as microarray data when formulating a classification rule. Binary variables can be coded as 0 or 1. Ordered variables created from continuous variables, such as age categories, can be assigned the midpoint of each category. An ordered variable of low, medium, and high can be treated as two binary variables, one comparing low versus medium and high, and one comparing low and medium versus high. To evaluate the use of classification rules to stratify patients for treatment, in a new sample one could select patients with the highest class probabilities based on the classification rule and randomize them to treatment.
With Goal 2 of gene discovery, the most frequently occurring genes (desmin, zyxin, hepsin, and HLA class II) among random splits of the training sample in the four data sets with good classification performance in the test samples have an interesting connection to the tissue organization field theory of carcinogenesis. Tissue organization field theory postulates that a disruption of intercellular communication between the microenvironment and the cells in which cancer arises is the proximal cause of cancer [21][22][23]. In contrast the somatic mutation theory postulates that genetic alterations in the cells in which cancer arises are the proximal cause of cancer. Desimin is associated with pericytes, cells in the blood vessel walls, that have been implicated in foreign-body carcinogenesis [24], a phenomenon that likely involves disruption of intercellular communication [25]. Zxyin is associated with morphogenesis [26]. Hepsin mediates the digestion of extracellular matrix components in initial tumor growth [27]. Lastly HLA class II is a marker for tumor-infiltrating dendritic cells [28]. These genes involve changes in the tumor microenvironment, which is important to the development of cancer under the tissue organization field theory.

Conclusion
The proposed simple and flexible classification rule that select Swirls or Ripples after parsimoniously selecting genes and a distance measure is a good basis for either rule discovery and testing or gene discovery.

Methods
Computing the centroid set in the training sample Measuring classification performance using AUC Selection of the gene G, set distance measure D, and score formula S for the training sample involve the AUC under a binormal distribution [29] applied to the score,   and Φ denotes the cumulative normal distribution function.
Selecting the gene set, distance measure, score formula in the training sample For each score formula S and each distance measure D, the classification rule selects a gene set G. To simplify notation, let F TR{a,b} = (C TR ,G = {a,b}, D, S), where a and b refer to different genes. To make computations tractable, the classification rule starts with a preliminary filter that selects the 50 genes in the training sample with the highest values of AUCS(Z TR , F TR{j} ), where j indexes genes. Subsequent calculations involve these 50 genes.
The greedy algorithm selects the gene a with the highest AUCS(Z TR, F TR{a} ) and identifies the gene b with the highest AUCS(Z TR , F TR{a,b} ). If the increase in AUC is less than 0.02, G = {a}; otherwise G includes {a,b} and this procedure continues for additional genes.
The wrapper algorithm involves five random splits, each with 50% of the training sample constituting a training-training sample (TR:TR) for formulating the classification rule and 50% constituting a training-test sample (TR:TE) for computing AUC. The algorithm selects the gene a with the highest AUCS(Z TR:TR , F TR:TE{a} ) and identifies the gene b with the highest AUCS(Z TR:TR , F TR:TE{a, b} ). If the increase in AUC is less than 0.01, G = {a}; otherwise G includes {a,b} and this procedure continues additional genes. The wrapper selects the best classification rule, in terms of AUC, among the random splits.
For each S with G already selected, the classification rule selects D = 1 if the increase in AUC for D = 2 is less than 0.02 and D = 2 otherwise. For each S with D and G already selected, the classification rule selects S with the highest AUC.

Computing ROC and RU curves in the test sample
Let F TR denote the components of the final classification rule derived from the training sample. Let z TE(ijk) denote the gene For Goal 1, confidence intervals are computed by bootstrapping the data in the test sample 20 times. For each bootstrap sample, TPR at FPR = 0.1, 0.2, ..., 0.9 is computed via linear interpolation. The ROC curve plot for the bootstrap iterations consists of the mean ROC curve and upper and lower bounds based on the standard deviation of the ROC curves. An RU curve is computed from the concave envelope of the mean ROC curve, where the risk thresholds are derived from the slopes of the ROC curve. If the concave ROC curve has only one point between (0,0) and (1,1), there are insufficient data to compute a RU curve.

Availability and requirements
Project name: Swirl Genes listed in bold occur most frequently and are discussed in the text.
Project homepage: http://prevention.cancer.gov/programs-resources/ groups/b/software/swirls Programming language: Mathematica 7.0 [30] Disclaimer: This code is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the National Cancer Institute or the individual developers be liable for any claim, damages or other liability of any kind. Use of this code by recipient is at recipient's own risk. The National Cancer Institute makes no representations that the use of the code will not infringe any patent or proprietary rights of third parties.
Reproducibility. Functions are provided to reproduce calculations.
Inputs: The user needs to specify (a) the gene expression data, consisting of two matrices, one for each class, with rows corresponding to genes and columns corresponding to specimens, (b) a list of gene names, the name of the data set, and names of the two classes, and (c) the class probabilities in the target population, if different from study population, for computing the RU curve.