Multiclass classification of microarray data samples with a reduced number of genes
- Elizabeth Tapia^{1, 2}Email author,
- Leonardo Ornella^{1},
- Pilar Bulacio^{1, 2} and
- Laura Angelone^{1, 2}
DOI: 10.1186/1471-2105-12-59
© Tapia et al; licensee BioMed Central Ltd. 2011
Received: 21 July 2010
Accepted: 22 February 2011
Published: 22 February 2011
Abstract
Background
Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained.
Results
A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples.
Conclusions
A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples.
Background
A number of multiclass classification methods for microarray data have been developed in the recent years [1, 2]. However, their ability to scale well to the number of classes and to provide accurate and sparse multiclass classification models essentially free of model selection-bias remain challenging issues [3, 4]. Sparse multiclass classification models of microarray data samples are useful; they involve a reduced number of input genes and thus are easy to compute with and to interpret [5].
In this paper, a new gene selection method valid for binary mediated multiclass classification approaches of microarray data samples and able to implicitly model a gene selection sparsity constraint is presented. We rely on the use of output coding [6] methods allowing the binary reduction of M-multiclass classification into n binary classification tasks. We assume a model of independent genes, independent binary classifiers and a principle of information content equipartition among binary classifiers to derive a bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification approaches of microarray data samples. The derived bound scales with the inverse n thus providing a way to tackle the computational complexity of finding accurate and sparse multiclass classification models of microarray data samples: just increase the number n of binary classifiers and perform bounded optimum gene selection on lists of predictive genes for individual binary classifiers. In other words, the blessing face of dimensionality might be solution for the problem of accurate and sparse multiclass classifiers of microarray data samples; we just need to guarantee the induction of a large number n of independent binary classifiers. However, the induction of a large number n of independent binary classifiers by means of output coding methods may be hard to achieve when training data is scarce like in microarray data analysis. Hence, we may be forced to accept the best n with regard to the key independence factor [7, 8] of general output coding methods. Just in case the best n is sufficiently large, the design of accurate and sparse multiclass classifiers of microarray data samples would be feasible.
Output coding embodies the design of well-known One Against All (OAA) [9] multiclass classifiers allowing the division of M - multiclass classification problems into n = M binary classification tasks, each binary task dealing with the problem of discriminating a given class against the others. A further generalization of OAA classifiers leads to the design of Error Correcting Output Coding (ECOC) classifiers [10, 11] allowing the division of M - multiclass classification problems into n binary classification tasks, n being determined by the size of some error correcting code. ECOC classifiers can then be used to explore the feasibility of accurate and sparse multiclass classifiers of microarray data samples by letting n approach to infinity. In this paper, the recently introduced [12] class of ECOC classifiers based on LDPC codes [13] is considered. Hence, ECOC classifiers based on LDPC codes of size n up to ⌈15·log_{2}M⌉ and OAA classifiers of size n = M are evaluated. For OAA as well as ECOC classifiers, binary linear Support Vector Machines (SVMs) [14] classifiers are assumed. For the purposes of selecting most important genes at core SVMs, univariate ranking information [15] based on the widely used S2N metric [16–18] is assumed. Using the above setting, a complete experimental protocol is presented for the design of accurate and sparse multiclass classifiers for microarray data samples essentially free of model selection-bias [19–22]. Our approach is evaluated on 8 benchmark microarray datasets. Experimental results confirm the feasibility of our proposed method.
Results and Discussion
An upper bound on the number of genes per binary classifier
How much information can a set of p independent genes convey about a set of M phenotypes? Being aware of such a fundamental limitation could be crucial in the design of accurate and sparse multiclass classifiers of microarray data samples. Let S be a microarray dataset comprising q samples from M ≥ 3 classes, each sample defined by the gene expression measurements of p genes (p ≫ q). Hence, the average information content per class sample in S can be upper bounded by H_{ M } = log_{2}M.
Bounded optimum S2N gene selection
For a fixed n, we now face the problem of finding the optimum number of genes in the list of top p* Q_{ max } (n) most discriminative genes for each binary classifier. Such optimum will follow from a partial search scheme and thus, we provide no guarantee of identifying the optimal gene set [25]. But as n increases, finding such optimum implies finding a sparse representation of a high dimensional feature space from a small number of training samples. Because sparsity is key structural property of most genomic studies involving disease classification, we conjecture that the proposed gene selection method could indeed be a solution for the problem of designing accurate and sparse multiclass classifiers of microarray data samples.
Letting n approach to infinity cannot be realized in practice. Hence, some bounded exploration of the n dimension must be assumed in advance. In this paper, the exploration of n dimension from n_{ min } = ⌈log_{2}M⌉ + 2 up to n_{ max } = ⌈15·log_{2}M⌉ is considered. Notice that n = ⌈15·log_{2}M⌉ + 1 is not considered; it would entail the use of parity codes only able to detect (but not correct) binary classifiers errors. For practical n ranges, the exhaustive exploration of p* Q_{ max } (n) most important genes for each binary classifier may still be too computationally demanding. Thus, a multi-scale resolution approach for the Q-dimension was devised. Firstly, the Q dimension was coarsely quantized with a base 10 logarithmic scale, i.e., Q ∈ [0.001, 0.01, 0.1, 1] was assumed. Secondly, each logarithmic segment, except the last one, was linearly quantized into 10 equal parts; the last logarithmic segment was quantized into 100 equal parts. Finally, genes at each binary classifier were ranked according to their S 2N value (see Methods for details) with respect to the response variable and mapped to the formerly quantized Q-dimension for further selection. As a result, for a fixed computational budget, more computational effort can be put into the exploration of highly discriminative genes, i.e., top ranking genes, than into those of poor discriminative power.
Results on Real Data
We first note that the application of the Shapiro-Wilk test to the empirical distributions of performance measures (classification error, overall fraction of selected genes and gene selection stability) of either ECOC or OAA classifiers frequently rejected the null hypothesis of normally distributed data at the 0.05 α level of significance, thus justifying the use of the more conservative Kolmogorov-Smirnov (KS) and Mann-Whitney (MW) U tests.
The classification performance of OAA and ECOC classifiers
p-values^{a} | |||||||
---|---|---|---|---|---|---|---|
Dataset | M | n | Error-ECOC(F) | Error-OAA(G) | F ≠ G | F < G | MW |
200 Montecarlo 4:1 train-test partitions at η = 5 | |||||||
Lymphoma | 3 | NA | NA | 0 | NA | NA | - |
SRCBT | 4 | 9 | 0 | 0 | 0.00437 | 0.00219 | 0.99682 |
Brain | 5 | 9 | 0.1250 | 0.1250 | 0.98741 | - | - |
NCI60 | 8 | 9 | 0.3077 | 0.2308 | 0.02222 | 0.01111 | 0.99682 |
Staunton | 9 | 12 | 0.4615 | 0.4615 | 0.71123 | - | - |
GCM RM | 11 | 11 | 0 | 0 | 0.39273 | - | - |
Su | 11 | 13 | 0.0857 | 0.0857 | 0.92282 | - | - |
GCM | 14 | 12 | 0.3625 | 0.2863 | 9.99e-16 | 4.76e-16 | 1 |
200 Montecarlo 4:1 train-test partitions at η = 10 | |||||||
Lymphoma | 3 | 11 | 0 | 0 | 0.98741 | - | - |
SRCBT | 4 | 9 | 0 | 0 | 0.00307 | 0.00153 | 0.99999 |
Brain | 5 | 15 | 0.1250 | 0.1250 | 0.99970 | - | - |
NCI60 | 8 | 14 | 0.3077 | 0.2308 | 0.00213 | 0.00106 | 0.99996 |
Staunton | 9 | 19 | 0.4615 | 0.4615 | 0.79201 | - | - |
GCM RM | 11 | 12 | 0 | 0 | 0.79201 | - | - |
Su | 11 | 17 | 0.0857 | 0.0857 | 0.32750 | - | - |
GCM | 14 | 12 | 0.3624 | 0.2863 | 9.99e-16 | 4.76e-16 | 1 |
200 Montecarlo 4:1 train-test partitions at η = 15 | |||||||
Lymphoma | 3 | 11 | 0 | 0 | 0.98741 | - | - |
SRCBT | 4 | 9 | 0 | 0 | 0.00307 | 0.00153 | 0.99999 |
Brain | 5 | 18 | 0.125 | 0.125 | 0.99999 | - | - |
NCI60 | 8 | 16 | 0.3077 | 0.2308 | 0.00045 | 0.00022 | 0.99999 |
Staunton | 9 | 19 | 0.4615 | 0.4615 | 0.62717 | - | - |
GCM RM | 11 | 12 | 0 | 0 | 0.96394 | - | - |
Su | 11 | 17 | 0.0857 | 0.0857 | 0.46532 | - | - |
GCM | 14 | 12 | 0.3666 | 0.2863 | < 2.2e-16 | < 2.2e-16 | 1 |
The overall number of genes selected by OAA and ECOC classifiers
p-values^{a} | |||||||||
---|---|---|---|---|---|---|---|---|---|
Dataset | M | N | B-ECOC | B-OAA | G-ECOC(F) | G-OAA(G) | F ≠ G | F < G | MW |
200 Montecarlo 4:1 train-test partitions at η = 5 | |||||||||
Lymphoma | 3 | NA | NA | 4 | NA | 22 | NA | NA | NA |
SRCBT | 4 | 9 | 14.22 | 6 | 37 | 23 | < 2.2e-16 | < 2.2e-16 | 1 |
Brain | 5 | 9 | 28.1 | 19 | 177 | 109.5 | 5.08e-05 | 2.54e-05 | 0.99975 |
NCI60 | 8 | 9 | 45.11 | 34 | 310 | 326 | 9.31e-07 | 0.27804 | 0.07651 |
Staunton | 9 | 12 | 46 | 34.11 | 387 | 296 | 9.91e-08 | 4.95e-08 | 0.99993 |
GCM RM | 11 | 11 | 142 | 36 | 800 | 365.5 | < 2.2e-16 | 2.76e-08 | 1 |
Su | 11 | 13 | 126 | 62 | 1056 | 916 | 5.36e-12 | 1.15e-24 | 0.99978 |
GCM | 14 | 12 | 322 | 128 | 2096 | 1406 | < 2.2e-16 | < 2.2e-16 | 1 |
200 Montecarlo 4:1 train-test partitions at η = 10 | |||||||||
Lymphoma | 3 | 11 | 4.27 | 4 | 12 | 22 | 5.52e-08 | 1 | 9.85e-09 |
SRCBT | 4 | 9 | 12.22 | 6 | 33 | 23 | < 2.2e-16 | < 2.2e-16 | 1 |
Brain | 5 | 15 | 16.16 | 19 | 109.5 | 109.5 | 0.03970 | 0.01984 | 0.54495 |
NCI60 | 8 | 14 | 42.12 | 39 | 286.5 | 326 | 9.31e-07 | 0.95599 | 0.00105 |
Staunton | 9 | 19 | 40.03 | 34.11 | 381.5 | 296 | 6.95e-10 | 3.48e-10 | 0.99997 |
GCM RM | 11 | 12 | 72 | 36 | 570 | 365.5 | < 2.2e-16 | 1.66e-19 | 1 |
Su | 11 | 17 | 112 | 62 | 940 | 916 | 1.82e-10 | 9.11e-11 | 0.98387 |
GCM | 14 | 12 | 322 | 128 | 2078 | 1406 | < 2.2e-16 | < 2.2e-16 | 1 |
200 Montecarlo 4:1 train-test partitions at η = 15 | |||||||||
Lymphoma | 3 | 11 | 4.26 | 4 | 12 | 22 | 3.05e-08 | 1 | 3.85e-09 |
SRCBT | 4 | 9 | 12.22 | 6 | 33 | 23 | < 2.2e-16 | < 2.2e-16 | 1 |
Brain | 5 | 18 | 16.06 | 19 | 105 | 109.5 | 0.03970 | 0.01984 | 0.15586 |
NCI60 | 8 | 16 | 36.15 | 39 | 251 | 326 | 9.31e-07 | 1 | 3.23e-05 |
Staunton | 9 | 19 | 34.09 | 34.11 | 373.5 | 296 | 4.81e-09 | 2.41e-09 | 0.99989 |
GCM RM | 11 | 12 | 72 | 36 | 561 | 365.5 | < 2.2e-16 | 1.66e-19 | 1 |
Su | 11 | 17 | 112 | 62 | 924.5 | 916 | 1.34e-09 | 6.69e-10 | 0.97006 |
GCM | 14 | 12 | 322 | 128 | 2066 | 1406 | < 2.2e-16 | < 2.2e-16 | 1 |
The stability of gene selection attained by OAA and ECOC classifiers
p-values^{a} | |||||||
---|---|---|---|---|---|---|---|
Dataset | M | n | S-ECOC(F) | S-OAA(G) | F ≠ G | F > G | MW |
200 Montecarlo 4:1 train-test partitions at η = 5 | |||||||
Lymphoma | 3 | NA | NA | 0.5539 | NA | NA | NA |
SRCBT | 4 | 9 | 0.6835 | 0.5652 | < 2.2e-16 | 0.99979 | < 2.2e-16 |
Brain | 5 | 9 | 0.4643 | 0.4315 | < 2.2e-16 | 0.02363 | < 2.2e-16 |
NCI60 | 8 | 9 | 0.4313 | 0.4365 | < 2.2e-16 | < 2.2e-16 | 1 |
Staunton | 9 | 12 | 0.4129 | 0.4119 | < 2.2e-16 | < 2.2e-16 | 0.73628 |
GCM RM | 11 | 11 | 0.6043 | 0.6143 | < 2.2e-16 | < 2.2e-16^{b} | < 2.2e-16 |
Su | 11 | 13 | 0.6286 | 0.5461 | < 2.2e-16 | 0.99594 | < 2.2e-16 |
GCM | 14 | 12 | 0.6783 | 0.5886 | < 2.2e-16 | 1 | < 2.2e-16 |
200 Montecarlo 4:1 train-test partitions at η = 10 | |||||||
Lymphoma | 3 | 11 | 0.6093 | 0.5539 | < 2.2e-16 | 1 | < 2.2e-16 |
SRCBT | 4 | 9 | 0.6745 | 0.5652 | < 2.2e-16 | 1 | < 2.2e-16 |
Brain | 5 | 15 | 0.4582 | 0.4315 | < 2.2e-16 | 0.00213^{b} | < 2.2e-16 |
NCI60 | 8 | 14 | 0.4234 | 0.4365 | < 2.2e-16 | < 2.2e-16 | 1 |
Staunton | 9 | 19 | 0.4185 | 0.4119 | < 2.2e-16 | < 2.2e-16 | 5.93e-07 |
GCM RM | 11 | 12 | 0.6112 | 0.6143 | < 2.2e-16 | 6.83e-08^{b} | < 2.2e-16 |
Su | 11 | 17 | 0.6423 | 0.5461 | < 2.2e-16 | 0.99154 | < 2.2e-16 |
GCM | 14 | 12 | 0.6650 | 0.5886 | < 2.2e-16 | 0.42216 | < 2.2e-16 |
200 Montecarlo 4:1 train-test partitions at η = 15 | |||||||
Lymphoma | 3 | 11 | 0.6093 | 0.5539 | < 2.2e-16 | 1 | < 2.2e-16 |
SRCBT | 4 | 9 | 0.6740 | 0.5652 | < 2.2e-16 | 1 | < 2.2e-16 |
Brain | 5 | 18 | 0.4591 | 0.4315 | < 2.2e-16 | 0.00165^{b} | < 2.2e-16 |
NCI60 | 8 | 16 | 0.4170 | 0.4365 | < 2.2e-16 | < 2.2e-16 | 1 |
Staunton | 9 | 19 | 0.4168 | 0.4119 | < 2.2e-16 | < 2.2e-16 | 0.02409 |
GCM RM | 11 | 12 | 0.6124 | 0.6143 | < 2.2e-16 | 8.46e-05^{b} | < 2.2e-16 |
Su | 11 | 17 | 0.6405 | 0.5461 | < 2.2e-16 | 0.99154 | < 2.2e-16 |
GCM | 14 | 12 | 0.6578 | 0.5886 | < 2.2e-16 | 0.03809^{b} | < 2.2e-16 |
The performance of OAA and ECOC classifiers on train-test partitions
Dataset | M | n | G-ECOC | G-OAA | Error-ECOC | Error-OAA |
---|---|---|---|---|---|---|
η = 5, 10, 15 | ||||||
GCM RM | 11 | 10 | 926 | 1260 | 0.1852 | 0.1852 |
GCM | 14 | 20 | 1314 | 423 | 0.4782 | 0.3043 |
Conclusions
The divide and conquer approach to the design of multiclass classifiers for microarray data samples which we have presented offers the hope that accurate and sparse multiclass classifiers can be constructed without incurring in undesirable forms of gene selection bias hidden in the selection of optimal gene subsets of restricted or unrestricted size [26]. Generalized binary reductions of M-multiclass classification problems into n binary classification tasks and bounded explorations of resulting gene spaces are advised to accomplish this objective. At each binary classifier, the maximum number of genes that can be selected scales with the inverse of n, thus providing a way to accomplish optimum gene selection at affordable computational costs, provided n is sufficiently large.
In this paper, the power of OAA and ECOC binary reductions in the design of accurate and sparse multiclass classifiers for microarray data samples has been evaluated. Without loss of generality, we have restricted ourselves to the class of ECOC classifiers based on LDPC codes, linear SVM binary classifiers and univariate S 2N gene selection. Experimental results show that dimensionality exchange between input and output domains of binary mediated multiclass classifiers of microarray data samples is indeed possible: the larger the size of candidate ECOC classifiers, the greater the chance of selecting smaller sets of genes. Although promising, the dimensionality reduction performance exhibited by ECOC (LDPC) classifiers is not enough to definitely improve naive OAA classifiers, which remain the best practical option.
From an overall view, experimental results suggest that improving the dimensionality reduction ratio of OAA classifiers with ECOC classifiers may not be as easy as it seems. We note, however, that a consensus approach to gene selection and classification on a set of diverse ECOC classifiers under bounded optimum gene selection could finally boost their dimensionality reduction factor beyond that of OAA classifiers. Briefiy, provided individual ECOC solutions are good enough compared to OAA classifiers, a consensus approach to gene selection on a set of diverse ECOC classifiers should preserve most relevant genes and reject a great proportion of irrelevant ones. Since ECOC classifiers based on LDPC codes seem to be closely related neighbors of OAA counterparts, this hypothesis will be focus of future research. Finally, further dimensionality reduction improvements may still be attainable with more elaborated forms of gene selection like SVM-RFE [27].
Overall, our results provide evidence that bounded optimum gene selection in high dimensional binary output domains induced by either OAA or ECOC classifiers may be a solution for the problem of accurate multiclass classification of microarray data samples based on a reduced number of genes.
Methods
A key problem with conventional ECOC classifiers based on random codes is that randomness inhibits the systematic control of independence between binary classifiers as n approaches to infinity. A possible way to overcome this problem is to construct large ECOC classifiers from a number of small ECOC classifiers connected via shared binary classifiers. Small constituent ECOC classifiers able to locally control the key independence factor despite the size n of the overall ECOC classifier can be easily designed, for example with simple parity codes. Provided the connectivity profile of constituent ECOC classifiers and binary classifiers remains sparse, the overall ECOC design can be nicely interpreted in terms of the design of LDPC codes.
Briefly, LDPC codes are linear block codes obtained from sparse random bipartite graphs subject to sparsity constraints allowing a divide and conquer interpretation of generated ECOC classifiers[12]. Let G be a bipartite graph with n left nodes (called message nodes) and m right nodes (called check nodes). If the n message nodes are associated to the n coordinates of codewords c defined as those vectors (c_{1},..., c_{ n } ) satisfying the constraint that the sum of the neighboring positions for all check nodes among the message nodes is zero, then G models a linear code of size n which can protect at least k = n - m bits of information and which structure can be dissected into m simple parity codes. In addition, if the connectivity profile of G is sparse, i.e., each codeword bit is constrained by j < <m parity codes and each parity code constraints u < <n codeword bits, then the corresponding linear code turns to be an LDPC code. The sparsity of the graph structure is a key property in the design of efficient LDPC decoding algorithms for a variety of channel models. A channel model subsumes our prior knowledge about the statistics of binary errors. In this paper, the iterative message passing decoding algorithm described in [13] for the Additive White Gaussian Noise channel is used. A factor graph [32] model of a typical LDPC code is shown in Figure 2. The construction of ECOC classifiers based on LDPC codes is straightforward once the bipartite graph model of the underlying LDPC code is given. In factor graph terms, we just need to associate right message nodes to ideal binary classifiers predictions c_{ i } and left check nodes to constituent ECOC classifiers constructed from simple parity codes. To complete the factor graph model of an ECOC-LDPC classifier, message nodes r_{ i } modeling practical binary classifiers predictions and check nodes f_{ i } modeling prior statistical knowledge about pairs (c_{ i } , r_{ i } ) ("channel functions") must be introduced. A request for an ECOC prediction on a set of input features x starts with the computation of a corrupted codeword r(x) by the set of n binary classifiers. Assuming a suitable channel model specified by check nodes f_{ i } , the corrupted codeword r(x) is given to an iterative message passing decoding algorithm for the computation of a hopefully good estimate $\widehat{c}(x)$ of the unknown codeword c(x) encoding the unknown class label y associated to x. Remarkably, the computation of $\widehat{c}(x)$ can be fully described as a message passing algorithm over the ECOC-LDPC factor graph. In addition to convenient graphical $\widehat{c}(x)$ computation, ECOC-LDPC factor graphs also allow for seamless integration of general bounded gene selection strategies. We just need to add message nodes x_{ k } , k = 1,..., p, modeling gene expression behavior, check nodes L_{ i } , i = 1,..., n, modeling practical binary classifiers and a sparse connectivity profile ensuring that at each L_{ i } the number v of incident edges (selected genes) is no more than $p\cdot {Q}_{max}\approx {\scriptscriptstyle \frac{lo{g}_{2}M}{n\cdot H({\scriptscriptstyle \frac{1}{p}})}}$, in agreement with Eq.2.
Microarray Datasets
Eight cancer microarray data sets were used in the evaluation of binary mediated multiclass classification with bounded optimum S 2N gene selection. The Lymphoma dataset [33] consists of 62 samples of a specialized cDNA chip spanning M = 3 subtypes of Diffuse large B-cell lymphoma, each sample defined by the expression of p = 4026 genes. Samples in the Lymphoma dataset are highly imbalanced: 42 samples of diffuse large B-cell lymphoma, 9 of follicular lymphoma and 11 of chronic lymphocytic leukemia. Original data is available at http://llmpp.nih.gov/lymphoma/data/figure1. In this study, a preprocessed dataset version compiled by [34] based on [35] was used.
The Small Round Blue Cell Tumors (SRBCT) dataset [36] consists of 63 samples of a specialized cDNA chip spanning M = 4 subtypes of small round blue cell tumors of childhood, each sample defined by the expression of p = 2308 genes. Samples are distributed as follows: 12 samples of neuroblastoma, 20 samples of rhabdomyosarcoma, 8 samples of non-Hodgkin lymphoma and 23 samples of the Ewing family of tumors. In this study, a preprocessed dataset version available at http://research.nhgri.nih.gov/microarray/Supplement/index.html was used.
The Brain dataset [37] consists of 42 samples of the Affymetrix HuGeneFL chip spanning M = 5 tumors classes of the central nervous system, each sample defined by the expression of p = 5597 genes. Samples are distributed as follows: 10 medulloblastomas, 10 malignant gliomas, 10 atypical teratoid/rhabdoid tumors (AT/RTs), 8 primitive neuro-ectodermal, tumors (PNETs) and 4 human cerebella. In this study, the original dataset version (Dataset A) was used. Expression values based on average difference units were computed using the Affymetrix GENECHIP MAS 4.0 analysis software. This dataset is available at http://www.broadinstitute.org/mpr/CNS/.
The NCI60 dataset [35] consists of 61 samples of a specialized cDNA chip spanning M = 8 tumor classes, each sample defined by the expression of p = 5244 genes. Samples are distributed as follows: 7 breast, 5 central nervous system, 7 colon, 6 leukemia, 8 melanoma, 9 non-small cell lung carcinoma, 6 ovarian and 9 renal tumors. Original data is available at http://genome-www.stanford.edu/nci60. In this study, a preprocessed dataset version compiled by [34] based on [35] was used.
The Staunton dataset [38] consists of 60 samples of the Affymetrix Hu6800 chip spanning M = 9 classes of tumors, each sample defined by the expression of p = 5726 genes. Expression values based on average difference units were computed using the Affymetrix GENECHIP MAS 4.0 analysis software. In this study, a preprocessed dataset version compiled by [1] involving the rescaling of gene expression measurements to the interval 0[1] was used. This dataset is available at http://www.gems-system.org/.
The Su[39] consists of 174 samples of the Affymetrix U95a chip spanning M = 11 classes of tumors, each sample defined by the expression values of p = 12533 genes. Expression values based on average difference units were computed using the Affymetrix GENECHIP MAS 4.0 analysis software. In this study, a preprocessed dataset version compiled by [1] involving the rescaling of gene expression values to the interval 0[1] was used. This dataset is available at http://www.gems-system.org/.
The GCM dataset [18] consists of 190 samples of the Affymetrix Hu6800 and Hu35K chips spanning M = 14 tumor classes of primary tumors, each sample defined by the expression of values p = 16063 genes. Expression values based on average difference units were computed using Affymetrix GENECHIP MAS 4.0 analysis software. This dataset, which comes with a public train-test partition involving q = 144 samples for training and 46 for test, is available at http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi.
The GCM RM dataset [40] consists of 123 samples of the Affymetrix Hu6800 chip spanning M = 11 classes of tumors, each sample defined by the expression values of p = 7129 genes. This dataset was derived from the GCM dataset with the purpose of improving multiclass classification with variability estimates of repeated gene expression measurements. Hence, expression values were computed with the more robust log scale multi-array analysis (RMA) measure. This dataset, which comes with a public train-test partition involving q = 96 samples for training and 27 for test, is available at http://expression.washington.edu/publications/kayee/shrunken_centroid/.
Experimental Protocol
Optimum bounded gene selection over OAA and ECOC multiclass based on linear SVMs classifiers was evaluated on 8 publicly available microarray datasets (M ∈ {3, 4, 5, 8, 9, 11, 14}). Aiming a systematic evaluation of the n-dimension, we restricted ourselves to the class of ECOC classifiers based on LDPC codes. For both OAA and ECOC classifiers, binary classifiers decisions were fusioned by means of soft-decoding techniques. Hence, OAA classifiers based on hinge loss decoding of SVM's outputs and ECOC classifiers based on LDPC codes able to perform soft iterative decoding of SVM's outputs were used. Owing to the constraint p > >q, which highly limits the diversity between induced binary classifiers, just one iterative decoding loop was allowed. The Java Weka library version 3.4.10 [41] was used to provide the implementations of OAA multiclass and binary linear SVM classifiers. An extension of the Weka library was developed to implement ECOC classifiers based on LDPC codes and bounded optimum gene selection for both OAA and ECOC classifiers.
Assessing the classification performance
The classification performance of OAA and ECOC multiclass classifiers was evaluated by means of a randomized strategy. Based on [42] and [35], 200 Montecarlo 4:1 ($\frac{4}{5}$ for training and $\frac{1}{5}$ for testing) partitions of available data were considered. For those datasets with a public train-test partition, the specific train-test evaluation was additionally performed. The following performance metrics were considered: the test error rate, the number of binary classifiers, the number of genes per binary classifier, the overall number of selected genes and the stability of gene selection. Briefiy, stability of gene selection measures how multiple classification models resemble between them; models may be close to each other in terms of error, but can be distant in terms of their forms (the identity of selected genes) [43]. Thus, stability of gene selection is an important requirement for ensuring reliable conclusions in microarray data analysis [44, 45]. Stability of gene selection with respect to changes in the training data was measured by means of the Salton's cosine coefficient [46]. Let A_{ i } and A_{ j } respectively denote the sets of genes selected by classifier A in partitions i and j, i ≠ μj. Hence, the similarity between sets A_{ i } and A_{ j } according to the Salton's coefficient is given by ${\scriptscriptstyle \frac{\#\phantom{\rule{0.5em}{0ex}}genes\phantom{\rule{0.5em}{0ex}}in\phantom{\rule{0.5em}{0ex}}both\phantom{\rule{0.5em}{0ex}}{A}_{i}\phantom{\rule{0.5em}{0ex}}and\phantom{\rule{0.5em}{0ex}}{A}_{j}}{\sqrt{\#\phantom{\rule{0.5em}{0ex}}genes\phantom{\rule{0.5em}{0ex}}in\phantom{\rule{0.5em}{0ex}}{A}_{i}}\cdot \sqrt{\#\phantom{\rule{0.5em}{0ex}}genes\phantom{\rule{0.5em}{0ex}}in\phantom{\rule{0.5em}{0ex}}{A}_{j}}}}$. Using 200 random train-test partitions lead to 200 · 199/2 pairwise similarity measurements from which the mean stability of gene selection can be reported.
Searching the best parameters
where μ(j)_{+}, μ(j)_{-} and σ(j)_{+}, σ(j)_{-} denote the means and standard deviations of the j - th gene in positive and negative examples in the current (binary) training set. Most g important genes under the S 2N metric are defined as the first g/2 and the last g/2 genes in the ranked list of genes. For a fixed number n of binary classifiers, optimum bounded gene selection requires the estimation of the optimum number of genes g(n), or its fractional equivalent $Q(n)=\frac{g(n)}{p}$, in the list of p* Q_{ max } (n) most important genes. Such threshold can be estimated by a nested 10-Fold CV loop in the current training set using the multiscale resolution approach described in the Results section. The process must be repeated for each candidate n in the range [n_{ min } , n_{ max } ]. Afterwards, the best performing (n, Q(n)) pair can be reported. In case of multiple solutions, that involving the largest n, i.e., the smallest Q(n), is selected.
The best C for ECOC and classifiers based on linear SVMs
Dataset | ECOC at η= 5^{a} | ECOC at η= 10^{a} | ECOC at η= 15^{a} | OAA^{a} |
---|---|---|---|---|
Lymphoma | NA | 1:1-1 | 1:1-1 | 1:1-1 |
SRCBT | 1:1-1 | 1:1-1 | 1:1-1 | 1:1-1 |
Brain | 1:1-1 | 1:1-1 | 1:1-1 | 1:1-1 |
NCI60 | 1:1-1 | 1:1-1 | 1:1-1 | 1:0.5-1 |
Staunton | 1:1-1 | 1:1-1 | 1:1-1 | 1:0.5-1 |
GCMRM | 1:1-1 | 1:1-1 | 1:1-1 | 1:1-1 |
Su | 1:1-1 | 1:1-1 | 1:1-1 | 1:0.5-1 |
GCM | 1:1-1 | 1:1-1 | 1:1-1 | 1:0.5-1 |
Assessing the statistical significance of results
where u and v respectively denote the sample sizes from the empirical distributions of F and G and N = u + v.
Hence, to test whether ECOC classifiers can attain better classification performance than OAA classifiers, the two-sided D (Eq. 7) and the one-sided D^{ - } (Eq. 8) statistics were used (the alternative parameter of the ks.test function in the stats R package respectively set to "two.sided" and "less"). A similar approach was used to assess the statistical significance of the differences between the overall fraction of selected genes by ECOC and OAA classifiers. Finally, to assess the statistical significance of stability differences between ECOC and OAA classifiers, the D (Eq. 7) and the D^{+} (Eq. 9) statistics were used (the alternativ e parameter of the ks.test function in the stats R package respectively set to "two.sided" and "greater"). One-sided KS tests were supplemented with one-sided Mann-Whitney U tests (MW) for analyzing the difference between medians of two groups. A criterion alpha level of 0.05 was used for all statistical tests.
Appendix
A more formal derivation of an upper bound for the number of genes per binary classifier
Declarations
Acknowledgements
The authors would like to thank Javier De Las Rivas, member of the CIC, CSIC/USAL, Spain, for providing initial access to computational resources. The authors would also like to thank anonymous reviewers for their helpful comments. ET's, LO's, PB's and LA's work was supported by projects PICT No. 02226, SECYT, Argentina and Red Sudamericana e Iberoamericana de Bioinformática, PROSUL CNPq 011/2008, Brasil.
Authors’ Affiliations
References
- Statnikov A, Aliferis C, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005, 21: 631–643. 10.1093/bioinformatics/bti033View ArticlePubMedGoogle Scholar
- Liu KH, Xu CG: A genetic programming-based approach to the classification of multiclass microarray datasets. Bioinformatics 2009, 25: 331–337. 10.1093/bioinformatics/btn644View ArticlePubMedGoogle Scholar
- Li T, Zhang C, Ogihara M: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 2004, 20(15):2429–2437. 10.1093/bioinformatics/bth267View ArticlePubMedGoogle Scholar
- Statnikov A, Wang L, Aliferis C: A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 2008, 9: 319. 10.1186/1471-2105-9-319PubMed CentralView ArticlePubMedGoogle Scholar
- Fan J, Fan Y: High dimensional classification using features annealed independence rules. Ann Statist 2008.Google Scholar
- Allwein EL, Schapire RE, Singer Y: Reducing Multiclass to Binary: A Unifying Approach for Margin classifiers. In ICML '00: Proceedings of the Seventeenth International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc; 2000:9–16.Google Scholar
- Guruswami V, Sahai A: Multiclass learning, boosting, and error-correcting codes. In COLT '99: Proceedings of the twelfth annual conference on Computational learning theory. USA: ACM Press; 1999:145–155.View ArticleGoogle Scholar
- Masulli F, Valentini G: Dependence among Codeword Bits Errors in ECOC Learning Machines: An Experimental Analysis. Multiple classifier Systems 2001, 158–167.View ArticleGoogle Scholar
- Rifkin R, Klautau A: In Defense of One-Vs-All classification. Journal of Machine Learning Research 2004, 5: 101–141.Google Scholar
- Dietterich TG, Bakiri G: Error-correcting output codes: a general method for improving multiclass inductive learning programs. In Proceedings of the Ninth AAAI National Conference on Artificial Intelligence. Edited by: Dean TL, Mckeown K. Menlo Park, CA: AAAI Press; 1991:572–577.Google Scholar
- Rifkin R: Everything old is new again: A fresh look at historical approaches in machine learning. PhD thesis. Massachusetts Institute of Technology; 2002.Google Scholar
- Tapia E, Bulacio P, Angelone L: Recursive ECOC classification. Pattern Recognition Letters 2010, 31(3):210–215. 10.1016/j.patrec.2009.09.031View ArticleGoogle Scholar
- Mackay DJC: Good error-correcting codes based on very sparse matrices. Information Theory, IEEE Transactions on 1999, 45(2):399–431. 10.1109/18.748992View ArticleGoogle Scholar
- Vapnik V: The nature of statistical learning theory (Information Science and Statistics). Springer; 1999.Google Scholar
- Saeys Y, Inza I, Larranaga P: A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23(19):2507–2517. 10.1093/bioinformatics/btm344View ArticlePubMedGoogle Scholar
- Slonim DK, Tamayo P, Mesirov JP, Golub TR, Lander ES: Class prediction and discovery using gene expression data. Recomb 2000, 263–272. full_textView ArticleGoogle Scholar
- Furey T, Cristianini N, Duffy N, Bednarski D, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16(10):906–914. 10.1093/bioinformatics/16.10.906View ArticlePubMedGoogle Scholar
- Yeang CH, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin RM, Angelo M, Reich M, Lander E, Mesirov J, Golub T: Molecular classification of multiple tumor types. Bioinformatics 2001., 17(Suppl 1):
- Furlanello C, Serafini M, Merler S, Jurman G: Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics 2003, 4: 54. 10.1186/1471-2105-4-54PubMed CentralView ArticlePubMedGoogle Scholar
- Dupuy A, Simon R: Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J Natl Cancer Inst 2007, 99: 147–157. 10.1093/jnci/djk018View ArticlePubMedGoogle Scholar
- Lee S: Mistakes in validating the accuracy of a prediction classifier in high-dimensional but small-sample microarray data. Stat Methods Med Res 2008, 17: 635–642. 10.1177/0962280207084839View ArticlePubMedGoogle Scholar
- Aliferis CF, Statnikov A, Tsamardinos I, Schildcrout JS, Shepherd BE, Harrell FE: Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data. PLoS ONE 2009, 4: 3:e4922. 10.1371/journal.pone.0004922View ArticlePubMedGoogle Scholar
- Shmulevich I, Dougherty ER, Kim S, Zhang W: Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics 2002, 18(2):261–274. 10.1093/bioinformatics/18.2.261View ArticlePubMedGoogle Scholar
- Huang S: Non-genetic heterogeneity of cells in development: more than just noise. Development 2009, 136(23):3853–3862. 10.1242/dev.035139PubMed CentralView ArticlePubMedGoogle Scholar
- Tsamardinos I, Aliferis CF: Towards Principled Feature Selection: Relevancy, Filters and Wrappers. in Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics 2003.Google Scholar
- Zhu J, McLachlan G, Jones LBT, Wood I: On selection biases with prediction rules formed from gene expression data. Journal of Statistical Planning and Inference 2008, 138(2):374–386. 10.1016/j.jspi.2007.06.003View ArticleGoogle Scholar
- Guyon I, Weston J, Barnhill S, Vapnik V: Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 2002, 46(1–3):389–422. 10.1023/A:1012487302797View ArticleGoogle Scholar
- Berger A: Error-Correcting Output Coding for Text classification. In Proceedings of IJCAI-99 Workshop on Machine Learning for Information Filtering 1999.Google Scholar
- James G, Hastie T: The Error Coding Method and PICTs. Journal of Computational and Graphical Statistics 1998, 7(3):377–387. 10.2307/1390710Google Scholar
- Lin Y: Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery 2002, 6: 259–275. 10.1023/A:1015469627679View ArticleGoogle Scholar
- Cristianini N, Shawe-Taylor J: An introduction to support vector machines: and other kernel-based learning methods. 1st edition. Cambridge University Press; 2000.View ArticleGoogle Scholar
- Kschischang FR, Frey BJ, Loeliger HA: Factor graphs and the sum-product algorithm. Information Theory, IEEE Transactions on 2001, 47(2):498–519. 10.1109/18.910572View ArticleGoogle Scholar
- Alizadeh A, Eisen M, Davis R, Ma C, Lossos I, Rosenwald A, Boldrick J, Sabet H, Tran T, Yu X, Powell J, Yang L, Marti G, Moore T, Hudson J Jr, Lu L, Lewis D, Tibshirani R, Sherlock G, Chan W, Greiner T, Weisenburger D, Armitage J, Warnke R, Levy R, Wilson W, Grever M, Byrd J, Botstein D, Brown P, Staudt L: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403(6769):503–11. 10.1038/35000501View ArticlePubMedGoogle Scholar
- Dettling M: BagBoosting for tumor classification with gene expression data. Bioinformatics 2003, 20(18):3583. 10.1093/bioinformatics/bth447View ArticleGoogle Scholar
- Dudoit S, Fridlyand J, Speed TP: Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association 2002, 97(457):77–87. 10.1198/016214502753479248View ArticleGoogle Scholar
- Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu C, Peterson C, Meltzer P: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 2001, 7(6):673–679. 10.1038/89044PubMed CentralView ArticlePubMedGoogle Scholar
- Pomeroy S, Tamayo P, Gaasenbeek M, Sturla L, Angelo M, Mclaughlin M, Kim J, Goumnerova L, Black P, Lau C, Allen J, Zagzag D, Olson J, Curran T, Wetmore C, Biegel J, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis D, Mesirov J, Lander E, Golub T: Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 2002, 415(6870):436–442. 10.1038/415436aView ArticlePubMedGoogle Scholar
- Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee JK, Reinhold WO, Weinstein JN, Mesirov JP, Lander ES, Golub TR: Chemosensitivity prediction by transcriptional profiling. Proc Natl Acad Sci USA 2001, 98: 10787–10792. 10.1073/pnas.191368598PubMed CentralView ArticlePubMedGoogle Scholar
- Su AI, Welsh JB, Sapinoso LM, Kern SG, Dimitrov P, Lapp H, Schultz PG, Powell SM, Moskaluk CA, Frierson HF, Hampton GM: Molecular Classification of Human Carcinomas by Use of Gene Expression Signatures. Cancer Res 2001, 61(20):7388–7393.PubMedGoogle Scholar
- Yeung K, Bumgarner R: Multiclass classification of microarray data with repeated measurements: application to cancer. Genome Biol 2003, 4(12):R83. 10.1186/gb-2003-4-12-r83PubMed CentralView ArticlePubMedGoogle Scholar
- Witten I, Frank E: Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann; 1999.Google Scholar
- Azuaje F: Genomic data sampling and its effect on classification performance assessment. BMC Bioinformatics 2003, 4: 5. 10.1186/1471-2105-4-5PubMed CentralView ArticlePubMedGoogle Scholar
- Breiman L: Statistical Modeling: The Two Cultures. Statistical Science 2001, 16(3):199–215. 10.1214/ss/1009213726View ArticleGoogle Scholar
- Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y: Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 2010, 26(3):392–398. 10.1093/bioinformatics/btp630View ArticlePubMedGoogle Scholar
- Qiu X, Xiao Y, Gordon A, Yakovlev A: Assessing stability of gene selection in microarray data analysis. BMC Bioinformatics 2006, 7: 50. 10.1186/1471-2105-7-50PubMed CentralView ArticlePubMedGoogle Scholar
- Salton G: Automatic text processing: the transformation, analysis, and retrieval of information by computer. USA: Addison-Wesley Longman Publishing Co., Inc; 1989.Google Scholar
- Ambroise C, Mclachlan GJ: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 2002, 99(10):6562–6566. 10.1073/pnas.102102699PubMed CentralView ArticlePubMedGoogle Scholar
- Hadar J, Russell WR: Rules for Ordering Uncertain Prospects. American Economic Review 1969, 59: 25–34.Google Scholar
- Delgado MA, Farinas JC, Ruano S: Firm productivity and export markets: a non-parametric approach. Journal of International Economics 2002, 57(2):397–422. 10.1016/S0022-1996(01)00154-4View ArticleGoogle Scholar
- Hollander M, Wolfe DA: Nonparametric Statistical Methods. 2nd edition. Wiley-Interscience; 1999.Google Scholar
- Shapiro SS, Wilk MB: An analysis of variance test for normality (complete samples). Biometrika 1965., 3(52):
- Shannon CE: A Mathematical Theory of Communication. The Bell System Technical Journal 1948, 27: 379–423. 623 623View ArticleGoogle Scholar
- Fano RM: Transmission of information: a statistical theory of communications. M.I.T. Press & Wiley, London; 1961.Google Scholar
- Cover TM, Thomas JA: Elements of Information Theory. Wiley-Interscience; 1991.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.