To aggregate or not to aggregate high-dimensional classifiers
© Xu et al; licensee BioMed Central Ltd. 2011
Received: 4 November 2010
Accepted: 13 May 2011
Published: 13 May 2011
High-throughput functional genomics technologies generate large amount of data with hundreds or thousands of measurements per sample. The number of sample is usually much smaller in the order of ten or hundred. This poses statistical challenges and calls for appropriate solutions for the analysis of this kind of data.
Principal component discriminant analysis (PCDA), an adaptation of classical linear discriminant analysis (LDA) for high-dimensional data, has been selected as an example of a base learner. The multiple versions of PCDA models from repeated double cross-validation were aggregated, and the final classification was performed by majority voting. The performance of this approach was evaluated by simulation, genomics, proteomics and metabolomics data sets.
The aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models. The disadvantage and limitations of aggregating were also discussed.
The mining of high-dimensional data in which the number of features is much larger than the number of samples, has become increasingly important, especially in genomics, proteomics, biomedical imaging and other areas of systems biology . The availability of high dimensional data along with new scientific problems have significantly challenged traditional statistical theory and reshaped statistical thinking .
The high dimensionality of functional genomic data sets poses problems to build classifiers. Because of the sparsity of data in high dimensional spaces, many classical methods of classification break down. For example, Fisher discrimination rule will be inapplicable because the within scatter matrix become singular if the number of variables is larger than the number of samples [3, 4].
Another problem is caused by the small sample size. The number of samples is usually not adequate to be representative of the total population. Moreover classifiers built on small sample sets are often not stable and may have a large variance in the number of misclassification . One common approach for this problem is to aggregate many classifiers instead of using a single one. There has been considerable interest recently in the application of aggregating methods in the classification of high-dimension data [6–11]. The most well-known method in this class of techniques is perhaps bootstrap aggregating (bagging). Breiman found that gains in accuracy could be obtained by bagging when the base learner is not stable . However, Vu and Braga-Neto argued that the use of bagging in classification of small-sample data increases computational cost, but is not likely to improve overall classification accuracy over other simpler classification rules . Moreover, if the sample size is small, the gains achieved via a bagged ensemble may not compensate for the decrease in accuracy of individual models .
Cross-validation is probably the most widely used method for estimating prediction error. In small sampled high dimension data modeling, k-fold cross-validation is often used . The k-fold cross-validation estimate is a stochastic variable that depends on the partition of the data set. Full cross-validation, that means performing all-possible ways of partitioning, will give an accurate estimation, but is computationally too expensive. Therefore, repeating k-fold cross-validation multiple times using different splits provides a good Monte-Carlo estimate of the full cross-validation . This repeating procedure results in a lot of classifiers.
In this paper, we aggregated the classifiers obtained from principal component discriminant analysis (PCDA) with a double cross-validation scheme . PCDA is an adaptation of Fisher's linear discriminant analysis (FLDA) for high-dimensional data. In PCDA, the dimensionality of the data is reduced by principal component analysis (PCA). In the reduced dimensional space the within scatter matrices is nonsingular and classical LDA can be performed [13–16]. A double cross-validation scheme was used to estimate both the number of principal components and the predictor error of the PCDA model . The classifiers that were obtained from the different cross-validation loops are aggregated to make a single classifier. This approach is tested on simulated data, gene expression, proteomics and metabolomics data. The results obtained from the research may provide insights into the use of aggregating learner in low sample, high dimensional biological data.
Here r is the number of classes, and each class has m i samples. M i is the index set of samples in each class i. and are the class centroids and the global centroid respectively.
The discriminating direction d is the eigenvector corresponding to the largest eigenvalue of the matrix . Because the number of features n is larger than the number of samples m in high dimensional data, the matrix S w is singular. This means that does not exist and FLDA cannot be applied directly.
To overcome the difficulties imposed by the singular covariance matrices, the data can be first projected onto a low dimension PCA subspace, and LDA is then applicable in this PCA subspace. The main goal of PCA is to reduce the dimensionality of a data, whilst retaining as much as possible of the information present in the original data. This reduction is achieved by a linear transformation to a new set of variables, the principal component (PC) scores. The combination of LDA with PCA yields principal component discriminant analysis (PCDA).
Aggregating PCDA with double cross-validation
Divide the training data set into K parts:
For i= 1 to K
For j= 1 to K-1
Build PCDA models with different PCs
Find an optimal PC number
Build PCDA model with the optimal PC number
Obtain cross-validation error.
Since the cross-validation error accuracy would depend on the random assignment samples, a common practice is to stratify the folds themselves . In stratified K-fold cross-validation, the folds are created in a way that they contain approximately the same proportion of classes as the original dataset. With randomly chosen partitions of inner and outer validation set, we can repeat the double cross-validation scheme to produce a lot of PCDA classifiers. The multiple versions of the predictors can be aggregated by majority voting, i.e., the winning class is the one being predicted by the largest number of predictors.
Equation 4 and 5 are to ensure the separability of two classes, and equation 6 is to make two classes have the same common covariance matrix Ω.
By following the above procedure, we obtain the simulated data set of size 200 × 590. Before building PCDA classification model by double cross-validation on the simulated data set, we separated the simulated data set into training set and test set as shown in Figure 1. In order to form training sets of differ sample sizes, we randomly selected 12, 30, 50, 75, 100 objects from 200 objects. In the test set, 100 objects were random selected without replacement from the data set after removing the training set. The whole selection procedure was repeated 100 times randomly. To make a reasonable comparison, we fix the random seeds in each selection procedure. In single PCDA, a double cross-validation with ten-fold in the outer loop and nine-fold in the inner loop were used to obtain the optimal PC number and cross-validation error. In aggregating PCDA, the PCDA approach was repeated 51 times with different cross-validation splits to obtain an aggregated classifier. Besides, we also constructed a single PCDA model with double cross-validation in the simulated data sets to compare the classification performance of PCDA with aggregating PCDA.
Leukemia gene expression data
Leukemia data from high-density Affymetrix oligonucleotide arrays were previously analyzed in Golub and Tibshirani [21, 22], and are available at http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. There are 7129 genes and 72 samples coming from two classes: 47 in class ALL (acute lymphocytic leukemia) and 25 in class AML (acute mylogenous leukemia). Among these 72 samples, 38 (27 in class ALL and 11 in class AML) are set to be training samples and 34 (20 in class ALL and 14 in class AML) are set as test samples. The data is mean-centered before classification. It should be noted that the pretreatment step such as mean-centering and auto-scaling was always performed on the training data and then the test data was pretreated with by the mean and standard deviation obtained from the training set. Auto-scaling means mean-centering the data and scaling each column by its standard deviation.
Gaucher proteomics data
The data consist of serum protein profiles of 20 Gaucher patients and 20 controls . Serum samples were surveyed for basic proteins with SELDI-TOF-MSS making use of the anionic surface of CM10 PrtoeinChip. All preprocessing (spot-spot calibration, baseline subtraction, peak detection) of the SELDI-TOF-MS data was performed using Ciphergen software. The data set of size 40 × 590 is available at http://www.bdagroup.nl/content/Downloads/datasets/datasets.php. One Gaucher sample (a female receiving enzyme replacement therapy) has been detected as an outlier and was removed. The spectra profiles were first normalized by dividing each profile by its median to arrive at comparable spectra. Subsequently, the data sets were auto-scaled before classification.
Grape extract metabolomics data
The data set is from Unilever Food and Health Research, Vlaardingen, Netherlands, Thirty five healthy males were recruited to investigate the effect of grape extract supplementation on vascular function and other vascular health markers. The study has a double-blind, placebo controlled randomized full crossover design with 3 treatments, a run-in period, 3 interventions- and 2 washout periods. 1D 1H NMR spectra of plasma: D2O (1:1 v/v) samples were recorded on a Bruker Advance 600 MHz NMR spectrometer according to a Standard Operating Procedure with a pulse sequence. All data were processed in Bruker XWIN-NMR software version 3.0 (Bruker BioSpin GmbH, Rheinstetten, Germany) and imported in AMIX software from Bruker. Due to some missing data, the final NMR data of 276 plasma samples were bucketed in the spectral region 0-9 ppm using a bucket-width of 0.02 ppm.
The data set of size 276 × 412 of two classes was divided into two subsets, 200 samples in training set and 76 samples in prediction set, using the Kennard-Stone method . The Kennard-Stone method was used to select objects to model such that they are uniformly scattered over the experimental space. In the training set and test set, the samples were assigned in such a way that the ratio of class membership is similar to the original data. The data sets were auto-scaled before classification.
Results and Discussion
When aggregating works
Cross-validation errors evaluated by outer validation sets with PCDA
Prediction errors evaluated by test sets with PCDA
The aggregated PCDA can make a good PCDA classifier better since the variance of misclassification rate can be reduced [24–27]. A heuristic explanation is that the variance of the prediction error of the aggregated classifier is equal to or smaller than the error of the original classifier since majority voting is modeling averaging.
The dimension reduction step by PCA can not be guarantied to preserve all directions that contain discriminative information . But in an aggregated PCDA model, the discarded discriminant information of one PCDA model can be re-modeled from other PCDA model with different partition of training data sets by cross-validation. So, aggregating PCDA itself may contain more discrimination information than single PCDA.
Cross-validation errors evaluated by outer validation sets with SVM
Prediction errors evaluated by test sets with SVM
When aggregating does not work
Aggregating may increase the bias of a learner since only a part of the training data are sampled by cross-validation or bootstrapped for modeling. That is to say, the use of K-fold cross-validation may have a negative effect on the accuracy of individual PCDA models. As shown in Figure 2, when the sample size is twelve, the performance of PCDA classifier is relatively bad and not stable. After aggregating, the classification performance did not achieve expected training and prediction performance yet, since basically in such case more samples are needed to build a precise model. Another situation which does not favor aggregating is case of very weak learners. A very weak learner means that the performance of learner is even worse than random guess. Aggregating such learner will make prediction even worse because averaging such learners will result in a learner that will give a wrong prediction in all cases. For example, if an observation is classified as a success about four times out of ten. After the majority voting, it will give 100% wrong.
Another question about aggregating PCDA is how many times resampling is enough? Figure 6 gives the misclassification rate in training with increasing number of aggregation. The number of aggregation starts from 20 to 1000, and increases by 20 each time. We observe in Figure 6 that the aggregated misclassification rate will keep stable after 100 replicas in leukemia and grape data. For Gaucher data, 200 replicas also give a reasonable estimation. To our experience, 50-200 replicas are usually enough to get a stable value. Aggregating learner in this paper is obtained from cross-validation, which is resampling without replacement. The conventional bagging is obtained from bootstrapping, which is resampling with replacement. As stated by Buja and Stuetzel , there is an equivalence between bagging based on resampling with and without replacement. So, the conclusion obtained in this paper in our opinion also holds in bagging approaches.
Another concern is whether aggregating PCDA can apply to multi-classification problem. Because the discrimination in PCDA is performed by LDA, the properties of LDA for multi-classes also hold. Since the decision boundaries in LDA are constructed in a pair wise manner , the conclusions drawn in this paper in principle are also valid for a multi-class problem. However, many discriminative methods are often most accurate and efficient when dealing with two classes only, but usually at reduced accuracy and efficiency for multi-classification . The effects of aggregating multi-classifier still need further careful studies.
In addition, an interpretable model is usually required as it is important to identify which genes, proteins and metabolites contribute most to classifiers. The PCDA model has been already combined with rank products [13, 16, 32] to find important variables. In aggregating PCDA, we can repeat the same strategy too. For example, we aggregate 100 PCDA learners together. As a single PCDA yields 10 discriminant vectors in a 10 fold cross-validation; 100 runs gives 1000 discriminant vectors in total. Then for all features the products of the 1000 ranks are calculated. After sorting, the features with the lowest rank products are the ones with the largest discriminative power.
The use of cross-validation to study the performance of a classifier is an established method. If performed in a proper way cross-validation provides roughly unbiased estimates of the prediction measures. However, the different partitions in cross-validation can give rise to high variability of the model predictions. In this paper we show a way to overcome the variability by building one aggregated classifier from all the classifiers that were build in the repeating cross-validations.
Aggregating learners can have several important benefits. Aggregating over a collection of fitted values can help compensate for overfitting. That is, the majority voting tends to cancel out results shaped by idiosyncratic features of the data. One can then obtain more stable and more honest assessments of how good the fit really is.
Aggregating learners also have some limits. When the sample size is very small, aggregating learner may have a large bias. So it is important for us to visualize the data to see if aggregating will be helpful or not.
In conclusion, we recommend the use of aggregating learner in high dimensional data analysis, but a careful look on data structure and comparison with base learner result.
The authors thank Ewoud van Velzen, Unilever Food and Health Research Institute, Vlaardingen, Netherlands for supplying the Grape extracts data. This work was part of the BioRange programme of the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI).
- Hastie T, Tibshirani R, Friedman JH: The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd edition. New York: Springer; 2009.View ArticleGoogle Scholar
- Fan JQ, Li RZ, Statistical challenges with high dimensionality: feature selection in knowledge discovery. In Proceedings of the international Congress of Mathematicians. Madrid, Spain: 2006 European Mathematical Society; 2006.Google Scholar
- Fukunaga K: Introduction to Statistical Pattern Recognition. New York: Academic Press; 1990.Google Scholar
- Chen LF, Liao HYM, Ko MT, Lin JC, Yu GJ: A new LDA-based face recognition system which can solve the small sample size problem. Pattern Recognit 2000, 33(10):1713–1726.View ArticleGoogle Scholar
- Skurichina M, Duin RPW: Bagging for linear classifiers. Pattern Recognit 1998, 31(7):909–930.View ArticleGoogle Scholar
- Breiman L: Bagging predictors. Mach Learn 1996, 24(2):123–140.Google Scholar
- Geurts P, Fillet M, de Seny D, Meuwis MA, Malaise M, Merville MP, Wehenkel L: Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics 2005, 21(14):3138–3145.View ArticlePubMedGoogle Scholar
- Statnikov A, Wang L, Aliferis CF: A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 2008, 9: 319.PubMed CentralView ArticlePubMedGoogle Scholar
- Gunther EC, Stone DJ, Gerwien RW, Bento P, Heyes MP: Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro. Proc Natl Acad Sci USA 2003, 100(16):9608–9613.PubMed CentralView ArticlePubMedGoogle Scholar
- Vu TT, Braga-Neto UM: Is Bagging Effective in the Classification of Small-Sample Genomic and Proteomic Data? EURASIP Journal on Bioinformatics and Systems Biology 2009, 2009: 10. Article ID 158368 Article ID 158368View ArticleGoogle Scholar
- Kohavi R: "A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence (IJCAI) 1995. [http://robotics.stanford.edu/users/ronnyk/]Google Scholar
- Kotsiantis SB, Pintelas PE: Combining Bagging and Boosting. International Journal of computational Intelligence 2004, 2004(1):323–333.Google Scholar
- Smit S, van Breemen MJ, Hoefsloot HCJ, Smilde AK, Aerts J, de Koster CG: Assessing the statistical validity of proteomics based biomarkers. Anal Chim Acta 2007, 592(2):210–217.View ArticlePubMedGoogle Scholar
- Hoogerbrugge R, Willig SJ, Kistemaker PG: Discriminant-analysis by double stage principal component analysis. Anal Chem 1983, 55(11):1710–1712.View ArticleGoogle Scholar
- Belhumeur PN, Hespanha JP, Kriegman DJ: Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Trans Pattern Anal Mach Intell 1997, 19(7):711–720.View ArticleGoogle Scholar
- Hoefsloot HCJ, Smit S, Smilde AK: A classification model for the Leiden proteomics competition. Stat Appl Genet Mol Biol 2008., 7(2): Article 8 Article 8Google Scholar
- Stone M: Cross-Validatory Choice and Assessment of Statistical Predictions J R Stat Soc B. 1974, 36: 111–147.Google Scholar
- Vandeginste BGM, Massart DL, Buydens LMC, Jong SD, Lewi PJ, Smeyers-Verbeke J: Handbook of Chemometrics and Qualimerics: Part B. Amsterdam: Elsevier; 1998.Google Scholar
- Mertens BJA, De Noo ME, Tollenaar R, Deelder AM: Mass spectrometry proteomic diagnosis: Enacting the double cross-validatory paradigm. J Comput Biol 2006, 13(9):1591–1605.View ArticlePubMedGoogle Scholar
- Filmoser P, Liebmann B, Varmuza K: Repeated double cross validation. J Chemometr 2009, 23(3–4):160–171.View ArticleGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 1999, 286(5439):531–537.View ArticlePubMedGoogle Scholar
- Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 2002, 99(10):6567–6572.PubMed CentralView ArticlePubMedGoogle Scholar
- Kennard RW, Stone L: Computer aided design of experiments. Technometrics 1969, 11: 137–148.View ArticleGoogle Scholar
- Friedman JH: On bias, variance, 0/1 - Loss, and the curse-of-dimensionality. Data Min Knowl Discov 1997, 1(1):55–77.View ArticleGoogle Scholar
- Buhlmann P, Yu B: Analyzing bagging. Ann Stat 2002, 30(4):927–961.View ArticleGoogle Scholar
- Grandvalet Y: Bagging equalizes influence. Mach Learn 2004, 55(3):251–270.View ArticleGoogle Scholar
- Berk RA: Statistical Learning from a Regression Perspective. New York: Springer-Verlag; 2008.Google Scholar
- Yang J, Yang JY: Why can LDA be performed in PCA transformed space? Pattern Recognit 2003, 36(2):563–566.View ArticleGoogle Scholar
- Vapnik V: The Nature of Statistical Learning Theory. Springer-Verlag; 1995.View ArticleGoogle Scholar
- Buja A, Stuetzle W: Observations on bagging. Stat Sin 2006, 16(2):323–351.Google Scholar
- Ding CHQ, Dubchak I: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 2001, 17(4):349–358.View ArticlePubMedGoogle Scholar
- Breitling R, Armengaud P, Amtmann A, Herzyk P, Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett 2004, 573(1–3):83–92.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.