- Research article
- Open Access
Classification across gene expression microarray studies
© Buness et al; licensee BioMed Central Ltd. 2009
Received: 8 September 2008
Accepted: 30 December 2009
Published: 30 December 2009
The increasing number of gene expression microarray studies represents an important resource in biomedical research. As a result, gene expression based diagnosis has entered clinical practice for patient stratification in breast cancer. However, the integration and combined analysis of microarray studies remains still a challenge. We assessed the potential benefit of data integration on the classification accuracy and systematically evaluated the generalization performance of selected methods on four breast cancer studies comprising almost 1000 independent samples. To this end, we introduced an evaluation framework which aims to establish good statistical practice and a graphical way to monitor differences. The classification goal was to correctly predict estrogen receptor status (negative/positive) and histological grade (low/high) of each tumor sample in an independent study which was not used for the training. For the classification we chose support vector machines (SVM), predictive analysis of microarrays (PAM), random forest (RF) and k-top scoring pairs (kTSP). Guided by considerations relevant for classification across studies we developed a generalization of kTSP which we evaluated in addition. Our derived version (DV) aims to improve the robustness of the intrinsic invariance of kTSP with respect to technologies and preprocessing.
For each individual study the generalization error was benchmarked via complete cross-validation and was found to be similar for all classification methods. The misclassification rates were substantially higher in classification across studies, when each single study was used as an independent test set while all remaining studies were combined for the training of the classifier. However, with increasing number of independent microarray studies used in the training, the overall classification performance improved. DV performed better than the average and showed slightly less variance. In particular, the better predictive results of DV in across platform classification indicate higher robustness of the classifier when trained on single channel data and applied to gene expression ratios.
We present a systematic evaluation of strategies for the integration of independent microarray studies in a classification task. Our findings in across studies classification may guide further research aiming on the construction of more robust and reliable methods for stratification and diagnosis in clinical practice.
Transcriptional profiling studies of human diseases aim to identify causal molecular mechanisms as well as to improve diagnosis. For example, in breast cancer several prognostic gene signatures have been proposed [1–4]. To this date, one has been approved for clinical diagnosis. Despite this success, molecular signatures based on microarray gene expression data may be unstable and thus still need to be considered with caution . Similarly, the lack of agreement between different signatures raises doubts about the reliability and robustness of reported predictive gene lists . Insufficient sample sizes, heterogeneity of tumor samples and patient characteristics are common obstacles concerning microarray data analysis. The integration of multiple studies may overcome this limitation. However, different protocols and technologies hamper such attempts and the translation to clinical practice. This particularly affects predictive signatures derived from gene expression microarray data. For example, a drop in predictive accuracy across two different technology platforms measuring a common set of samples has been found . The misclassification rate raised from 2 to 19.5% in this study. On the other hand promising classification results for the integration of studies were reported  as well as a high level of concordance between several microarray-based and alternative technology platforms measuring gene expression .
Data integration and classification
The limited sample size in microarray studies always raises the question whether the sample under investigation is representative of the population. This is a major concern for any prediction task, and therefore it is highly desireable to consider all available data from related studies. Still then, it is unclear whether a meta-analysis increases the performance of a classifier, since technical and experimental settings vary between studies and introduce an additional layer of variation which might undo the benefits of a larger sample size. Therefore, any arbitrary integration of available data may not be the best choice. Nonetheless, an adequate strategy for the integration of heterogenous data sets may ideally help to identify those signals or genes which are not only most relevant for the classification task but also least variable with respect to the underlying experimental and technical differences. This helps to construct more robust gene signatures and ultimately leads to classifiers of increased predictive power.
Gene expression microarrays
Experimental and technical differences of microarray gene expression studies hamper data integration. Gene expression microarray platforms are based upon the principle of hybridization. The most important difference between platforms are the sequences used to measure the mRNA of a given gene. Not only the length of the sequence varies but also the choice of the most representative sequence for any particular gene. Moreover, the choice of the sequence itself influences the intensity of the signal independent of the given mRNA abundance. For example, the number of physical positions which can bind a fluorescent dye varies. Besides that, the total of all sequences on the microarray as well as the experimental processing influence the cross-hybridization properties. Thus, it is not only difficult to decide which measurements respectively sequences correspond to the same mRNA, but even in the case of matching sequences the measurement characteristics still vary. For example, the offset, scale, dynamic range and behavior as well as the level of noise differs. Another level of complexity arises from the use of two-channel/two-color platforms and single channel platforms. The two-channel platforms mostly quantify all measurements relative to a common reference sample. As a consequence the offset of the measurement is completely dependent on the choice of the reference. Nonetheless, the influence of the reference approximately cancels out if one focuses on differences in mRNA abundance between classes. Here, we disregard all issues of sequence matching including exons, introns, splicing variants etc. and rely solely on the gene annotation of probes or probe sets. The gene symbol serves as basis for the comparison of annotated probes across platforms. Note that any analysis will be hampered by different, technology-specific probe affinities for the same gene, by measurement failures as well as by wrongly annotated probes.
Classification across studies
The major issue when focusing on the experimental and technical aspect of classification across studies can be summarized as follows: different gene expression profiling platforms or studies measure the expression of the same common gene with different precision and on a different scale. Nonetheless, a common way to represent the gene expression measurements does not only allow to directly combine microarray data sets, but also to readily apply the generated classifier on a new data set which is represented in the same manner. To this end, [10, 11] proposed the method TSP (top scoring pair) and  the generalized version kTSP (k-top scoring pairs), classifiers which directly refer to the relative ranks, i.e. the ordering of the actual gene expression values within a profile. kTSP was shown to perform as good as state-of-the-art algorithms while using a relatively small number of genes for classification. In addition, enhanced types of this approach have been developed and successfully applied to integrate gene expression studies for classification [13, 14]. Independently,  introduced several variants to represent microarray data based upon the relative ranks of the gene expression values. Standard classification techniques were applied to binned or scaled quantile discretized data. Here, we choose the same general approach based upon the relative ranks and focus on the quantile transformed gene expression values. This approach appears to be the most simple and intuitive. In particular, it is relatively insensitive to preprocessing, e.g. scaling and normalization.
The final classifier for estrogen receptor status which we obtained when integrating all four independent studies comprises 50 genes for the methods DV and kTSP (see Additional File 5). This gene list includes the three genes FOXA1, GATA3 and SLC39A6 which have been previously reported in a ten gene signature . The ten gene signature was found to be highly predictive for estrogen receptor status. Our final classifier of PAM is based upon 115 genes (see Additional File 5). Again three genes FOXA1, GATA3 and ESR1 overlap with the ten gene signature. All three final classifiers contain several genes which are well known to be estrogen responders like AGR2, STC2, TFF1 or XBP1 . These findings confirm the relevance and predictivity of the gene signatures. The classifiers obtained by the methods SVM and RV contain 1000 genes since no further feature selection was applied. The final classifier for histological grade upon the DV and kTSP method was derived after the integration of three studies which contain grading annotation (Table 1). This classifier is based on the following 20 genes: APOC1, CENPA, CKS2, CXCL10, FST, GJA1, H2AFZ, HMGB3, HN1, KIF13B, KIF2C, NAT1, NOVA1, PCM1, PNRC2, QDPR, SCUBE2, SEC61G, STC2, UBE2C . This signature contains genes which are strongly linked to breast cancer progression in previous studies. APOC1 has been identified in a multiprotein index upon postoperative serum proteomic profiles which is associated to metastatic relapse in high-risk breast cancer patients receiving adjuvant chemotherapy . Protein expression of STC2 correlates with longer disease free survival . KIF2C is overexpressed in breast cancer cells and functional analysis suggest a link between overexpression and carcinogenesis . RNAi based inhibition of KIF2C expression inhibits growth in breast cancer cell lines. Furthermore, reactivation of the potential tumor suppressor connexin GJA1 (Cx43) leads to reduced cell migration and regulated various angiogenesis linked proteins in breast cancer cell lines . These findings indicate that the signature is not only predictive for histological grade of breast cancer specimen, but can also be linked to tumor progression. Thus, the signature may contain further important candidate genes even though predictive signatures must not be related to any causal molecular mechanism in general. The 60 genes of PAM classifier strongly overlap with the previous list of 20 genes (see Additional File 5). 10 genes are common to both lists: CENPA, GJA1, HMGB3, KIF13B, KIF2C, NAT1, QDPR, SCUBE2, STC2, UBE2C. This concordance between the signatures for histological grade across different methods confirms their relevance. Moreover, the observed partial overlap of the predictive signatures derived for estrogen receptor status and grade may relate to the aforementioned association of estrogen receptor status negative tumors and high grade. For example, in case of DV and kTSP the intersection of the corresponding signatures comprises the three genes NAT1, STC2 and SCUBE2 .
The complete cross-validation procedure separately applied to each microarray study and classification method gives an unbiased estimate and benchmark for the misclassification rate. The rather large sample sizes of the studies tend to control the variance of this estimate. Thus, the distinct drop of accuracy in classification across studies (Figures 2 and 3) has to be assigned to methodological shortcomings, underlying experimental differences and other factors which distinguish the studies and sample collectives. We present an approach to systematically evaluate and quantify the effect of data integration on the accuracy in across studies classification. A weak, but consistent decrease of the misclassification rate with increasing number of studies used in the training set was observed. At least the potential shortcomings of the computational methods used for the classification across studies remain a not very well studied area in research to this date. Our framework allows to assess any computational method for data integration and classification across studies and our proposed method DV shows favorable characteristics which together may guide further research.
The drastic increase of the dimensionality when considering pairs of genes instead of genes adds to the complexity of the problem. However, the good performance of kTSP and DV may indicate an interesting characteristic of gene expression microarray data which account for their success despite the dimensionality. Recent results highlight the importance of the dependence structure in microarray gene expression data and discuss the advantage of focusing on differences of ordered, non-overlapping pairs of log-transformed gene expression values . This adds evidence that a well chosen step into the space of pairs or possibly even triples of genes does not only increase the dimensionality and associated difficulties, but may allow to focus on the relevant structure present in the data.
Our straightforward extension of the kTSP method evaluated in the setting of classification across studies and platforms performed comparably well. Nonetheless, further improvements of DV can be anticipated. The impact on the classification rule when failing to measure a single gene varies for the different genes. The loss of information on a single gene affect all related gene pairs which are not necessarily equally distributed between the genes. To further balance the importance of the genes in the majority vote across all gene pairs, one may introduce a two stage voting scheme. For each gene the majority vote of all pairs which contain the gene is determined and then in a second step the majority of all resulting individual gene votes defines the prediction.
Moreover, it is worthwhile to mention that the simple counting type nature of the score Δ ij does allow to integrate any further microarray data set easily in an already existing classifier rule and hence does allow to extend the sample size. However, the properties of such a classifier require further validation steps. Moreover, when aiming to build a reasonable classifier one should recall the counting nature of the score Δ ij and ensure a minimal number samples of possibly at least 20 samples in total. Besides that, the preselection of the genes which is based upon the kTSP method itself might be replaced by a more tailored strategy accounting for the interconnected classification rule of DV. Additionally, faster alternatives like a feature reduction which preselects all gene pairs with the highest variance in their relative difference may be evaluated. Moreover, other promising variants of kTSP remain to be analyzed. For example, instead of building the classifier upon the relation between pairs of genes, one might consider relations between features like set of genes, summarized networks or pathways. This can include mixtures like the matrix of all relations between genes and summarized networks.
The consistent misclassification of a core of samples independently of the method and training set composition indirectly confirms generalization power. These misclassifications might be caused by technical or experimental issues of the mRNA processing and measurement or by wrong sample annotations and constraints at the time of the class assignment in the clinic. Improved methods in clinical practice which profit from the feedback and findings made by gene expression microarray studies may overcome such limitations in future. For example, quantitative RT-PCR or robust low-density array platforms may become a more important alternative to existing clinical procedures. Moreover, the identification of 'marginal' samples as well as sample which persistently failed correct classification may be helpful to rebuild and refine the classifier which then need to be validated in subsequent studies.
Normalization or calibration procedures often rely on the assumption of no change in expression of housekeeping genes or overall genes. This assumption may be of critical importance in diagnostic testing which is based on low-density arrays or quantitative RT-PCR where only a small number of genes is measured. Thus, classification methods solely relying on the relative quantification between pairs or sets of genes like kTSP and DV appear to be especially well suited. Despite their simplicity they show competitive performance and may play an important role in future clinical tests.
The integration and combined analysis of gene expression microarray studies remains still a challenge. Mainly, different technologies hamper the integration and translation to clinical practice. We propose a tailored classification method and a systematic approach to unbiasedly assess the benefit of integration of independent studies. This approach aims to establish good statistical practice. We analyzed four human transcriptomic breast cancer studies on different platforms comprising almost 1000 samples, evaluated five classification methods and exemplary derived classifiers for the prediction of estrogen receptor status and histological grade. In summary, our proposed method performed favorable. Particularly, it showed superior performance in the across platform classification setting when trained on single channel data. Our results will guide further research aiming on more reliable diagnostic and prognostic gene signatures in clinical practice.
Four independent and publicly available breast cancer microarray studies were compiled and prepared for the analysis as described elsewhere . We restricted ourselves to those 9765 unique Entrez gene identifiers which were common to all data sets. In addition, 85 samples in study  already present in study  were removed prior to the analysis and only tumor samples with documented histological grade (377) or estrogen receptor status (926) were included in the analysis (Table 1).
Generalization of kTSP
The kTSP (k-top scoring pairs) method is a binary classifier which is composed of k elementary classifiers. Each elementary classifier is based upon a pair of genes of 'high predictive power'. This means that gene i has a prevalently higher expression value than gene j in one class and vice versa in the other class. An unknown sample is then classified according to the relative expression values of gene i and j in that sample. The kTSP method simply collects the votes from the k best disjoint elementary classifiers and reports the majority vote as its decision. Each gene is allowed to participate in only one elementary classifier, because this makes the decision rule robust against systematic errors in the measurements of a single gene. On the other hand, an "indicator" gene that is very high in one group and very low in the other is very likely to be part of many high performance elementary classifiers. In our derived version (DV) of kTSP, we tradeoff between robustness and performance by introducing an upper bound F that limits the number of classifiers that contain a common gene. Our derived version selects gene pairs by starting with the highest scoring gene pair and then successively adjoining the next highest scoring gene pair for which none of its two constituents is already contained in F previously selected gene pairs. Note that the parameter choice F = 1 results in the original kTSP method when only considering all pairs of genes in the set of two times k genes which have been selected by the kTSP classifier.
and the Bayes error would be decreasing for increasing score |Δ ij |, see .
The final decision of DV is again made by majority vote. An extensive and more algorithmic description of DV is given in the supplements (see Additional File 6).
All analysis was done using R  and Bioconductor . We selected the following five methods for the analysis: support vector machines (SVM) with radial basis kernel, predictive analysis of microarrays (PAM), random forest (RF), kTSP and our generalization DV [12, 25–27]. The parameter k of kTSP which controls the number of non-overlapping gene pairs was optimized between the choices 10, 25 and 50. In case of DV the kTSP method was used as a feature reduction technique and k was optimized for the same choices. F of DV was set to 4.
To estimate the misclassification rate for each individual data set via complete cross-validation we used the package MCRestimate . The whole process of a (outer) 5-fold cross-validation was repeated ten times. Parameter were optimized in a nested (inner) 5-fold cross-validation. The normalized data sets constrained to the 9765 common unique Entrez gene identifiers were used without any further preprocessing as described elsewhere . A preprocessing step was applied in each cross-validation step to reduce the number of features. The 1000 genes with highest variance were selected.
Our approach to evaluate classification across studies is outlined in Figure 1. Each study is either used as training or test set in a systematic manner. This covers all possible combinations and assures an unbiased estimate of the generalization power. As in the complete cross-validation approach 1000 genes were selected in a preprocessing step. Those 1000 genes were chosen which were found to have the highest variance across all samples in the training set after quantile normalization to the mean gene expression of the corresponding pooled quantiles. After gene selection, PAM, SVM and RF were directly trained and tested on the original quantiles i.e. on the data for each sample which resulted from the replacement of the gene expression value by its quantile prior to the gene selection. The same preprocessing for gene selection was used for kTSP and DV but the methods were then applied to the original normalized gene expression values.
We would like to thank Holger Sültmann for his support. In addition, the work was supported by a grant of the German National Genome Research Network (NGFN) founded by the German Federal Ministry for Education and Research (RK by NGFN grant 01GR0101 and MR by NGFN grant 01GR0459).
- Chang HY, Nuyten DSA, Sneddon JB, Hastie T, Tibshirani R, Sorlie T, Dai H, He YD, van't Veer LJ, Bartelink H, Rijn M, Brown PO, Vijver MJ: Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proc Natl Acad Sci USA 2005, 102(10):3738–43. 10.1073/pnas.0409462102PubMed CentralView ArticlePubMedGoogle Scholar
- Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci USA 2005, 102(38):13550–5. 10.1073/pnas.0506230102PubMed CentralView ArticlePubMedGoogle Scholar
- van't Veer LJ, Dai H, Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530–536. 10.1038/415530aView ArticleGoogle Scholar
- Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, van Gelder MEM, Yu J, Jatkoe T, Berns EMJJ, Atkins D, Foekens JA: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005, 365(9460):671–9.View ArticlePubMedGoogle Scholar
- Michiels S, Koscielny S, Hill C: Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 2005, 365(9458):488–92. 10.1016/S0140-6736(05)17866-0View ArticlePubMedGoogle Scholar
- Ein-Dor L, Zuk O, Domany E: Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci USA 2006, 103(15):5923–8. 10.1073/pnas.0601231103PubMed CentralView ArticlePubMedGoogle Scholar
- Stec J, Wang J, Coombes K, Ayers M, Hoersch S, Gold DL, Ross JS, Hess KR, Tirrell S, Linette G, Hortobagyi GN, Symmans WF, Pusztai L: Comparison of the predictive accuracy of DNA array-based multigene classifiers across cDNA arrays and Affymetrix GeneChips. J Mol Diagn 2005, 7(3):357–67.PubMed CentralView ArticlePubMedGoogle Scholar
- Warnat P, Eils R, Brors B: Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics 2005, 6: 265. 10.1186/1471-2105-6-265PubMed CentralView ArticlePubMedGoogle Scholar
- Consortium MAQC: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 2006, 24(9):1151–1161. 10.1038/nbt1239View ArticleGoogle Scholar
- Geman D, d'Avignon C, Naiman DQ, Winslow RL: Classifying gene expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol 2004., 3: Article19. Article19.Google Scholar
- Xu L, Tan AC, Naiman DQ, Geman D, Winslow RL: Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data. Bioinformatics 2005, 21(20):3905–11. 10.1093/bioinformatics/bti647View ArticlePubMedGoogle Scholar
- Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D: Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics 2005, 21(20):3896–904. 10.1093/bioinformatics/bti631PubMed CentralView ArticlePubMedGoogle Scholar
- Xu L, Geman D, Winslow RL: Large-scale integration of cancer microarray data identifies a robust common cancer signature. BMC Bioinformatics 2007, 8: 275. 10.1186/1471-2105-8-275PubMed CentralView ArticlePubMedGoogle Scholar
- Xu L, Tan AC, Winslow RL, Geman D: Merging microarray data from separate breast cancer studies provides a robust prognostic test. BMC Bioinformatics 2008, 9: 125. 10.1186/1471-2105-9-125PubMed CentralView ArticlePubMedGoogle Scholar
- Schneider J, Ruschhaupt M, Buness A, Asslaber M, Regitnig P, Zatloukal K, Schippinger W, Ploner F, Poustka A, Sültmann H: Identification and meta-analysis of a small gene expression signature for the diagnosis of estrogen receptor status in invasive ductal breast cancer. Int J Cancer 2006, 119(12):2974–2979. 10.1002/ijc.22234View ArticlePubMedGoogle Scholar
- Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, de Vijver MJV, Bergh J, Piccart M, Delorenzi M: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 2006, 98(4):262–72. 10.1093/jnci/djj052View ArticlePubMedGoogle Scholar
- Goncalves A, Esterni B, Bertucci F, Sauvan R, Chabannon C, Cubizolles M, Bardou VJ, Houvenaegel G, Jacquemier J, Granjeaud S, Meng XY, Fung ET, Birnbaum D, Maraninchi D, Viens P, Borg JP: Postoperative serum proteomic profiles may predict metastatic relapse in high-risk primary breast cancer patients receiving adjuvant chemotherapy. Oncogene 2006, 25(7):981–989. 10.1038/sj.onc.1209131View ArticlePubMedGoogle Scholar
- Esseghir S, Kennedy A, Seedhar P, Nerurkar A, Poulsom R, Reis-Filho JS, Isacke CM: Identification of NTN4, TRA1, and STC2 as prognostic markers in breast cancer in a screen for signal sequence encoding proteins. Clin Cancer Res 2007, 13(11):3164–3173. 10.1158/1078-0432.CCR-07-0224View ArticlePubMedGoogle Scholar
- Shimo A, Tanikawa C, Nishidate T, Lin ML, Matsuda K, Park JH, Ueki T, Ohta T, Hirata K, Fukuda M, Nakamura Y, Katagiri T: Involvement of kinesin family member 2C/mitotic centromere-associated kinesin overexpression in mammary carcinogenesis. Cancer Sci 2008, 99: 62–70.PubMedGoogle Scholar
- McLachlan E, Shao Q, Wang HL, Langlois S, Laird DW: Connexins act as tumor suppressors in three-dimensional mammary cell organoids by regulating differentiation and angiogenesis. Cancer Res 2006, 66(20):9886–9894. 10.1158/0008-5472.CAN-05-4302View ArticlePubMedGoogle Scholar
- Klebanov L, Yakovlev A: Diverse correlation structures in gene expression data and their utility in improving statistical inference. Ann Appl Statist 2007, 1(2):538–559. 10.1214/07-AOAS120View ArticleGoogle Scholar
- Buness A, Kuner R, Ruschhaupt M, Poustka A, Sültmann H, Tresch A: Identification of aberrant chromosomal regions from gene expression microarray studies applied to human breast cancer. Bioinformatics 2007, 23(17):2273–80. 10.1093/bioinformatics/btm340View ArticlePubMedGoogle Scholar
- Ihaka R, Gentleman R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics 1996, 5: 299–314. 10.2307/1390807Google Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. 10.1186/gb-2004-5-10-r80PubMed CentralView ArticlePubMedGoogle Scholar
- Liaw A, Wiener M: Classification and Regression by randomForest. R News 2002, 2(3):18–22.Google Scholar
- Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 2002, 99(10):6567–72. 10.1073/pnas.082099299PubMed CentralView ArticlePubMedGoogle Scholar
- Vapnik V: The nature of statistical learning theory. Springer Verlag; 1999.Google Scholar
- Ruschhaupt M, Huber W, Poustka A, Mansmann U: A compendium to ensure computational reproducibility in high-dimensional classification tasks. Stat Appl Genet Mol Biol 2004., 3: Article37. Article37.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.