- Research article
- Open Access
Stepwise classification of cancer samples using clinical and molecular data
© Obulkasim et al; licensee BioMed Central Ltd. 2011
- Received: 27 May 2011
- Accepted: 28 October 2011
- Published: 28 October 2011
Combining clinical and molecular data types may potentially improve prediction accuracy of a classifier. However, currently there is a shortage of effective and efficient statistical and bioinformatic tools for true integrative data analysis. Existing integrative classifiers have two main disadvantages: First, coarse combination may lead to subtle contributions of one data type to be overshadowed by more obvious contributions of the other. Second, the need to measure both data types for all patients may be both unpractical and (cost) inefficient.
We introduce a novel classification method, a stepwise classifier, which takes advantage of the distinct classification power of clinical data and high-dimensional molecular data. We apply classification algorithms to two data types independently, starting with the traditional clinical risk factors. We only turn to relatively expensive molecular data when the uncertainty of prediction result from clinical data exceeds a predefined limit. Experimental results show that our approach is adaptive: the proportion of samples that needs to be re-classified using molecular data depends on how much we expect the predictive accuracy to increase when re-classifying those samples.
Our method renders a more cost-efficient classifier that is at least as good, and sometimes better, than one based on clinical or molecular data alone. Hence our approach is not just a classifier that minimizes a particular loss function. Instead, it aims to be cost-efficient by avoiding molecular tests for a potentially large subgroup of individuals; moreover, for these individuals a test result would be quickly available, which may lead to reduced waiting times (for diagnosis) and hence lower the patients distress. Stepwise classification is implemented in R-package stepwiseCM and available at the Bioconductor website.
- Partial Little Square
- Data Type
- Random Forest
- Molecular Data
- Classification Algorithm
Accurate prognosis of relevant cancer-related endpoints, such as relapse, recurrence or metastasis, may lead to more targeted treatment and avoid unnecessary chemotherapy or surgery. One example is breast cancer recurrence. A major clinical problem of breast cancer recurrence is that by the time primary tumor is diagnosed, microscopic metastases may have already occurred. For this, patients at high risk receive more intensive chemotherapy, endocrine or radiotherapy. Yet, the ability to predict metastasis still remains one of the greatest clinical challenges in oncology.
Classifying cancer subtypes with high precision and predicting treatment outcome are intensive research topics. Traditional cancer prognosis relies on a complex and inexact combination of assessment of clinical and histopathological data. These classic approaches, however, may fail when dealing with atypical tumors or morphologically indistinguishable tumor subtypes; most cancers are both clinically and biologically heterogeneous diseases.
Various clinical or pathological factors have been evaluated as prognosis factors. For example, the treatment of cancer is often based on factors such as age, lymph node status, tumor size, etc. Although these factors provide valuable information about the risk of recurrence, they are generally considered to be insufficient to predict individual patient outcomes and determine an individual patients need for systematic adjuvant therapy. Recent advances in biotechnologies allow us to generate various types of molecular data for the same sample, e.g. copy number aberrations as measured by array CGH, mRNA expression, SNPs, methylation, etc. Each of these distinct data types provides one view of the molecular machinery of the cancer cell. Molecular data allows for adding information to the analysis of biological phenotypes. For illustrating our stepwise approach, we assume clinical data to be comparatively easy to collect and cheap, whereas the molecular data is high-dimensional and relatively expensive. This is, however, not a crucial assumption for the method as such. Moreover, our method is partly motivated by the common perception that classification results from clinical data are more stable than those from high-dimensional molecular data.
Although molecular data and clinical covariates are likely to be correlated, they also contain partly independent information. For example, the extent of lymph node metastasis is currently the key predictor of tumor state, aggressiveness and recurrence risk; this prognostic value can until now not be replaced by any type of molecular data . On the other hand, molecular data alone may supercede other non-genomic factors in prognosis, based on refined and improved molecular technologies that improve the capacity to characterize complex oncogenic processes. Combining these complementary pieces of information may be expected to enhance classification accuracy.
So far few methods have been proposed to integrate clinical and molecular data to obtain accurate cancer prognosis. In  a way of integrating microarray data and clinical variables using a modular hierarchical model to predict the outcome for diffuse large B-cell lymphoma (DLBCL) is proposed. Separate modules are constructed for microarray and clinical data. The microarray predictor module is formed by a neural network classifier. For the clinical predictor, an existing clinical prognostic model is converted to a Bayesian classifier. The predictions of the two independent modules are combined and fused to a single prediction. In  a Bayesian tree-based approach for combining two data types is proposed. At each node of a tree, the collection of metagenes and clinical factors are sampled to determine which function optimally divides the patients at the node. A split is made when significance exceeds a specified level. An integrative approach in which clinical factors combined with gene expression data using the stepwise logistic regression procedure has been introduced in . The logit transformation of the patients 7-year progression-free probability (PFP) calculated from the nomogram is imposed as the first variable and gene variables are added until optimal classification is achieved. In  a study on how to quantify the additive accuracy of the prognosis of cancer patients using gene classifiers in addition to clinical characteristics is conducted. In  a method which uses partial least squares (PLS) dimension reduction on molecular data and applies the random forest algorithm (RF) on both clinical and reduced molecular data is proposed. In  a mixture expert model to combine clinical and gene expression using different functions to incorporate both types of features has been proposed. Different gene selection techniques are applied before applying an integrative mixture expert model. More extensive overviews on integrating clinical and high-dimensional molecular data for the purpose of prediction are available in [8, 9].
All these approaches require the presence of molecular data for all patients, which may be costly, impractical or inefficient. Moreover, optimal combination of these low and high-dimensional data are still under debate . The stepwise approach we propose requires molecular data to be available for a subset of patients only. At the same time it aims to achieve high accuracy. Moreover, it applies classification to two types of data independently, thereby eliminating the concern about optimal combination. We illustrate the performance of our methods using three publicly available data sets.
What can we gain by the stepwise approach?
The stepwise classification procedure
Obtain a prediction label for every sample in the training set using two data types, separately.
Calculate a distance matrix for the training set using the two data types independently.
Project the test set onto the clinical feature space.
Estimate the re-classification score (RS) for each test sample. RS is a combination of the sample's local error rate in the clinical space and score for the potential improvement when re-classifying it.
Rank the re-classification score in descending order and reclassify a pre-defined proportion of samples which are ranked on the top with the molecular data classifier.
In the following sections we give a detailed description of each step.
The prediction step
In order to assess the performance of each of the two data types, user defined classification algorithm(s) is (are) applied to the two data types independently to obtain the predictions of the training set. This is one of the characteristics of our method. We apply existing algorithms to construct independent prediction models with the training set.
The distance metric
Since we try to determine the bad neighborhood in both clinical and molecular data spaces, we need a distance metric which can measure the similarity between samples in the heterogeneous data spaces. High-dimensional molecular data are usually in ratio scale, while clinical data often has continuous, binary and nominal features. So, we are hampered by the distance calculation which is suitable for both types of data. We could discretize the continuous numeric features into the categorical features to make the data homogenous, but this may lead to loss of information . Applying different weighting parameters for the categorical feature and the numerical feature has been proposed , but the behavior of the weighting parameter is not yet fully understood and it is difficult to find an optimal one.
Inspired by the work of , we present a method that overcomes the aforementioned problems by using the Random Forest (RF) algorithm  for calculating similarity (referred to as 'proximity') between samples. Random Forest (a collection of decision trees) is originally introduced for the classification problem, but at this stage we only use it for the distance matrix calculation.
where denotes the rank of the proximity value between the test sample i and its j th closest correctly (wrongly) classified neighbor. and denote the weighted rank of the j th closest neighbor of the test sample i from the correctly classified samples group and the incorrectly classified sample group, respectively.
Next, we discuss how to use a similar concept in the molecular space.
The large RS i means that the aggregated information from the two spaces indicate that the i th test sample is likely to benefit more when classified using molecular data. After estimating the RS for all test samples, we order these in descending order and only pass the top ranked pre-defined proportion of samples to molecular data for re-classification (see additional file 1 for an example calculation of the RS). In practice, the test samples often arrive one at the time. In such cases, we advise the following implementation of our procedure. First, based on the classification curve, as obtained from the study data, and practical (e.g. cost) considerations decide upon the desired re-classification proportion. Then, this proportion implies a cut point for the RS, which is then used prospectively. If the study data is a good reflection of the entire population, one may expect that this strategy indeed prospectively reclassifies the desired proportion. Naturally, one may monitor the re-classified proportion for the given cut point, and adjust if necessary.
In case the data produce a perfect classification result, which would set K = 0, we use K = 1. For consistency reasons we prefer to use the same number of neighbors for all the 2 * 2 = 4 instances (clinical/molecular; correct/wrong).
The stepwise classification method is evaluated on three publicly available real data sets for which both clinical and gene expression data were available. These three data sets have also been analyzed in  using the integrative approach. The first data set is a breast cancer data set  containing 256 samples, 75 samples with recurrence and 181 without recurrence metastasis within 5 years. It consists of expression levels of 5537 genes. The available clinical variables are age (nominal), number of positive nodes (nominal), tumor size (binary), tumor grade (ordinal), estrogen receptor status (binary), surgery type (binary), chemotherapy treated status (binary), hormonal therapy treated status (binary). The second data set is a central nervous system (CNS) tumor data set  which has been used to predict the response of childhood malignant embryonal tumors of CNS to the therapy. The data set is composed of 60 patients, 21 patients died and 39 survived within 24 months. Gene expression data has 7128 genes and clinical features are Chang stage (nominal), sex (binary), age (nominal), chemo Cx (binary), chemo VP (binary). We also evaluated our method on prostate cancer data . Analysis results of this data set are given in additional file 1.
For the sake of comparing accuracy and efficiency of our stepwise approach with the fully integrative classifier in the MAclinical R-package  we apply the random forest (RF) for clinical data and the Plsrf-x (partial least square dimension reduction plus RF) and the Plsrf-x-pv (pre-validated PLS dimension reduction plus RF) for molecular data, separately. Besides that, we also use a variety of well-known classification algorithms, e.g. penalized logistic regression, top scoring pair (TSP)  and support vector machine (SVM) (see additional file 1). We use full molecular data without any pre-filtering. To achieve more stable results prediction accuracy is estimated using 10 times 10-fold CV evaluation. Moreover, 5-fold inner-CV is applied to each training set for each classification algorithm for the purpose of parameter tuning.
Our approach is expected to be most useful when the molecular data classifier has higher classification accuracy than the clinical one. We also present scenarios where clinical data classifier has better performance than molecular data classifier and the scenario where both data sets have equal performances to illustrate how our approach adapts to the situation. Note that the relevant scenario for a given study depends on the data, but also on the pre-specified classification algorithms for both data types. Therefore, we illustrate the three scenarios by combinations of data sets and algorithms that lead to the given scenario. To come to a fair comparison between approaches (clinical, molecular, fully integrated or stepwise) we fix the classification algorithm used on clinical and the one used on molecular data in each illustration.
The scenario where molecular data classifier performs better than clinical data
The scenario where clinical and the molecular data classifier perform equally well
In this scenario, the desired result is that the stepwise classifier produces an accuracy somewhat higher than those of both the molecular and clinical data classifiers. Besides that, the maximum accuracy should be attained without fully using the molecular data. Since the clinical and the molecular data classifiers have equal performance, passing samples to the molecular classifier may help less in terms of accuracy than in the previous scenario. We illustrate this scenario by using the breast cancer data again, but with different classification algorithms. In the first setting we use RF on clinical data and the RF-PLS with pre-validation on molecular data. In the second setting we use GLM on clinical data and the same molecular data classifiers as in the first setting. Here, we did use the plsrf with PV and only compared it with the stepwise approach that includes the PV (the IntegrativeME does not apply in this setting). We prefer to use the PV in this comparison, because, conceptually, it should be useful (see [6, 17].) We observe in Figures 5c and 5d that in both settings the stepwise classifier accuracy curves behave as expected.
The scenario where the clinical data classifier performs better than the molecular data classifier
The results of our approach when the clinical data classifier alone performs better than the molecular data classifier are presented in additional file 1. The optimal result from the stepwise approach in this scenario is high accuracy at the beginning and decreasing accuracy following the increase of the proportion of re-classified samples with molecular data. As expected, the accuracy from the stepwise approach reaches its top at the beginning. Accuracy is close to the one from the IntegrativeME, keeping in mind that the IntegrativeME is a fully integrative approach.
The efficiency gain of the stepwise approach is considerable when the molecular data classifier performs better than clinical data. Our approach nicely adapts to the more powerful data type in an economically efficient manner. After applying the different classification algorithms, we find that when the performances of the two data types are close, the stepwise classification performance is similar to the integrative classifier. If two types of data have unequal performances, then the stepwise approach may outperform the integrative classifier in terms of the accuracy. If the two data type are equally powerful, then a fully integrative approach may outperform our stepwise approach in terms of the accuracy. This is not surprising, because the integrative approaches treat the two data types in a more symmetric way than we do. In the latter case, one should consider whether the gain in accuracy outweighs the loss in efficiency.
The accuracy plots suggest that in many cases it is sufficient to re-classify only a part of the samples. The actual choice of the percentage to be re-classified may depend on the estimated accuracies, but also on the available budget, which might restrict the maximum percentage of samples prospectively re-classified. In such a case several scenarios are possible. Suppose the accuracies of the separate clinical and molecular classifiers are available. If the clinical data classifier is clearly outperforming the molecular data classifier, simply use 0%: no re-classification. If the molecular data classifier is clearly better: use the maximum proportion allowed by the budget or use the percentage where the accuracy curve starts to flatten to save the costs. If the two are competing: take the percentage lower or equal to the maximum that performs best, provided it is allowed by the available budget. We are aware that the reported accuracy rate of the latter procedure might be slightly optimistic (because the best percentage is chosen). However, the bias should be very modest, because it concerns a maximization over very positively correlated quantities: they only differ by the portion of data re-classified.
As an alternative to the presented stepwise classifier, we also considered the case where the second stage uses both molecular and clinical data. We experienced that adding clinical data in the second stage may worsen the performance of the stepwise classifier, as illustrated using the case corresponding to Figure 5a (molecular data classifier performs better than clinical one; see additional file 1). The reason for this might be that the stepwise approach passes the sample to the second stage when the sample has relatively high RS. High RS means that the prediction from clinical data for this particular sample is likely to be unreliably. So, adding clinical data to the second stage may do more harm than good in addition to molecular data. We also ran the analysis for the case where molecular and clinical data have equal performances (corresponding to Figure 5d), but no improvement is observed either (result not shown).
In this paper, we introduce a new classification method which takes advantage of the distinct prediction power of the comparatively cheap traditional clinical risk factors and high-dimensional molecular data. Robust proximity calculation for mixed features and the neighborhood information in the two different data spaces is used to determine a group of samples which are likely to benefit most by measuring and using the comparatively expensive molecular covariates. Our approach not only utilizes the locality information in the clinical data space, but also tries to extract information from the molecular data space by the indirect mapping (IM). We believe that the IM will be useful in the integration field, as it maybe used to quantify the potential benefit of molecular data without actually measuring it. All the calculation steps take place in the clinical data space (for the prospective samples), there is no need to measure the molecular characters for new samples unless sample's re-classification score falls in the user define range. We demonstrated that the stepwise approach may save a considerable amount of samples to be molecularly profiled without losing accuracy. Moreover, our method has the ability to decrease the variation from algorithm to algorithm (adaptivity and stabilization effect). This is a very useful property when one does not have the prior knowledge about the the most suitable algorithm for the data at hand, which is the most common case.
Efficient while keeping the reasonable classification accuracy
Very generally applicable. It is able to work with
any classification algorithm
any type of data
The latter implies that our stepwise approach is also applicable when a cheap, standardized molecular platform is available. In such a case, it may be of interest to either reverse the role of the clinical and (cheap) molecular classifier or use the cheap molecular data instead of or in addition to the clinical data in the first stage while keeping the expensive molecular data for the second stage.
One possible drawback of the proposed approach is that the indirect mapping is based on the correlation between clinical and molecular data. If the correlation is very weak, then the indirect mapping does not provide much information. Future work includes the study of more indirect mapping schemes. Another possible extension of our method will be a multi-step approach, where after estimating the re-classification scores in the clinical data space, one is allowed to choose the most optimal data types for re-classification from the available multiple molecular data types.
In short, we develop a flexible and powerful classifier which is based on a multi-objective (cost efficiency and accuracy) formulation of the classification problem. It utilizes the data in a more economical way than other integrative classifiers, while still achieving relatively high accuracy.
This study was performed within the framework of CTMM, the Center for Translational Molecular Medicine, DeCoDe (Decrease Colorectal Cancer Death) project (grant 03O-101). We would like to thank the anonymous reviewers for their helpful comments and suggestions.
- Krag D, Weaver D, Ashikaga T: The sentinel node in breast cancer a multicenter validation study. The New England Journal of Medicine 1998, 339: 941–946. 10.1056/NEJM199810013391401View ArticlePubMedGoogle Scholar
- Futschik M, Sullivan M, Reeve A, Kasabov N: Prediction of clinical behaviour and treatment for cancers. Applied Bioinformatics 2003, 2: 53–58.Google Scholar
- Nevins RJ, Huang SE, Dressman H: Towards integrated clinico-genomic models for personalized medicine: combining gene expression signatures and clinical factors in breast cancer outcomes prediction. Human Molecular Genetics 2003, 43: 745–751.Google Scholar
- Stephenson JA, Smit A, Katta WM: Integration of gene expression profiling and clinical variables to predict prostate carcinoma recurrence after radical prostatectomy. Cancer 2005, 104: 290–298. 10.1002/cncr.21157PubMed CentralView ArticlePubMedGoogle Scholar
- Dunkler D, Michiels S, Schemper M: Gene expression profiling: Does it add predictive accuracy to clinical characteristics in cancer prognosis? European Journal of Cancer 2006, 12: 153–157.Google Scholar
- Boulesteix AL, Porzelius C, Daumer M: Microarray-based Classification and Clinical Predictors: on Combined Classifiers and Additional Predictive Value. Bioinformatics 2008, 24: 1698–1706. 10.1093/bioinformatics/btn262View ArticlePubMedGoogle Scholar
- Cao KA, Meugnier E, McLachlan JG: Integrative Mixture of Expert to Combined Clinical Factors and Gene Markers. Bioinformatics 2008, 24: 1698–1706. 10.1093/bioinformatics/btn262View ArticleGoogle Scholar
- Bovelstad M, Nygard S, Borgan O: Survival prediction from clinico-genomic models - a comparative study. BMC Bioinformatics 2009, 10: 413. 10.1186/1471-2105-10-413PubMed CentralView ArticlePubMedGoogle Scholar
- Boulesteix AL, Sauerbrei W: Added predictive value of high-throughput molecular data to clinical data and its validation. Briefings in Bioinformatics 2011.Google Scholar
- Huang ZX: Clustering Large Data Sets With Mixed Numeric and Categorical Values. The First Pacific-Asia Conference on Knowledge Discovery and Data Mining 1997, 16–27.Google Scholar
- Qi Y, Klein-Seetharaman J, Bar-Joseph Z: Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources. Pacific Symposium on Biocomputing 2005, 531–542.Google Scholar
- Breiman L: Random Forests. Machine Learning 2001, 45: 5–32. 10.1023/A:1010933404324View ArticleGoogle Scholar
- Yong Z, Yupu Y, Liang Z: Pseudo Nearest Neighbor Rule for Pattern Classification. Expert System with Applications 2009, 36: 3587–3595. 10.1016/j.eswa.2008.02.003View ArticleGoogle Scholar
- van de Vijver M, He Y, van't Veer L: A gene-expression signature as a predictor of survival in breast cancer. The New England Journal of Medicine 2002, 347: 1999–2009. 10.1056/NEJMoa021967View ArticlePubMedGoogle Scholar
- Pomeroy SL, Tamayo P, Gaasenbeek M: Prediction of Central Nervous System Embryonal Tumour Outcome Based on Gene Expression. Nature 2002, 415: 436–442. 10.1038/415436aView ArticlePubMedGoogle Scholar
- Tan A, Daniel QN, Xu L, LW R, Geman D: Simple Decision Rules for Classifying Human Cancers from Gene Expression Profiles. Bioinformatics 2005, 21: 3896–3904. 10.1093/bioinformatics/bti631PubMed CentralView ArticlePubMedGoogle Scholar
- Tibshirani JR, Efron B: Pre-validation and inference in microarrays. Statistical Applications in Genetics and Molecular Biology 2002., 1:Google Scholar
- Boulesteix AL, Strobl C: Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction. BMC Medical Research Methodology 2009, 9: 85. 10.1186/1471-2288-9-85PubMed CentralView ArticlePubMedGoogle Scholar
- Jelizarow M, Guillemot V, Tenenhaus A, Strimmer K, Boulesteix AL: Over-optimism In Bioinformatics: An Illustration. Bioinformatics 2010, 16: 1990–1998.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.