- Research Article
- Open Access
Subtype prediction in pediatric acute myeloid leukemia: classification using differential network rank conservation revisited
© Obulkasim et al. 2015
- Received: 11 October 2014
- Accepted: 11 September 2015
- Published: 23 September 2015
One of the most important application spectrums of transcriptomic data is cancer phenotype classification. Many characteristics of transcriptomic data, such as redundant features and technical artifacts, make over-fitting commonplace. Promising classification results often fail to generalize across datasets with different sources, platforms, or preprocessing. Recently a novel differential network rank conservation (DIRAC) algorithm to characterize cancer phenotypes using transcriptomic data. DIRAC is a member of a family of algorithms that have shown useful for disease classification based on the relative expression of genes. Combining the robustness of this family’s simple decision rules with known biological relationships, this systems approach identifies interpretable, yet highly discriminate networks. While DIRAC has been briefly employed for several classification problems in the original paper, the potentials of DIRAC in cancer phenotype classification, and especially robustness against artifacts in transcriptomic data have not been fully characterized yet.
In this study we thoroughly investigate the potentials of DIRAC by applying it to multiple datasets, and examine the variations in classification performances when datasets are (i) treated and untreated for batch effect; (ii) preprocessed with different techniques. We also propose the first DIRAC-based classifier to integrate multiple networks. We show that the DIRAC-based classifier is very robust in the examined scenarios. To our surprise, the trained DIRAC-based classifier even translated well to a dataset with different biological characteristics in the presence of substantial batch effects that, as shown here, plagued the standard expression value based classifier. In addition, the DIRAC-based classifier, because of the integrated biological information, also suggests pathways to target in specific subtypes, which may enhance the establishment of personalized therapy in diseases such as pediatric AML. In order to better comprehend the prediction power of the DIRAC-based classifier in general, we also performed classifications using publicly available datasets from breast and lung cancer. Furthermore, multiple well-known classification algorithms were utilized to create an ideal test bed for comparing the DIRAC-based classifier with the standard gene expression value based classifier. We observed that the DIRAC-based classifier greatly outperforms its rival.
Based on our experiments with multiple datasets, we propose that DIRAC is a promising solution to the lack of generalizability in classification efforts that uses transcriptomic data. We believe that superior performances presented in this study may motivate other to initiate a new aline of research to explore the untapped power of DIRAC in a broad range of cancer types.
- Relative expression
- Batch effect
- High-dimensional data
High-throughput genomic technologies have dramatically expanded the breadth of biological information available for the analysis and characterization of disease. One of the most important application fields of transcriptomic data is classification, where a multitude of promising results have been reported in classification literatures with near perfect classification accuracy for given datasets. Yet, despite this progress, the reliability of these results has been proven illusory, as the seemingly promising results often fail in reproducing, inextricable yet largely unachieved goal, in new datasets . This is mainly due to the fact that a single transcriptome contains tens of thousands of features, while a limited number of samples are available for analysis. A large number of redundant, irrelevant features and technical noise added during the data generation process make classifiers susceptible to over-fitting . In some cases, even a small change in data preprocessing can lead to substantial changes in downstream analyses . The obvious and ideal remedy of increasing the sample size is often infeasible, leaving researchers in a quandary about how to increase the models’ reliability. Thus, the search for a robust classifier in clinical application becomes a notoriously difficult endeavour.
In recent years, there has been a burgeoning interest in system methods that incorporate network information into classification algorithms for biomarker discovery in personalized medicine [4–7]. The general hope is that the application of hard-earned biological domain knowledge can improve the typical low reproducibility of the biomarkers. A promising example, the topic of this article, is the Differential Rank Conservation (DIRAC) algorithm proposed by Eddy et al. (2010) .
Here, our intention is not to re-introduce DIRAC, but rather to explore some of the undocumented characteristics of a DIRAC classifier in common genomic classification scenarios. We aim to broaden researchers’ familiarity with the pros and cons of DIRAC, especially in identifying robust cancer subtype specific network signatures that are robust against unwanted variations stemming from differences in preprocessing, platforms, batches, and other confounding variables.
DIRAC is a member of a family of algorithms that work with the relative expression of genes [8, 9]. Since DIRAC uses the relative rather the absolute expression of genes for disease classification, algorithms fall in the latter category populated the journal landscape, it has potentials to be robust against ancillary sources of variation in high-dimensional molecular data.
It tried to capture the biological system as a whole instead of reporting a list of individual parts as the conventional classifier does.
Focus on genesets (pathways) as opposed to individual genes. Hence, valuable gene-gene interaction could be harvested.
Resulting genesets (pathways) from the DIRAC based classifier is more intuitive to biologist.
We find it necessary to point out that the DIRAC is not a classification algorithm. It falls within the realm of data transformation, i.e. project samples that are originally in the gene space to a much lower dimensional pathway space. The hope is that the difference between groups under investigation will be pronounced in the new space. The transformed data can be combined with any exiting classification algorithm.
We first perform the pathway and GEV signatures identification for each cytogenetic subtype. Then, the signatures are used in later stage to evaluate their robustness against different preprocessing techniques and reproducibility in independent datasets in presence of batch effect, respectively. Performances of the two classifiers are measured by calculating their sensitivity, specificity and F-score. The F-score, that equals 2(s e n s i t i v i t y×s p e c i f i c i t y)/(s e n s i t i v i t y+s p e c i f i c i t y), is a harmonic mean of the former two. It ranges from 0 to 1, with one being the perfect classification.
Signature identification for cytogenetic subtypes using the pediatric 1 AML dataset.
MLL: C10orf140, DEXI, HOXA7.
t(8;21): RUNX1T1, POU4F1, CACNA2D2.
inv(16): MN1, TM4SF1, MYH11, CLIP3, SPARC, AK5, PTRF, FAM171A1, AK022033, LOC100506388.
t(15;17): STAB1, FGF13, LGALS12, PTGDS.
NUP98.NSD1: VENTX, PRDM16, FAM92A1.
CEPBA.dm: LRRC28, CNRIP1, ST8SIA1, MSS51, SLC14A1, LGSN.
NPM1: HOXB-AS3, HOXB3, SMC4, SEPP1, LTBP1, CLIC2, HOXB4, TWISTNB, HIST1H2BC, SIAE, PIEZO2, HLF, ELL2, LOC100507520, HOXB2, FOXC1, HOXB6, HOXB5, EMR1, SNCAIP, RPL39L, USP44, BEX1, TTC27, PTPRC, HENMT1, AK027199, COL4A5, NAP1L5, TIAM1, NPDC1, CLEC11A, SCD5, CCL1, FTO, AK093529, ENSG00000184551, CDKN1B, FAM105A, PHLDA1, HOXA-AS5, LOC100506591, GMDS, TOM1L1, IL12A, DMXL2, SDPR, FOXF1.
The rank matching scores (RMS) (see Fig. 1) of the signature pathways generated by the DIRAC-based classifier
These pathway signatures may be options for understanding specific disease mechanisms and provide a key for the design of new treatment. For example, one may seek network aware intervention to adjust the signature pathways in each subtype to its normal non-leukemic cell behavior and, consequently, to control clinical presentation.
Assessing the robustness of the DIRAC-based and GEV-based classifiers against different preprocessing techniques
Multiple preprocessing methods for gene expression data have been proposed . However, none of them are reported to be uniformly better than the other. Schmidt et al. (2011)  reported that findings in gene expression data analysis significantly depend on the preprocessing method used. Here, we examined the effect of a mundane but indispensable task of ‘data normalization’ on classification. We compared the robustness of the two competing approaches when the data was preprocessed with different methods. Specifically, a classifier was trained with the pathway and gene expression signatures obtained from the pediatric 1 dataset and were used to predict subtype labels of the same dataset that has been preprocessed using six well-known methods. This is meant to be representative of a situation where a researcher intends to combine or compare transcriptomic profiles from two studies where the preprocessing methods were different and the original unprocessed data is not available. Ideally, one would have the original data and use the same preprocessing methods on those data sources, but this is not always an option. The six preprocessing methods are: M1) CEL file transformed to raw expression data using Robust multi-array average expression measure (RMA); M2) Raw data obtained with RMA and normalized with quantile normalization; M3) CEL file transformed to raw expression data using Affymetrix’s MAS 5.0 expression measure (MAS5); M4) Raw data obtained with MAS5 and normalized with scale normalization; M5) CEL file transformed to raw expression data using Robust multi-array average expression measure with help of probe sequence (GCRMA); M6) Raw data obtained with GCRMA and normalized with quantile normalization.
Assessing the robustness of the DIRAC-based and GEV-based classifiers against batch effect
Genomic data often contain batch effects, i.e. differences in sample sources and other technical variations. Many algorithms have been proposed to remove these artifacts, but this (often) comes along with the potential cost of removing between-group biological heterogeneity and consequently salient genomic signatures . Note that, in this work we refer to the batch effect as differences between datasets that have: generated in different platforms or different labs, included in two different studies.
In order to better comprehend the prediction power of the DIRAC-based classifier in general, we conducted performance comparison using publicly available datasets from breast and lung cancer (see Additional file 1). Also, multiple well-known classification algorithms were utilized to create an ideal test bed for comparing the two competing approaches. We obtained two indenpedent breast cancer datasets with substantial batch-effect. Similar to the classification settings mentioned above, we used one of them used for signature discovery (discovery set) and the second one for testing the signature reproducibilty (test set). We observe that the DIRAC classifier performed similarly in the discovery set, but greatly outperformed the standard GEV-based classifier in test set without batch effect correction. These results corroborate our findings in the AML datasets.
Where does the power of the DIRAC-based classifier come from?
Clearly, the DIRAC-based classifier is robust against most of the major ancillary sources of variation in the data. We attribute this to the robustness of relative expression compared to absolute expression values used  and the ability to tap the orchestrated behaviour of genes within transcriptome using the domain knowledge.
According to the no-free-lunch theorem  no classification algorithm is uniformly better than the others. Hence, we do not claim that DIRAC is a panacea for all types of genomic classification problems. As we showed in this study, there are cases in which the GEV-based classifier outperforms the DIRAC-based classifier. The focal point is, however, placed on the robustness of the latter. We recommend the DIRAC-based classifier when the choices of preprocessing and batch effect correction methods are not obvious, either by limited statistical resources or due to limited data availability. It is, for example, often the case that raw data (e.g. CEL files) are not publicly available. Thus, albeit numerous databases exist to store tremendous amounts of genomic data, it is surprisingly difficult to find a dataset that has been preprocessed in exactly the same way as the one from which new findings (e.g. subtype signatures) were discovered and are in need to be validated. As we demonstrated in this study, these issues, to a large extent, are ameliorated when the DIRAC-based classifier is used. To our surprise, the trained DIRAC-based classifier even translated well to a dataset with different biological characteristics (e.g. adult AML) in the presence of substantial batch effects that, as shown here, plagued the standard GEV-based classifier. In addition, the DIRAC-based classifier, because of the integrated biological information, also suggests pathways to target in specific subtypes, which may enhance the establishment of personalized therapy in diseases such as pediatric AML.
In this study, we investigate the performance of the DIRAC-based classifier using multiple AML datasets. We believe that its superior performances presented in this study may motivate other to explore the untapped power of DIRAC in different types of cancer or may even open a new line of application.
In summary, through this study we have demonstrated the robustness of DIRAC in classification, which has previously been undocumented and underestimated. While this is not the first multi-class DIRAC method proposed , we have demonstrated the first multi-network classification method using DIRAC. We believe that the robustness, simplicity and biological interpretability of the DIRAC-based classifier make it not only an attractive alternative to existing algorithms, but often a preferred choice.
In this study we investigate the potentials of DIRAC in multiple scenarios frequently appear in analyzing high-dimensional transcriptomic data. The number of conceivable applications of DIRAC is possibly large, hence we do not claim to provide exhaustive list. Instead, the focus is on classification problem that has been the subject of bioinformatics research for long time. Reported experimental results from multiple real-world datasets serve two ends: 1) to illustrate the benefits of the DIRAC-based classifier, 2) to clarify what makes the DIRAC-based classifier superior than exiting approaches. In the following sections we describe the datasets that are considered in this study, and the experimental setups, i.e. how we compare performance of the DIRAC-based classifier with exiting approaches in an unbiased way.
Gene expression datasets
Summary of AML datasets used in this study
Cytogenetic and molecular characteristics of the AML datasets included in this study
Pediatric 1 AML
Pediatric 2 AML
In this study, 217 manually curated (comprised of 1279 unique genes) BioCarta pathways from the Molecular Signatures Database (MSigDB version 4.0, updated May 31, 2013) were used to construct the DIRAC-based classifier. For each dataset, genes were grouped according to their presence in these pathways. To reliably represent the pathways, only those pathways that have at least 30 % of associated genes presented in a given dataset were retained. The minimum fraction of genes that are present both in individual pathways and gene expression data is: 70 % in the pediatric 1 and adult AML datasets, 57 % in the pediatric 2 AML dataset. Note that, for the DIRAC-based classifier only those genes that belong to the retained pathways were used. For the GEV-based classifier the full list of genes, not just the pathway mapped subsets, was used. Note that, the number of genes used for the GEV-based classifier is far more exceeds the number of pathways used for the DIRAC-based classifier. This means, the former has a larger search space to find the discriminative features compared to the latter.
Classifier construction and signature identification
We used the pediatric 1 AML dataset (Table 2) to generate gene (pathway) signatures specific to each cytogenetic subtype (i.e., via 1-vs.-Rest multiclass classification scheme) using Support Vector Machines (SVM). Our strategy to extract both signatures is similar to the one used in Balgobind et al. (2011) . Specifically, the dataset is randomly divided, subject to stratification that maintains the proportionality of the subtypes, into a discovery set (2/3 samples) and validation set (1/3 samples). The discovery set is used to select the subtype specific signatures, and the validation set serves as independent validation cohort. Importantly, the validation set is not used at any point in classifier training.
The selected features (50 genes/pathways) were processed in following ways to obtain reduced set of GEV and pathway signatures. For each subtype, the GEV (pathways) signatures were ranked (from best to worst) according to their association with the subtype via the global test . Subsequently, an SVM classifier was trained (e.g. MLL vs. Rest) using the 50 genes (pathways) and predictive performance on the inner-test set was gauged. The size of the signature list was then reduced from 50 by removing the gene located at the bottom of the list of ranked genes (pathways), and the classification was re-run. This process was repeated until there were no more than 3 genes (pathways) remaining and mean sensitivity (across subtypes) was recorded. Finally, the list that corresponds to the median sensitivity over the 3 ICV train-test procedures is chosen.
At the end of 100 ICV iterations, a signature list of smallest size that renders highest prediction sensitivity was determined. Among the three optimal lists generated via 3-fold CV using the discovery set, the one with the median prediction sensitivity was taken as the optimal signature list for one OCV iteration. The final classifier was trained using the signature list that produced the highest prediction sensitivity in 100 OCV iterations and subsequently used to predict the labels of the validation set. Balgobind et al. (2011)  argued that this double-loop CV avoids over-fitting and leads to stable signatures with highest prediction accuracy.
Ethics approval is not required for this study.
The authors thank Ruud Delwel and Remco Hoogenboezem (Dept of Hematology, Rotterdam, The Netherlands) for kindly providing the Adult AML dataset. We are grateful for the constructive comments from John Earls and James Eddy (Institute for Systems Biology, Seattle, WA, USA). We thank Valerie de Haas (Dutch Childhood Oncology Group (DCOG), The Hague, the Netherlands), Jan Trka (Pediatric Hematology/Oncology, 2nd Medical School, Charles University, Prague, Czech Republic), and André Baruchel (Hematology, Hopital Saint- Louis, Paris, France) for providing patient material and clinical data. We thank Prof. Mark A. van de Wiel (Dept. Epidemiology & Biostatistics, VU university medical center, Amsterdam, The Netherlands) for insightful comments on the manuscript. This work was funded by KinderenKankervrij (KIKA) project 109.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Ein-Dor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. PNAS. 2006; 103:5923–928.View ArticlePubMedPubMed CentralGoogle Scholar
- Jelizarow M, Guillemot V, Tenenhaus A, Strimmer K, Boulesteix AL. Over optimism in bioinformatics: an illustration. Bioinformatics. 2010; 26:1990–8.View ArticlePubMedGoogle Scholar
- Schmidt MT, Handschuh L, Zyprych J, Szabelska A, Olejnik-Schmidt AK, Siatkowski I, et al.Impact of DNA microarray data transformation on gene expression analysis - comparison of two normalisation methods. Acta Biochim Pol. 2011; 58:573–80.PubMedGoogle Scholar
- Eddy JA, Hood L, Price ND, Geman D. Identifying tightly regulated and variably expressed networks by differential rank conservation (DIRAC). PLoS Comput Biol. 2010; 6:5923–928.View ArticleGoogle Scholar
- Ben-Hamo R, Gidoni M, Efroni S. PhenoNet: Identification of key networks associated with disease phenotype. Bioinformatics. 2014; 30:2399–2405.View ArticlePubMedGoogle Scholar
- Hofree M, Shen JP, Carter H, Gross A, Ideker T. Network-based stratification of tumor mutations. Nat Methods. 2013; 10:1108–15.View ArticlePubMedPubMed CentralGoogle Scholar
- Nibbe RK, Koyutürk M, Chance MR. An integrative -omics approach to identify functional sub-networks in human colorectal cancer. PLoS Comput Biol. 2010;6.Google Scholar
- Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D. Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics. 2005; 21:3896–904.View ArticlePubMedPubMed CentralGoogle Scholar
- Geman D, dAvignon C, Naiman DQ, Winslow RL. Classifying gene expression profiles from pairwise mRNA comparisons. Stat Appl Geneti Mol Biol. 2011;3.Google Scholar
- Balgobind BV, Raimondi SC, Harbott J, Zimmermann M, Alonzo TA, Auvrignon A, et al. Novel prognostic subgroups in childhood 11q23/MLL-rearranged acute myeloid leukemia: results of an international retrospective study. Blood. 2009; 114:2489–496.View ArticlePubMedPubMed CentralGoogle Scholar
- Quackenbush J. Microarray data normalization and transformation. Nat Genet. 2002; 32:496–501.View ArticlePubMedGoogle Scholar
- Parker HS, Leek JT, Favorov AV, Considine M, Xia X, Chavan S, et al.Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction. Bioinformatics. 2014; 30:2757–2763.View ArticlePubMedPubMed CentralGoogle Scholar
- Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007; 3:1724–1735.View ArticlePubMedGoogle Scholar
- Afsari B, Geman D, Fertig EJ. Learning Dysregulated pathways in cancers from differential variability analysis. in press.Google Scholar
- Patil P, Bachant-Winner PO, Haibe-Kains B, Leek JT. Avoiding test set bias with rank-based prediction. bioRxiv. 2014.Google Scholar
- Walpert DH. The lack of a priori distinctions between learning algorithms. Neural Comput. 1996; 8:1341–90.View ArticleGoogle Scholar
- Wang C, Funk CC, Eddy JA, Price ND. Transcriptional analysis of aggressiveness and heterogeneity across grades of astrocytomas. PLoS ONE. 2013;8.Google Scholar
- Balgobind BV, Van den Heuvel-Eibrink MM, De Menezes RX, Reinhardt D, Hollink IH, Arentsen-Peters ST, et al.Evaluation of gene expression signatures predictive of cytogenetic and molecular subtypes of pediatric acute myeloid leukemia. Haematologica. 2011; 96:221–3.View ArticlePubMedGoogle Scholar
- Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004; 3:1–25.Google Scholar
- Goeman JJ, Van De Geer SA, Van Houwelingen HC. Testing against a high-dimensional alternative. J R Stat Soc, Series B. 2006; 68:477–93.View ArticleGoogle Scholar
- Radtke I, Mullighan CG, Ishii M, Su X, Cheng J, Ma J, et al.Genomic analysis reveals few genetic alterations in pediatric acute myeloid leukemia. PNAS. 2009; 106:12944–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Verhaak RG, Wouters BJ, Erpelinck CA, Abbas S, Beverloo HB, Lugthart S, et al.Prediction of molecular subtypes in acute myeloid leukemia based on gene expression profiling. Haematologica. 2009; 94:131–4.View ArticlePubMedGoogle Scholar