Subtype prediction in pediatric acute myeloid leukemia: classification using differential network rank conservation revisited

Background One of the most important application spectrums of transcriptomic data is cancer phenotype classification. Many characteristics of transcriptomic data, such as redundant features and technical artifacts, make over-fitting commonplace. Promising classification results often fail to generalize across datasets with different sources, platforms, or preprocessing. Recently a novel differential network rank conservation (DIRAC) algorithm to characterize cancer phenotypes using transcriptomic data. DIRAC is a member of a family of algorithms that have shown useful for disease classification based on the relative expression of genes. Combining the robustness of this family’s simple decision rules with known biological relationships, this systems approach identifies interpretable, yet highly discriminate networks. While DIRAC has been briefly employed for several classification problems in the original paper, the potentials of DIRAC in cancer phenotype classification, and especially robustness against artifacts in transcriptomic data have not been fully characterized yet. Results In this study we thoroughly investigate the potentials of DIRAC by applying it to multiple datasets, and examine the variations in classification performances when datasets are (i) treated and untreated for batch effect; (ii) preprocessed with different techniques. We also propose the first DIRAC-based classifier to integrate multiple networks. We show that the DIRAC-based classifier is very robust in the examined scenarios. To our surprise, the trained DIRAC-based classifier even translated well to a dataset with different biological characteristics in the presence of substantial batch effects that, as shown here, plagued the standard expression value based classifier. In addition, the DIRAC-based classifier, because of the integrated biological information, also suggests pathways to target in specific subtypes, which may enhance the establishment of personalized therapy in diseases such as pediatric AML. In order to better comprehend the prediction power of the DIRAC-based classifier in general, we also performed classifications using publicly available datasets from breast and lung cancer. Furthermore, multiple well-known classification algorithms were utilized to create an ideal test bed for comparing the DIRAC-based classifier with the standard gene expression value based classifier. We observed that the DIRAC-based classifier greatly outperforms its rival. Conclusions Based on our experiments with multiple datasets, we propose that DIRAC is a promising solution to the lack of generalizability in classification efforts that uses transcriptomic data. We believe that superior performances presented in this study may motivate other to initiate a new aline of research to explore the untapped power of DIRAC in a broad range of cancer types. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0737-3) contains supplementary material, which is available to authorized users.

1 Details of the leukemia datasets used in the paper and pre-processing procedures 1.1 The pediatric 1 AML dataset • Cancer type: Pediatric acute myeloid leukemia (AML).
• Preprocessing: CEL files were transformed to probe intensity data using the rma function in the R-package "affy". Probe sets were underwent filtering according to the following criteria: a probe either interrogates only one gene or specificity of the probe (available in the Weizmann Institute Affymetrix probe-sets annotation database, http://genecards.weizmann.ac.il/geneannot/index.shtml) is at least 0.8. This led to a data of 39634 probe sets. Then, the reduced probe level data were converted into a gene level data by collapsing those probes that interrogate the same gene to their mean. The gene level data that comprised of 22564 genes were normalized using the 'vsn' function in the R-package "VSN".
• Preprocessing: CEL file were transformed to probe intensity data using Microarray Suite version 5.0 (MAS 5.0) in the R-package "affy" with Affymetrix default analysis settings and subsequently 'vsn' normalization was applied.
• Preprocessing: CEL file were transformed to probe intensity data using Microarray Suite version 5.0 (MAS 5.0) in the R-package "affy" with Affymetrix default analysis settings and subsequently 'van' normalization was applied.
2 Visualization of the batch effect Figure S1: Visualization of the batch effect between the pediatric 1 and pediatric 2 AML datasets. The combined dataset is projected onto the space defined by the first two principal components before (left) and after (right) undergoing a batch effect correction. Figure S2: Visualization of the batch effect between the pediatric 1 and adult AML datasets. The combined dataset is projected onto the space defined by the first two principal components before (left) and after (right) undergoing a batch effect correction.
3 Performance comparison of the DIRAC-based and the GEV-based approaches when diverse classification algorithms are used To support our argument that our decision to use SVM as a classifier is unbiased, in this section, we present results from the two competing approaches when three different classification algorithms, i.e. Random Forest (RF), Logistic Regression (LR) and shrunken centroid classifier (PAM), are used. For this illustration we used the pediatric 1 AML dataset. The same as in the paper, one third of the samples are used for validation. The double-loop-cross-validation is applied on the remaining samples to obtain robust signatures. Finally, the signatures are applied to the validation set and results are reported. It can be seen from Figure S3-S5 that, variations from algorithm to algorithm are very modest. Most importantly, differences in performances of two approaches are quite stable across the algorithms. Hence, we conclude that superior performances of the DIRACbased approach over its competitor are not due to the SVM classifier we used.

Classification performance comparison using nonleukemic datasets
To demonstrate the superiority of the DIRAC-based classifier over the standard GEVbased classifier in general, we conducted a series of performance comparisons using multiple independent non-leukemic datasets. We collected publicly available datasets from breast and lung cancer (see below). We refer to the original publications and the GEO database for patient and sample characteristics. The performances of the two approaches are compared using SVM, RF, LR and PAM algorithms, separately. Note that, in the breast cancer setting, besides the dataset that is used for signatures discovery, there is an independent dataset to test the reproducibility of the signatures obtained by the two approaches. For the sake of clarity, we termed the first dataset as a Discovery set, the second one as a Test set. The same as in the paper, in the discovery set, one third of the samples are left out for validation purpose. The double-loop-cross-validation is applied on the remaining samples to obtain robust signatures. Finally, the signatures are applied to the Test set (if available) and results are reported. In the following sections we detail the datasets and performances of the two approaches on them.

The breast cancer dataset
Two publicly available breast cancer datasets are used for performances comparison. Two binary classifiers are constructed to discriminate the cases with oestrogen receptor positive from the negative ones. First, a dataset from Wang et al (2012) was used as a Discovery set to extract gene and pathways signatures, separately. Then, two classifiers were trained using the signatures (gene and pathways signatures, separately) on this dataset. Subsequently, the trained classifiers were applied to an independent dataset (Test set) from van de Vijver et al. (2002) and van't Veer et al. (2002). Note that, there is strong batch effect between the Discovery and Test sets ( Figure S7). Similar to the experimental settings given in the paper, we evaluated the reproducibility of the signatures by applying them to the Test set, with and without a batch effect correction. The batch effect correction was performed using the 'ComBat' function in the R-package sva.
As we observe from Figure S7, the two competing approaches performed similarly across different classification algorithms. This may further strengthen our argument that our decision to use SVM as a primary classification tool is free of selection bias. While the DIRAC-based classifier kept similar promising performances, the GEV-based classifier pushed all samples into one class (F-score = 0) in the Test set with batch effect as shown in Figure S8 (left). While improved performances of the GEV-based classifier are observed ( Figure S8 right), accuracies from the DIRAC-based classifier decreased substantially after batch effect correction. This example support our claim in the paper that batch effect correction sometimes do harm than good. In this particular example the batch effect correction procedure removed some important biological variations specific to each group.
The dataset used for signature discovery: • Cancer type: Breast cancer.
• Preprocessing: we downloaded the normalized dataset that has measurements on 22283 probes. By collapsing into the gene level, dataset size reduced to 13212 genes. Subsequently, the dataset was underwent log2 transformation. No missing values were detected. To match with the size of the Test set, we kept only those genes that existed in the Test set. The final dataset contains 9786 genes.
The dataset used for testing the reproducibility of signatures from the Discovery set: • Cancer type: Breast cancer.
• Preprocessing: We obtained the normalized dataset that has measurements on 24481 probes from the R-package "breastCancerNKI". By collapsing into the gene level, dataset size reduced to 13119 genes. Missing values were imputed using the 'impute.knn' function in the R-package "impute". To match with the size of the Discovery set, we kept only those genes that existed in the Discovery set. The final dataset contains 9786 genes.  Figure S8: Performances of the gene and pathway signatures generated by the two approaches on the Test set before (left) and after (right) undergoing a batch effect correction. The x-axis shows the classification algorithms used, the y-axis shows the accuracy quantified as F-score.

The lung cancer dataset
Due to difficulty in finding suitable a dataset that serves as a Test set, here, we only reported results on the discovery set. In this dataset, a binary classifier is trained to discriminate the tumor samples from the normal controls. Since no Test set is present, the results shown here only meant for two ends: first, to support our argument that the DIRAC-based classifier is invariant to the type of classification algorithms used. As shown in Figure S9, the DIRAC-based classifier performed equally across diverse classification algorithms in this dataset as well; Second, to show that a different classifier performance evaluation criteria, other than the F-score we used in this study, show similar result. Figure S10 shows performances of the two approaches when AUC (area under the curve) was used. Clearly, a different accuracy criteria would not change the fact that the DIRACbased classifier outperforms the GEV-based one.
• Preprocessing: We downloaded normalized probe level data with 54675 probes GEO. We collapsed the probes level data into gene level, by taking the mean of multiple probes that interrogated a single gene. By collapsing into the gene level, dataset size reduced to 21050 genes. No missing values were detected.