BCDForest: a boosting cascade deep forest model towards the classification of cancer subtypes based on gene expression data
- Yang Guo1,
- Shuhui Liu1,
- Zhanhuai Li1 and
- Xuequn Shang1Email author
https://doi.org/10.1186/s12859-018-2095-4
© The Author(s). 2018
Published: 11 April 2018
Abstract
Background
The classification of cancer subtypes is of great importance to cancer disease diagnosis and therapy. Many supervised learning approaches have been applied to cancer subtype classification in the past few years, especially of deep learning based approaches. Recently, the deep forest model has been proposed as an alternative of deep neural networks to learn hyper-representations by using cascade ensemble decision trees. It has been proved that the deep forest model has competitive or even better performance than deep neural networks in some extent. However, the standard deep forest model may face overfitting and ensemble diversity challenges when dealing with small sample size and high-dimensional biology data.
Results
In this paper, we propose a deep learning model, so-called BCDForest, to address cancer subtype classification on small-scale biology datasets, which can be viewed as a modification of the standard deep forest model. The BCDForest distinguishes from the standard deep forest model with the following two main contributions: First, a named multi-class-grained scanning method is proposed to train multiple binary classifiers to encourage diversity of ensemble. Meanwhile, the fitting quality of each classifier is considered in representation learning. Second, we propose a boosting strategy to emphasize more important features in cascade forests, thus to propagate the benefits of discriminative features among cascade layers to improve the classification performance. Systematic comparison experiments on both microarray and RNA-Seq gene expression datasets demonstrate that our method consistently outperforms the state-of-the-art methods in application of cancer subtype classification.
Conclusions
The multi-class-grained scanning and boosting strategy in our model provide an effective solution to ease the overfitting challenge and improve the robustness of deep forest model working on small-scale data. Our model provides a useful approach to the classification of cancer subtypes by using deep learning on high-dimensional and small-scale biology data.
Keywords
Background
It is well known that cancer tumor is heterogeneous disease with diverse pathogeneses [1, 2]. Most cancers have multiple subtypes with distinct molecular signatures and likely have different prognoses and treatment responses [3–5]. Recently, the advance of high-throughput profiling technologies has produced huge genomic data and provided unprecedented opportunities to investigate genomic or transcriptomic changes associated with cancers, which makes it possible to cognize cancer subtypes at molecular levels. In the past few years, various types of large-scale genomic data have been used for cancer prognosis integrating gene function studies [6–8] and subtype outcome prediction [1, 4, 9–11], and numerous cancer subtype classification methods have been proposed [5, 12–14]. However, since the complexity of cancer diseases and limited prior knowledge of cancer subtypes [15–17], the overall performance of most current methods still need to be further improved. In general, the intuitive approaches of cancer subtype classification use conventional classification algorithms to learn prediction models based on various types of genomic data and prior subtype knowledge, such as gene expression, DNA-methylation or gene mutations, etc. [18–21]. However, three challenges may limit the application of conventional leaning models, such as SVM, random forest, etc., to the mission of cancer subtype classification on biology data. Firstly, the characteristics of small sample size and high-dimensionality of biology data strengthen the risk of overfitting in training. Secondly, class-imbalance is very common in biology data, which aggravates the difficulties of model learning. Thirdly, large sequencing bias of biology data may weaken the ability of model estimation. Although many modified models have been proposed to ease these challenges in the past few years [5, 22], the alternative options of available methods towards small-scale biology data are still limited, and more accurate and robust methods need to be further developed for the mission of cancer subtype classification.
In recent years, the advance of deep neural networks (DNNs) has achieved great success in various applications, especially in visual and speech recognitions [23, 24]. Inspired by deep neural networks, many methods have been proposed to predict cancer subtypes using variants of deep learning approaches [25, 26]. However, a few deficiencies may limit the applications of deep neural networks in cancer genomic data. On the one hand, deep neural networks are complicated models and huge amount of training data are usually required to train the model [23]. Nevertheless, there aren’t large enough samples for most cancer genomic data at present. On the other hand, it is well known that there are many hyper-parameters in deep neural networks, and the performance of model largely depends on the skills of parameter tuning. This makes it is unruly to get anticipate classification performance using deep neural networks in practice, especially on the small-scale biology datasets.
Illustration of cascade forest structure. Each level of the cascade consists of two random forests (black) and two completely random forests (red). Suppose there are three classes to predict; each forest outputs a three-dimensional class vector, which is then concatenated for representation of original input [23]
In this paper, we propose so-called BCDForest (Boosting Cascade Deep Forest) model to follow the mission of cancer subtype classification. The main idea of BCDForest model is to both encourage the diversity of ensemble and consider the fitting quality of each random forest in multi-grained scanning to give more informative presentations of input. We also propose a simple strategy to boost the weights of important features in cascade random forests, thus to improve the overall performance of cascade ensemble random forests. Our contributions can be summarized as follows. 1). we adopt a multi-class-grained scanning strategy to encourage the diversity of ensemble by using different training data of classes respectively. Different training sub-datasets are used to construct various types of forests to encourage the diversity of ensemble. 2). we consider the model fitting qualities of forests in feature representation learning by using sliding window scanning. The out-of-bagging [27, 28] approach is used to estimate the error of model fitting and assign a confidence weight to each forest to correct the outcome predictions. 3). we propose a variation based strategy to boost important features in forest learning at each layer of cascade forest.
Applying BCDForest to three public microarray gene expression datasets and six RNA-Seq gene expression datasets from TCGA [29, 30], we find that BCDForest has better prediction performance than conventional methods and gcForest. Importantly, BCDForest achieves higher prediction accuracy than the standard gcForest on small-scale (small sample size) and class-imbalance datasets, which is crucial to the supervised learning of cancer genomic data.
Methods
Boosting cascade forest
The cascade forest model provides an alternative to deep neural networks (DNNs) to learn hyper-level representations in low expense. Instead of learning hidden variables according to complex forward and backward propagation algorithms in DNNs, cascade forest tries to learn class distribution features directly by assembling amounts of decision tree-based forests under supervision of input. The layer-wise supervised learning strategy allows cascade forest can be easily trained. In addition, the ensemble of forests hopes to obtain more precise class distribution features, as it is well known that the random forest has powerful classification ability in most applications. However, in the standard deep cascade forest model, the feature importance isn’t considered in representation learning process among multiple layers. This may lead to the overall prediction performance is sensitive to the quantities of decision trees in each forest, especially on small-scale data, since it is crucial to select the discriminative features as splitting nodes to construct decision trees. Based on the basic architecture of cascade forest, in this section, we introduce a novel modified version of cascade forest, which is denoted as BCDForest.
Inspired by the boosting idea, we assign the discriminative features with higher weights than uninformative features in forest training process. Obviously, it is hard to give a meaningful weight to each feature directly in the output concatenated vector at each cascade layer, as it is a combinatorial class distribution of global and local features. On the one hand, different random forests may offer different estimations, we don’t have any weight information about their estimations. On the other hand, extensive weight estimations of features will introduce additional expense. In this study, alternatively, we try to emphasize the discriminative features in the concatenated vector to boost the possibility of these features to be selected as splitting features of decision trees in each random forest, thus to encourage generating better fitting forests. In fact, the importance of each feature can be predicted by the structures of decision trees in their fit random forest. In detail, the features in high level of decision trees tend to be more important to discriminate different classes on training data, so they should be boosted as more important features that need to be reconsidered in the next layer. Given a fit forest, by combining all decision trees, the importance of features can be easily predicted by considering the average height rank of features in all decision trees of forest. We select the top-k most important features in each forest, and use the standard deviation of the k features to compose a new feature. Then, we combine the new variance feature with the output class distribution vector together to boost its class distribution in the concatenated input vector of next layer, thus to reduce the false discovery rate of estimation in the next propagation layer. The reasons for using the standard deviation of top-k features instead of using top-k features directly are: 1). to reduce the sensitivity of model to the k parameter; 2). the variance can present the difference of instances on top-k features in some extent. This boosting operation can be implemented at each layer of cascade forest, and it doesn’t introduce additional computational expense, since the importance of features can be easily estimated by generating forests.
Illustration of boosting cascade forest structure. Each level of the cascade consists of two random forests (black) and two completely random forests (red). The standard deviation of top-k important features in each forest will compose a new feature to be concatenated in the next cascade level to emphasize the discriminative features
Multi-class-grained scanning
Inspired by deep neural network in handling feature relationships, cascade forest employs multi-grained scanning strategy, a sliding window based approach, to extract local features by scanning raw input to generate a series of local low-dimensional feature vectors, and then train a series of forests by using those low-dimensional vectors to obtain class distributions of input vectors (more details in [23]). Although it has been proved the multi-grained scanning is effective on local feature recognition, a few drawbacks may affect the quality of the extracted features in applications. 1). to consider a class-imbalance data, the decision trees of forests tend to underline the classes with most instances, and block the recognition of small-size classes in training, especially on the high-dimensional data. 2). the diversity of forests depends on manual hard-definition, not automatically determines in data-driven way. This may weaken the ability of classification, as the diversity is crucial to ensemble construction [23], especially on small-scale data. 3). all scanning forests in ensemble have equal contributions in multi-grained scanning, and it may lead to the estimation of classification distribution is sensitive to the fitting quantities of forests. To ease these issues, in this study, we propose a multi-class-grained scanning approach to encourage the diversity of ensemble forests by using different training data of classes.
Illustration of multi-class-grained scanning. a Suppose four classes (A, B, C and D) in training dataset. For each class, we produce the positive and negative sub-datasets, and then use the sub-datasets to train a binary random forest classifier. Four different types of random forests will be produced by using different training datasets (sliding window based). The out-of-bagging (OOB) score of each forest is used to calculate a normalized quantity weight to each forest. b Based on the fit forests and their quantity weights, a 500-dim instance vector can be transformed to a concatenated 1604-dim representation
Formally, we normalize each 4-dimensional vector into a class distribution space. Suppose X = (x1, x2, x3, x4) is a 4-dimensional class vector, W = (w1, w2, w3, w4) is the vector of out-of-bagging fitting score for all scanning forests, \( {W}^{\prime }=\left({w}_1^{\prime },{w}_2^{\prime },{w}_3^{\prime },{w}_4^{\prime}\right) \), where \( {w}_i^{\prime }={w}_i/\sum \limits_{i=1}^4{w}_i \), is the weight vector of forests, and the normalized class vector is defined as \( {X}^{\prime }=\left({x}_1^{\prime },{x}_2^{\prime },{x}_3^{\prime },{x}_4^{\prime}\right) \), where \( {x}_i^{\prime }={x}_i{w}_i^{\prime }/\sum \limits_{i=1}^4{x}_i{w}_i^{\prime } \). Each 100-dimensional vector is transformed into a 4-dimensional normalized class vector, and all of 401 4-dimensional class vectors are concatenated to a 401 × 4-dimensional class vector, corresponding to the original 500-dimensional raw feature vector. Figure 3 shows only one window size, but multiple window sizes could be defined by user to result in more features in the final transformed vectors.
In addition, if there are many classes in the training data, we still can divide the raw data into different class groups by the principle of leaving label balance, and thus simulate multiple sub-datasets, and train multiple types of forests. Note, the most difference between our multi-class-grained scanning and the standard multi-grained scanning is that we use different sub-datasets to train different forests to encourage the diversity of ensemble. It is a data-driven way to train different forests, thus to extract more meaningful local features from raw features. The three most advantages of our approach can be summarized as: 1). dividing training data into different sub-datasets can ease under estimation caused by class-imbalance data. 2). the number of forests is determined by the classes in raw data instead of a hyper-parameter to ease the risk of overfitting. 3). it encourages the diversity of forests and heartens to give more accurate classification by assembling more simple classifiers.
Overall procedure of BCDForest
Overall procedure of BCDForest. Suppose there are four classes, and the sliding windows are 100-dim and 200-dim. Two cascade layers are used to give final prediction
Results
Datasets and parameters
Comparison of overall accuracy on microarray gene expression datasets
Code | Dataset | Sample | Gene | Class | Overall Accuracy | |||||
---|---|---|---|---|---|---|---|---|---|---|
KNN | LR | RF | SVM | gcForest | BCDForest | |||||
1 | Adenocarcinoma | 76 | 9868 | 2 | 0.842 | 0.736 | 0.841 | 0.842 | 0.857 | 0.928 |
2 | Brain | 42 | 5597 | 5 | 0.784 | 0.858 | 0.796 | 0.690 | 0.892 | 0.964 |
3 | Colon | 62 | 2000 | 2 | 0.801 | 0.660 | 0.846 | 0.885 | 0.916 | 0.916 |
Comparison of overall accuracy on RNA-Seq gene expression datasets
Code | Dataset | Sample | Gene | Class | Overall Accuracy | |||||
---|---|---|---|---|---|---|---|---|---|---|
KNN | LR | RF | SVM | gcForest | BCDForest | |||||
1 | PANCANCER | 3594 | 8026 | 11 | 0.955 | 0.979 | 0.960 | 0.968 | 0.965 | 0.973 |
2 | BRCA | 514 | 3641 | 4 | 0.778 | 0.854 | 0.845 | 0.793 | 0.881 | 0.920 |
3 | GBM | 164 | 3180 | 4 | 0.694 | 0.651 | 0.702 | 0.619 | 0.741 | 0.806 |
4 | LUNG | 275 | 4000 | 3 | 0.710 | 0.744 | 0.791 | 0.786 | 0.830 | 0.867 |
5 | COAD_I | 264 | 3010 | 6 | 0.348 | 0.287 | 0.377 | 0.372 | 0.392 | 0.411 |
6 | COAD_N | 270 | 3006 | 3 | 0.699 | 0.631 | 0.696 | 0.700 | 0.711 | 0.730 |
7 | COAD_T | 282 | 3014 | 3 | 0.766 | 0.701 | 0.767 | 0.765 | 0.767 | 0.785 |
8 | LIHC_I | 347 | 4401 | 3 | 0.532 | 0.491 | 0.536 | 0.527 | 0.558 | 0.588 |
9 | LIHC_N | 400 | 4398 | 2 | 0.695 | 0.519 | 0.698 | 0.696 | 0.708 | 0.759 |
10 | LIHC_T | 347 | 4347 | 3 | 0.574 | 0.503 | 0.579 | 0.561 | 0.608 | 0.652 |
Overall performance on microarray datasets
We compared the classification performance of BCDForest with four conventional methods (KNN, SVM, Logistic Regression (LR) and Random Forest (RF)) and the standard gcForest on three microarray datasets. To perform the classification estimate, the most challenge of those three datasets is the data characteristics of small sample size but high-dimensional gene features. We used 5-fold cross validation to evaluate the overall accuracy of different methods on these three datasets. In fair, in each class of datasets, we randomly selected 4/5 samples for training data, and 1/5 samples for testing data. As shown in Table 1., BCDForest consistently outperforms other methods in overall accuracy prediction. This illustrates that our method is effective to cancer subtype classification on small-scale data. This may be for the reason that BCDForest uses simple binary forests to learn the classification distribution features in multi-class-grained scanning, and it can ease the risk of overfitting in some extent. In addition, we find both gcForest and BCDForest outperform other conventional methods on these three small-scale datasets, especially on the adenocarcinoma dataset, which includes 9868 genes, but only 76 samples, both gcForest (85.7%) and BCDForest (92.8%) obtain higher accuracy. This demonstrates that the deep forest framework has powerful ability to cancer subtype classification than others on small-scale microarray datasets.
Overall performance on RNA-Seq datasets
To systematically investigate the robustness of BCDForest, we examined the estimate performance on more datasets, comparing with four conventional methods and the standard gcForest. In order to test the performance of BCDForest on relatively large-scale dataset, we downloaded the integrated pan-cancers dataset from TCGA, which included 3594 samples from 11 different cancer types, and the class sample size ranged from 72 to 840 respectively. We used the tumor cancer type to label each sample, and trained each model based on these labels. We used 5-fold cross validation to evaluate the performance of each method on these datasets. In fair, in each class, we randomly selected 4/5 samples to compose the training data, and 1/5 samples to compose the testing data. Table 2. shows the overall accuracy performance of different methods on integrated pan-cancer dataset. We also downloaded gene expression and clinical data of other five cancer types (BRCA, GBM, LUNG, COAD and LIHC) from TCGA. Specifically, the BRCA, GBM and LUNG datasets have cancer subtype label information in the clinical data. We used their subtype labels directly in experiments. For the COAD and LIHC datasets, there wasn’t known subtype information in the clinical data, while the pathologic states were depicted. We used three of clinical pathologic states (pathologic stage (I), pathologic N and pathologic T) to define three different cancer sub-datasets, which stated different pathologic class labels. In particular, we filtered out the pathologic subtypes with few samples in each data to reduce the effect of outliers. Table 2. shows the details of each dataset and the overall performance of each method on each dataset.
As shown in Table 2., on the large-scale pan-cancers dataset, all methods have similar prediction performance, although the LR (97.9%) and BCDForest method (97.3%) have slightly higher accuracy than others. The reason may be that there are different gene expression patterns among different cancer types. However, on the other small-scale cancer datasets, BCDForest is consistently better than other methods, especially comparing with the conventional methods. For example, on the GBM data, BCDForest method obtains the highest accuracy (80.6%), and it is better than gcForest (74.1%) over 6.5%, and better than RF (70.2%) over 10%. In addition, it is interesting that both BCDForest and gcForest are better than other conventional methods on all five cancer types of datasets. This indeed demonstrates that the deep forest methods are more powerful to the classification of cancer subtypes since more complex features can be learned to discriminate different classes.
Cancer type classification on pan-cancers dataset
Comparison of different methods on large-scale pan-cancers dataset. Each dot presents the performance of each corresponding method on each cancer type. 11 cancer types were included in the pan-cancers dataset
Cancer subtype classification on BRCA, GBM and LUNG datasets
Comparison of BCDForest and gcForest on three cancer type datasets (BRCA, GBM and LUNG). Each dot presents the performance of each method on each cancer subtype class
Comparison of overall performance of BCDForest and gcForest on BRCA, GBM and LUNG datasets. The average precision, recall and F-1 score on all subtype classes of each dataset were evaluated
Pathologic cancer subtype classification on COAD and LIHC datasets
Comparison of overall performance of BCDForest and gcForest on COAD datasets. The average precision, recall and F-1 score on all subtype classes of each dataset were evaluated
Comparison of overall performance of BCDForest and gcForest on LIHC datasets. The average precision, recall and F-1 score on all subtype classes of each dataset were evaluated
Discussion
The deep forest framework provides an alternative to deep learning in practice. The standard deep forest model may face overfitting and ensemble diversity challenges working on small-scale biology data. BCDForest is a novel modification of the standard deep forest model (gcForest), and it provides an effective solution to ease the overfitting challenge and improves the robustness of the standard deep forest model working on small-scale biology data. We compared BCDForest with the standard gcForest and several conventional classification methods on both microarray and RNA-Seq gene expression cancer datasets. We found: 1). the deep forest methods (BCDForest and gcForest) consistently outperformed other conventional classification methods on most of cancer datasets. This may be because the deep forest methods can learn more meaningful high-level features in supervised learning. 2). BCDForest consistently outperformed the standard gcForest on most of cancer datasets. This illustrates that our boosting strategies are effectively to improve the classifying ability of the standard deep forest model on small-scale biology cancer datasets, and it provides a robust model to the classification of cancer subtypes. In addition, although our BCDForest model tends to give better prediction performance than the state-of-the-art methods in cancer subtype predictions, some challenges still need to be further fixed, such as working on extremely class-imbalance and high-dimensionality small-scale datasets and improving the stability further, etc. Besides, it has been proved that integrating multiple types of genomic data can benefit the performance of cancer subtype prediction in recent years [33–35]. In this study, we only focused on cancer subtype classifications based on gene expression data. It will be useful to extend the deep forest model to integrate multiple types of genomic data to advance the performance of cancer subtype classification.
Conclusions
The classification of cancer subtypes is vital to cancer diagnosis and therapy. In this paper, we proposed a deep learning model, so-called BCDForest, to address cancer subtype classification on small-scale biology data, which can be viewed as a modification of the standard gcForest presented recently in [23]. On the one hand, instead of manually defining different types of complex random forests, we proposed a named multi-class-grained scanning strategy to encourage the diversity of ensemble by training multiple simple binary classifiers using the whole training data. Meanwhile, we considered the fitting quality of each simple classifier in representation learning to encourage the accuracy of estimations. On the other hand, we proposed a boosting strategy to emphasize the important features in cascade forests, thus to propagate the benefits of discriminative features among cascade layers. Systematic comparison experiments on both microarray and RNA-Seq gene expression datasets demonstrate that our method consistently outperforms the state-of-the-art methods. In conclusion, our BCDForest method provides an option to investigate cancer subtypes by using deep learning on small-scale biology datasets.
Declarations
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant No.61332014 and 61772426). We thank Pierre-Yves Lablanche for technical support.
Funding
The publication costs for this article were funded by Northwestern Polytechnical University.
Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 5, 2018: Selected articles from the Biological Ontologies and Knowledge bases workshop 2017. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-5.
Authors’ contributions
YG and XS designed the research; YG performed the method development; YG and SL performed the experiments; YG, SL, XS and ZL wrote and/or edited the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Stingl J, Caldas C. Molecular heterogeneity of breast carcinomas and the cancer stem cell hypothesis. Nat Rev Cancer. 2007;7(10):791–9.View ArticlePubMedGoogle Scholar
- Bianchini G, Iwamoto T, Qi Y, Coutant C, Shiang CY, Wang B, Santarpia L, Valero V, Hortobagyi GN, Symmans WF, et al. Prognostic and therapeutic implications of distinct kinase expression patterns in different subtypes of breast cancer. Cancer Res. 2010;70(21):8852–62.View ArticlePubMedGoogle Scholar
- Heiser LM, Sadanandam A, Kuo WL, Benz SC, Goldstein TC, Ng S, Gibb WJ, Wang NJ, Ziyad S, Tong F, et al. Subtype and pathway specific responses to anticancer compounds in breast cancer. Proc Natl Acad Sci U S A. 2012;109(8):2724–9.View ArticlePubMedGoogle Scholar
- Prat A, Parker JS, Karginova O, Fan C, Livasy C, Herschkowitz JI, He X, Perou CM. Phenotypic and molecular characterization of the claudin-low intrinsic subtype of breast cancer. Breast Cancer Res. 2010;12(5):R68.View ArticlePubMedPubMed CentralGoogle Scholar
- Jahid MJ, Huang TH, Ruan J. A personalized committee classification approach to improving prediction of breast cancer metastasis. Bioinformatics. 2014;30(13):1858–66.View ArticlePubMedPubMed CentralGoogle Scholar
- Peng J, Wang H, Lu J, hui W, Wang Y, Shang X. identifying term relations cross different gene ontology categories. BMC Bioinformatics. 2017;18(16):573.View ArticlePubMedPubMed CentralGoogle Scholar
- Peng JJ, Xue HS, Shao YK, Shang XQ, Wang YD, Chen J. A novel method to measure the semantic similarity of HPO terms. International Journal of Data Mining and Bioinformatics. 2017;17(2):173–88.View ArticleGoogle Scholar
- Jiajie P, Zhang X, Hui W, Lu J, Li Q, Shang X. Improving the measurement of semantic similarity by combining gene ontology and co-functional network: a random walk based approach. BMC systems biology. 2018;12(Suppl2)Google Scholar
- Koboldt DC, Fulton RS, McLellan MD, Schmidt H, Kalicki-Veizer J, McMichael JF, Fulton LL, Dooling DJ, Ding L, Mardis ER, et al. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61–70.View ArticleGoogle Scholar
- List M, Hauschild AC, Tan Q, Kruse TA, Mollenhauer J, Baumbach J, Batra R. Classification of breast cancer subtypes by combining gene expression and DNA methylation data. J Integr Bioinform. 2014;11(2):236.PubMedGoogle Scholar
- Peng J, Lu J, Shang X, Chen J. Identifying consistent disease subnetworks using DNet. Methods. 2017;131:104–10.View ArticlePubMedGoogle Scholar
- Zheng CH, Ng TY, Zhang L, Shiu CK, Wang HQ. Tumor classification based on non-negative matrix factorization using gene expression data. Ieee Transactions on Nanobioscience. 2011;10(2):86–93.View ArticlePubMedGoogle Scholar
- Marisa L, de Reynies A, Duval A, Selves J, Gaub MP, Vescovo L, Etienne-Grimaldi MC, Schiappa R, Guenot D, Ayadi M, et al. Gene expression classification of colon cancer into molecular subtypes: characterization, validation, and prognostic value. PLoS Med. 2013;10(5)Google Scholar
- Leong HS, Galletta L, Etemadmoghadam D, George J, Australian Ovarian Cancer S, Kobel M, Ramus SJ, Bowtell D. Efficient molecular subtype classification of high-grade serous ovarian cancer. J Pathol. 2015;236(3):272–7.View ArticlePubMedGoogle Scholar
- Hu Y, Zhou M, Shi H, Ju H, Jiang Q, Cheng L. Measuring disease similarity and predicting disease-related ncRNAs by a novel method. BMC Medical Genomics. 2017;10(Suppl 5)Google Scholar
- Hu Y, Zhao L, Liu Z, Ju H, Shi H, Xu P, Wang Y, Liang L. DisSetSim: an online system for calculating similarity between disease sets. Journal of Biomedical Semantics. 2017;8(Suppl 1):28.View ArticlePubMedPubMed CentralGoogle Scholar
- Cheng L, Jiang Y, Wang Z, Shi H, Sun J, Yang H, Zhang S, Hu Y, Zhou M. DisSim: an online system for exploring significant similar diseases and exhibiting potential therapeutic drugs. Sci Rep. 2016;6:30024.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu HX, Zhang RS, Luan F, Yao XJ, Liu MC, Hu ZD, Fan BT. Diagnosing breast cancer based on support vector machines. J Chem Inf Comput Sci. 2003;43(3):900–7.View ArticlePubMedGoogle Scholar
- Okun O, Priisalu H: Random forest for gene expression based cancer classification: Overlooked issues. Pattern Recognition and Image Analysis, Pt 2, Proceedings 2007, 4478:483−+.Google Scholar
- Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. Bmc Bioinformatics. 2008;9Google Scholar
- Ali HR, Rueda OM, Chin SF, Curtis C, Dunning MJ, Aparicio SAJR, Caldas C. Genome-driven integrated classification of breast cancer validated in over 7,500 samples. Genome Biol. 2014;15(8)Google Scholar
- Saddiki H, McAuliffe J, Flaherty P. GLAD: a mixed-membership model for heterogeneous tumor subtype classification. Bioinformatics. 2015;31(2):225–32.View ArticlePubMedGoogle Scholar
- Zhou Z-H, Feng J. Deep Forest: towards an alternative to deep neural networks. In: ArXiv e-prints vol. 1702.08835v1: 2017Google Scholar
- Hinton G, Deng L, Yu D, dahl GE, Mohamed AR, Jaitly N, senior a, Vanhoucke V, Nguyen P, Sainath TN, et al. deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process Mag. 2012;29(6):82–97.View ArticleGoogle Scholar
- Liang MX, Li ZZ, Chen T, Zeng JY. Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach. Ieee-Acm Transactions on Computational Biology and Bioinformatics. 2015;12(4):928–37.View ArticlePubMedGoogle Scholar
- Litjens G, Sanchez CI, Timofeeva N, Hermsen M, Nagtegaal I, Kovacs I, Hulsbergen-van de Kaa C, Bult P, van Ginneken B, van der Laak J. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Sci Rep. 2016;6Google Scholar
- Martinez-Munoz G, Suarez A. Out-of-bag estimation of the optimal sample size in bagging. Pattern Recogn. 2010;43(1):143–52.View ArticleGoogle Scholar
- Bylander T. Estimating generalization error on two-class datasets using out-of-bag estimates. Mach Learn. 2002;48(1–3):287–97.View ArticleGoogle Scholar
- Akbani R, Ng KS, Werner HM, Zhang F, Ju ZL, Liu WB, Yang JY, Lu YL, Weinstein JN, Mills GB. a pan-cancer proteomic analysis of the cancer genome atlas (TCGA) project. Cancer Res. 2014;74(19)Google Scholar
- Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM, Network CGAR. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45(10):1113–20.View ArticlePubMedPubMed CentralGoogle Scholar
- Diaz-Uriarte R, de Andres SA. Gene selection and classification of microarray data using random forest. Bmc Bioinformatics. 2006;7Google Scholar
- Lopez V, Fernandez A, Garcia S, Palade V, Herrera F. An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci. 2013;250:113–41.View ArticleGoogle Scholar
- Bhattacharyya M, Nath J, Bandyopadhyay S. MicroRNA signatures highlight new breast cancer subtypes. Gene. 2015;556(2):192–8.View ArticlePubMedGoogle Scholar
- Bediaga NG, Acha-Sagredo A, Guerra I, Viguri A, Albaina C, Diaz IR, Rezola R, Alberdi MJ, Dopazo J, Montaner D, et al. DNA methylation epigenotypes in breast cancer molecular subtypes. Breast Cancer Res. 2010;12(5)Google Scholar
- Cantini L, Isella C, Petti C, Picco G, Chiola S, Ficarra E, Caselle M, Medico E. MicroRNA-mRNA interactions underlying colorectal cancer molecular subtypes. Nat Commun. 2015;6Google Scholar