Gene expression profiling of breast cancer survivability by pooled cDNA microarray analysis using logistic regression, artificial neural networks and decision trees
© Chou et al.; licensee BioMed Central Ltd. 2013
Received: 12 July 2012
Accepted: 26 February 2013
Published: 19 March 2013
Microarray technology can acquire information about thousands of genes simultaneously. We analyzed published breast cancer microarray databases to predict five-year recurrence and compared the performance of three data mining algorithms of artificial neural networks (ANN), decision trees (DT) and logistic regression (LR) and two composite models of DT-ANN and DT-LR. The collection of microarray datasets from the Gene Expression Omnibus, four breast cancer datasets were pooled for predicting five-year breast cancer relapse. After data compilation, 757 subjects, 5 clinical variables and 13,452 genetic variables were aggregated. The bootstrap method, Mann-Whitney U test and 20-fold cross-validation were performed to investigate candidate genes with 100 most-significant p-values. The predictive powers of DT, LR and ANN models were assessed using accuracy and the area under ROC curve. The associated genes were evaluated using Cox regression.
The DT models exhibited the lowest predictive power and the poorest extrapolation when applied to the test samples. The ANN models displayed the best predictive power and showed the best extrapolation. The 21 most-associated genes, as determined by integration of each model, were analyzed using Cox regression with a 3.53-fold (95% CI: 2.24-5.58) increased risk of breast cancer five-year recurrence…
The 21 selected genes can predict breast cancer recurrence. Among these genes, CCNB1, PLK1 and TOP2A are in the cell cycle G2/M DNA damage checkpoint pathway. Oncologists can offer the genetic information for patients when understanding the gene expression profiles on breast cancer recurrence.
KeywordsBreast cancer Microarray Artificial neural network Logistic regression Decision tree
Breast cancer is one of the most common cancers in women worldwide. According to the American Cancer Society, breast cancer is the second leading cause of death among women in the U.S.. However, significantly different five-year recurrence rates and survival rates have been observed among breast cancer patients with the same course of disease. In other words, prognostic factors for breast cancer recurrence, such as histology and lymph node status, cannot entirely correctly predict the subsequent clinical manifestations of patients [2, 3].
Microarray technology can be used to acquire information about thousands of genes simultaneously. Traditional statistical methods, such as logistic regression, have become increasingly difficult to use for survivability prediction models due to several constraints that dictate the low statistical power with small sample size and complex polynomial interaction terms with curvilinear effects among the relationship of variables. Data mining techniques, such as artificial neural networks and decision trees, can process thousands of independent variables without the need to consider constraints from statistical assumption and polynomial interaction terms. Compared with logistic regression, these techniques have a better potential and are more advantageous for building survivability prediction models.
Previously reported analyses of microarray data that aimed to predict breast cancer recurrence rarely selected the same groups of genes, possibly due to the small sample sizes used [4-11]. One of the objectives of the present study was to increase the sample size through the integration of samples from multiple breast cancer microarray databases. In addition, we sought to assess the capacity of logistic regression, decision tree and artificial neural network models to predict breast cancer recurrence, with the goal of developing a more predictive gene profile for breast cancer relapse within five years and identifying important risk genes that affect breast cancer recurrence.
Breast cancer microarray datasets
Number of study subjects
Desmedt et al. 
Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series.
Sotiriou et al. 
Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis.
Ivshina et al. 
Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer.
Wang et al. 
Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer.
Preprocessing of microarray data
The four breast cancer microarray datasets included in this study all employed the HG-U133A oligonucleotide Gene Chip from Affymetrix. This array is comprised of 22,283 probes for the simultaneous analysis of 20,000-30,000 genes. In our meta-analysis, the probe data of the four datasets were analyzed to obtain log conversions, standardized Z values, the sum of each probe score and quartile rankings for subsequent study. We used the GC Robust Multi-array Average (GCRMA) method and R language software with procedures of library(gcrma) and library(precprocessCore) to remove the chip background associated with the microarray gene expression levels. The expression levels of the probe sets were converted into gene expression levels. Because the probe expression levels showed a skewed distribution, the median probe expression in a gene was calculated to represent the gene expression level. The datasets were finally merged to obtain the expression levels of genes, which conversion formulae 1-3 followed by the quantile normalization [16, 17] of all gene expression values.
X: original value, Y: converted value.
Following log conversion, the four datasets were further standardized into Z values with a mean value of 0 and a standard deviation of 1. Compared with the original data, the standardized Z values did not show significant differences in distribution among study objects.
The HG-U133A gene chip used in this study is comprised of 22,283 probes that cover 13,452 genes. Each gene is covered by 1-14 probes. Of the 13,452 genes, 5,107 (38%) are covered by more than two probe combinations. For genes covered by multiple probe combinations, this study adopted the median method. For example, when the expression level of the HFE gene was reflected by the levels of 13 probes, the level of the seventh (the median number) probe was chosen to represent the expression of the HFE gene.
Definition and selection of clinical variables
Clinical variables of each dataset
Wang et al. []
Ivshina et al. []
Sotiriou et al. []
Desmedt et al. []
Lymph node status
Lymph node status
Lymph node status
Lymph node status
Lymph node status
Distant metastasis events
Relapse events b
Relapse events b
Relapse events a
Distant metastasis time
Distant metastasis events
Distant metastasis time
Distant metastasis events
Distant metastasis time
Study subject selection
Gene prediction model
The 100 genetic variables and six clinical variables (age, tumor size, histopathological classification, estrogen receptor status, relapse occurrence within five years and relapse onset time) from the 400 training and test sample sets were subjected to statistical tests. The variables without significant differences between the training and test samples were selected to establish 20 training sets for a 20-fold cross-validation.
Construction of the prediction models
In this study, Clementine 10.1 was used to construct the decision tree (DT), logistic regression (LR) and artificial neural network (ANN) models. The ANN parameter of over-training prevention was set as the percent difference between the training samples and test samples. The prediction accuracy of the test group was higher than that of the training group without the setup of the over-training prevention parameter. Therefore, the ANN model of this study was set at 80% over-training prevention, or ANN80. Because LR does not have the option of over-training prevention, another ANN model, ANN100, was constructed without over-training prevention to compare the predictive power of LR and ANN.
Because DT is capable of selecting important variables from a field of many, the composite model of this study first used DT to select important variables, which were then integrated into the LR or ANN models. Three types of composite models were used: the DT-LR composite model (DL), the DT-ANN composite model with 80% over-training prevention (DA80) and the DT-ANN composite model without over-training prevention (DA100).
Criteria for assessing the results of the analysis
Three indicators were adopted to evaluate the predictive ability of the models in this study. The first indicator was accuracy (ACC). For this measure, the higher the score, the better the predictive ability of the model. ACC was calculated as follows:
ACC = number of cases with correctly predicted breast cancer recurrence within five years/total number of cases.
The second indicator was AUC: the area under the ROC curve drawn by sensitivity (X axis) vs. 1-specificity (Y axis). This value can be used to determine the classification ability of a model: the higher the AUC, the better the predictive ability of the model.
The third indicator was extrapolation, or the difference in ACC (or AUC) between the training and test samples. This value represents the magnitude of change in the predictive ability of a prediction model toward test samples after training with the training samples. This value was calculated as follows:
ΔACC = ACC of training samples - ACC of test samples
ΔAUC = AUC of training samples - AUC of test samples
Analysis of recurrence risks and genetic and biochemical pathways
In this study, SPSS14.0 software was used to perform a Cox proportional regression to analyze the relative risks of genetic characteristics with regard to breast cancer recurrence. The log-rank test was used to determine the survival curve variances of genetic characteristics, and the Ingenuity Pathway Analysis database was used to analyze and predict the major biochemical functions of the identified genes. The net reclassification improvement (NRI), which is available in a sub function of the MATLAB , was used to compare AUC for cox models that contained the predictors and those that did not, as additional markers of incremental improvement in risk prediction [19-21].
Assessment of the predictive power of each single model using the 100-gene profile
assessment of the predictive power of the composite models using the 100-gene profile
The top 10 most important genes based on rankings from all of the single models tested in this study were as follows: LMCD1, DEAF1, AP2A2, LMNB1, ZFP36L2, ABCC1, PLOD2, LARS2, CDCA3 and AACS. Of these, LMCD1 was ranked first in three of the four models (LR, ANN80 and ANN100), DEAF1 was ranked among the top 10 most important genes in all models and AP2A2 was ranked among the top 10 most important genes in three models (DT, ANN80 and ANN100). The overall top 10 important genes in all models were among the top 40% of important genes in any single model.
Cox regression analysis of the five-year breast cancer recurrence of the test samples
HR (95% CI)
HR (95% CI)
21 Genes Profile
2.60 ( 1.44-4.68)
Estrogen Receptor b
When we investigated the reproducibility of the selected genes from each dataset, we found that only three genes, CCNE2, GTSE1 and KPNA2, were included in the lists of genes selected by the original authors, suggesting a very low reproducibility of the selected genes in different studies. The 100 genes selected in this study were compared with the genes selected by the original authors. The results showed that 5 genes from this study were also selected by Wang et al. and Desmedt et al., 19 genes were also selected by Sotiriou et al. and 20 genes were also selected by Ivshina et al.. Among these genes, two genes, MLF1IP and PLK1, were selected by Wang et al., Desmedt et al., Sotiriou et al. and this study. The PLK1 gene was ranked among the top 10 most important genes in the DT and ANN100 models.
Xu et al. analyzed genes related to five-year metastasis rates of breast cancer using four breast cancer microarray datasets that are available online . Of these four datasets, two (Wang et al. and Sotiriou et al.) were also included in the present study. In addition, the Miller et al. dataset is essentially equivalent to the Ivshina et al. dataset included in this study because only three subjects differed between the two datasets. Thus, the Desmedt et al. dataset is the only dataset that was included in the present study but not included in the 2008 study by Xu et al. (the Pawitan et al. dataset  was the fourth dataset used in the Xu et al. study). Of the 112 predictive genes identified by the Xu et al. study, 5 genes were also selected by Wang et al. and Desmedt et al., 19 genes were also selected by Sotiriou and 19 genes were also selected by Ivshina; this level of agreement is similar to that observed in the present study. In this study, we compared our 100 selected genes with the 112 genes from the 2008 Xu et al. study. We found that 13 genes were selected in both studies: AP2A2, ASPM, CDKN3, EEF1E1, IGHM, IGKC, LST1, MAD2L1, MELK, MLF1IP, PRC1, RACGAP1 and STK6. Xu et al. addressed the same question as the present study and conducted a meta-analysis using similar databases. The percentage of identified genes that overlapped with those in the original datasets is similar between our study and the Xu et al. study, suggesting that, when compared with genes selected using small sample sizes, selecting genes by meta-analysis can improve the accuracy of predicting breast cancer recurrence.
Xu et al. predicted five-year breast cancer recurrence rates using 112 selected genes, achieving a sensitivity of 88% and a specificity of 54.6%. The risk of recurrence was 9.3-fold (hazard ratio = 9.3, 95% CI: 2.9-29.9). The risk of five-year breast cancer recurrence found in this study was lower than that found in the 2008 study by Xu et al. However, this study achieved a similar result to that study with only one-fifth the number of genes, suggesting that the 21 genes in this study can effectively differentiate breast cancer patients with high and low risks of recurrence.
The present study observes a proportion of genes consistently identified by pooled microarray datasets of aggregating several studies that can be a set of candidate genes expression profile for a future work. It is very important to examine how reliable the set of signatures proposed in this study can predict cancer relapse of breast cancer patients in an independent replication study. Furthermore, the candidate genes also are worthy to investigate more characteristics on epigenetics and genetics in breast cancers, like as DNA methylations, mRNA expressions, micro RNA interactions, biochemical pathway and so on, for future studies.
One limitation of this study is that the pooled microarray datasets were obtained from multiple studies. It can benefit from an increase of sample size but may also compensate for study heterogeneity caused by the discrepancies among studies. They lacked the complete collection to identify the discrepancies of all breast cancer-related variables among studies, as well as variables affecting the survival of breast cancer patients, such as the use of chemotherapy and radiotherapy, phenotype definition, population ethnicity, genetic heterogeneity. Therefore, this study did not effectively control for other breast cancer-related factors that could affect the selection of the genes related to breast cancer recurrence. However, we tried to adjust causes by the discrepancies among studies, like as age, tumor diameter, histopathologic grade and estrogen receptor. Although four datasets were combined to increase the sample number in this study, only 757 patients were left after excluding those patients with positive lymph nodes or follow-up times of less than 5 years. Additionally, several other groups of study subjects, such as those treated with tamoxifen, chemotherapy or radiotherapy and those with redundant database entries, were not excluded to ensure an adequate number of samples for the study. The authors of the original datasets mentioned that the inclusion of patients who received these treatments would not affect the results of their studies; thus, in the present study, we assumed that the breast cancer recurrence rates and gene expression levels among the selected patients were not affected by interfering factors, in the absence of more detailed information.
In the present study, after integrating the results of breast cancer microarray dataset analyses using several different models, we identified 21 genes that are closely related to breast cancer recurrence: LMCD1, DEAF1, AP2A2, LMNB1, ZFP36L2, ABCC1, PLOD2, LARS2, CDCA3, AACS, TNFRSF25, SMC1A, ADIPOQ, DPP3, FADD, PLK1, SDS, HSPB6, MTERFD1, CHPF and AQP1. Among these, the PLK1 gene was of particular interest because it is involved in the DNA damage checkpoint response at the G2/M phase of the cell cycle and, along with other genes (such as CCNB1 and TOP2A), plays a role in regulating cell cycle progression. Regarding statement of translational relevance, we concluded the most effective genes profiling and identified 21 genes that are closely related to breast cancer recurrence: LMCD1, DEAF1, AP2A2, LMNB1, ZFP36L2, ABCC1, PLOD2, LARS2, CDCA3, AACS, TNFRSF25, SMC1A, ADIPOQ, DPP3, FADD, PLK1, SDS, HSPB6, MTERFD1, CHPF and AQP1. Among these, the PLK1 gene was of particular interest because it is involved in the DNA damage checkpoint response at the G2/M phase of the cell cycle and, along with other genes (such as CCNB1 and TOP2A), plays a role in regulating cell cycle progression. Two genes, MLF1IP and PLK1, were selected by the most pooled microarray datasets of Wang et al., Desmedt et al., Sotiriou et al. and this study. The PLK1 gene was ranked among the top most reproductive gene. These genes profiling will be valuable to be as the targets of prognosis and treatment.
Area under the curve
Artificial neural network
Gene Expression Omnibus
The authors would like to thank all members in the Bioinformatics Laboratory of National Defense Medical Center, Far Eastern Memorial Hospital, Cathay General Hospital and Chang-Gung University for sharing their data mining and machine learning knowledge
- American Cancer Society. 2011. http://www.cancer.org/docroot/home/index.asp
- Eifel P, Axelson JA, Costa J, Crowley J, Curran WJ Jr, Deshler A, Fulton S, Hendricks CB, Kemeny M, Kornblith AB: National Institutes of Health Consensus Development Conference Statement: adjuvant therapy for breast cancer: 1-3 November 2000. J Natl Cancer Inst 2001, 93: 979-989.View ArticlePubMedGoogle Scholar
- McGuire WL: Breast cancer prognostic factors: evaluation guidelines. J Natl Cancer Inst 1991, 83: 154-155. 10.1093/jnci/83.3.154View ArticlePubMedGoogle Scholar
- Davis CA, Gerick F, Hintermair V, Friedel CC, Fundel K, Kuffner R, Zimmer R: Reliable gene signatures for microarray classification: assessment of stability and performance. Bioinformatics 2006, 22: 2356-2363. 10.1093/bioinformatics/btl400View ArticlePubMedGoogle Scholar
- Gemignani F, Perra C, Landi S, Canzian F, Kurg A, Tonisson N, Galanello R, Cao A, Metspalu A, Romeo G: Reliable detection of beta-thalassemia and G6PD mutations by a DNA microarray. Clin Chem 2002, 48: 2051-2054.PubMedGoogle Scholar
- Gutmann O, Kuehlewein R, Reinbold S, Niekrawietz R, Steinert CP, de Heij B, Zengerle R, Daub M: Fast and reliable protein microarray production by a new drop-in-drop technique. Lab Chip 2005, 5: 675-681. 10.1039/b418765bView ArticlePubMedGoogle Scholar
- Lassmann S, Kreutz C, Schoepflin A, Hopt U, Timmer J, Werner M: A novel approach for reliable microarray analysis of microdissected tumor cells from formalin-fixed and paraffin-embedded colorectal cancer resection specimens. J Mol Med 2009, 87: 211-224. 10.1007/s00109-008-0419-yView ArticlePubMedGoogle Scholar
- Shi L, Perkins RG, Fang H, Tong W: Reproducible and reliable microarray results through quality control: good laboratory proficiency and appropriate data analysis practices are essential. Curr Opin Biotechnol 2008, 19: 10-18. 10.1016/j.copbio.2007.11.003View ArticlePubMedGoogle Scholar
- Stirewalt DL, Pogosova-Agadjanyan EL, Khalid N, Hare DR, Ladne PA, Sala-Torra O, Zhao LP, Radich JP: Single-stranded linear amplification protocol results in reproducible and reliable microarray data from nanogram amounts of starting RNA. Genomics 2004, 83: 321-331. 10.1016/j.ygeno.2003.08.008View ArticlePubMedGoogle Scholar
- van der Spek PJ, Kremer A, Murry L, Walker MG: Are gene expression microarray analyses reliable? A review of studies of retinoic acid responsive genes. Genomics Proteomics Bioinformatics 2003, 1: 9-14.PubMedGoogle Scholar
- Xu X, Li Y, Zhao H, Wen SY, Wang SQ, Huang J, Huang KL, Luo YB: Rapid and reliable detection and identification of GM events using multiplex PCR coupled with oligonucleotide microarray. J Agric Food Chem 2005, 53: 3789-3794. 10.1021/jf048368tView ArticlePubMedGoogle Scholar
- Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, Haibe-Kains B, Viale G, Delorenzi M, Zhang Y, d'Assignies MS: Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res 2007, 13: 3207-3214. 10.1158/1078-0432.CCR-06-2765View ArticlePubMedGoogle Scholar
- Ivshina AV, George J, Senko O, Mow B, Putti TC, Smeds J, Lindahl T, Pawitan Y, Hall P, Nordgren H: Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. Cancer Res 2006, 66: 10292-10301. 10.1158/0008-5472.CAN-05-4414View ArticlePubMedGoogle Scholar
- Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 2006, 98: 262-272. 10.1093/jnci/djj052View ArticlePubMedGoogle Scholar
- Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005, 365: 671-679.View ArticlePubMedGoogle Scholar
- Wang Y, Sun G, Ji Z, Xing C, Liang Y: Weighted change-point method for detecting differential gene expression in breast cancer microarray data. PLoS One 2012, 7: e29860. 10.1371/journal.pone.0029860PubMed CentralView ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 2005, 102: 15545-15550. 10.1073/pnas.0506580102PubMed CentralView ArticlePubMedGoogle Scholar
- Padoan A: Net Reclassification Improvement (NRI) has been proposed as an alternative to the area under the curve of the the ROC. The MathWorks, Inc; 2010. . Accessed 22. November 2012 http://www.mathworks.com/matlabcentral/fileexchange/28579-net-reclassification-improvement&watching=28579 . Accessed 22. November 2012Google Scholar
- Pencina MJ, D'Agostino RB Sr, D'Agostino RB Jr, Vasan RS: Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med 2008, 27: 157-172. discussion 207-212 discussion 207-212 10.1002/sim.2929View ArticlePubMedGoogle Scholar
- Pencina MJ, D'Agostino RB Sr, Demler OV: Novel metrics for evaluating improvement in discrimination: net reclassification and integrated discrimination improvement for normal variables and nested models. Stat Med 2012, 31: 101-113. 10.1002/sim.4348PubMed CentralView ArticlePubMedGoogle Scholar
- Pencina MJ, D'Agostino RB Sr, Steyerberg EW: Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med 2011, 30: 11-21. 10.1002/sim.4085PubMed CentralView ArticlePubMedGoogle Scholar
- Beyer SJ, Zhang X, Jimenez RE, Lee ML, Richardson AL, Huang K, Jhiang SM: Microarray analysis of genes associated with cell surface NIS protein levels in breast cancer. BMC Res Notes 2011, 4: 397. 10.1186/1756-0500-4-397PubMed CentralView ArticlePubMedGoogle Scholar
- Delen D, Walker G, Kadam A: Predicting breast cancer survivability: a comparison of three data mining methods. Artif Intell Med 2005, 34: 113-127. 10.1016/j.artmed.2004.07.002View ArticlePubMedGoogle Scholar
- Kumar R, Sharma A, Tiwari RK: Application of microarray in breast cancer: An overview. J Pharm Bioallied Sci 2012, 4: 21-26.PubMed CentralView ArticlePubMedGoogle Scholar
- Snow PB, Kerr DJ, Brandt JM, Rodvold DM: Neural network and regression predictions of 5-year survival after colon carcinoma treatment. Cancer 2001, 91: 1673-1678. 10.1002/1097-0142(20010415)91:8+<1673::AID-CNCR1182>3.0.CO;2-TView ArticlePubMedGoogle Scholar
- Hsu YH: Investigating the Models of Logistic Regression, Decision Tree, Artificial Neural Network and Hybrid Analysis for Predicting Coronary Artery Disease. Taipei, Taiwan: Master Thesis of National Defense Medical Center; 2007.Google Scholar
- Xu L, Tan AC, Winslow RL, Geman D: Merging microarray data from separate breast cancer studies provides a robust prognostic test. BMC Bioinformatics 2008, 9: 125. 10.1186/1471-2105-9-125PubMed CentralView ArticlePubMedGoogle Scholar
- Pawitan Y, Bjohle J, Amler L, Borg AL, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S: Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res 2005, 7: R953-R964. 10.1186/bcr1325PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.