Core module biomarker identification with network exploration for breast cancer metastasis
© Yang et al; licensee BioMed Central Ltd. 2012
Received: 16 September 2011
Accepted: 18 January 2012
Published: 18 January 2012
In a complex disease, the expression of many genes can be significantly altered, leading to the appearance of a differentially expressed "disease module". Some of these genes directly correspond to the disease phenotype, (i.e. "driver" genes), while others represent closely-related first-degree neighbours in gene interaction space. The remaining genes consist of further removed "passenger" genes, which are often not directly related to the original cause of the disease. For prognostic and diagnostic purposes, it is crucial to be able to separate the group of "driver" genes and their first-degree neighbours, (i.e. "core module") from the general "disease module".
We have developed COMBINER: COre Module Biomarker Identification with Network ExploRation. COMBINER is a novel pathway-based approach for selecting highly reproducible discriminative biomarkers. We applied COMBINER to three benchmark breast cancer datasets for identifying prognostic biomarkers. COMBINER-derived biomarkers exhibited 10-fold higher reproducibility than other methods, with up to 30-fold greater enrichment for known cancer-related genes, and 4-fold enrichment for known breast cancer susceptible genes. More than 50% and 40% of the resulting biomarkers were cancer and breast cancer specific, respectively. The identified modules were overlaid onto a map of intracellular pathways that comprehensively highlighted the hallmarks of cancer. Furthermore, we constructed a global regulatory network intertwining several functional clusters and uncovered 13 confident "driver" genes of breast cancer metastasis.
COMBINER can efficiently and robustly identify disease core module genes and construct their associated regulatory network. In the same way, it is potentially applicable in the characterization of any disease that can be probed with microarrays.
In recent years, gene expression signatures based on DNA microarray technology have proven useful for predicting the risk of breast cancer. Agendia's MammaPrint has become the first FDA-cleared breast cancer prognosis marker chip containing 70 gene signatures . Many other microarray-based biomarkers, such as 76 gene signatures  have been derived using independent data sources. However, there are only three overlaps between MammaPrint's 70-gene and Wang's 76-gene signatures. Furthermore, many of these markers are functionally unrelated to breast cancer. In order to identify robust, functionally relevant disease biomarkers, it is crucial to find gene signatures that are consistent in various data sources.
A complex disease such as breast cancer results in many differentially expressed genes (DEGs), which together can be used to construct a "disease module" network . Some of these DEGs directly correspond to the disease phenotype (i.e. "driver" genes). The expression changes enacted on the driver genes lead to a cascade of changes of other genes: initially to their first-degree interaction neighbors , followed by downstream effects to so-called "passenger" genes. Due to their direct relevance to the biology of the disease in question, the expression changes of the driver genes and their first-degree neighbours (i.e. members of the "core module"), should be more consistent than those of the passenger genes when compared across independent cohorts. However, it is often difficult to separate the core module from the passenger genes for a given disease [5, 6]. In this paper, we aim to isolate the core module from the more general disease module and further identify the driver genes using network analysis.
The most intuitive way of finding the disease core module is to identify the Differential Expressed Genes (DEGs) over various cohorts. Unfortunately, the typically larger number of passenger genes in each cohort will contribute to the majority of gene overlaps, due to statistical chance. A more biologically-motivated technique for identifying the core module is to find overlapping differentially expressed pathways. However, a pathway may also contain hundreds of genes with respect to the disease in question, while only a functional submodule (a small group of genes) is differentially expressed. These submodules are often overlooked in pathway enrichment analysis.
In light of the aforementioned challenges, we propose to identify Pathway Activities (PAs) from cohorts of data and use supervised classification to isolate a consistent core module. Each PA is a vector aggregating the information of a few genes expressed in a pathway [7, 8]. The use of PAs for biomarker identification has been shown improve reproducibility and disease-related functional enrichment of the resulting biomarkers . The main idea behind our method is to infer the most significant PAs in each data cohort, and validate these PAs using classification methods in other cohorts. If a PA also scores highly in all the other cohorts, we consider it to be consistently differentially expressed in the disease of interest. Furthermore, we would consider the genes that make up the PA to belong to the disease core module.
To illustrate its utility, we apply COMBINER to three benchmark breast cancer datasets. We evaluate the resulting core module for accuracy, reproducibility, and enrichment for known cancer-related genes. We then explore the roles of the COMBINER-identified core module in the hallmarks of cancer, and we reconstruct a breast cancer-specific interaction network composed of functionally coherent modules. Finally, we summarize our analyses by identifying 13 high confidence driver genes from COMBINER markers.
Results and Discussion
COMBINER is a multi-level optimization framework for identifying core module markers (Figure 1 and Methods). Briefly, COMBINER infers candidate submodules from known pathways, identifies the reproducible "core module" using independent cohorts, and uses intracellular signaling pathways and protein networks to identify the "driver" genes from the "core module".
We applied COMBINER to three independent breast cancer datasets to evaluate its effectiveness: Netherlands , USA , and Belgium . We obtained pathway information from the MsigDB v3.0 Canonical Pathways subset . To decrease redundancy, we applied pathway filtering to remove bulky pathways such as KEGG Pathways of Cancer. This resulted in a pathway dataset containing 624 pathways with 5,155 genes assayed in all three benchmark datasets.
Core Module Inference improves reproducibility and classification accuracy
Core module markers enrich cancer-related genes
Cancer Gene Enrichment rate of various breast cancer gene signatures
Core module markers highlight the hallmarks of cancer
Core module markers in predicted protein-protein interaction networks underpin functional modules
Confident "driver" genes for breast cancer metastasis
mitogen-activated protein kinase kinase 1
E2F transcription factor 1
growth factor receptor-bound protein 2
nuclear factor of kappa light polypeptide gene enhancer in B-cells 1
breast cancer 1, early onset
v-fos FBJ murine osteosarcoma viral oncogene homolog
son of sevenless homolog 1 (Drosophila)
phosphoinositide-3-kinase, catalytic, alpha polypeptide
Janus kinase 1
SHC (Src homology 2 domain containing) transforming protein 1
v-myc myelocytomatosis viral oncogene homolog (avian)
Identifying accurate and reproducible disease biomarkers is an important challenge for gene expression analysis. To facilitate this task, we developed COMBINER, a novel pathway-based biomarker identification method that extracts the essential "core module" of disease from known biological networks. Compared to existing methods, COMBINER substantially improves the reproducibility and cancer-specific enrichment of its resulting biomarkers. We examined the identified markers in intracellular signalling networks highlighting the hallmarks of cancer. Reassembling the core module genes into a regulatory network, we found 13 "driver" genes connecting eight functional modules. We anticipate such molecular descriptions to prove even more useful when applied to diseases that are less well-characterized; our current work focuses on several such applications.
Gene expression, pathways, cancer gene databases, and interactome
We used three breast cancer datasets from different countries of origin to evaluate our method: Netherlands , USA , and Belgium . Each dataset recorded whether the assayed patients developed metastasis within 5 years after surgery. The Netherlands, USA, and Belgium datasets contain expression profiles for 295, 286, and 198 patients, respectively, with 78, 107, and 35 patients experiencing metastasis. All of the patients in the USA and Belgium datasets had lymph-node-negative disease, although their estrogen receptor (ER) types differed. The Netherlands data contained both lymph-node positive and negative disease patients with differing ER types, 130 of which received adjuvant systemic therapy including chemotherapy and hormonal therapy. We performed a two-tailed t-test on the gene expression values of each dataset to distinguish between metastatic and non-metastatic patients, considering genes with p-value ≤.05 as differentially expressed (DE).
The reference cancer genes for enrichment analysis were collected from datasets including NetPath  (all cancers, http://www.netpath.org/), Atlas of Cancer Genes  (all cancers, http://atlasgeneticsoncology.org/), Census Genes  (all cancers), CANgenes  (breast cancer), G2SBC  (breast cancer, http://www.itb.cnr.it/breastcancer/), and KEGG Pathways of Cancer  (all cancers, KEGG hsa05200 http://www.genome.jp/kegg/pathway/hsa/hsa05200.html).
Pathway information was obtained from the MsigDB v3.0 Canonical Pathways subset [11, 21]. This collection contains 880 pathways collected from seven hand-curated pathway databases including KEGG, Reactome, and Biocarta.
Predicted protein protein interaction information was obtained from STRING 9 .
Core Module Inference
g i is the ith DEG in descending order and Pj is the PA containing from g1 to g j . | g i ∈ DEGs | denotes number of DEGs in the pathway. The DEGs by default are the genes with p-value ≤ 0.05 in a two-tailed t-test. We limit the largest marker size to 20 DEGs. In fact, all marker sets have fewer than 20 components.
where is the ith PA in descending order in the inference dataset, and is its corresponding PA in the validation dataset. For the breast cancer datasets, the overall reproducibility is then given by the average Cscore of the inferred pathways over all six inference-validation pairs.
Six methods were compared in this work, including CMI, CORG , Mean , Median , PCA , and Individual Gene. LLR(Log likelihood Ratio, ) was not compared here, because it is not discussed in the same gene expression space.
Consensus Feature Elimination (CFE)
with w the weight vector and b the bias value.
Cancer gene enrichment analysis
COMBINER was implemented in Matlab R2010a with Bioinformatics toolbox v3.5. The source code is available on http://www.ruotingyang.com.
We gratefully acknowledge financial support from U.S. Army Research Office (PTSD Grant W911NF-10-2-0111).
- van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530–536. 10.1038/415530aView ArticlePubMedGoogle Scholar
- Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EMJJ, Atkins D, Foekens JA: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005, 365(9460):671–679.View ArticlePubMedGoogle Scholar
- Barabasi AL, Gulbahce N, Loscalzo J: Network medicine: a network-based approach to human disease. Nat Rev Genet 2011, 12(1):56–68. 10.1038/nrg2918PubMed CentralView ArticlePubMedGoogle Scholar
- Beyer A, Bandyopadhyay S, Ideker T: Integrating physical and genetic maps: from genomes to interaction networks. Nat Rev Genet 2007, 8(9):699–710. 10.1038/nrg2144PubMed CentralView ArticlePubMedGoogle Scholar
- Li J, Lenferink AEG, Deng Y, Collins C, Cui Q, Purisima EO, O'Connor-McCourt MD, Wang E: Identification of high-quality cancer prognostic markers and metastasis network modules. Nat Commun 2010, 1: 34.PubMedGoogle Scholar
- Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 2005, 21(2):171–178. 10.1093/bioinformatics/bth469View ArticlePubMedGoogle Scholar
- Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Network-based classification of breast cancer metastasis. Mol Syst Biol 2007., 3(140):Google Scholar
- Lee E, Chuang HY, Kim JW, Ideker T, Lee D: Inferring pathway activity toward precise disease classification. PLoS Comput Biol 2008, 4(11):e1000217. 10.1371/journal.pcbi.1000217PubMed CentralView ArticlePubMedGoogle Scholar
- van de Vijver MJ, He YD, van 't Veer LJ, Dai H, Hart AAM, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A gene-expression signature as a predictor of survival in breast cancer. N England J Med 2002, 347(25):1999–2009. 10.1056/NEJMoa021967View ArticleGoogle Scholar
- Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, Haibe-Kains B, Viale G, Delorenzi M, Zhang Y, d'Assignies MS, Bergh J, Lidereau R, Ellis P, Harris AL, Klijn JGM, Foekens JA, Cardoso F, Piccart MJ, Buyse M, Sotiriou C: Strong time dependence of the 76-Gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res 2007, 13(11):3207–3214. 10.1158/1078-0432.CCR-06-2765View ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005, 102(43):15545–15550. 10.1073/pnas.0506580102PubMed CentralView ArticlePubMedGoogle Scholar
- Hanahan D, Weinberg R: The hallmarks of cancer. Cell 2000, 100: 57–70. 10.1016/S0092-8674(00)81683-9View ArticlePubMedGoogle Scholar
- Hanahan D, Weinberg Robert A: Hallmarks of cancer: the next generation. Cell 2011, 144(5):646–674. 10.1016/j.cell.2011.02.013View ArticlePubMedGoogle Scholar
- Vuaroqueaux V, Urban P, Labuhn M, Delorenzi M, Wirapati P, Benz C, Flury R, Dieterich H, Spyratos F, Eppenberger U, Eppenberger-Castori S: Low E2F1 transcript levels are a strong determinant of favorable breast cancer outcome. Breast Cancer Res 2007, 9(3):R33. 10.1186/bcr1681PubMed CentralView ArticlePubMedGoogle Scholar
- Kandasamy K, Mohan SS, Raju R, Keerthikumar S, Kumar G, Venugopal A, Telikicherla D, Navarro JD, Mathivanan S, Pecquet C, Gollapudi S, Tattikota S, Mohan S, Padhukasahasram H, Subbannayya Y, Goel R, Jacob H, Zhong J, Sekhar R, Nanjappa V, Balakrishnan L, Subbaiah R, Ramachandra Y, Rahiman BA, Prasad TK, Lin JX, Houtman J, Desiderio S, Renauld JC, Constantinescu S: NetPath: a public resource of curated signal transduction pathways. Genome Biol 2010, 11(1):R3. 10.1186/gb-2010-11-1-r3PubMed CentralView ArticlePubMedGoogle Scholar
- Huret JL, Minor SL, Dorkeld F, Dessen P, Bernheim A: Atlas of genetics and cytogenetics in oncology and haematology, an interactive database. Nucleic Acids Res 2000, 28(1):349–351. 10.1093/nar/28.1.349PubMed CentralView ArticlePubMedGoogle Scholar
- Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer 2004, 4(3):177–183. 10.1038/nrc1299PubMed CentralView ArticlePubMedGoogle Scholar
- Sjöblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, Mandelker D, Leary RJ, Ptak J, Silliman N, Szabo S, Buckhaults P, Farrell C, Meeh P, Markowitz SD, Willis J, Dawson D, Willson JKV, Gazdar AF, Hartigan J, Wu L, Liu C, Parmigiani G, Park BH, Bachman KE, Papadopoulos N, Vogelstein B, Kinzler KW, Velculescu VE: The consensus coding sequences of human breast and colorectal cancers. Science 2006, 314(5797):268–274. 10.1126/science.1133427View ArticlePubMedGoogle Scholar
- Mosca E, Alfieri R, Merelli I, Viti F, Calabria A, Milanesi L: A multilevel data integration resource for breast cancer study. BMC Sys Biol 2010, 4(1):76. 10.1186/1752-0509-4-76View ArticleGoogle Scholar
- Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 2000, 28(1):27–30. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP: Molecular signatures database (MSigDB) 3.0. Bioinformatics 2011, 27(12):1739–1740. 10.1093/bioinformatics/btr260PubMed CentralView ArticlePubMedGoogle Scholar
- Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, Mering Cv: The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res 2011, 39(suppl 1):D561-D568.PubMed CentralView ArticlePubMedGoogle Scholar
- Guo Z, Zhang T, Li X, Wang Q, Xu J, Yu H, Zhu J, Wang H, Wang C, Topol E, Wang Q, Rao S: Towards precise classification of cancers based on robust gene functional expression profiles. BMC Bioinformatics 2005, 6(1):58. 10.1186/1471-2105-6-58PubMed CentralView ArticlePubMedGoogle Scholar
- Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D, Joshi MB, Harpole D, Lancaster JM, Berchuck A, Olson JA, Marks JR, Dressman HK, West M, Nevins JR: Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 2006, 439(7074):353–357. 10.1038/nature04296View ArticlePubMedGoogle Scholar
- Su J, Yoon BJ, Dougherty ER: Accurate and reliable cancer classification based on probabilistic inference of pathway activity. PLoS ONE 2009, 4(12):e8161. 10.1371/journal.pone.0008161PubMed CentralView ArticlePubMedGoogle Scholar
- Friedman JH: Regularized discriminant analysis. J AM STAT ASSOC 1989, 84(405):165–175. 10.2307/2289860View ArticleGoogle Scholar
- Vapnik V: Statistical Learning Theory. Wiley-Interscience; 1998.Google Scholar
- Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines. Mach Learn 2002, 46(1):389–422. 10.1023/A:1012487302797View ArticleGoogle Scholar
- Davis CA, Gerick F, Hintermair V, Friedel CC, Fundel K, Küffner R, Zimmer R: Reliable gene signatures for microarray classification: assessment of stability and performance. Bioinformatics 2006, 22(19):2356–2363. 10.1093/bioinformatics/btl400View ArticlePubMedGoogle Scholar
- Duan KB, Rajapakse JC, Wang H, Azuaje F: Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans NanoBiosci 2005, 4(3):228–234. 10.1109/TNB.2005.853657View ArticleGoogle Scholar
- Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y: Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 2010, 26(3):392–398. 10.1093/bioinformatics/btp630View ArticlePubMedGoogle Scholar
- MacDonald TJ, Brown KM, LaFleur B, Peterson K, Lawlor C, Chen Y, Packer RJ, Cogen P, Stephan DA: Expression profiling of medulloblastoma: PDGFRA and the RAS/MAPK pathway as therapeutic targets for metastatic disease. Nat Genet 2001, 29(2):143–152. 10.1038/ng731View ArticlePubMedGoogle Scholar
- Giubellino A, Burke TR, Bottaro DP: Grb2 signaling in cell motility and cancer. Expert Opin on Ther Tar 2008, 12(8):1021–1033. 10.1517/14728126.96.36.1991View ArticleGoogle Scholar
- Van Laere SJ, Van der Auwera I, Van den Eynden GG, Elst HJ, Weyler J, Harris AL, van Dam P, Van Marck EA, Vermeulen PB, Dirix LY: Nuclear Factor-κB Signature of Inflammatory Breast Cancer by cDNA Microarray Validated by Quantitative Real-time Reverse Transcription-PCR, Immunohistochemistry, and Nuclear Factor-κB DNA-Binding. Clin Cancer Res 2006, 12(11):3249–3256. 10.1158/1078-0432.CCR-05-2800View ArticlePubMedGoogle Scholar
- Hamann U, Herbold C, Costa S, Solomayer EF, Kaufmann M, Bastert G, Ulmer HU, Frenzel H, Komitowski D: Allelic Imbalance on Chromosome 13q: Evidence for the Involvement of BRCA2 and RB1 in Sporadic Breast Cancer. Cancer Res 1996, 56(9):1988–1990.PubMedGoogle Scholar
- Rakha EA, Reis-Filho JS, Ellis IO: Basal-Like Breast Cancer: A Critical Review. J Clin Oncol 2008, 26(15):2568–2581. 10.1200/JCO.2007.13.1748View ArticlePubMedGoogle Scholar
- Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, Van de Vijver MJ, Bergh J, Piccart M, Delorenzi M: Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. J Natl Cancer Inst 2006, 98(4):262–272. 10.1093/jnci/djj052View ArticlePubMedGoogle Scholar
- Smid M, Wang Y, Klijn JGM, Sieuwerts AM, Zhang Y, Atkins D, Martens JWM, Foekens JA: Genes Associated With Breast Cancer Metastatic to Bone. J Clin Oncol 2006, 24(15):2261–2267. 10.1200/JCO.2005.03.8802View ArticlePubMedGoogle Scholar
- Campbell IG, Russell SE, Choong DYH, Montgomery KG, Ciavarella ML, Hooi CSF, Cristiano BE, Pearson RB, Phillips WA: Mutation of the PIK3CA Gene in Ovarian and Breast Cancer. Cancer Res 2004, 64(21):7678–7681. 10.1158/0008-5472.CAN-04-2933View ArticlePubMedGoogle Scholar
- Woelfle U, Cloos J, Sauter G, Riethdorf L, Jänicke F, van Diest P, Brakenhoff R, Pantel K: Molecular Signature Associated with Bone Marrow Micrometastasis in Human Breast Cancer. Cancer Res 2003, 63(18):5679–5684.PubMedGoogle Scholar
- Ursini-Siegel J, Hardy WR, Zuo D, Lam SHL, Sanguin-Gendreau V, Cardiff RD, Pawson T, Muller WJ: ShcA signalling is essential for tumour progression in mouse models of human breast cancer. EMBO J 2008, 27(6):910–920. 10.1038/emboj.2008.22PubMed CentralView ArticlePubMedGoogle Scholar
- Wolfer A, Wittner BS, Irimia D, Flavin RJ, Lupien M, Gunawardane RN, Meyer CA, Lightcap ES, Tamayo P, Mesirov JP, Liu XS, Shioda T, Toner M, Loda M, Brown M, Brugge JS, Ramaswamy S: MYC regulation of a "poor-prognosis" metastatic cancer cell state. Proc Natl Acad Sci USA 2010, 107(8):3698–3703. 10.1073/pnas.0914203107PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.