Core module biomarker identification with network exploration for breast cancer metastasis

Yang, Ruoting; Daigle, Bernie J; Petzold, Linda R; Doyle, Francis J

doi:10.1186/1471-2105-13-12

Research article
Open access
Published: 18 January 2012

Core module biomarker identification with network exploration for breast cancer metastasis

Ruoting Yang¹,
Bernie J Daigle Jr²,
Linda R Petzold^1,2,3 &
…
Francis J Doyle III^1,4

BMC Bioinformatics volume 13, Article number: 12 (2012) Cite this article

8525 Accesses
31 Citations
1 Altmetric
Metrics details

Abstract

Background

In a complex disease, the expression of many genes can be significantly altered, leading to the appearance of a differentially expressed "disease module". Some of these genes directly correspond to the disease phenotype, (i.e. "driver" genes), while others represent closely-related first-degree neighbours in gene interaction space. The remaining genes consist of further removed "passenger" genes, which are often not directly related to the original cause of the disease. For prognostic and diagnostic purposes, it is crucial to be able to separate the group of "driver" genes and their first-degree neighbours, (i.e. "core module") from the general "disease module".

Results

We have developed COMBINER: COre Module Biomarker Identification with Network ExploRation. COMBINER is a novel pathway-based approach for selecting highly reproducible discriminative biomarkers. We applied COMBINER to three benchmark breast cancer datasets for identifying prognostic biomarkers. COMBINER-derived biomarkers exhibited 10-fold higher reproducibility than other methods, with up to 30-fold greater enrichment for known cancer-related genes, and 4-fold enrichment for known breast cancer susceptible genes. More than 50% and 40% of the resulting biomarkers were cancer and breast cancer specific, respectively. The identified modules were overlaid onto a map of intracellular pathways that comprehensively highlighted the hallmarks of cancer. Furthermore, we constructed a global regulatory network intertwining several functional clusters and uncovered 13 confident "driver" genes of breast cancer metastasis.

Conclusions

COMBINER can efficiently and robustly identify disease core module genes and construct their associated regulatory network. In the same way, it is potentially applicable in the characterization of any disease that can be probed with microarrays.

Background

In recent years, gene expression signatures based on DNA microarray technology have proven useful for predicting the risk of breast cancer. Agendia's MammaPrint has become the first FDA-cleared breast cancer prognosis marker chip containing 70 gene signatures [1]. Many other microarray-based biomarkers, such as 76 gene signatures [2] have been derived using independent data sources. However, there are only three overlaps between MammaPrint's 70-gene and Wang's 76-gene signatures. Furthermore, many of these markers are functionally unrelated to breast cancer. In order to identify robust, functionally relevant disease biomarkers, it is crucial to find gene signatures that are consistent in various data sources.

A complex disease such as breast cancer results in many differentially expressed genes (DEGs), which together can be used to construct a "disease module" network [3]. Some of these DEGs directly correspond to the disease phenotype (i.e. "driver" genes). The expression changes enacted on the driver genes lead to a cascade of changes of other genes: initially to their first-degree interaction neighbors [4], followed by downstream effects to so-called "passenger" genes. Due to their direct relevance to the biology of the disease in question, the expression changes of the driver genes and their first-degree neighbours (i.e. members of the "core module"), should be more consistent than those of the passenger genes when compared across independent cohorts. However, it is often difficult to separate the core module from the passenger genes for a given disease [5, 6]. In this paper, we aim to isolate the core module from the more general disease module and further identify the driver genes using network analysis.

The most intuitive way of finding the disease core module is to identify the Differential Expressed Genes (DEGs) over various cohorts. Unfortunately, the typically larger number of passenger genes in each cohort will contribute to the majority of gene overlaps, due to statistical chance. A more biologically-motivated technique for identifying the core module is to find overlapping differentially expressed pathways. However, a pathway may also contain hundreds of genes with respect to the disease in question, while only a functional submodule (a small group of genes) is differentially expressed. These submodules are often overlooked in pathway enrichment analysis.

In light of the aforementioned challenges, we propose to identify Pathway Activities (PAs) from cohorts of data and use supervised classification to isolate a consistent core module. Each PA is a vector aggregating the information of a few genes expressed in a pathway [7, 8]. The use of PAs for biomarker identification has been shown improve reproducibility and disease-related functional enrichment of the resulting biomarkers [7]. The main idea behind our method is to infer the most significant PAs in each data cohort, and validate these PAs using classification methods in other cohorts. If a PA also scores highly in all the other cohorts, we consider it to be consistently differentially expressed in the disease of interest. Furthermore, we would consider the genes that make up the PA to belong to the disease core module.

In this work, we develop a novel biomarker identification framework entitled COre Module Biomarker Identification with Network ExploRation (COMBINER). COMBINER identifies "core module" (Figure 1) that are consistently differentially expressed as a whole in the data cohorts of interest. COMBINER uses a Core Module Inference (CMI) component to infer candidate PAs from pathway database, a Consensus Feature Elimination (CFE) component to filter out irreproducible PAs, and a multi-level reproducibility validation framework to find the consistent PAs, which in turn make up the complete core module. In its final step, COMBINER uses known pathways and protein networks to identify the driver genes within this core module.

To illustrate its utility, we apply COMBINER to three benchmark breast cancer datasets. We evaluate the resulting core module for accuracy, reproducibility, and enrichment for known cancer-related genes. We then explore the roles of the COMBINER-identified core module in the hallmarks of cancer, and we reconstruct a breast cancer-specific interaction network composed of functionally coherent modules. Finally, we summarize our analyses by identifying 13 high confidence driver genes from COMBINER markers.

Results and Discussion

Overview

COMBINER is a multi-level optimization framework for identifying core module markers (Figure 1 and Methods). Briefly, COMBINER infers candidate submodules from known pathways, identifies the reproducible "core module" using independent cohorts, and uses intracellular signaling pathways and protein networks to identify the "driver" genes from the "core module".

We applied COMBINER to three independent breast cancer datasets to evaluate its effectiveness: Netherlands [9], USA [2], and Belgium [10]. We obtained pathway information from the MsigDB v3.0 Canonical Pathways subset [11]. To decrease redundancy, we applied pathway filtering to remove bulky pathways such as KEGG Pathways of Cancer. This resulted in a pathway dataset containing 624 pathways with 5,155 genes assayed in all three benchmark datasets.

Core Module Inference improves reproducibility and classification accuracy

A primary challenge of pathway inference is to find pathway subsets that are reproducible between independent datasets. We compared Core Module Inference (CMI) with five other inference methods as well as individual genes (see Methods). When compared to a range of numbers of inferred Pathway Activities (PAs), CMI showed two-fold increased reproducibility over the related CORG method and about a 10-fold improvement over other methods (Figure 2).

We then compared the classification accuracy of CMI and the other inference methods using Linear Discriminant Analysis-Consensus Feature Elimination (LDA-CFE) classifiers focused on the top 100 inferred PAs (Methods). As shown in Figure 3, COMBINER run using PA vectors identified by CMI (CMI-COMBINER) exhibits better overall accuracy than the other methods coupled with COMBINER. Similarly, CMI also shows good overall accuracy using the SVM classifier (Additional file 1, Figure S1).

Core module markers enrich cancer-related genes

We compared the enrichment of known cancer genes in the biomarkers discovered by CMI-COMBINER, (93 genes); CORG-COMBINER, (i.e. COMBINER run using CORG activity vectors), (123 genes); Subnetwork markers (1162 genes) ( [7], http://www.cellcircuits.com); MammaPrint's 70-gene signature (G70) (70 genes) [1]; and Wang's 76-gene signature (G76) (76 genes) [2]. Seven known cancer gene datasets were compared (see Materials and methods). Both CMI-COMBINER and CORG-COMBINER showed much higher enrichment of cancer-related genes in their biomarker signatures (Table 1). Specifically, CMI- and CORG-COMBINER showed up to 4-fold increased enrichment over subnetwork markers and up to 30-fold enrichment over other gene signatures. In particular for known breast cancer genes in Census, they exhibited up to 4 fold enrichment over others. More than 50% and 40% of the resulting biomarkers are cancer and breast cancer specific, respectively. Additionally, CMI-COMBINER showed greater enrichment than CORG-COMBINER with respect to the Atlas of Cancer Genes, which is the largest cancer gene collection. Consistent to Chuang et al's results [7],. we also found insignificant enrichment in CANgene dataset including 122 mutative genes from 11 breast cancer cell lines. A possible explanation is that "the cancer cell lines capture a different disease state than that found in the population of patients surveyed by microarray profiling." [7] The COMBINER core module markers with associated pathways are summarized in Additional file 2, Table S1 and Additional file 3, Table S2. Additional file 4, Table S3 lists the overlaps between CMI-/CORG-COMBINER and KEGG pathways of cancer, along with up-/down-regulation information.

Table 1 Cancer Gene Enrichment rate of various breast cancer gene signatures

Full size table

Core module markers highlight the hallmarks of cancer

As shown in Figure 4, the COMBINER-discovered biomarkers are overlaid on the hallmarks of cancer [12, 13], which integrate the common intracellular signalling pathways of all subtypes of cancer. The components of the core module markers from CMI and CORG along with eighteen common markers are listed in different fonts. The remaining proteins (most were not differentially expressed) in the pathways are consolidated into unlabeled nodes. Figure 4 shows that the identified core module genes comprehensively highlight the hallmarks, demonstrating the high specificity of COMBINER. In particular, 18 common markers, which we regard as the most reliable predictors, describe well-characterized processes involving growth factors, survival factors, the cell cycle, and the ExtraCellular Matrix (ECM). The modules unique to CMI-COMBINER include anti-apoptosis and JAK-STAT cascades, while pathways describing anti-growth factors and death factors were unique to CORG-COMBINER. A few well-known mutant proteins, including cyclin D1 and p53, may play an important role in connecting other signatures [7], but they showed only limited predictive ability in the three breast cancer datasets.

Core module markers in predicted protein-protein interaction networks underpin functional modules

Figure 5 shows how a regulatory network was constructed using the interactome of the core module markers. The regulatory network was divided into a few functional modules, including cell cycle and ECM. These functional modules were interconnected by 20 "hub" genes (larger pink/green nodes), 13 of which overlapped with the common marker genes (Additional file 2, Table S1). Our results imply that these 13 "hub" markers are the essential "driver" genes of breast cancer metastasis (Table 2). For example, BRCA1 is among the most well-characterized genes whose mutation gives rise to breast cancer. In addition, low E2F1 transcript levels strongly predicted good prognosis based on quantitative RT-PCR in 317 primary breast cancer patients [14]. We further enlarged the nodes of three standard breast cancer indicators TP53, BRCA1, and ERBB2, which connect many of the surrounding hub genes. Although TP53 and ERBB2 are useful for a mechanistic understanding of breast cancer, they were not identified as discriminative gene markers. A regulatory network was also created representing CORG-COMBINER (Additional file 5, Figure S2), but no additional "hub" markers were found.

Table 2 Confident "driver" genes for breast cancer metastasis

Full size table

Conclusions

Identifying accurate and reproducible disease biomarkers is an important challenge for gene expression analysis. To facilitate this task, we developed COMBINER, a novel pathway-based biomarker identification method that extracts the essential "core module" of disease from known biological networks. Compared to existing methods, COMBINER substantially improves the reproducibility and cancer-specific enrichment of its resulting biomarkers. We examined the identified markers in intracellular signalling networks highlighting the hallmarks of cancer. Reassembling the core module genes into a regulatory network, we found 13 "driver" genes connecting eight functional modules. We anticipate such molecular descriptions to prove even more useful when applied to diseases that are less well-characterized; our current work focuses on several such applications.

Methods

Gene expression, pathways, cancer gene databases, and interactome

We used three breast cancer datasets from different countries of origin to evaluate our method: Netherlands [9], USA [2], and Belgium [10]. Each dataset recorded whether the assayed patients developed metastasis within 5 years after surgery. The Netherlands, USA, and Belgium datasets contain expression profiles for 295, 286, and 198 patients, respectively, with 78, 107, and 35 patients experiencing metastasis. All of the patients in the USA and Belgium datasets had lymph-node-negative disease, although their estrogen receptor (ER) types differed. The Netherlands data contained both lymph-node positive and negative disease patients with differing ER types, 130 of which received adjuvant systemic therapy including chemotherapy and hormonal therapy. We performed a two-tailed t-test on the gene expression values of each dataset to distinguish between metastatic and non-metastatic patients, considering genes with p-value ≤.05 as differentially expressed (DE).

The reference cancer genes for enrichment analysis were collected from datasets including NetPath [15] (all cancers, http://www.netpath.org/), Atlas of Cancer Genes [16] (all cancers, http://atlasgeneticsoncology.org/), Census Genes [17] (all cancers), CANgenes [18] (breast cancer), G2SBC [19] (breast cancer, http://www.itb.cnr.it/breastcancer/), and KEGG Pathways of Cancer [20] (all cancers, KEGG hsa05200 http://www.genome.jp/kegg/pathway/hsa/hsa05200.html).

Pathway information was obtained from the MsigDB v3.0 Canonical Pathways subset [11, 21]. This collection contains 880 pathways collected from seven hand-curated pathway databases including KEGG, Reactome, and Biocarta.

Predicted protein protein interaction information was obtained from STRING 9 [22].

Core Module Inference

The CMI method adopts the strategy of the CORG method [8] of finding the genes with the most discriminative power, differing in three ways: first, the CORG method collects CORGs only from the up- or downregulated subset of genes in a pathway, and some key genes can thus be discarded. In contrast, CMI considers both up- and downregulation together. Second, CMI improves the greedy search for the discriminative set of genes. Third, CMI considers only differentially expressed genes. As illustrated in Figure 1, given a pathway consisting of genes {g₁,... g_i, ..., g_n} ranking by a descending order of their absolute t-scores, with their normalized expression values {z(g₁),..., z(g_n)}, determining a core module {g₁,..., g_K} is equivalent to finding the K^th component, such that

K = arg max (t_{s c o r e} (P_{j})),

(1)

where

P_{j} = {\begin{matrix} \frac{{\sum^{}}_{i = 1}^{j} z (g_{i}) s i g n (t_{s c o r e} (g_{i}))}{\sqrt{j}}, 1 \leq j \leq m i n (| g_{i} \in D E G s |, 20), & | g_{i} \in D E G_{s} | > 0, \\ 0, & | g_{i} \in D E G_{s} | = 0. \end{matrix}

(2)

g_iis the i^th DEG in descending order and Pj is the PA containing from g₁ to g_j. | g_i∈ DEGs | denotes number of DEGs in the pathway. The DEGs by default are the genes with p-value ≤ 0.05 in a two-tailed t-test. We limit the largest marker size to 20 DEGs. In fact, all marker sets have fewer than 20 components.

Reproducibility power

We consider an inference-validation pair datasets to be reproducible if their pathway activities provide similar discriminative power. First, we rank the PAs inferred from the inference dataset in descending order by their tscores. Then, we define reproducibility by

C_{s c o r e} (N) = \frac{1}{N} \sum_{i = 1}^{N} t_{s c o r e} (P_{I}^{i}) \cdot t_{s c o r e} (P_{V}^{i}),

(3)

where $P_{I}^{i}$ is the i^th PA in descending order in the inference dataset, and $P_{V}^{i}$ is its corresponding PA in the validation dataset. For the breast cancer datasets, the overall reproducibility is then given by the average Cscore of the inferred pathways over all six inference-validation pairs.

Six methods were compared in this work, including CMI, CORG [8], Mean [23], Median [23], PCA [24], and Individual Gene. LLR(Log likelihood Ratio, [25]) was not compared here, because it is not discussed in the same gene expression space.

Consensus Feature Elimination (CFE)

In this work, gene expression and activity vectors are generalized as features for classification. Given a set of features {x₁, x₂,..., x_n} with class labels {y₁, y₂,..., y_n} ∈ {-1, +1}, the task of binary classification is to find a decision function

D (x) \{\begin{gathered} > 0 \Rightarrow x \in c l a s s (+) \\ < 0 \Rightarrow x \in c l a s s (-) \\ = 0 \Rightarrow x \in d e c i s i o n b o u n d a r y, \end{gathered}

(4)

We choose a linear decision function, which can be described as a separating hyperplane:

D (x) = w \cdot x + b,

(5)

with w the weight vector and b the bias value.

Linear classifiers such as Linear Discriminant Analysis (LDA) [26] and linear Support Vector Machines (SVM) [27] use differing optimization criteria to estimate the weight vector. Intuitively, the weights indicate the importance of the associated features. Guyon et al proposed Recursive Feature Elimination (RFE), which removes features recursively based on their weights [28]. However, classical RFE exhibits lack of stability in feature selection [29]. In contrast to binary classification tasks that emphasize maximization of classification accuracy, biomarker identification requires features that are both accurate and reproducible across multiple experiments. Thus, we propose a Consensus Feature Elimination (CFE) approach to improve the stability of RFE. As illustrated in Figure 6, we first generate 100 alternative 5-fold random splits of samples, upon which we construct 500 classifiers and record their AUCs (Area Under Receiver Operating Characteristic Curves) and weight vectors. Each feature was then ranked by average square weight $\bar{w} = \sum_{j = 1}^{500} {(w^{j})}^{2} / 500$ . The lowest ranking feature was removed recursively until the maximum average AUC was achieved. This process, which has also been called Multiple RFE [30] or ensemble feature selection [31] is known to increase biomarker reproducibility and accuracy by as much as 30% and 15%, respectively. For the breast cancer datasets described in this work, we found the maximum AUC to be very stable, while the corresponding biomarker set was not always unique. Thus we chose to repeat the above procedure 100 times, selecting the most frequently occurring biomarkers as the final marker set.

Seven methods were compared in this work, including CMI, CORG [8], Mean [23], Median [23], PCA [24], LLR [25], and Individual Gene.

Cancer gene enrichment analysis

The cancer gene enrichment analysis examines over-representation of known cancer genes in a gene signature. Assuming the total number of genes N, cancer genes M, and signature genes J, the probability of having more than K cancer genes in a signature follows a hypergeometric distribution:

P (# of cancer genes > K) = 1 - \sum_{i = 0}^{K} \frac{(i_{J}^{}) ({M - i}_{N - J}^{})}{(M_{N}^{})} .

(6)

Software

COMBINER was implemented in Matlab R2010a with Bioinformatics toolbox v3.5. The source code is available on http://www.ruotingyang.com.

References

van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530–536. 10.1038/415530a
Article PubMed Google Scholar
Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EMJJ, Atkins D, Foekens JA: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005, 365(9460):671–679.
Article CAS PubMed Google Scholar
Barabasi AL, Gulbahce N, Loscalzo J: Network medicine: a network-based approach to human disease. Nat Rev Genet 2011, 12(1):56–68. 10.1038/nrg2918
Article PubMed Central CAS PubMed Google Scholar
Beyer A, Bandyopadhyay S, Ideker T: Integrating physical and genetic maps: from genomes to interaction networks. Nat Rev Genet 2007, 8(9):699–710. 10.1038/nrg2144
Article PubMed Central CAS PubMed Google Scholar
Li J, Lenferink AEG, Deng Y, Collins C, Cui Q, Purisima EO, O'Connor-McCourt MD, Wang E: Identification of high-quality cancer prognostic markers and metastasis network modules. Nat Commun 2010, 1: 34.
PubMed Google Scholar
Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 2005, 21(2):171–178. 10.1093/bioinformatics/bth469
Article CAS PubMed Google Scholar
Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Network-based classification of breast cancer metastasis. Mol Syst Biol 2007., 3(140):
Google Scholar
Lee E, Chuang HY, Kim JW, Ideker T, Lee D: Inferring pathway activity toward precise disease classification. PLoS Comput Biol 2008, 4(11):e1000217. 10.1371/journal.pcbi.1000217
Article PubMed Central PubMed Google Scholar
van de Vijver MJ, He YD, van 't Veer LJ, Dai H, Hart AAM, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A gene-expression signature as a predictor of survival in breast cancer. N England J Med 2002, 347(25):1999–2009. 10.1056/NEJMoa021967
Article CAS Google Scholar
Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, Haibe-Kains B, Viale G, Delorenzi M, Zhang Y, d'Assignies MS, Bergh J, Lidereau R, Ellis P, Harris AL, Klijn JGM, Foekens JA, Cardoso F, Piccart MJ, Buyse M, Sotiriou C: Strong time dependence of the 76-Gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res 2007, 13(11):3207–3214. 10.1158/1078-0432.CCR-06-2765
Article CAS PubMed Google Scholar
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005, 102(43):15545–15550. 10.1073/pnas.0506580102
Article PubMed Central CAS PubMed Google Scholar
Hanahan D, Weinberg R: The hallmarks of cancer. Cell 2000, 100: 57–70. 10.1016/S0092-8674(00)81683-9
Article CAS PubMed Google Scholar
Hanahan D, Weinberg Robert A: Hallmarks of cancer: the next generation. Cell 2011, 144(5):646–674. 10.1016/j.cell.2011.02.013
Article CAS PubMed Google Scholar
Vuaroqueaux V, Urban P, Labuhn M, Delorenzi M, Wirapati P, Benz C, Flury R, Dieterich H, Spyratos F, Eppenberger U, Eppenberger-Castori S: Low E2F1 transcript levels are a strong determinant of favorable breast cancer outcome. Breast Cancer Res 2007, 9(3):R33. 10.1186/bcr1681
Article PubMed Central PubMed Google Scholar
Kandasamy K, Mohan SS, Raju R, Keerthikumar S, Kumar G, Venugopal A, Telikicherla D, Navarro JD, Mathivanan S, Pecquet C, Gollapudi S, Tattikota S, Mohan S, Padhukasahasram H, Subbannayya Y, Goel R, Jacob H, Zhong J, Sekhar R, Nanjappa V, Balakrishnan L, Subbaiah R, Ramachandra Y, Rahiman BA, Prasad TK, Lin JX, Houtman J, Desiderio S, Renauld JC, Constantinescu S: NetPath: a public resource of curated signal transduction pathways. Genome Biol 2010, 11(1):R3. 10.1186/gb-2010-11-1-r3
Article PubMed Central PubMed Google Scholar
Huret JL, Minor SL, Dorkeld F, Dessen P, Bernheim A: Atlas of genetics and cytogenetics in oncology and haematology, an interactive database. Nucleic Acids Res 2000, 28(1):349–351. 10.1093/nar/28.1.349
Article PubMed Central CAS PubMed Google Scholar
Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer 2004, 4(3):177–183. 10.1038/nrc1299
Article PubMed Central CAS PubMed Google Scholar
Sjöblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, Mandelker D, Leary RJ, Ptak J, Silliman N, Szabo S, Buckhaults P, Farrell C, Meeh P, Markowitz SD, Willis J, Dawson D, Willson JKV, Gazdar AF, Hartigan J, Wu L, Liu C, Parmigiani G, Park BH, Bachman KE, Papadopoulos N, Vogelstein B, Kinzler KW, Velculescu VE: The consensus coding sequences of human breast and colorectal cancers. Science 2006, 314(5797):268–274. 10.1126/science.1133427
Article PubMed Google Scholar
Mosca E, Alfieri R, Merelli I, Viti F, Calabria A, Milanesi L: A multilevel data integration resource for breast cancer study. BMC Sys Biol 2010, 4(1):76. 10.1186/1752-0509-4-76
Article Google Scholar
Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 2000, 28(1):27–30. 10.1093/nar/28.1.27
Article PubMed Central CAS PubMed Google Scholar
Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP: Molecular signatures database (MSigDB) 3.0. Bioinformatics 2011, 27(12):1739–1740. 10.1093/bioinformatics/btr260
Article PubMed Central CAS PubMed Google Scholar
Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, Mering Cv: The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res 2011, 39(suppl 1):D561-D568.
Article PubMed Central CAS PubMed Google Scholar
Guo Z, Zhang T, Li X, Wang Q, Xu J, Yu H, Zhu J, Wang H, Wang C, Topol E, Wang Q, Rao S: Towards precise classification of cancers based on robust gene functional expression profiles. BMC Bioinformatics 2005, 6(1):58. 10.1186/1471-2105-6-58
Article PubMed Central PubMed Google Scholar
Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D, Joshi MB, Harpole D, Lancaster JM, Berchuck A, Olson JA, Marks JR, Dressman HK, West M, Nevins JR: Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 2006, 439(7074):353–357. 10.1038/nature04296
Article CAS PubMed Google Scholar
Su J, Yoon BJ, Dougherty ER: Accurate and reliable cancer classification based on probabilistic inference of pathway activity. PLoS ONE 2009, 4(12):e8161. 10.1371/journal.pone.0008161
Article PubMed Central PubMed Google Scholar
Friedman JH: Regularized discriminant analysis. J AM STAT ASSOC 1989, 84(405):165–175. 10.2307/2289860
Article Google Scholar
Vapnik V: Statistical Learning Theory. Wiley-Interscience; 1998.
Google Scholar
Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines. Mach Learn 2002, 46(1):389–422. 10.1023/A:1012487302797
Article Google Scholar
Davis CA, Gerick F, Hintermair V, Friedel CC, Fundel K, Küffner R, Zimmer R: Reliable gene signatures for microarray classification: assessment of stability and performance. Bioinformatics 2006, 22(19):2356–2363. 10.1093/bioinformatics/btl400
Article CAS PubMed Google Scholar
Duan KB, Rajapakse JC, Wang H, Azuaje F: Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans NanoBiosci 2005, 4(3):228–234. 10.1109/TNB.2005.853657
Article Google Scholar
Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y: Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 2010, 26(3):392–398. 10.1093/bioinformatics/btp630
Article CAS PubMed Google Scholar
MacDonald TJ, Brown KM, LaFleur B, Peterson K, Lawlor C, Chen Y, Packer RJ, Cogen P, Stephan DA: Expression profiling of medulloblastoma: PDGFRA and the RAS/MAPK pathway as therapeutic targets for metastatic disease. Nat Genet 2001, 29(2):143–152. 10.1038/ng731
Article CAS PubMed Google Scholar
Giubellino A, Burke TR, Bottaro DP: Grb2 signaling in cell motility and cancer. Expert Opin on Ther Tar 2008, 12(8):1021–1033. 10.1517/14728222.12.8.1021
Article CAS Google Scholar
Van Laere SJ, Van der Auwera I, Van den Eynden GG, Elst HJ, Weyler J, Harris AL, van Dam P, Van Marck EA, Vermeulen PB, Dirix LY: Nuclear Factor-κB Signature of Inflammatory Breast Cancer by cDNA Microarray Validated by Quantitative Real-time Reverse Transcription-PCR, Immunohistochemistry, and Nuclear Factor-κB DNA-Binding. Clin Cancer Res 2006, 12(11):3249–3256. 10.1158/1078-0432.CCR-05-2800
Article CAS PubMed Google Scholar
Hamann U, Herbold C, Costa S, Solomayer EF, Kaufmann M, Bastert G, Ulmer HU, Frenzel H, Komitowski D: Allelic Imbalance on Chromosome 13q: Evidence for the Involvement of BRCA2 and RB1 in Sporadic Breast Cancer. Cancer Res 1996, 56(9):1988–1990.
CAS PubMed Google Scholar
Rakha EA, Reis-Filho JS, Ellis IO: Basal-Like Breast Cancer: A Critical Review. J Clin Oncol 2008, 26(15):2568–2581. 10.1200/JCO.2007.13.1748
Article PubMed Google Scholar
Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, Van de Vijver MJ, Bergh J, Piccart M, Delorenzi M: Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. J Natl Cancer Inst 2006, 98(4):262–272. 10.1093/jnci/djj052
Article CAS PubMed Google Scholar
Smid M, Wang Y, Klijn JGM, Sieuwerts AM, Zhang Y, Atkins D, Martens JWM, Foekens JA: Genes Associated With Breast Cancer Metastatic to Bone. J Clin Oncol 2006, 24(15):2261–2267. 10.1200/JCO.2005.03.8802
Article CAS PubMed Google Scholar
Campbell IG, Russell SE, Choong DYH, Montgomery KG, Ciavarella ML, Hooi CSF, Cristiano BE, Pearson RB, Phillips WA: Mutation of the PIK3CA Gene in Ovarian and Breast Cancer. Cancer Res 2004, 64(21):7678–7681. 10.1158/0008-5472.CAN-04-2933
Article CAS PubMed Google Scholar
Woelfle U, Cloos J, Sauter G, Riethdorf L, Jänicke F, van Diest P, Brakenhoff R, Pantel K: Molecular Signature Associated with Bone Marrow Micrometastasis in Human Breast Cancer. Cancer Res 2003, 63(18):5679–5684.
CAS PubMed Google Scholar
Ursini-Siegel J, Hardy WR, Zuo D, Lam SHL, Sanguin-Gendreau V, Cardiff RD, Pawson T, Muller WJ: ShcA signalling is essential for tumour progression in mouse models of human breast cancer. EMBO J 2008, 27(6):910–920. 10.1038/emboj.2008.22
Article PubMed Central CAS PubMed Google Scholar
Wolfer A, Wittner BS, Irimia D, Flavin RJ, Lupien M, Gunawardane RN, Meyer CA, Lightcap ES, Tamayo P, Mesirov JP, Liu XS, Shioda T, Toner M, Loda M, Brown M, Brugge JS, Ramaswamy S: MYC regulation of a "poor-prognosis" metastatic cancer cell state. Proc Natl Acad Sci USA 2010, 107(8):3698–3703. 10.1073/pnas.0914203107
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

We gratefully acknowledge financial support from U.S. Army Research Office (PTSD Grant W911NF-10-2-0111).

Author information

Authors and Affiliations

Institute for Collaborative Biotechnologies, University of California Santa Barbara, Santa Barbara, CA, 93106-5080, USA
Ruoting Yang, Linda R Petzold & Francis J Doyle III
Department of Computer Science, University of California Santa Barbara, Santa Barbara, CA, 93106-5110, USA
Bernie J Daigle Jr & Linda R Petzold
Department of Mechanical Engineering, University of California Santa Barbara, Santa Barbara, CA, 93106-5070, USA
Linda R Petzold
Department of Chemical Engineering, University of California Santa Barbara, Santa Barbara, CA, 93106-5080, USA
Francis J Doyle III

Authors

Ruoting Yang
View author publications
You can also search for this author in PubMed Google Scholar
Bernie J Daigle Jr
View author publications
You can also search for this author in PubMed Google Scholar
Linda R Petzold
View author publications
You can also search for this author in PubMed Google Scholar
Francis J Doyle III
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francis J Doyle III.

Additional information

Authors' contributions

RY, BJD, LRP, and FJD conceived and designed the research. RY, and BJD performed the analysis, the statistical computations, and wrote the paper. RY implemented the programs. All authors read and approved the final manuscript.

Ruoting Yang, Bernie J Daigle Jr contributed equally to this work.

Electronic supplementary material

12859_2011_5168_MOESM1_ESM.TIFF

Additional file 1:Figure S1: Comparison of CMI and other pathway inference methods using SVM-CFE classifiers subject to top 100 inferred pathways. (TIFF 434 KB)

Additional file 2:Table S1: List of core module genes identified by CMI and CORG. (XLSX 20 KB)

Additional file 3:Table S2: Pathway markers identified by all methods. (XLSX 28 KB)

Additional file 4:Table S3: List of core module genes overlaid in KEGG pathway of cancers. (XLSX 14 KB)

Additional file 5:Figure S2: Unique core module of cancer pathway identified by CORG-COMBINER method. (TIFF 712 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Yang, R., Daigle, B.J., Petzold, L.R. et al. Core module biomarker identification with network exploration for breast cancer metastasis. BMC Bioinformatics 13, 12 (2012). https://doi.org/10.1186/1471-2105-13-12

Download citation

Received: 16 September 2011
Accepted: 18 January 2012
Published: 18 January 2012
DOI: https://doi.org/10.1186/1471-2105-13-12

Core module biomarker identification with network exploration for breast cancer metastasis

Abstract

Background

Results

Conclusions

Background

Results and Discussion

Overview

Core Module Inference improves reproducibility and classification accuracy

Core module markers enrich cancer-related genes

Core module markers highlight the hallmarks of cancer

Core module markers in predicted protein-protein interaction networks underpin functional modules

Conclusions

Methods

Gene expression, pathways, cancer gene databases, and interactome

Core Module Inference

Reproducibility power

Consensus Feature Elimination (CFE)

Cancer gene enrichment analysis

Software

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us