Transcriptional network classifiers
© Chang and Ramoni; licensee BioMed Central Ltd. 2009
Published: 17 September 2009
Gene interactions play a central role in transcriptional networks. Many studies have performed genome-wide expression analysis to reconstruct regulatory networks to investigate disease processes. Since biological processes are outcomes of regulatory gene interactions, this paper develops a system biology approach to infer function-dependent transcriptional networks modulating phenotypic traits, which serve as a classifier to identify tissue states. Due to gene interactions taken into account in the analysis, we can achieve higher classification accuracy than existing methods.
Our system biology approach is carried out by the Bayesian networks framework. The algorithm consists of two steps: gene filtering by Bayes factor followed by collinearity elimination via network learning. We validate our approach with two clinical data. In the study of lung cancer subtypes discrimination, we obtain a 25-gene classifier from 111 training samples, and the test on 422 independent samples achieves 95% classification accuracy. In the study of thoracic aortic aneurysm (TAA) diagnosis, 61 samples determine a 34-gene classifier, whose diagnosis accuracy on 33 independent samples achieves 82%. The performance comparisons with three other popular methods, PCA/LDA, PAM, and Weighted Voting, confirm that our approach yields superior classification accuracy and a more compact signature.
The system biology approach presented in this paper is able to infer function-dependent transcriptional networks, which in turn can classify biological samples with high accuracy. The validation of our classifier using clinical data demonstrates the promising value of our proposed approach for disease diagnosis.
Genome-wide expression analysis has revolutionized disease diagnostic models through the identification of molecular signatures , which are selected from high ranked genes determined by statistical measures, such as fold change , t statistic , signal-to-noise ratio , or subnetwork scores . Over the last decade, system biology researchers also exploited the comprehensive transcriptional landscape offered by microarrays to identify the transcriptional networks that unravel regulatory gene interactions and explain how diseases progress [6–8]. Although these two analysis approaches seem antithetic, they can be unified to create transcriptional network classifiers to enhance disease diagnosis accuracy. We can regard the transcriptional networks underpinning disease development as perturbed by the presence of diseases. The phenotype is treated as a binary perturbation of the overall transcriptional network. To reconstruct the classifier, our task is just to infer from expression profiles the function-dependent transcriptional network that modulates phenotypic traits.
Both the graphical structure of a Bayesian network and the parameters of the conditional probabilities can be learned from the available database. Nevertheless, learning a network is computationally intensive because ideally the dependent relations of all pairs of variables must be evaluated. We circumvent the demanding computations by a two-stage learning process. Our algorithm begins with the use of Bayes factor to select the genes that are functionally dependent on the phenotype, since only function-dependent genes have potential to play a role in tissue discrimination. Then, we explore the detailed dependencies between the selected genes to reconstruct a transcriptional network. After the transcriptional network is learned, it can be exploited for tissue classification, again formulated in the Bayesian networks framework. In the learned network, the phenotype's Markov blanket is the set of nodes composed of the phenotype's parents, its children, and its children's parents. Given the genes under the Markov blanket, the phenotype is independent of the genes not covered by the Markov blanket. Hence, only the genes under the Markov blanket contribute to phenotype classification, and they assemble a signature. With reference to Figure 1, genes 1, 2, 3 are those under the phenotype's Markov blanket, consisting of a signature for tissue classification.
We validate our approach by two clinical studies: discrimination of lung cancer subtypes and diagnosis of thoracic aortic aneurysm.
Discrimination of lung cancer subtypes
The signature of 25 genes for characterizing lung cancer subtypes. Enrichment shows that there are 23 unique genes in the signature.
ATP-binding cassette, sub-family C (CFTR/MRP), member 3
bicaudal D homolog 2 (Drosophila)
Pyrimidine metabolism, Drug metabolism
Cell adhesion molecules, Tight junction, Leukocyte transendothelial migration
homogentisate 1,2-dioxygenase (homogentisate oxidase)
Tyrosine metabolism, Styrene degradation
inositol 1,4,5-trisphosphate 3-kinase A
Inositol phosphate metabolism, Calcium signaling pathway, Phosphatidylinositol signaling system
keratin 14 (epidermolysis bullosa simplex, Dowling-Meara, Koebner)
KRT6A, KRT6B, KRT6C
keratin 6A, keratin 6B, keratin 6C,
mucin 3B, cell surface associated
mucin 5B, oligomeric mucus/gel-forming
nicotinamide nucleotide adenylyltransferase 2
Nicotinate and nicotinamide metabolism
neurotrophic tyrosine kinase, receptor, type 2
MAPK signaling pathway
Rh family, C glycoprotein
serpin peptidase inhibitor, clade B (ovalbumin), member 13
SRY (sex determining region Y)-box 2
serine peptidase inhibitor, Kazal type 1
small proline-rich protein 1A
tight junction protein 3 (zona occludens 3)
TOX high mobility group box family member 3
The performance of 10-fold cross validation achieves 98.5% accuracy. We further test the classification accuracy of the network on seven independent study populations with Gene Expression Omnibus accession numbers GSE10072, GSE7670, GSE12667, GSE4824, GSE2109, GSE4573, and GSE6253, for a total of 422 samples, 232 AC and 190 SCC, from subjects of Caucasian, Asian and African descent representing 84.6%, 6.9%, and 2.8% of the data, respectively. On these independent samples, our transcriptional network classifier achieves an accuracy of 95.2%.
The 25-gene signature identified by the classifier is unique to discriminate AC and SCC with high accuracy. Furthermore, most of these genes have been reported their specificity to lung cancer. ABCC3, CLDN3, DPP4, MUC3B, MUC5B, NTRK2, SPINK1, TJP3 are specific markers of lung AC [23–29]. KRT6A, KRT6B, KRT6C, KRT17, RHCG, SPRR1A, and VSNL1 are unique to lung SCC [30–33]. BICD2, CDA, NMNAT2, SERPINB13, and TOX3 have no specificity to either AC or SCC but to lung cancer [34–38].
Diagnosis of thoracic aortic aneurysm
The signature of 34 genes for diagnosing TAA.
ATP-binding cassette, sub-family G (WHITE), member 4
aryl-hydrocarbon receptor nuclear translocator 2
chromosome 17 open reading frame 63
calcium binding protein 2
cleavage stimulation factor, 3' pre-RNA, subunit 2, 64kDa
defensin, beta 1
deoxynucleotidyltransferase, terminal, interacting protein 1
Fas associated factor family member 2
fibrinogen gamma chain
insulin-like growth factor 2 mRNA binding protein 1
IWS1 homolog (S. cerevisiae)
keratin associated protein 17-1
keratin associated protein 23-1
mal, T-cell differentiation protein 2
matrix metallopeptidase 11 (stromelysin 3)
RNA binding motif protein 16
transmembrane 4 L six family member 1
zinc finger and BTB domain containing 4
zinc finger and BTB domain containing 9
zinc finger protein 394
Comparisons with other methods
Principal Component Analysis with Linear Discriminant Analysis (PCA/LDA): The PCA/LDA method begins with reducing the number of genes to a small number of principal genes and then searches for a discriminative linear function on expression values to separate tissues.
Prediction Analysis for Microarray (PAM) : PAM utilizes signal to noise ratios to pick up a signature and uses the ratios to determine the tissue types of testing samples.
Weighted Voting : This method ranks genes by the fold change of the means of the expression values. The classification is determined by how close to the high rank genes the testing data is.
Performance comparisons with other methods on the lung cancer data.
Performance comparisons with other methods on the TAA data.
The clinical application confirms improved accuracy of our proposed system biology approach. Literature survey on the functions of the signature genes also validates the capability of our approach to extract biologically reasonable signatures. Furthermore, the large-scale independent test on seven cohorts in the lung cancer study shows robustness of our classifier across platforms and populations. The two studies also demonstrate the capability of our method to analyze data assayed by microarrays manufactured by different makers.
Unlike existing methods that require the operator to specify a cutoff of statistical measures to select high ranked genes, our method is threshold free for signature selection, because the signature genes are determined once the transcriptional network is modelled. For phenotype classification, we need to keep the network merely composing of the signature genes, and the remaining network can be discarded; this way can save storage resources in clinical usage. Another feature of our transcriptional network classifier is its visualization of molecular dependence network, which will provide biologists a clue for gene causality investigation.
A recent work proposes to use prior knowledge of known pathway information to select gene subnetworks as features for tissue classification . However, this method will discard a major portion of the data, because a large number of genes have not been discovered their functional pathways. Dissimilar to this method, our approach fully utilizes the entire data to screen the function-dependent genes and to reconstruct the network.
This paper uses a system biology approach to develop transcriptional network classifiers. The classifier can be thought of as a gene network perturbed by the presence of the phenotypic traits. We adopt Bayesian network framework to model the classifier. The algorithm uses Bayes factor for gene filtering, followed by collinearity elimination via network learning. The clinical applications of our approach to lung cancer subtypes classification and TAA diagnosis demonstrate high classification accuracy of the network based classifiers. The biological validation of the signatures further confirms the ability of the transcriptional network classifier to extract meaningful signatures.
Let Y1,Y2,...,Y N be Gaussian random variables representing the expression levels of genes 1,...,N, and C be a multinomial random variable indicating tissue conditions. We use uppercase to denote random variables and lowercase to denote their values. Our algorithm first uses Bayes factor to filter function-dependent genes and then exploit Bayesian network learning to eliminate collinearity among these selected genes.
Gene filtering by Bayes factor
The genes functionally dependent on the phenotype are filtered in the beginning. The filtering can be realized by Bayes factor, which evaluates for each gene the ratio of its likelihood of being dependent on the phenotype to its likelihood of being independent of the phenotype. When the Bayes factor is greater than one, the gene is selected because it is more likely to be dependent on than to be independent of the phenotype.
Collinearity elimination via network learning
Due to limited space, we in this paper do not present the detailed computation, which can be derived from . Finally, the determination of the best Bayesian network model is .
where H denotes the set of genes that are the children of the phenotype C in the network and assemble a signature. Equivalently, the set H of genes corresponds to the genes under the phenotype's Markov blanket.
This research is supported in part by NIH/NHGRI (R01HG003354).
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 9, 2009: Proceedings of the 2009 AMIA Summit on Translational Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S9.
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286(5439):531–537. 10.1126/science.286.5439.531View ArticlePubMedGoogle Scholar
- Chen Y, Dougherty ER, Bittner ML: Ratio-based decisions and the quantitative analysis of cDNA microarray images. J Biomedical Optics 1997, 2(4):364–374. 10.1117/12.281504View ArticlePubMedGoogle Scholar
- Reich M, Ohm K, Angelo M, Tamayo P, Mesirov JP: GeneCluster 2.0: an advanced toolset for bioarray analysis. Bioinformatics 2004, 20(11):1797–1798. 10.1093/bioinformatics/bth138View ArticlePubMedGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 2001, 98(9):5116–5121. 10.1073/pnas.091062498PubMed CentralView ArticlePubMedGoogle Scholar
- Lee E, Chuang HY, Kim JW, Ideker T, Lee D: Inferring pathway activity toward precise disease classification. PLoS Comput Biol 2008, 4(11): e1000217. 10.1371/journal.pcbi.1000217PubMed CentralView ArticlePubMedGoogle Scholar
- Huang E, Ishida S, Pittman J, Dressman H, Bild A, Kloos M, D'Amico M, Pestell RG, West M, Nevins JR: Gene expression phenotypic models that predict the activity of oncogenic pathways. Nature genetics 2003, 34(2):226–230. 10.1038/ng1167View ArticlePubMedGoogle Scholar
- Lamb J, Ramaswamy S, Ford HL, Contreras B, Martinez RV, Kittrell FS, Zahnow CA, Patterson N, Golub TR, Ewen ME: A mechanism of cyclin D1 action encoded in the patterns of gene expression in human cancer. Cell 2003, 114(3):323–334. 10.1016/S0092-8674(03)00570-1View ArticlePubMedGoogle Scholar
- Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proceedings of the National Academy of Sciences of the United States of America 2004, 101(25):9309–9314. 10.1073/pnas.0401994101PubMed CentralView ArticlePubMedGoogle Scholar
- Abdollahi A, Schwager C, Kleeff J, Esposito I, Domhan S, Peschke P, Hauser K, Hahnfeldt P, Hlatky L, Debus J, et al.: Transcriptional network governing the angiogenic switch in human pancreatic cancer. Proceedings of the National Academy of Sciences of the United States of America 2007, 104(31):12890–12895. 10.1073/pnas.0705505104PubMed CentralView ArticlePubMedGoogle Scholar
- Friedman N: Inferring cellular networks using probabilistic graphical models. Science 2004, 303: 799–805. 10.1126/science.1094068View ArticlePubMedGoogle Scholar
- Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M: A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 2003, 302: 449–453. 10.1126/science.1087361View ArticlePubMedGoogle Scholar
- Sebastiani P, Ramoni MF, Nolan V, Baldwin CT, Steinberg MH: Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia. Nature genetics 2005, 37(4):435–440. 10.1038/ng1533PubMed CentralView ArticlePubMedGoogle Scholar
- Lauritzen SL, Sheehan NA: Graphical models for genetic analysis. Statist Sci 2004, 18(4):489–514.Google Scholar
- Kato H, Ichinose Y, Ohta M, Hata E, Tsubota N, Tada H, Watanabe Y, Wada H, Tsuboi M, Hamajima N: A randomized trial of adjuvant chemotherapy with uracil-tegafur for adenocarcinoma of the lung. N Engl J Med 2004, 350(17):1713–1721. 10.1056/NEJMoa032792View ArticlePubMedGoogle Scholar
- Thomas P, Khokha R, Shepherd FA, Feld R, Tsao MS: Differential expression of matrix metalloproteinases and their inhibitors in non-small cell lung cancer. J Pathol 2000, 190(2):150–156. 10.1002/(SICI)1096-9896(200002)190:2<150::AID-PATH510>3.0.CO;2-WView ArticlePubMedGoogle Scholar
- Yu CJ, Shih JY, Lee YC, Shun CT, Yuan A, Yang PC: Sialyl Lewis antigens: association with MUC5AC protein and correlation with post-operative recurrence of non-small cell lung cancer. Lung Cancer 2005, 47(1):59–67. 10.1016/j.lungcan.2004.05.018View ArticlePubMedGoogle Scholar
- Nesbitt JC, Putnam JB Jr., Walsh GL, Roth JA, Mountain CF: Survival in early-stage non-small cell lung cancer. Ann Thorac Surg 1995, 60(2):466–472. 10.1016/0003-4975(95)00169-LView ArticlePubMedGoogle Scholar
- Okamoto T, Maruyama R, Suemitsu R, Aoki Y, Wataya H, Kojo M, Ichinose Y: Prognostic value of the histological subtype in completely resected non-small cell lung cancer. Interact Cardiovasc Thorac Surg 2006, 5(4):362–366. 10.1510/icvts.2005.125989View ArticlePubMedGoogle Scholar
- Jamieson LA, Carey FA: Pathology of lung tumours. SURGERY 2005, 23(11):389–393.Google Scholar
- Wistuba II, Gazdar AF: Lung cancer preneoplasia. Annu Rev Pathol 2006, 1: 331–348. 10.1146/annurev.pathol.1.110304.100103View ArticlePubMedGoogle Scholar
- Nonami Y, Ohtuki Y, Sasaguri S: Study of the diagnostic difference between the clinical diagnostic criteria and results of immunohistochemical staining of multiple primary lung cancers. J Cardiovasc Surg (Torino) 2003, 44(5):661–665.Google Scholar
- Bild A, Yao G, Chang J, Wang Q, Potti A, Chasse D, Joshi M, Harpole D, Lancaster J, Berchuck A, et al.: Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 2006, 439(7074):353–357. 10.1038/nature04296View ArticlePubMedGoogle Scholar
- Hanada S, Maeshima A, Matsuno Y, Ohta T, Ohki M, Yoshida T, Hayashi Y, Yoshizawa Y, Hirohashi S, Sakamoto M: Expression profile of early lung adenocarcinoma: identification of MRP3 as a molecular marker for early progression. J Pathol 2008, 216(1):75–82. 10.1002/path.2383View ArticlePubMedGoogle Scholar
- Kuner R, Muley T, Meister M, Ruschhaupt M, Buness A, Xu EC, Schnabel P, Warth A, Poustka A, Sultmann H, et al.: Global gene expression analysis reveals specific patterns of cell junctions in non-small cell lung cancer subtypes. Lung Cancer 2009, 63(1):32–38. 10.1016/j.lungcan.2008.03.033View ArticlePubMedGoogle Scholar
- Wesley UV, Tiwari S, Houghton AN: Role for dipeptidyl peptidase IV in tumor suppression of human non small cell lung carcinoma cells. Int J Cancer 2004, 109(6):855–866. 10.1002/ijc.20091View ArticlePubMedGoogle Scholar
- Nguyen PL, Niehans GA, Cherwitz DL, Kim YS, Ho SB: Membrane-bound (MUC1) and secretory (MUC2, MUC3, and MUC4) mucin gene expression in human lung cancer. Tumour Biol 1996, 17(3):176–192. 10.1159/000217980View ArticlePubMedGoogle Scholar
- Copin M, Buisine M, Leteurtre E, Marquette C, Porte H, Aubert J, Gosselin B, Porchet N: Mucinous bronchioloalveolar carcinomas display a specific pattern of mucin gene expression among primary lung adenocarcinomas. Hum Pathol 2001, 32(3):274–281. 10.1053/hupa.2001.22752View ArticlePubMedGoogle Scholar
- Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, Sougnez C, Greulich H, Muzny DM, Morgan MB, et al.: Somatic mutations affect key pathways in lung adenocarcinoma. Nature 2008, 455(7216):1069–1075. 10.1038/nature07423PubMed CentralView ArticlePubMedGoogle Scholar
- Borczuk A, Kim H, Yegen H, Friedman R, Powell C: Lung adenocarcinoma global profiling identifies type II transforming growth factor-beta receptor as a repressor of invasiveness. Am J Respir Crit Care Med 2005, 172(6):729–737. 10.1164/rccm.200504-615OCPubMed CentralView ArticlePubMedGoogle Scholar
- Hawthorn L, Stein L, Panzarella J, Loewen G, Baumann H: Characterization of cell-type specific profiles in tissues and isolated cells from squamous cell carcinomas of the lung. Lung Cancer 2006, 53(2):129–142. 10.1016/j.lungcan.2006.04.015View ArticlePubMedGoogle Scholar
- Fujii T, Dracheva T, Player A, Chacko S, Clifford R, Strausberg R, Buetow K, Azumi N, Travis W, Jen J: A preliminary transcriptome map of non-small cell lung cancer. Cancer Res 2002, 62(12): 3340–3346.PubMedGoogle Scholar
- Chen BS, Xu ZX, Xu X, Cai Y, Han YL, Wang J, Xia SH, Hu H, Wei F, Wu M, et al.: RhCG is downregulated in oesophageal squamous cell carcinomas, but expressed in multiple squamous epithelia. Eur J Cancer 2002, 38(14):1927–1936. 10.1016/S0959-8049(02)00190-9View ArticlePubMedGoogle Scholar
- Fu J, Fong K, Bellacosa A, Ross E, Apostolou S, Bassi DE, Jin F, Zhang J, Cairns P, Ibanez de Caceres I, et al.: VILIP-1 downregulation in non-small cell lung carcinomas: mechanisms and prediction of survival. PLoS ONE 2008, 3(2):e1698. 10.1371/journal.pone.0001698PubMed CentralView ArticlePubMedGoogle Scholar
- Guha U, Chaerkady R, Marimuthu A, Patterson AS, Kashyap MK, Harsha HC, Sato M, Bader JS, Lash AE, Minna JD, et al.: Comparisons of tyrosine phosphorylated proteins in cells expressing lung cancer-specific alleles of EGFR and KRAS. Proc Natl Acad Sci U S A 2008, 105(37):14112–14117. 10.1073/pnas.0806158105PubMed CentralView ArticlePubMedGoogle Scholar
- Tibaldi C, Giovannetti E, Vasile E, Mey V, Laan AC, Nannizzi S, Di Marsico R, Antonuzzo A, Orlandini C, Ricciardi S, et al.: Correlation of CDA, ERCC1, and XPD polymorphisms with response and survival in gemcitabine/cisplatin-treated advanced non-small cell lung cancer patients. Clin Cancer Res 2008, 14(6):1797–1803. 10.1158/1078-0432.CCR-07-1364View ArticlePubMedGoogle Scholar
- Chari R, Lonergan KM, Ng RT, MacAulay C, Lam WL, Lam S: Effect of active smoking on the human bronchial epithelium transcriptome. BMC Genomics 2007, 8: 297. 10.1186/1471-2164-8-297PubMed CentralView ArticlePubMedGoogle Scholar
- Heighway J, Knapp T, Boyce L, Brennand S, Field JK, Betticher DC, Ratschiller D, Gugger M, Donovan M, Lasek A, et al.: Expression profiling of primary non-small cell lung cancer for target identification. Oncogene 2002, 21(50):7749–7763. 10.1038/sj.onc.1205979View ArticlePubMedGoogle Scholar
- Hu Z, Chen J, Tian T, Zhou X, Gu H, Xu L, Zeng Y, Miao R, Jin G, Ma H, et al.: Genetic variants of miRNA sequences and non-small cell lung cancer survival. J Clin Invest 2008, 118(7):2600–2608.PubMed CentralPubMedGoogle Scholar
- Wang Y, Barbacioru CC, Shiffman D, Balasubramanian S, Iakoubova O, Tranquilli M, Albornoz G, Blake J, Mehmet NN, Ngadimo D, et al.: Gene expression signature in peripheral blood detects thoracic aortic aneurysm. PLoS ONE 2007, 2(10):e1050. 10.1371/journal.pone.0001050PubMed CentralView ArticlePubMedGoogle Scholar
- Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A 2002, 99(10):6567–6572. 10.1073/pnas.082099299PubMed CentralView ArticlePubMedGoogle Scholar
- Ferrazzi F, Sebastiani P, Ramoni MF, Bellazzi R: Bayesian approaches to reverse engineer cellular systems: a simulation study on nonlinear Gaussian networks. BMC Bioinformatics 2007, 8(Suppl 5):S2. 10.1186/1471-2105-8-S5-S2PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.