Skip to main content

Gene expression based prototype for automatic tumor prediction


Automatic detection of tumors is a challenging task due to the heterogeneous phenotypic and genotypic behaviors of cells within tumor types [13]. In recent years, a number of research endeavors have been reported in literatures that exploit microarray gene expression data to predict tissue/tumor types with high confidence [314]. However, in predicting tissue types, the above mentioned works neither explicitly considered correlation among the genes nor the probable subgroups within the known groups. In this work, our primary objective is to develop an automated prediction scheme for tumors based on DNA microarray gene expressions of tissue samples.

Material and methods

The workflow to build the tumor prototypes is shown in Fig. 1. Considering various sources of variation in array measures, we estimate tumor-specific gene expression measures using a two-way ANOVA model. Then, marker genes are identified using Wilcoxon [15] and Kruskal-Wallis [16] test. We then group the highly correlated marker genes together. Then, we obtain eigen-gene expressions measures [10] from each individual gene group. At the end of this step, we replace the gene expression measurements with eigen-gene expression values that conserve correlations among the strongly correlated genes. We then divide the tissue samples of known tumor types into subgroups. The CS measure [17] is exploited to obtain the optimal number of gene groups and tissue subgroups within each tissue type. The centroids of these subgroups of tissue samples represent the prototype of the corresponding tumor type. Finally, any new tissue sample is predicted as the tumor type of the closest centroid.

Figure 1
figure 1

Simplified workflow to build the tumor prototypes.


To evaluate the proposed tumor prediction scheme, five different gene microarray datasets [35, 79] are used, all of which were obtained using Affymetrix technology. We use leave-one-out cross validation method. Table 1 shows a summary of our experimental results for all the datasets. We provide relevant intermediate results along with the final classification accuracy. Finally, Table 2 shows the performance comparison between our proposed prediction scheme and the methods discussed in original works [3, 5, 79] wherein the corresponding datasets are published. We also compare our classification accuracies with those of a Supervised Clustering method [4] for completeness.

Table 1 Experimental results with different dataset.
Table 2 Comparison of methods.


In this work, we propose a novel, seamless, and integrated technique of automatic tumor detection using Affymetrix microarray gene expression data. We appropriately normalize the data by estimating tumor-specific gene expression measures using an ANOVA model. Furthermore, our novel tumor prediction scheme explores molecular information such as probable correlations among genes and probable unknown subgroups within known tumor types. We demonstrate the efficacy of our proposed scheme using five different Affymetrix gene expression datasets.


  1. NCI Brain Tumor Progress Review Group[]

  2. Yang Y, Guccione S, Bednarski MD: Comparing genomic and histologic correlations to radiographic changes in tumors: A murine SCC Vll model Study. Academic Radiology 2003, 10(10):1165–1175. 10.1016/S1076-6332(03)00327-1

    Article  PubMed  Google Scholar 

  3. Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR: Prediction of central nervous system embryonal tumor outcome based on gene expression. Nature 2002, 415: 436–442. 10.1038/415436a

    Article  CAS  PubMed  Google Scholar 

  4. Dettling M, Buhlmann P: supervised clustering of genes. Genome Biology 2002, 3(12):1–15. 10.1186/gb-2002-3-12-research0069

    Article  Google Scholar 

  5. Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D, Levine A: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of National Academic of Science 1999, 96(12):6745–6750. 10.1073/pnas.96.12.6745

    Article  CAS  Google Scholar 

  6. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburge DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403: 503–511. 10.1038/35000501

    Article  CAS  PubMed  Google Scholar 

  7. Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531–537. 10.1126/science.286.5439.531

    Article  CAS  PubMed  Google Scholar 

  8. West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson J, Marks J, Nevins J: Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci 2001, 98: 11462–11467. 10.1073/pnas.201162998

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  9. Singh D, Febbo P, Ross K, Jackson D, Manola J, Ladd C, Tamayo P, Renshaw A, D’Amico A, Richie J: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002, 1: 203–209. 10.1016/S1535-6108(02)00030-2

    Article  CAS  PubMed  Google Scholar 

  10. Shen R, Ghosh D, Chinnaiyan A, Meng Z: Eigengene-based linear discriminant model for tumor classification using Gene expression microarray data. Bioinformatics 2006, 22(21):2635–2642. 10.1093/bioinformatics/btl442

    Article  CAS  PubMed  Google Scholar 

  11. Sandberg R, Ernberg I: Assessment of tumor characteristic gene expression in cell lines using a tissue similarity index (TSI). Proceedings of the National Academy of Sciences. USA 2005, 102(6):2052–2057. 10.1073/pnas.0408105102

    Article  CAS  Google Scholar 

  12. Poisson LM, Ghosh D: Statistical issues and analyses of in vivo and in vitro genomic data in order to identify clinically relevant profiles. Cancer Informatics 2007, 3: 231–243.

    PubMed Central  PubMed  Google Scholar 

  13. Fromke C, Horhorn LA, Kropt S: Nonparametric relevance-shifted multiple testing procedures for analysis of high-dimensional multivariate data with small sample sizes. BMC Bioinformatics 2008, 9: 54. 10.1186/1471-2105-9-54

    Article  PubMed Central  PubMed  Google Scholar 

  14. Islam A, Iftekharuddin KM, George EO: Class specific gene expression estimation and classification in microarray data. Proceedings of IEEE International Joint Conference on Neural Networks (IJCNN) 2008, 1678–1685.

    Google Scholar 

  15. Wilcoxon F: Individual comparisons by ranking methods. Biometrics 1945, 1: 80–83. 10.2307/3001968

    Article  Google Scholar 

  16. NIST/SEMATECH e-Handbook of Statistical Methods[]

  17. Chou C, Su M, Lai E: A new cluster validity measure for clusters with different densities. IASTED International Conference on Intelligent Systems and Control 2003, 276–281.

    Google Scholar 

Download references


The research in this paper is supported in part through research grants [RG-01-0125, TG-04-0026] provided by the Whitaker Foundation with Khan M. Iftekharuddin as the principal investigator.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Khan M Iftekharuddin.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Islam, A., Iftekharuddin, K.M. & George, O.E. Gene expression based prototype for automatic tumor prediction. BMC Bioinformatics 12 (Suppl 7), A15 (2011).

Download citation

  • Published:

  • DOI:


  • Prediction Scheme
  • Gene Expression Dataset
  • Microarray Gene Expression Data
  • Supervise Cluster
  • Affymetrix Gene Expression