stam – a Bioconductor compliant R package for structured analysis of microarray data
© Lottaz and Spang; licensee BioMed Central Ltd. 2005
Received: 16 March 2005
Accepted: 25 August 2005
Published: 25 August 2005
Genome wide microarray studies have the potential to unveil novel disease entities. Clinically homogeneous groups of patients can have diverse gene expression profiles. The definition of novel subclasses based on gene expression is a difficult problem not addressed systematically by currently available software tools.
We present a computational tool for semi-supervised molecular disease entity detection. It automatically discovers molecular heterogeneities in phenotypically defined disease entities and suggests alternative molecular sub-entities of clinical phenotypes. This is done using both gene expression data and functional gene annotations.
We provide stam, a Bioconductor compliant software package for the statistical programming environment R. We demonstrate that our tool detects gene expression patterns, which are characteristic for only a subset of patients from an established disease entity. We call such expression patterns molecular symptoms. Furthermore, stam finds novel sub-group stratifications of patients according to the absence or presence of molecular symptoms.
Our software is easy to install and can be applied to a wide range of datasets. It provides the potential to reveal so far indistinguishable patient sub-groups of clinical relevance.
Microarray analysis is among the most promising clinical applications of modern genomics. It opens perspectives for more reliable and efficient diagnosis of established tumor entities [1, 2], risk group determination [3, 4], and the prediction of response to treatment . In the supervised setting, various software tools implementing algorithms from statistical learning theory are available and have been evaluated in the context of microarray data (e.g. [6–10]).
All these methods aim for reproducing or predicting predefined clinical phenotypes. However, often clinical phenotypes will not be homogeneous from a molecular point of view. For example, when distinguishing between recurrent and non-recurrent disease, it is of course possible that recurrence has various molecular backgrounds. If this is the case, one will expect different molecular changes in different patients, and purely supervised analysis is unsatisfactory.
In several studies, unsupervised clustering algorithms have been applied to patient profiles, with the aim to define novel disease entities [11–14]. However, clustering of patients is not straightforward, since the clinical relevance of a clustering result is often unclear. It is quite possible that a given clustering reflects unimportant covariates like gender and age or even experimental artifacts. This is usually avoided by visual inspection of the clustered data and an educated manual selection of interesting genes. Automated software tools for this problem are not available so far.
We have recently suggested a novel algorithm for semi-supervised analysis called structured analysis of microarrays . We consider the setting where a disease group is to be distinguished from a set of patients with a different clinical phenotype (controls). Instead of determining a single global signature to detect all disease cases, we generate several local signatures, which identify only subsets. We call the local signatures molecular symptoms. A special feature of the method is that it produces multiple candidate symptoms and characterizes each by a functional annotation, like patients with poor prognosis and altered expression of apoptosis related genes. The functional annotations stored in the Gene Ontology (GO) are used to ensure biological focus.
In GO , terms describing biological processes, molecular functions and cellular localizations are organized in a directed acyclic graph, where each node represents a biological process and child-terms are either members or representatives of their parent-terms. Genes are attributed to nodes according to the knowledge the biological research community has gathered so far. Molecular symptoms found by stam exclusively contain genes associated with one node of the Gene Ontology and therefore have a biological focus.
A detailed description of structured analysis of microarrays is given in . Here we only give a brief review of the method.
generate a rooted, directed classifier graph according to the gene Ontology,
construct leaf-node classifiers based on expression values of genes, which are directly annotated to the leaf nodes,
propagate the results through inner nodes to the root,
and shrink the classifier graph to determine a concise set of molecular symptoms.
We have implemented the algorithm based on the R package for statistical computing . Time-consuming parts of the method are written in C to improve computational performance. Furthermore, we rely on packages from the Bioconductor suite of bioinformatics tools .
The raw classifier graph
Gene Ontology annotations available in Bioconductor – For the microarrays listed in this table, Bioconductor meta data packages are available. The second column gives the number of leaf nodes the third column the number of inner nodes considered when generating classifier graphs. The last column reports the ratio of probe-sets being associated with any leaf node.
Each leaf node contains a set of associated genes. The classifiers for leaf nodes are constructed using only these genes. For each patient, it returns a number between zero and one. Zero indicates clear evidence for the control group, one indicates clear evidence for the disease group and intermediate values represent levels of uncertainty. In the current implementation we use the shrunken centroid classifiers  implemented in the Bioconductor package pamr for leaf node predictions.
Propagation of classifier results
For propagating leaf node results to inner nodes, weighted sums of child classifications are used. Children with good classification performance receive more weight than those with poor performance. Thereby, stam measures performance according to the desired properties of molecular symptoms by punishing low specificity more severely than lack of sensitivity. Prediction results are propagated from the leaf nodes towards the root in a postorder traversal of inner nodes. Hence, stam always computes results for all children before it computes results for the parent node. The root naturally displays an overall classification result.
Classifier graph shrinkage
Many biological processes are not involved with the investigated phenotype. Therefore, stam simplifies the classifier graph by eliminating irrelevant branches. This is done in analogy to gene shrinkage in the shrunken centroid algorithm . stam controls the shrinkage process by calibrating a shrinkage parameter in a cross validation setting. We define an objective function considering two independent goals: good predictive performance in the root and a set of molecular symptoms for patient stratification. For the second goal aggressive shrinkage is counterproductive, since it eliminates too many inherently heterogeneous molecular symptoms.
The program's output is a classifier graph, where each node represents a molecular symptom. We have shown in  that the collection of these classifiers yields state-of-the-art predictive performance and allows for a resolved diagnosis. A stam-diagnosis is more resolved than the classification provided for training because molecular symptoms are usually absent in some of the disease patients. Patterns of absence and presence of molecular symptoms identify smaller groups of patients and thus provide an additional molecular stratification of patients. Due to this unsupervised aspect within our supervised method, we call our approach semi-supervised.
Installing stam works like any other Bioconductor package either by downloading and installing from a local copy or directly through the internet. We provide packaged versions ready for download on the Bioconductor web site  as well as on our own web page .
Computing with stam is done on a command-line level. Gene expression matrices can either be provided as plain R matrices or as exprSet Bioconductor objects. R can read tab-delimited files written by any other software. stam provides functions for cross validation, model fit, and prediction. First, cross validation is applied on training data to find the appropriate shrinkage level. The second function computes a classifier model given this shrinkage level. This model can than be used by the prediction function to diagnose new patients and assign them to novel molecular disease entities. For convenience all three steps can be performed by one call of an evaluation function. This function can also randomly split patients into a training and a test set.
For further illustration, we use a data set from a microarray study on lung cancer . The investigators have analyzed gene expression profiles from 186 lung cancer as well as 17 non-tumor lung biopsies using hierarchical and probabilistic clustering with the goal to uncover novel molecular lung cancer entities. The study uses the HG-U95Av2 microarray from Affymetrix and contains samples from various subtypes of lung cancer. For illustration, we apply stam with the squamous cancers forming the disease group of interest and all other cancers as controls. In the dataset there are 21 squamous carcinomas. The 203 samples are randomly split into a training set (135 samples containing 14 squamous) and a test set (68 samples containing 7 squamous).
Automatic and manual calibration of graph shrinkage
Results are written on interlinked HTML pages. Links allow navigation along the edges of the classifier graph. The pages contain classification results and performance evaluation for all nodes as well as overall information on cross-validation, model fit and root diagnosis of patients. For inner nodes the propagation weights are provided and for leaf nodes the genes used for classification can be displayed. The user can further explore term definitions and probe-set annotations through external links to the Gene Ontology and the Affymetrix web sites.
Interactive use of stam
In this paper we present a software package to integrate biological annotation into statistical class prediction analysis of microarray data in an a priori fashion. We use the functional annotation collected in the Gene Ontology database to construct structured classifiers. Class predictions are computed for each term in the Gene Ontology which is related to the disease. Our method allows for biologically resolved diagnosis of patients. It is thus able to stratify complex clinical phenotypes, where different patients who show the phenotype may display different molecular characteristics.
We suggest structured analysis of microarrays for different applications. In addition to predictive performance we also aim for making underlying disease mechanisms transparent. We do this by identifying molecular symptoms associated to subsets of patients in the disease group. Molecular symptoms are always restricted to well defined biological processes. Patients who are positive for a molecular symptom display specific gene expression in the corresponding process. Not all patients in the disease group are positive for every identified molecular symptom, but some patients can be positive for more than one of them. Using patterns of absence and presence of molecular symptoms, we define an additional molecular stratification of patients.
In summary, stam is a novel algorithm for uncovering previously unknown molecular disease sub-entities. The R package is easily accessible to all researchers working with Affymetrix® oligo chips.
Availability and requirements
The Bioconductor compliant R package stam is available through the Bioconductor web site . Alternatively we also make it available on the Computational Diagnostics Software Page at the Max Planck Institute for Molecular Genetics in Berlin . There, the source package is available for download and we also run a Bioconductor compliant package repository.
Our software is written for the R package for statistical computing downloadable from . An installation of version 2.0.0 or later of the R software is needed to run stam. Our software is based on other Bioconductor packages, namely the meta data packages for the Gene Ontology annotations. We recommend to install release 1.5 of the Bioconductor suite from . For the layout of classifier graphs, we rely on the graphviz package versions 1.10 or later available at .
We have extensively used stam on Linux installations on i686 based machines as well as alpha based UNIX machines running OSF1 and True64 operating systems.
The authors are grateful to Florian Markowetz, Jörn Tödling, Jochen Jäger, Stefanie Scheid and Stefan Bentink from our work group as well as to our partners Renate Kirschner-Schwabe, Christian Hagemeier and Karl Seeger from the Charité Medical Center for fruitful discussions. This research has been supported by BMBF grants 031U117/217 of the German Federal Ministry of Education and the National Genome Research Network.
- Bhattacharjee A, Richards W, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark E, Lander E, Wong W, Johnson B, Golub T, Sugarbaker D, Meyerson M: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci U S A 2001, 98(24):13790–5. 10.1073/pnas.191502998PubMed CentralView ArticlePubMedGoogle Scholar
- Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR: Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 2002, 1: 133–45. 10.1016/S1535-6108(02)00032-6View ArticlePubMedGoogle Scholar
- Huang E, Cheng SH, Dressman H, Pittman J, Tsou MH, Horng C, Bild A, Iversen E, Liao M, Chen CM, West M, Nevins JR, Huang AT: Gene expression predictors of breast cancer outcomes. Lancet 2003, 361(9363):1590–6. 10.1016/S0140-6736(03)13308-9View ArticlePubMedGoogle Scholar
- van't Veer L, Dai H, van de Vijver M, He Y, Hart A, Mao M, Peterse H, van der Kooy K, Marton M, Witteveen A, Schreiber G, Kerkhoven R, Roberts C, Linsley P, Bernards R, Friend S: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530–6. 10.1038/415530aView ArticleGoogle Scholar
- Cheok MH, Yang W, Pui CH, Downing JR, Cheng C, Naeve CW, Relling MV, Evans WE: Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells. Nature Genet 2003, 34: 85–90. 10.1038/ng1151View ArticlePubMedGoogle Scholar
- Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z: Tissue classification with gene expression profiles. J Comp Biol 2000, 7: 559–83. 10.1089/106652700750050943View ArticleGoogle Scholar
- Dudoit S, Fridlyand J, Speed T: Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. J Amer Stat Assoc 2002, 97: 77–87. 10.1198/016214502753479248View ArticleGoogle Scholar
- Slonim DK, Tamayo T, Mesirov JP, Golub TR, Lander ES: Class Prediction and Discovery Using Gene Expression Data. Proc Internatl Conf Comp Biol 2000, 263–72.Google Scholar
- Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types using shrunken centroids of gene expression. Proc Natl Acad Sci 2002, 99(10):6567–72. 10.1073/pnas.082099299PubMed CentralView ArticlePubMedGoogle Scholar
- West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR: Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci 2001, 98(20):11462–7. 10.1073/pnas.201162998PubMed CentralView ArticlePubMedGoogle Scholar
- Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403(6769):503–11. 10.1038/35000501View ArticlePubMedGoogle Scholar
- Monti S, Savage KJ, Kutok JL, Feuerhake F, Kurtin P, Mihm M, Wu B, Pasqualucci L, Neuberg D, Aguiar RC, Cin PD, Ladd C, Pinkus GS, Salles G, Harris NL, Dalla-Favera R, Habermann TM, Aster JC, Golub TR, Shipp MA: Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response. Blood 2005, 105(5):1851–1861. 10.1182/blood-2004-07-2947View ArticlePubMedGoogle Scholar
- Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lonning PE, Borresen-Dale AL, Brown PO, Botstein D: Molecular portraits of human breast tumours. Nature 2000, 406(6797):747–752. 10.1038/35021093View ArticlePubMedGoogle Scholar
- Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Giltnane JM, Hurt EM, Zhao H, Averett L, Yang L, Wilson WH, Jaffe ES, Simon R, Klausner RD, Powell J, Duffey PL, Longo DL, Greiner TC, Weisenburger DD, Sanger WG, Dave BJ, Lynch JC, Vose J, Armitage JO, Montserrat E, Lopez-Guillermo A, Grogan TM, Miller TP, LeBlanc M, Ott G, Kvaloy S, Delabie J, Holte H, Krajci P, Stokke T, Staudt LM: The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med 2002, 346(25):1937–1947. 10.1056/NEJMoa012914View ArticlePubMedGoogle Scholar
- Lottaz C, Spang R: Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data. Bioinformatics 2005, 21: 1971–8. 10.1093/bioinformatics/bti292View ArticlePubMedGoogle Scholar
- The Gene Ontology Consortium: Gene ontology: Tool for the unification of biology. Nature Genet 2000, 25: 25–9. 10.1038/75556PubMed CentralView ArticleGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2004.Google Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. 10.1186/gb-2004-5-10-r80PubMed CentralView ArticlePubMedGoogle Scholar
- The BioConductor Home Page[http://www.bioconductor.org]
- The Computational Diagnostics Software Page[http://compdiag.molgen.mpg.de/software]
- Gansner ER, North SC: An open graph visualization system and its applications to software engineering. Software Practice and Experience 2000, 30(11):1203–33. Publisher Full Text10.1002/1097-024X(200009)30:11<1203::AID-SPE338>3.0.CO;2-NView ArticleGoogle Scholar
- The R Project for Statistical Computing[http://www.r-project.org]
- Graphviz – Graph Visualization Software[http://www.graphviz.org]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.