Signature Evaluation Tool (SET): a Java-based tool to evaluate and visualize the sample discrimination abilities of gene expression signatures
© Jen et al; licensee BioMed Central Ltd. 2008
Received: 17 October 2007
Accepted: 28 January 2008
Published: 28 January 2008
The identification of specific gene expression signature for distinguishing sample groups is a dominant field in cancer research. Although a number of tools have been developed to identify optimal gene expression signatures, the number of signature genes obtained is often overly large to be applied clinically. Furthermore, experimental verification is sometimes limited by the availability of wet-lab materials such as antibodies and reagents. A tool to evaluate the discrimination power of candidate genes is therefore in high demand by clinical researchers.
Signature Evaluation Tool (SET) is a Java-based tool adopting the Golub's weighted voting algorithm as well as incorporating the visual presentation of prediction strength for each array sample. SET provides a flexible and easy-to-follow platform to evaluate the discrimination power of a gene signature. Here, we demonstrated the application of SET for several purposes: (1) for signatures consisting of a large number of genes, SET offers the ability to rapidly narrow down the number of genes; (2) for a given signature (from third party analyses or user-defined), SET can re-evaluate and re-adjust its discrimination power by selecting/de-selecting genes repeatedly; (3) for multiple microarray datasets, SET can evaluate the classification capability of a signature among datasets; and (4) by providing a module to visualize the prediction strength for each sample, SET allows users to re-evaluate the discrimination power on mis-grouped or less-certain samples. Information obtained from the above applications could be useful in prognostic analyses or clinical management decisions.
Here we present SET to evaluate and visualize the sample-discrimination ability of a given gene expression signature. This tool provides a filtration function for signature identification and lies between clinical analyses and class prediction (or feature selection) tools. The simplicity, flexibility and brevity of SET could make it an invaluable tool for marker identification in clinical research.
Gene expression profiling based on microarray technology has been applied widely on monitoring global transcriptome changes in biological samples. In cancer research, one of the major microarray applications is to identify genes, or features, whose expression patterns can discriminate samples with distinct states (usually defined by the phenotype of samples such as primary or metastatic tumour). These identified genes form an expression signature that can be used to assist clinical management decisions such as clinical trail risk assessment, treatment selection, or cancer prognosis [1–5].
To acquire a good expression signature, supervised methods are more appropriate than unsupervised approaches. Basically, a supervised prediction method consists of three common processes: 1) feature selection, 2) computation of weights for selected features, 3) creation of a prediction rule . By using the cross-validation method such as n-fold or leave-one-out cross-validation (LOOCV), the discrimination capability of a signature can be evaluated. Recently, many classification algorithms (such as SVM, evolutionary algorithm and I-RELIEF) combining cross-validation and heuristic searching to acquire an optimal expression signature have been proposed [7–9]. Furthermore, those algorithms have been incorporated into hassle-free tools to aid the acquisition of an optimal signature. For example, M@CBETH  is a web-based tool aimed at finding the best prediction among different classification methods. Prophet , another web-based tool, can automatically build classifiers using a strategy that renders unbiased cross-validated errors. The class prediction modules in GenePattern  also supports several supervised learning methods. Moreover, for improving the efficiency and the accuracy of an acquired signature, several feature selection tools based on statistical analysis have been developed: RankGene is a feature selection suite based on statistical ranking analyses , HykGene  and mRMR  are tools to minimise redundancy of genes.
In SET, we adopted the weighted voting algorithm published by Ramaswamy et al. and LOOCV [3, 16, 17] to evaluate the discrimination power of features. The signal-to-noise score was used:S x = (μ GI - μ GII )/(σ GI + σ GII )
S x : the weighted value for the each feature x
μ: mean of expression in group I ( GI ) or group II ( GII )
σ: standard deviation of expression in group I ( GI ) or group II ( GII )
The signal-to-noise statistics reflects how well a feature correlates with a particular group distinction (numerator). Also, it penalises features which have higher variance in both groups more than those having high variance in one group but low variance in another (denominator). This bias is useful for biological samples: for example, in cancer research, genes in normal tissues work normally and the regulation of which are strict. However, in tumours, genes are dysregulated and the levels of gene expression vary widely . The weighted voting algorithm has been compared with other class prediction methods (standard and diagonal discriminant analysis, classification trees with or without aggregation, and nearest neighbour classification) using three microarray datasets (adult lymphoma, leukaemia, and sixty human tumour cell lines), and it was the one with the best performance .
In order to avoid over-interpretation of the error rate value produced by weighted voting algorithm, there is a visualization module in SET to present the prediction strength (PS) information for all samples [3, 16, 17, 20]:PS = (V GI - V GII )/(V GI + V GII )
V GI and V GII represent the total votes for G I and G II respectively
The PS value ranges from -1 to +1, with higher absolute values reflecting stronger prediction. The prediction strength for each sample shows the margin of victory in either direction of two supervised groups. The visualization of uncertainty will provide important information about prognosis, such as the progression of tumour metastasis or the estimated survival time .
SET is a standalone Java application that deploys Java Web Start technology, providing a flexible platform for researchers to evaluate gene signatures based on expression datasets. It enables users to analyze unpublished profiles locally with the most up-to-date version of the program. Results are visualized by JFreeChart, an open-sourced Java chart library, which displays the line chart of error rate distribution and the scatter plot of prediction strength analysis. This software exhibits several unique presentations and user-friendly elements by following four simple steps:
Step1: Grouping arrays by supervised knowledge
Step 2: Error rate distribution
By default, the uploaded genes are ranked according to the absolute values of corresponding signal-to-noise scores in a descending order, but can be user-defined to be ranked by other attributes such as p-values. Genes are included into a signature one at a time based on the order of ranking. The error rate for each new signature is estimated by the weighted voting algorithm and LOOCV  and can be monitored by an error rate distribution plot (Figure 2B). Subsequently, based on the error rate information, the user can select an appropriate composition of discriminating genes, for instance, a composition with the lowest error rate.
Step 3: Signature evaluation
Genes within the chosen composition are ranked and displayed by their signal-to-noise scores and the user can manually select or de-select genes as appropriate (Figure 2C). Gene titles and gene symbols can be incorporated in this step if the annotations of an array platform are supported by our ArrayFusion database, which currently supports annotations for the majority of Affymetrix arrays and several Agilent arrays . The potential of selected genes to distinguish between two supervised groups can be evaluated by cross-validating error rate information, where a lower error rate reflects a superior distinguishing potential. The significance of error rate is estimated by 1,000 times of group permutations to ensure that the error rate is not a result of random chance . The expression signature can be arbitrarily modified during the analysis and the corresponding error rate can be recalculated repeatedly.
Step 4: Prediction strength
The result of prediction strength (PS) analysis for each sample is shown once a signature is defined. The PS values range from -1 to +1, where higher absolute values reflect stronger predictions . An overview of the results for samples in both "Supervised" and "Testing" groups is illustrated by the PS plot for the selected signature, and the results can be used to evaluate and predict the certainty of group identity for individual sample (see Figure 2D as an example). To increase the flexibility of evaluation, samples can be re-grouped (for instance, re-allocated from the "Testing" group to the "Supervised" group) and signature genes can be re-selected repeatedly (Figures 2A and 2C). Results of the analysis provide the user candidate genes for further experimental validation.
Further details are illustrated in the tutorial file on the website, please see the Availability and requirements section.
Results and Discussion
Serial signature evaluation
SET provides a rational way of narrowing down genes with optimal discriminative power. Unlike other feature selection tools such as Hykgene  or mRMR , which select non-redundant genes based on statistical calculations, SET adopts a speedy signature evaluation approach that ranks the gene list according to the contribution value (Sx or user-defined attributes) of individual genes and, additionally, plots the distribution of cross-validated error rate for signatures with increasing number of genes. Based on the distribution, the user can easily narrow down the number of genes with superior discriminative power; however, the approach is not without limitations. It is also possible that a subset of genes could generate lower error rate, albeit bearing lower weights. To avoid missing crucial genes, the user may commence by narrowing down the genes to a manageable quantity and subsequently select/de-select genes to further examine the power of the individual signature. Here, users are reminded that SET is a tool for signature evaluation rather than a machine learning tool for building an optimized prediction rule; in other words, the estimated error rate is only applied to the defined signature rather than to the signature building procedure that includes the feature selection process .
A flexible evaluation platform
As described in Implementation (i) to (iii), SET provides a signature evaluation platform that can adapt signatures from a variety of sources including third party analyses or candidates of interest that are deduced by biological knowledge. The ability to re-select/de-select genes following error rate distribution analysis enables the user to further choose genes from the narrowed down list (Figure 2C), and rapidly re-evaluate and re-adjust the discriminative power of the new signature (Figures 2C and 2D).
With the accumulation of microarray experiments, researchers nowadays may have more than one gene expression dataset. To evaluate the applicability of a specific signature between different datasets, researchers can import two datasets into SET separately but select the same signature members to carry out the evaluation step. Alternatively, researchers can merge two datasets into one expression matrix, upload it into SET, and then perform a two-step evaluation procedure proposed by Gloub et al. . In this case, one dataset can be set up as "Supervised" groups while the other as "Testing" group. The first step evaluation tests the selected features by cross-validation on samples of the "Supervised" group, and the second-step applies the built signature to assess its accuracy on the "Testing" group. Both results can be shown in the PS analysis.
Visualization of prediction strength (PS) for evaluation and prediction
Given that the prediction uncertainty is not revealed by estimation of error rate, the incorporation of the PS index in the analysis is of importance. The PS presentation methods used by MacDonald et al. [3, 16, 17] and Ye et al.  were integrated into SET, but a new way of displaying the PS information for each sample was devised (Figure 2D). The PS visualization module conveniently enables the user to trace back samples incorrectly grouped, or samples that have lower prediction certainty (PS value close to 0). Furthermore, the module would be of substantial value in clinical research when clinical parameters, such as disease progression, are taken into consideration. For example, in Figure 1, some primary tumour samples were grouped together with metastatic tumours. Not only does the tool enable re-validating the reliability of the features used, but also back tracking to the clinical information of those primary tumours, allowing potential discovery of patients with inferior clinical outcome or higher metastatic risk.
As described in Implementation, arrays in the same matrix can be annotated as "Supervised" or "Testing" samples in SET. Visualization of their PS information in the same plot enables the user to re-evaluate the discriminative power and validate the prediction power of a signature simultaneously.
Application on multi-class datasets
For datasets containing multiple phenotypes, one-versus-all comparisons can be performed to filter associated markers. This strategy has been proven successful in several high-quality microarray experiments , and the incorporation of algorithms designed for multivariate issues into the next version of SET is currently in progress.
SET and biological relevance analysis
Albeit it is of logic to assume biological correlation of signature genes between one another (for instance, the involvement in common pathways or genetic networks) the identification of the biological relevance of input or output genes, however, is not the primary function of SET. This tool is principally aimed at providing a gene filtration threshold for gene identification. Upon identification of a gene set of interest, the candidate genes can be applied to other biologically/clinically relevant analyses (such as Gene Ontology or Gene Set Enrichment Analysis) to determine the biological significance of those genes.
SET provides a gene filtration threshold for gene identification between biological/clinical analyses and typical feature selection tools. SET is focused on the "evaluation" of input/selected genes to suggest their prediction/classification power. It rapidly narrows down candidate diagnostic markers from numerous signature genes and offer prediction information. The application of SET to filter out a smaller number of diagnostic markers from publically accessible databases was exemplified in this report. Taken together, the flexibility and reliability makes SET a valuable tool for various evaluations in clinical research.
Availability and requirements
Project name: SET
Project home page: http://microarray.ym.edu.tw/tools/set/
Operating system(s): Platform independent
Programming language: Java and Java Web Start
Other requirements: Java 1.5.1 or higher
We thank Miss LL Li and Mr. KC Kuo for the language revision of the manuscript. This work is supported by grants from the National Research Program for Genomic Medicine (NRPGM), National Science Council (NSC), Taiwan (NSC95-3112-B-010-012 and NSC96-3112-B-010-009), in part by other grants from NSC (NSC95-2320-B-010-018 and NSC96-2320-B-010-026), a grant from the Taipei City Hospital (95002-62-086), and a grant from Ministry of Education, Aim for the Top University Plan. We also acknowledge the efforts of IGC and expO for their valuable microarray data.
- van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530–536. 10.1038/415530aView ArticlePubMedGoogle Scholar
- Chen HY, Yu SL, Chen CH, Chang GC, Chen CY, Yuan A, Cheng CL, Wang CH, Terng HJ, Kao SF, Chan WK, Li HN, Liu CC, Singh S, Chen WJ, Chen JJ, Yang PC: A five-gene signature and clinical outcome in non-small-cell lung cancer. N Engl J Med 2007, 356(1):11–20. 10.1056/NEJMoa060096View ArticlePubMedGoogle Scholar
- Ramaswamy S, Ross KN, Lander ES, Golub TR: A molecular signature of metastasis in primary solid tumors. Nat Genet 2003, 33(1):49–54. 10.1038/ng1060View ArticlePubMedGoogle Scholar
- De Smet F, Pochet NL, Engelen K, Van Gorp T, Van Hummelen P, Marchal K, Amant F, Timmerman D, De Moor BL, Vergote IB: Predicting the clinical behavior of ovarian cancer from gene expression profiles. Int J Gynecol Cancer 2006, 16 Suppl 1: 147–151. 10.1111/j.1525-1438.2006.00321.xView ArticlePubMedGoogle Scholar
- Valk PJ, Verhaak RG, Beijen MA, Erpelinck CA, Barjesteh van Waalwijk van Doorn-Khosrovani S, Boer JM, Beverloo HB, Moorhouse MJ, van der Spek PJ, Lowenberg B, Delwel R: Prognostically useful gene-expression profiles in acute myeloid leukemia. N Engl J Med 2004, 350(16):1617–1628. 10.1056/NEJMoa040465View ArticlePubMedGoogle Scholar
- Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 2003, 95(1):14–18.View ArticlePubMedGoogle Scholar
- Deb K, Raji Reddy A: Reliable classification of two-class cancer data using evolutionary algorithms. Biosystems 2003, 72(1–2):111–129. 10.1016/S0303-2647(03)00138-2View ArticlePubMedGoogle Scholar
- Jirapech-Umpai T, Aitken S: Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes. BMC Bioinformatics 2005, 6: 148. 10.1186/1471-2105-6-148PubMed CentralView ArticlePubMedGoogle Scholar
- Sun Y, Goodison S, Li J, Liu L, Farmerie W: Improved breast cancer prognosis through the combination of clinical and genetic markers. Bioinformatics 2007, 23(1):30–37. 10.1093/bioinformatics/btl543PubMed CentralView ArticlePubMedGoogle Scholar
- Pochet NL, Janssens FA, De Smet F, Marchal K, Suykens JA, De Moor BL: M@CBETH: a microarray classification benchmarking tool. Bioinformatics 2005, 21(14):3185–3186. 10.1093/bioinformatics/bti495View ArticlePubMedGoogle Scholar
- Medina I, Montaner D, Tarraga J, Dopazo J: Prophet, a web-based tool for class prediction using microarray data. Bioinformatics 2007, 23(3):390–391. 10.1093/bioinformatics/btl602View ArticlePubMedGoogle Scholar
- Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP: GenePattern 2.0. Nat Genet 2006, 38(5):500–501. 10.1038/ng0506-500View ArticlePubMedGoogle Scholar
- Su Y, Murali TM, Pavlovic V, Schaffer M, Kasif S: RankGene: identification of diagnostic genes based on expression data. Bioinformatics 2003, 19(12):1578–1579. 10.1093/bioinformatics/btg179View ArticlePubMedGoogle Scholar
- Wang Y, Makedon FS, Ford JC, Pearlman J: HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data. Bioinformatics 2005, 21(8):1530–1537. 10.1093/bioinformatics/bti192View ArticlePubMedGoogle Scholar
- Ding C, Peng H: Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 2005, 3(2):185–205. 10.1142/S0219720005001004View ArticlePubMedGoogle Scholar
- MacDonald TJ, Brown KM, LaFleur B, Peterson K, Lawlor C, Chen Y, Packer RJ, Cogen P, Stephan DA: Expression profiling of medulloblastoma: PDGFRA and the RAS/MAPK pathway as therapeutic targets for metastatic disease. Nat Genet 2001, 29(2):143–152. 10.1038/ng731View ArticlePubMedGoogle Scholar
- Ye QH, Qin LX, Forgues M, He P, Kim JW, Peng AC, Simon R, Li Y, Robles AI, Chen Y, Ma ZC, Wu ZQ, Ye SL, Liu YK, Tang ZY, Wang XW: Predicting hepatitis B virus-positive metastatic hepatocellular carcinomas using gene expression profiling and supervised machine learning. Nat Med 2003, 9(4):416–423. 10.1038/nm843View ArticlePubMedGoogle Scholar
- Reich M, Ohm K, Angelo M, Tamayo P, Mesirov JP: GeneCluster 2.0: an advanced toolset for bioarray analysis. Bioinformatics 2004, 20(11):1797–1798. 10.1093/bioinformatics/bth138View ArticlePubMedGoogle Scholar
- Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors suing gene expression data. J Am Stat Assoc 2002, 97: 77–87. 10.1198/016214502753479248View ArticleGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286(5439):531–537. 10.1126/science.286.5439.531View ArticlePubMedGoogle Scholar
- Yang TP, Chang TY, Lin CH, Hsu MT, Wang HW: ArrayFusion: a web application for multi-dimensional analysis of CGH, SNP and microarray data. Bioinformatics 2006, 22(21):2697–2698. 10.1093/bioinformatics/btl457View ArticlePubMedGoogle Scholar
- Radmacher MD, McShane LM, Simon R: A paradigm for class prediction using gene expression profiles. J Comput Biol 2002, 9(3):505–511. 10.1089/106652702760138592View ArticlePubMedGoogle Scholar
- Wang HW, Trotter MW, Lagos D, Bourboulia D, Henderson S, Makinen T, Elliman S, Flanagan AM, Alitalo K, Boshoff C: Kaposi sarcoma herpesvirus-induced cellular reprogramming contributes to the lymphatic endothelial gene expression in Kaposi sarcoma. Nat Genet 2004, 36(7):687–693. 10.1038/ng1384View ArticlePubMedGoogle Scholar
- Tran-Thang C, Kruithof E, Lahm H, Schuster WA, Tada M, Sordat B: Modulation of the plasminogen activation system by inflammatory cytokines in human colon carcinoma cells. Br J Cancer 1996, 74(6):846–852.PubMed CentralView ArticlePubMedGoogle Scholar
- Yeatman TJ, Chambers AF: Osteopontin and colon cancer progression. Clin Exp Metastasis 2003, 20(1):85–90. 10.1023/A:1022502805474View ArticlePubMedGoogle Scholar
- van de Vijver MJ, He YD, van't Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 2002, 347(25):1999–2009. 10.1056/NEJMoa021967View ArticlePubMedGoogle Scholar
- Yeang CH, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin RM, Angelo M, Reich M, Lander E, Mesirov J, Golub T: Molecular classification of multiple tumor types. Bioinformatics 2001, 17 Suppl 1: S316–22.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.