SIGNATURE: A workbench for gene expression signature analysis
© Chang et al; licensee BioMed Central Ltd. 2011
Received: 7 September 2011
Accepted: 14 November 2011
Published: 14 November 2011
Skip to main content
© Chang et al; licensee BioMed Central Ltd. 2011
Received: 7 September 2011
Accepted: 14 November 2011
Published: 14 November 2011
The biological phenotype of a cell, such as a characteristic visual image or behavior, reflects activities derived from the expression of collections of genes. As such, an ability to measure the expression of these genes provides an opportunity to develop more precise and varied sets of phenotypes. However, to use this approach requires computational methods that are difficult to implement and apply, and thus there is a critical need for intelligent software tools that can reduce the technical burden of the analysis. Tools for gene expression analyses are unusually difficult to implement in a user-friendly way because their application requires a combination of biological data curation, statistical computational methods, and database expertise.
We have developed SIGNATURE, a web-based resource that simplifies gene expression signature analysis by providing software, data, and protocols to perform the analysis successfully. This resource uses Bayesian methods for processing gene expression data coupled with a curated database of gene expression signatures, all carried out within a GenePattern web interface for easy use and access.
SIGNATURE is available for public use at http://genepattern.genome.duke.edu/signature/.
Gene expression signatures are powerful tools that can reveal a range of biologically and clinically important characteristics of biological samples. In recent years, signatures have been developed that can differentiate distinct subtypes of tumors, identify important cellular responses to their environment (hypoxia), predict clinical outcomes in cancer, and model the activation of signaling pathways . The power of gene expression signatures derives from their ability to connect an in vitro experimental state with an in vivo one in a quantitative manner. Commonly, the term gene expression signature has been used in two ways. In one, the signature is comprised of a set of genes that share a common pattern of expression. Sometimes this can be reported as genes that increase or decrease in expression, but the basic characteristic of the signature is the identity of the genes. Because of this, these signatures are often called gene sets. Gene sets have been curated from the literature and collected into databases such as MSigDB and GeneSigDB [2, 3]. Tools have been developed that can analyze gene sets by looking for shared function or characteristics such as Gene Ontology terms  or drug sensitivity . Another tool, single-sample GSEA has been previously applied to predict the co-regulation of gene sets from MSigDB on gene expression samples . Evidence of co-regulation is then used to infer the activation of the phenotype embodied by the gene set.
The second type of signature relates the magnitude of increase or decrease in gene expression, in the form of weighted values, to a biological phenotype using a quantitative predictive model [6–16]. These signatures are often developed from experimental conditions that precisely control the phenotype of interest - for instance, the activation of a cell signaling pathway or the response of cells to a defined stimulus. Since the signature is comprised of a quantitative measure of the expression levels of genes that define the phenotype, it allows a direct measurement of the phenotype, rather than an indirect inference through co-regulation of genes in a gene set. A limitation of this approach, however, is the complexity of the methods used to derive and analyze the signatures, making it difficult to apply without significant multidisciplinary expertise .
Three major obstacles hinder the broad use of signatures. First, gene expression signature analysis requires the rigorous application of complex statistical methodologies on gene expression data. Second, it requires the acquisition and validation of data that properly capture the biological state of interest. Third, it requires a computational infrastructure that makes available the statistical software and data in an easy to use interface. In sum, gene expression signature analysis requires a confluence of expertise across a range of disciplines, including statistics, biology, and computer science.
While others have previously made use of our approach , it does require a level of expertise and computational infrastructure not always available in biological laboratories. This bioinformatic investigation, requiring the proper selection and application of statistical algorithms, as well as biological curation and validation of the signatures, can be daunting. Therefore, a challenge is how to develop software tools that enable such analyses for the general user. While it has long been recognized that software can target different types of users, a set of principles for software that is biologist-friendly was recently described . In short, the recommendations are that the software 1) requires no knowledge of programming, 2) allows application of advanced methods, 3) can be used on different operating systems, and 4) provides a natural language description of the results. While such software has been developed for biological sequence alignment , sequence annotation , phylogenetic analysis , and comparison of prokaryotic genomes , no such platform exists for gene expression signature analysis. Because of this, and also because of the technical difficulty in performing gene expression analysis, we believe there is a need for a platform that captures a carefully refined analysis workflow, coupling algorithms and data, and enables a researcher to predict gene expression signatures on their samples.
To address the critical need for a platform for gene expression signature analysis, we have developed a collection of tools over the course of several years. First, we have developed BinReg, a statistical algorithm to predict the activation of a gene expression signature on a data set [23, 24]. Second, we have curated a database of signatures that predict the activation of oncogenic pathways . Now, we report on the development of a computational platform that combines these in a biologist-friendly interface, using the principles previously established. Here we describe the three components of a novel gene expression signature analysis platform, which we collectively call SIGNATURE.
where Ф is the cumulative density function of a normal distribution, Y is a vector of the posterior probabilities that the signature is active in each sample, and γ, the parameter to be sampled, is a k -dimensional vector of the contribution of each metagene. For the development of gene expression signatures, the number of metagenes chosen is a configurable parameter, where higher numbers of metagenes increase the complexity of the model, at the risk of potentially overfitting the training data.
The model is sampled using a standard Markov chain Monte Carlo algorithm. It produces the posterior probabilities Y as well as a 95% credible interval. Y should be interpreted as the probability that the pathway is active in each sample. The credible interval for Y indicates the upper and lower bound that can be set for the predictions, with 95% probability. Tighter bounds indicate higher confidence in the posterior probability Y, and wider ones indicate lower confidence. This statistical model has previously been described in detail .
Once a signature for a phenotype is developed, it can be used to score the phenotype in a new collection of samples. In all, a gene expression signature analysis requires seven parameters: 1 and 2) the train0 and train1 data, 3) the number of genes in the model, 4) the number of metagenes, 5) the algorithm used to preprocess the data set, 6) whether to apply quantile normalization, and 7) whether to apply shift-scale normalization. The first two parameters are the gene expression data that define the two cellular states. The next parameter specifies the number of genes to include in the statistical model. Then, the number of metagenes controls the complexity of the model . For parameter five, we support two methods of preprocessing, RMA and MAS5 . Parameters six and seven concern methods for normalizing the data to account for technical variation between the training and test sets. Quantile normalization has been described extensively in the literature. However, we use a variation of the algorithm whereby the quantiles are computed entirely from the training set to preserve independence between the training and test data. Finally, shift-scale normalization is an additional normalization method that, in short, adjusts the centroid and variance of the test set to match the training set.
Over the past five years, we have developed and curated a collection of gene expression signatures that predict the activation of a large number of important cell signaling pathways, such as Ras, Myc, p53, and others . Although this work has focused on developing signatures for pathways relevant to the study of cancer biology, the conceptual framework for this signature development is applicable across a wide range of other contexts. We envision that the current database would be most directly applicable to cancer studies, but there are also clear applications to other diseases with functional aberrations in these common pathways.
To simplify the analyses for general users, we determine empirically the best values for the seven parameters described above. Using a leave-one-out cross validation approach, we classify the samples in the training set. To ensure that the model is not over fit to artifacts or confounding factors in the original data, we then validate the selected parameters using an independent biochemical and/or genetic marker of pathway activity. The type of indicator used is specific to a pathway and depends on how it works. For example, to validate the PI3K signature, we compared against relative phosphorylated (active) p110 protein levels, and for the Estrogen Receptor (ER) pathway signature, the ER status in human breast tumors as determined by immuno-histochemistry .
Our signature database currently consists of 18 validated signatures, and we are actively developing and curating additional ones.
For gene expression signature analysis, we have developed software tools to cover two major use cases.
To use Score Signatures, one submits a gene expression data set of interest, such as that from a collection of tumor samples. The application will then apply our Bayesian algorithm to predict the activation of the signatures in the database. The output is a series of probability scores for each signature, reflecting the extent to which the signature is represented in each sample from the test data set. These probability scores are depicted in a heatmap that shows the pattern of activation of the pathways across the data set as determined by hierarchical clustering. Furthermore, Score Signatures also provides raw data as tab-delimited text files that can be accessed with standard tools such as Microsoft Excel and used to develop additional plots. These results are summarized in a human-readable report with a detailed description of the analysis as well as guidelines for interpreting the results.
Each Score Signatures analysis is comprised of Bayesian regression calculations that predict the activation of each signature from the signature database. A full analysis is described using a large number of parameters, seven for each pathway in the database. The challenge here is how to provide the analyses so that it is accessible for users that are not familiar with the technical details of gene expression analysis. We solve this issue by storing the validated parameters in the database. As a default, the values are retrieved from the signature database, ensuring that the signature runs in precisely the manner originally defined. However, for expert users, we make it possible to refine each parameter, and if changed, the system will document the deviation from the default. In this way, the needs of both general and expert users can be met.
Score Signatures provides a convenient way to apply the signatures from our signature database on a data set. However, it does not have an ability to generate a new signature. To address this, we have produced a second application, Create Signature, to develop novel gene expression signatures.
While Score Signatures can be used by investigators with little or no knowledge of the details of the underlying methodology, Create Signature requires an understanding of the machine learning framework and the parameters used to create the signatures. The user specifies the values for a total of 15 parameters. In addition to the seven parameters for the signatures as described above, it also includes parameters that govern the MCMC simulation of the Bayesian model, and others (such as other normalization methods) that we have not used in our signatures.
Once the parameters are specified, Create Signature generates a statistical model from the training set and predicts signature activity in both the training set (using leave-one-out cross validation) and the test set (after building a model from the entire training set). Similar to Score Signatures, Create Signatures also provides publication-ready plots, raw data, and a human-readable report of the key results, fulfilling a critical requirement of user-friendly software described above.
Our analysis tools are delivered through GenePattern . The GenePattern platform provides a web-based interface for external programs (or modules in GenePattern terminology) via a plug-in architecture. However, one limitation with GenePattern is that it does not have the means to provide a context-dependent interface that Score Signatures requires. That is, the interface for Score Signatures depends on the current state of the signature database, as well as the requirements of the user. Score Signatures requires (currently) a total of 74 parameters, but only two are likely to be changed by the vast majority of users. In this situation, the system needs the facility to hide rarely used parameters for novice users, but allow advanced users to tune them. This is not currently possible in GenePattern.
To address the limitations of GenePattern, we have extended GenePattern with an interface generator layer. An interface generator is an optional component of a module that is responsible for defining its interface, that is, the parameters that are provided for the user. This is implemented by modifying the GenePattern source code so that when a user accesses a module, GenePattern can retrieve the interface from the interface generator instead of its own default mechanism. Technically, interface generators are CGI scripts, which provide them the ability to access external resources, such as the signature database.
We have developed a public software platform SIGNATURE that simplifies gene expression signature analysis by providing an easy to use GenePattern interface on top of a complex infrastructure of analysis software and a signature database. Specifically, we have developed BinReg, a Bayesian probit regression algorithm that has been supplemented with metagenes and normalization functions to handle the idiosyncrasies of gene expression data. Also, we have curated and validated a database of 18 gene expression signatures for activated oncogenes. And finally, we have significantly extended GenePattern by developing an interface generator layer that can produce context-sensitive interfaces to fit the needs of the user.
One limitation of SIGNATURE is that the predictions are dependent upon the quality of the data. One potential factor that can confound the interpretation of the results is the presence of batch effects or other technical variation after the applied normalization . In our experience, we have observed that technical artifacts lead to broad changes in the expression profiles that lead to homogeneous predictions. That is, the predicted scores tend to cluster around the same probability, typically around 0% or 100%. This issue highlights the fact that these predictions should be confirmed with alternate assays. Currently, the tools available within SIGNATURE require expression profiles to be annotated with probe sets from Affymetrix U133 microarrays. To apply them to microarrays from other platforms, the probes would need to be converted to these U133 probe sets. Internally, we have successfully applied SIGNATURE to gene expression data from Illumina BeadArrays (data not shown), suggesting a high degree of reproducibility in the gene expression levels between these two platforms, consistent with prior reports [29, 30]. However, we have had more limited success in converting signals from cDNA arrays, and have not tried applying these analyses to expression data from sequencing platforms. We believe the ability to apply these methods depends on the reproducibility of the expression signals across platforms.
Modules available in SIGNATURES
To predict activation of pathways in gene expression data.
To create a new gene expression signature using a training set.
To remove technical variation across one or more gene expression data sets.
To find subtypes within a gene expression data set.
To assign a subtype to gene expression data using a previously developed model.
To dissect a gene expression data set into modules.
To score the modules in gene expression data using a previously developed model.
SIGNATURE is available for public use, without need for a material transfer agreement, at http://genepattern.genome.duke.edu/signature/. This page includes a link to the modules available on GenePattern, as well as sample data for testing purposes. The source code and gene expression signature database are also available from this page.
Project name: SIGNATURE
Project home page: http://genepattern.genome.duke.edu/signature/
Operating system: platform independent
Programming language: Python, C, R, Matlab
Other requirements: web browser
Any restrictions to use by non-academics: none
We thank the members and collaborators of the Nevins lab for helping to test and refine the software, in particular Jenny Freedman, Ashley Chi, Jeff VanDeusen, Daphne Friedman, Eran Andrechek, Holly Dressman, and Andrea Bild. We also thank anonymous reviewers for their helpful comments. JTC is supported by NIH R00LM009837 and grant R1006 from the Cancer Prevention Research Institute of Texas. JRN is supported by NIH 5R01CA106520 and NIH U54CA112952.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.