A simple and robust method for connecting small-molecule drugs using gene-expression signatures
© Zhang and Gant. 2008
Received: 15 February 2008
Accepted: 02 June 2008
Published: 02 June 2008
Skip to main content
© Zhang and Gant. 2008
Received: 15 February 2008
Accepted: 02 June 2008
Published: 02 June 2008
Interaction of a drug or chemical with a biological system can result in a gene-expression profile or signature characteristic of the event. Using a suitably robust algorithm these signatures can potentially be used to connect molecules with similar pharmacological or toxicological properties by gene expression profile. Lamb et al first proposed the Connectivity Map [Lamb et al (2006), Science 313, 1929–1935] to make successful connections among small molecules, genes, and diseases using genomic signatures.
Here we have built on the principles of the Connectivity Map to present a simpler and more robust method for the construction of reference gene-expression profiles and for the connection scoring scheme, which importantly allows the valuation of statistical significance of all the connections observed. We tested the new method with two randomly generated gene signatures and three experimentally derived gene signatures (for HDAC inhibitors, estrogens, and immunosuppressive drugs, respectively). Our testing with this method indicates that it achieves a higher level of specificity and sensitivity and so advances the original method.
The method presented here not only offers more principled statistical procedures for testing connections, but more importantly it provides effective safeguard against false connections at the same time achieving increased sensitivity. With its robust performance, the method has potential use in the drug development pipeline for the early recognition of pharmacological and toxicological properties in chemicals and new drug candidates, and also more broadly in other 'omics sciences.
One of the most fundamental challenges in all forms of 'omic technologies is the connection of biological event signatures with others previously derived to allow the recognition of new molecule properties or biological alteration. Simple, robust, and efficient matching methods are required to connect a new gene expression signature with those in a database. This problem was first tackled by Lamb et al who introduced the Connectivity Map as a resource and tool to connect small-molecule drugs, genes, and diseases. The Connectivity Map achieved a good degree of success, but also suffered from several deficiencies, particularly an inability to apply a measure of statistical validity at the individual reference signature level to allow rational filtering of the results to exclude false connections. We took the method of Lamb et al as a basis for development and have derived a simple, robust and statistically testable method for making connections between biological event signatures. The method was tested with genomic signatures resulting from small molecule interactions in cells, but also could be applied to any form of signature such as those from proteomic or metabolomic science.
The main assumption behind the concept of a connectivity map is that a biological state, whether physiological, pathological, or induced with chemical or genomic perturbations, can be described in terms of a genomic signature, eg., the genome-wide mRNA levels as measured by DNA microarray technologies. The working of a connectivity map involves several key components. First, a large collection of pre-built reference gene-expression profiles serve as the core database, where each reference profile characterizes a well-defined biological state. Secondly, a query gene signature from some specific studies. A query gene signature is basically a short (as compared to the list of genes in a typical reference profile) list of genes most relevant and important to characterize the biological state of the researchers' interest. Finally, a pattern matching algorithm or similarity metric defined between a query gene signature and a reference gene-expression profile to quantify the closeness or connection between the two biological states. Such a connectivity map can be used by biomedical researchers to find connections between the reference biological states and those of their own interest, leading to testable new biological hypotheses. In this paper, we present a new framework for the construction of reference profiles, new connection scoring scheme and testing procedures for the observed connections. We compare our method with that of Lamb et al, and show that more robust results are achieved using our method. In particular our method not only offers a more principled statistical procedure for testing connections, but more importantly it provides effective safeguards against false connections while at the same time achieving increased sensitivity. As a consequence it can benefit the end users by saving them time and resources in pursuing new biological hypotheses based on the findings of connectivity maps.
For the first-generation connectivity map, Lamb et al carried out a series of gene-expression profiling experiments, using 164 distinct small-molecule compounds in a few selected human cell lines. Each treatment instance consisted of one treatment sample and one (or more) vehicle control samples, whose genome wide mRNA levels were measured using Affymetrix GeneChip microarrays. In total 564 samples were microarrayed, which represented 453 different treatment instances. For example, treatment instance ID988 consisted of 1 treatment sample and 6 vehicle control samples. The treatment sample was obtained by treating human MCF7 cells with 100nMestradiol for 6 hours. The control samples were obtained by treating MCF7 cells with vehicle control for 6 hours. A gene-expression profile was constructed for each treatment instance, in which the relative expression (treatment relative to the control) of all the measured genes were specified, and sorted in descending order. A query gene signature, obtained and ordered in the same manner, can be compared to each reference profile in the Connectivity Map to calculate a connectivity score. For "positive connectivity", the up-regulated genes of the query signature find matches near the top of the reference profile, and the down-regulated genes find matches near the bottom of the reference profile. For "negative connectivity", the matches are opposite.
We obtained Lamb et al's data set (Accession Number GSE5258) from the GEO (Gene Expression Omnibus) database, and rebuilt the 453 reference gene-expression profiles using a new ranking scheme based on the following guiding principles: (1) A treatment instance was defined relative to a control, thus the effect of the treatment could be characterized by the relative differential expression status of all the genes together, (2) different genes were affected to different extents by the treatment, so genes which showed a greater differential expression should have more weight in characterizing this treatment, and (3) up- and down-regulated genes should be treated equally in a unified manner. This meant that a 2-fold down-regulated gene was considered as equally important as a 2-fold up-regulated gene in defining a reference profile. There are several reasons for the choice of treating up- and down-regulated genes equally. Theoretically, unless we have a lot of further information about so many genes on the microarray it is difficult to decide whether this 2-fold up-regulated gene is more important than that 2-fold down-regulated gene or the opposite. So it is logical to assign them equal weights. Another reason is the consideration of symmetry: if a gene is 2-fold up-regulated in sample 1 versus sample 2, it can also be viewed as 2-fold down-regulated in sample 2 versus sample 1. We should emphasize that assigning two genes equal weights does not imply in any sense they share the same molecular mechanism. Even two up-regulated genes with the same fold change could be involved in very different molecular mechanisms. To adhere to the above guidelines, an obvious choice for organizing the genes is the logarithm of the expression ratio (treatment over control). Thus instead of treating the down- and up-regulated genes separately as in the method of Lamb et al, we ordered genes in a reference profile by the absolute value of their expression log-ratios. Therefore the most differentially expressed genes (either up or down) appear first in the list, and those non-differentially expressed genes appear at the bottom of the list. In this way, the genes are ordered by their importance in defining the reference profile. It is then straightforward to assign ranks to them. Suppose there are in totalNgenes, the first gene in the list will be assigned a rankNif it is up-regulated, or a rank-Nif it is down-regulated. In general theith gene in the list will be ranked with (N - i+ 1) for up-regulation or-(N - i+ 1) for down-regulation. With this new ranking method, the importance of a gene is reflected by the absolute value of its rank, while the sign of its rank indicates its regulation status. The consequence and advantage of this method for creating reference profiles is that attaching statistical significance to the connection observed is a relatively straightforward step.
A query gene signature can be an ordered gene list, or just a collection of genes without specific ordering. We will refer to these two types of query gene signatures asorderedandunorderedgene signatures respectively. For an ordered gene signature, we rank the genes in the list in the same way as for a reference profile. Namely, the most important (differentially expressed) gene in the signature will be assigned a rankmor-mdepending on its regulation status, wheremis the number of genes in the signature. While the least important gene in the signature be ranked 1 or -1.
In another equally interesting situation, where themgenes in the signature match the firstmgenes in the reference profile in the right order, but the sign of each gene in the query is different from its sign in the reference, the connection strength is , the opposite of Eq.(2).
Given a query signature gene and a reference gene-expression profile, we can calculate their connection score by
c=C(R,s)/C max (N,m).(4)
So a connection scorec= 1 means that the gene signature has the maximum positive connection strength with the reference profile, which indicates that the experimental condition that gave rise to this gene signature had the strongest possible correlation with the treatment instance that generated the reference profile. A connection scorec= -1 indicates that the two experimental perturbations were most inversely correlated. In general, a connection scorecwill be within the range of (-1, 1).
As for most biomedical experiments with unavoidable biological and technical variation, statistical significance is a crucial aid to the interpretation and subsequent validation of the result. Here we propose calculating the p-value associated with a connection score by testing the following null hypothesis.
Null hypothesisH0: For a reference gene-expression profileRand a query gene signatures of lengthm, the null hypothesisH0states that there is no underlying biological connection between the two, and that the query signaturesis merely a randomm-gene signature, as generated by Procedure 1 described below.
Procedure 1:Generation of a randomm-gene signature. LetRbe a given reference gene-expression profile ofNgenes. Selectmgenes sequentially and randomly from theNgenes (without replacement), and assign +1 (up-regulation) or -1 (down-regulation) randomly with equal probability to each of themselected genes. If this gene signature is to be used as an ordered list, its order is just the order in which themgenes are selected; or if this gene signature is to be used as merely a collection of genes, then the order is irrelevant.
where is the connection score between a random gene signature and the reference profile. To estimate the p-value, a large number (eg., 105) of random gene signatures of the same lengthmcan be generated using Procedure 1 and their connection scores to reference profileRcalculated using Eq.(4), the proportion of random scores that are no less than the observed scorescin absolute value is an estimate of the two-tailed p-value.
The 453 individual treatment instances of the data set GSE5258 were created using only 164 distinct small-molecule compounds. Some treatment instances were replication experiments using the same compound at the same or different doses. It is thus interesting to consider all the treatment instances of the same compound as a set, and to assess the overall connection of the set with a query gene signature.
wherenis the number of individual treatment instances belonging to the treatment set,c i is the connection score of theith instance. To test the significance of a treatment set as a whole. We used the following null hypothesis,
Null hypothesis : WhereTdenotes a set of treatment instances,R i the reference profile based on treatment instancei, andsa query gene signature of lengthm, the null hypothesis states that there is no underlying biological connection between the gene signaturesand any of the reference profiles in T. The query signature is merely a randomm-gene list generated by Procedure 1.
Thus the null hypothesis is an extension ofH 0to a higher level. Alternatively,H 0can be viewed as a special case of , in which there is only one treatment instance in the treatment set. Once the connection score for a set is observed, its associated p value can be estimated in a similar way: a large number of random gene signatures of the same lengthmare generated using Procedure 1, and the connection score of the set to each of the random gene signatures is calculated using Eq.(6); the proportion of random connection scores that are greater than the observed score in absolute value is an estimation of the p value.
Results for rds01 using the Connectivity Map.
Results for rds01 using the new method.
The two examples of random gene signatures above showed that on an individual treatment instance level, the Connectivity Map does not provide effective safeguards against possible false connections. On the treatment set level, the Connectivity Map provides a permutation p value when a set of treatment instances associated with the same compound were viewed as a whole. The significance of the set of treatment instances in the ordered list of all instances was estimated by permutation (see the Supporting Online Material for). Note that the null hypothesis we use in our set-level analysis is different from that of the Connectivity Map. Tian et al were among the first groups of authors who made explicit distinctions between two different null hypotheses concerning a set of entities (a set of genes in context of Tian et al's paper). Other authors also addressed this issue in some recent studies on the significance analysis of gene sets[5, 6]. A more detailed comparison between the null hypotheses used by the set-level analysis of the Connectivity Map and the method presented here can be found in the supplementary information [see Additional file2].
Throughout this paper, when we use the permutation p values from the Connectivity Map to control false connections, the same criteria discussed above for setting threshold p value are used. The full tabulated results of significance analysis on the connections between the two random gene signatures and the 164 treatments sets can be found in the supplementary data. Both our method and the Connectivity Map gave the right answers for these two random gene signatures, ie, no significant connections were found more than expected by chance. Therefore the set-level analysis of the Connectivity Map can provide a control over possible false connections by means of permutation p values. So in the subsequent analysis on the experimentally derived gene signatures, we only use the set-level results of the Connectivity Map, but not its instance-level results. We need to point out that, in the Connectivity Map set-level analysis, permutation p values were not available for all the 164 treatment sets. For those treatment sets which only contain 1 treatment instance or those sets with mean connectivity score 0, no permutation p values could be calculated, and hence no statistical significance is attached to them. This problem affects the coverage and consequently the sensitivity of the Connectivity Map, because real biological connections between a query gene signature and any of those treatment sets may not be recognized.
To test the ability of the new method for identifying real biological connections we utilized some of the same examples used in to compare with the Connectivity Map. The first example was a gene signature of histone deacetylase (HDAC) inhibitors (Lamb's gene signature sigs01), which was compiled from an independent study on the responses of T24, MDA435 and MDA468 cells respectively to three histone deacetylase (HDAC) inhibitors: vorinostat, MS-27–275, and trichostatin A. This gene signature consisted of 8 up- and 5 down-regulated genes, represented by 25 Affymetrix probe-set IDs on the Affymetrix HG-U133A microarrays.
As the Connectivity Map does not provide effective safeguards on individual treatment instance level against possible false connections, we can only use the results of the Connectivity Map on the treatment set level. In total 6 compounds (vorinostat, trichostatin A, resveratrol, geldanamycin, valproic acid, and 17AAG) were identified to have significant positive connectivity with the HDAC inhibitor gene signature; and 2 compounds (5182598 and fludrocortisone) had significant negative connectivity. Vorinostat, trichostatin A, and valproic acid are known HDAC inhibitors thus the identification of these can be regarded as a success of the Connectivity Map. However another known HDAC inhibitor HC-toxin, a reference profile of which was contained in the database, was not identified. This happened because there was only one treatment instance of the compound HC-toxin in the database and so no permutation p value could be obtained using the Connectivity Map. Based on their instance level results, Lamb et al highlighted HC-Toxin in as it had the 7th highest connectivity score (0.914) of all instances in the dataset. However, the two examples of random gene signatures already showed that at the instance level the Connectivity Map gave false connections even for the highest connectivity scores +1 and -1. So the rational choice is to disregard the instance level results from the Connectivity Map.
The second example was a gene signature (Lamb's gene signature sigs02) taken from an independent study of MCF7 cells treated with estradiol. This gene signature consisted of 40 up- and 89 down-regulated genes represented by 189 Affymetrix probe-set IDs on the Affymetrix HG-U133A microarrays. We tested the Connectivity Map and the method presented in this paper with this estrogen signature.
For the same reason given above, we only used the treatment set level results of the Connectivity Map, which identified 4 compounds (genistein, estradiol, tretinoin, and alpha-estradiol) as having significant positive connectivity with the estrogen signature; and 5 compounds (trichostatin A, fulvestrant, LY-294002, vorinostat, and geldanamycin) that had significant negative connectivity.
Using our set-level analysis 16 compounds were found to have significant positive connection, and 25 compounds had significant negative connection to the estrogen gene signature. The 16 compounds with positive connection included genistein, estradiol, and alpha-estradiol, all known to be estrogen receptor agonists. The 25 compounds with negative connection included fulvestrant, raloxifene and tamoxifen, all known to be estrogen receptor antagonists. In comparison, the Connectivity Map identified all the estrogen receptor agonists above, but missed all the estrogen receptor antagonists except fulvestrant. These results therefore indicate the sensitivity of our method is substantially increased. The Connectivity Map was able to detect the pure estrogen receptor antagonist fulvestrant, but missed the two compounds tamoxifen and raloxifene which have mixed antagonist and agonist estrogen receptor activities.
For further testing we compiled a new gene signature from an independent study on cardiac allograft rejection and response to immunosuppressive therapy, where patients were treated with standard immunosuppression with corticosteroids, antimetabolites, calcineurin inhibitors, and/or sirolimus. This gene signature consisted of 40 Affymetrix probe set IDs (see Table2 of). Our set-level analysis identified 29 compounds as having significant connections with this gene signature. The three top compounds were azathioprine, thalidomide, and rosiglitazone. Azathioprine is a commonly used immunosuppressive drug[13, 14], so its significant positive connection with the gene signature is a good indication that the new method works very well here. The second compound thalidomide, which had a positive connection score, also has known immunosuppressive activities, inhibits release of TNFαfrom monocytes, and modulates other cytokine actions. The recognized properties of these molecules therefore accord with the outcome of the connectivity matching. The third compound rosiglitazone had a negative connection with the signature suggesting it may have properties to reduce or mitigate the effects of immunosuppressive activity. Recently, rosiglitazone was reported to suppress cyclosporin-induced chronic transplant dysfunction and prolong survival of rat cardiac allografts, where cyclosporin is also a commonly used immunosuppressive drug.
At the instance level, our method identified 89 reference profiles as having significant connections to the immunosuppressive drug gene signature, representing 63 distinct compounds. The top 3 compounds were azathioprine, staurosporine, and trichostatin A, which all achieved positive connection scores with this gene signature. The second compound, staurosporine, a protein kinase C inhibitor, is classified as an antineoplastic and immunosuppressive antibiotic drug. The third compound, trichostatin A, was recently shown to have some immunosuppressive effects in leukemia T cells. Therefore the method of instance testing could be particularly valuable for the identification of pharmacological and toxicological properties in novel molecules.
Our results indicate that the method presented here can identify many significant connections to a query gene signature. Then what criteria should we use and which compounds should we choose if new biological hypotheses are to be developed? Our suggestion is to concentrate more on those compounds which have many replicate instances in the database. Because the results obtained for those big treatment sets do not depend heavily on the quality of a small number of treatment instances, as in the case of small treatment sets or singleton sets (treatment sets with only 1 instance each). Lamb et al also recognized the importance of having replicate instances, and noted that the power to detect connections might be greater for compounds with many replicates. In defining a treatment set, ideally only the treatment instances of the same compound with the same dose and the same cell type should be considered as a set. For example, the biological state of HL60 cells perturbed by raloxifene should be considered as a different biological state from that of MCF7 cells perturbed with the same compound, thus these two instances of raloxifene should not be put into the same set. In this paper for comparative purposes we adopted Lamb et al's choice in defining a treatment set, i.e., all the instances of a compound were grouped together as a set regardless of their possible differences in dose or cell type. Mixing the instances of a compound with different doses and/or different cell types increases the heterogeneity of an otherwise more homogenous treatment set. This tends to average out the distinct characteristics attributable to the cell type or dosage difference, making some set-level connections insignificant or their interpretation difficult. In such cases the instance level connections supported by statistical significance can be of great help in interpreting the results. For future connectivity maps, efforts should be made to provide as many replicate treatment instances (replicates with the same compound, the same dose, and the same cell type, etc) as possible, so that the undesirable reliance on individual instances can be minimized.
The successful identification of real biological connections between the reference profiles and a gene signature also depends on the quality of the gene signature. In this paper, all the three experimentally derived signatures were compiled from independent studies, in which the original authors had selected those genes as most relevant to characterize the biological states being studied. It is reasonable to ask though how many genes should be selected from the full list of genes on the microarray to best characterize a biological state? We believe this will depend on the specific biological condition being investigated, and is the decision of the individual investigator. It is expected that only those genes that show significant differential expression should be included in the gene signature. In our experience, as also shown by the examples here, gene signatures with length in the order of 10 to 100 work well, but this can only serve as a rough guide. A Java application implementing the method presented in this paper can be downloaded via FTP from our institution's website [see Additional file3].
We have presented in this paper a framework for a new connectivity map, with the advantages that statistical significance measures are calculated at both treatment instance level and treatment set level, thus providing effective control over false connections. This important safeguard was not available in the original Connectivity Map at the instance level, as revealed by the two examples of random gene signatures. As the connectivity maps are most useful for high throughput screening and for generating new biological hypothesis, it is crucial that false connections are tightly controlled. We compared the performance of the method here with the original Connectivity Map using two gene signatures (for HDAC inhibitors and estrogens respectively) previously compiled in and also a new gene signature for immunosuppressive drugs. All these examples demonstrated that our method is more sensitive and robust than the original. With its increased sensitivity and with false connections being properly controlled, the method presented here can potentially benefit biomedical researchers by saving them time and resources and increasing their rate of success in pursuing new biological hypothesis based on the findings of connectivity maps.
We wish to acknowledge the support of the microarray team of the MRC Toxicology Unit. SDZ thanks Qing Wen for helpful discussions on a searching algorithm in the implementation of the new method.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.