Detection of compound mode of action by computational integration of whole-genome measurements and genetic perturbations
© Hallén et al; licensee BioMed Central Ltd. 2006
Received: 15 June 2005
Accepted: 02 February 2006
Published: 02 February 2006
A key problem of drug development is to decide which compounds to evaluate further in expensive clinical trials (Phase I- III). This decision is primarily based on the primary targets and mechanisms of action of the chemical compounds under consideration. Whole-genome expression measurements have shown to be useful for this process but current approaches suffer from requiring either a large number of mutant experiments or a detailed understanding of the regulatory networks.
We have designed an algorithm, CutTree that when applied to whole-genome expression datasets identifies the primary affected genes (PAGs) of a chemical compound by separating them from downstream, indirectly affected genes. Unlike previous methods requiring whole-genome deletion libraries or a complete map of gene network architecture, CutTree identifies PAGs from a limited set of experimental perturbations without requiring any prior information about the underlying pathways. The principle for CutTree is to iteratively filter out PAGs from other recurrently active genes (RAGs) that are not PAGs. The in silico validation predicted that CutTree should be able to identify 3–4 out of 5 known PAGs (~70%). In accordance, when we applied CutTree to whole-genome expression profiles from 17 genetic perturbations in the presence of galactose in Yeast, CutTree identified four out of five known primary galactose targets (80%). Using an exhaustive search strategy to detect these PAGs would not have been feasible (>1012 combinations).
In combination with genetic perturbation techniques like short interfering RNA (siRNA) followed by whole-genome expression measurements, CutTree sets the stage for compound target identification in less well-characterized but more disease-relevant mammalian cell systems.
We then compared CutTree with the network identification method developed by Gardner et al . Only when 90% of the network architecture was known, the network identification method could define PAGs with an accuracy which could not be distinguished (P < 2.2 × 10 -11, Student t-test) from that of CutTree (Fig. 3b). However, when 50% of the network connections were known CutTree outperformed the network identification method. In conclusion, the in silico experiments clearly demonstrated that CutTree outperforms an exhaustive search strategy (Fig. 1b, middle). The network identification approach (Fig. 1b, left) can only have a similar performance as CutTree when there is a close to perfect understanding (90%) of the underlying biology. Finally, we tested whether these results from CutTree were robust against noise in the data. We found that the noise level had to be larger than 50 % (noise level > 0.5, see Methods for implementation) in order to substantially reduce the performance of CutTree.
CutTree identified four out of five PAGs in the galactose-response pathway (Fig. 4, left). Note that this excellent performance is almost identical to the predicted accuracy of 70% from the in silico experiments (Fig. 3b). Given the large number of possible combinations of five PAGs in a set of 952 differentially expressed genes (1012), an exhaustive experimental library search would not have been feasible. Interestingly, not all of the five PAGs are differentially expressed when galactose is applied demonstrating that PAGs could not be identified from inspection of the amount of fold change. In particular, the deleted genes are not as a rule, classified as PAGs by CutTree. For example, knockouts of the GAL3, GAL4, and GAL80 genes, are correctly not labeled as PAGs by CutTree. Thus, neither the fold change nor the knockouts are sufficient to reveal the identity of the PAGs (Fig. 1a). Note that CutTree identified YLR460C, a previously unidentified PAG for galactose but did not detect GAL5 which have been suggested to be a PAG. This could reflect more complex regulation of the PAGs by Gal4, such as an unknown intermediate step between Gal4 and GAL5, possible involving Ylr460c. GAL5 has in contrast to GAL1, GAL2, GAL7, GAL10, not been verified as specifically regulated by Gal4. Furthermore YLR460C is a stress-response gene  suggesting that under conditions of stress Gal4 could also regulate this gene. GCY1 (ranked 4) might also be partly controlled by Gal4 [15, 16] and could therefore be a PAG. These suggestions are amenable for further experimental validation.
We have designed an algorithm, CutTree, which identifies the mechanisms of action of chemical compounds. CutTree could speed up the drug development pipeline, alleviating the current bottleneck in which the mechanism of action and side effects of current drug candidates are as a rule uncharacterized . Our strategy was to first validate the algorithm using in silico experiments and subsequently to test CutTree on public Yeast data. The rational behind our in silico evaluation is that for chemical compounds there is yet no clear procedure on how to directly infer mechanism of action from experimental data. It is therefore essential to independently validate and benchmark any algorithm, including CutTree, which specifies an experimental design and data-analysis. Consequently, the in silico data is particularly useful since the correct mechanism of action is inherently known by construction of the experiment. Hence, the in silico evaluation ensures a proper estimation of the expected success rate and data requirement for inference of mechanism of action before applying the technique in the analysis and design of biological experiments. Our in silico analysis demonstrated that CutTree can identify PAGs better than a random search method (Fig. 2a), in particular in the case of several PAGs (Fig. 3a). Furthermore, our analysis predicted a 70% success rate when there are five unknown PAGs (Fig. 3b).
Interestingly, this prediction came in close agreement with the experimental validation where CutTree identified four out of five (80%) of the primary targets of galactose in Yeast using data from 952 differentially expressed genes generated from 17 different perturbations. We may ask how many of the suggested candidates in the list generated by CutTree we would have to consider in order to identify those five PAGs? The estimated 70 % success rate from the in silico estimation informs us that if we consider the top 8 candidates, then we can expect 5.6 true PAGs. This procedure would ensure 100 % (5/5) recall, 62.5 % (5/8) precision, thereby a false positive rate of 37.5 % (3/8). Since there is always a trade-off between precision and recall which depends on the priorities of the experimenter there is also a rationale for choosing the top five candidates from the CutTree list with an estimated 3.5 true PAGs, hence a recall of 60 % (3/5) and 60 % (3/5) precision. Now, as the validation using the experimental data demonstrates, choosing the top five candidates, gives 80 % (4/5) and 80 % (4/5) precision. The efficient performance of the CutTree algorithm relies on the fact that CutTree utilizes a simultaneous application of the chemical compound and a genetic perturbation such as a knockout. Therefore, CutTree does not depend on the availability of a priori knowledge, which are currently available only for Yeast (e.g. ). Thus, CutTree is complementary to other methods  which require large knockout libraries. Even though CutTree does not require prior knowledge, information about putative pathways in which the compound could target facilitates the choice of the suggested initial genetic perturbation in the CutTree algorithm.
A complementary approach to CutTree is to first identify the underlying regulatory network and from the interaction map calculate the PAGs (Fig. 1b). Such a network strategy has a similar performance as CutTree provided that close to 90 % of the interactions in the network have been identified (Figure 2c). In Gardner et al. a network strategy was designed and experimentally tested on E-coli for a previously well characterized small network of nine genes constituting the SOS pathway. However, the applicability of such a network-based approach has hitherto been hampered by the difficulty to identify regulatory networks even in simpler organisms including E-coli and Yeast [7, 15]. To perform an estimate of the usefulness of a network based strategy depends on to what extent regulatory networks have been characterized as of today. Clearly, any estimation of our current understanding heavily depends on the individual researcher's assumptions. For example, given that less than 60 %  of the open reading frames (ORFs) have been characterized, how large fraction of the Yeast network has thereby been identified? One scenario is that we have complete knowledge of all the interactions between those characterized open reading frames (ORF). From this follows that 64 % of the interactions are unknown, since the interactions between the ORFs only account for 36 % of the total number of interactions. This can be interpreted as an optimistic estimate since most likely there are interactions between ORF that have not yet been characterized. For example, we would only know 18 % of the Yeast network if 50 % of the interactions between the ORFs were known. On the other hand, the estimates used are slightly biased on the negative side since we may have correlation-based information on the interactions between characterized ORFs and uncharacterized ORFs as well as mutual interaction between uncharacterized ORFs. Taking all these approximations together, we believe it is reasonable to argue that no more than about 50 % of all the interactions in Yeast are known. Considering the fact that the Yeast regulatory network probably is the best characterized regulatory network as of today, CutTree should serve as a useful tool for detection of PAGs in Yeast and other organisms, let alone mammalian cells, in the nearest foreseeable future. However, in the last couple of years there has been a rapid progress on algorithms and their application to experimental data for identifying cellular networks, mainly targeting E-coli and Yeast [7, 19–22].
Indeed, a recent study  has demonstrated the success of a slightly modified network-based approach, here denoted as a mode-of-action by network identification (MNI), where the regulatory interactions in Yeast are estimated from a large knockout compendium. The MNI algorithm correctly enrich for known targets and associated pathways in the majority of compounds that were examined. However, there are a number of differences between CutTree and MNI suggesting different domains of applications for the two algorithms depending on the amount of available prior biological knowledge; (i) The MNI requires large libraries of expression profiles from treated cells in order to train the network model. For Yeast MNI used 515 different expression profiles which were available. (ii) In addition, the MNI requires a large amount of prior knowledge. The MNI network model is based on principal component regression, thus reducing the network model of genes into a network model of metagenes since 515 expression profiles are not sufficient to identify a network model of genes. Note that number of ORFs is more than 10 times larger than the number of gene expression profiles. The PAGs, which were calculated from the compound action on the metagenes, were projected back to the genes. This step is essentially underdetermined and the MNI approach therefore requires a library of pathways and relevant GO classification in order to interpret the ranking of the metagenes. A preliminary analysis showed that the MNI algorithm could not detect any of the galactose PAGs in the top 50 putative PAGs derived from the Yeast gene expression library. The additional GO classification method used in  did not produce any significant pathways.
In contrast, CutTree does not require a library of expression profiles or a well-annotated system with pathways. Hence, CutTree is adapted to precisely characterize the mechanisms of action of drugs and chemical compounds in systems where less prior knowledge is available but the disease relevance might be larger than compared to better described systems such as E-coli and Yeast. Furthermore, CutTree provides the experimentalist with a novel experimental design protocol for how to perform a sequential and simultaneous application of a compound and a genetic perturbation, such as a knockout, in order to obtain the most useful information.
CutTree is complementary to the network identification strategy, including the MNI algorithm. CutTree therefore sets the stage for compound target identification in less well-characterized but more disease-relevant mammalian systems where novel genetic perturbation technologies such as sRNAi are making rapid progress. Further biological experiments will tell whether CutTree is broadly applicable for drug evaluation using prokaryotic and eukaryotic cell lines and, more importantly, whether CutTree proves to be useful in the analysis of compound mode of action in mammalian cells.
The underlying idea of the CutTree algorithm is to iteratively define primary affected genes (PAGs) by identifying recurrently active genes (RAGs) – genes whose expression changes most frequently over several whole-genome expression measurements. CutTree provides an experimental design in which the algorithm specifies which genes should be perturbed in the next round of experiments, for a given set of micro-array measurements. In addition, CutTree analyzes the data and produces a candidate list of PAGs. Based on this, CutTree either halts or suggests another genetic perturbation. Importantly, the chemical compound introduces an unknown perturbation to the system and the additional simultaneous well defined genetic perturbation (knockout/siRNA) performed by the experimenter as suggested by CutTree, makes it possible to distinguish the effect of the compound after only a small number of experiments. This is different from current approaches to the mode-of-action problem, where only one genetic perturbation (knockout/siRNA) is used in the absence of the compound (e.g. as in datasets like Hughes et al.) The efficient performance of CutTree depends on the simultaneous application of the compound with a selective genetic perturbation, such as a knockout or an siRNA perturbation [17, 24]. In each iteration we allow the knockout/siRNA perturbation target the top RAG. A termination criterion defines and ranks the top candidate PAGs (Fig. 1b, right). This strategy assumes that PAGs are the same genes independently of the second perturbation and that PAGs are at the highest level in the network hierarchy, since the compound constitutes an external perturbation of the root of the tree (Fig. 2c). CutTree iteratively calculates the probability that a gene is a PAG, based on the information obtained by perturbing genome-wide expression. A well-defined stop criterion halts the iteration. We use a Dirichlet distribution with hyper-parameters α1 and α0 to represent the probability that each gene is a PAG. This is a natural formulation as we have taken a probabilistic approach. The parameters encode a frequency formulation since they are based on the number of times we observe a PAG or nonPAG. The expected value of the distribution follows
The CutTree algorithm
1. Choose n, the number of genes to use in the stop criterion.
Choose m, the number of times the stop criterion shall remain unchanged.
STOP = 0.
For each gene i
α1i = α0i = 1 (uniform distribution).
2. If we have a priori knowledge that a gene is likely to be/not be a PAG.
Update α1 and α0 for all genes accordingly. Here we use 1. Another way would be to set α1i = number of times i observed as a PAG/number of times i observed as a non-PAG.
If gene i is likely to be a PAG, increase α1i by 1.
If gene i is not likely to be a PAG, increase α0i by 1.
3. While STOP = 0:
Calculate the expectation values (peaks) of the Dirichlet distributions:
Choose the gene with the largest peak value as the knockout. If multiples genes have the largest peak value, choose one at random as the target for the next knockout.
Perform genome-wide expression profiling to compare the knockout with the knockout in the presence of the compound.
Calculate expression changes.
For each gene i:
If expression changed, increase α1i by 1.
If expression did not change, increase α0i by 1.
Rank genes according to their peaks.
If the n highest genes = the n highest genes from the last m iterations
STOP = 1.
Note that CutTree does not specify how to perform the differential gene expression analysis since this is general problem within micro-array statistics. Parameters in the algorithm: In the in silico experiments, we used n = 6 and m = 4. PAGs are considered to be identified if they are among the top 10 most significant genes after the algorithm halted. This increases the number of false positives but also makes it more likely that we find the PAGs. Thus, we prefer to select for false positives instead of removing PAGs from the candidate list, increasing the false negative rate. These parameters are flexible depending on the behavior of the stop criterion and the approximate number of expected PAGs, which in turn depends on the particular experimental application.
Simulated random gene networks
The Mathematica 5.0 (Wolfram Research) package was used for all simulations. The simulated networks were modeled as linear networks. For a network with n nodes we describe x i as the concentration of RNA from gene i and the dynamics follows
where wij are the weights indicating how much gene j affects gene i. This gives an equation system with the concentration vector x of length n, the n × n weight matrix A and a vector b of length n containing the perturbations, the elements of b different from 0 represent the targets of the compound:
Initial values for the simulations were set randomly, and the simulation proceeded until a steady state was reached. When a gene was knocked out, the gene was kept in the matrix but all the connections to that gene were removed and the expression set to zero.
As a rule, the network structure was generated randomly, thus allowing for loops and other network motifs. The out-degree of the network connectivity followed a power law : the probability that a randomly chosen node from the network had d outgoing interactions is P(d) = d-2.3. In the simulations, 200 nodes (n) were used in the networks, and the node weights (wij) varied between -3 and 3 with a flat distribution. All simulation data are mean values from at least 50 experiments. In evaluating the effects of different network structures (Fig X), small networks (13 genes and one PAG) were created. The structures varied from the PAG being connected to all other genes (shallow tree with many branches and short paths) to the PAG being connected to a few genes (deeper tree with fewer branches and longer paths). To evaluate the effects of noise, we used an additive noise model such as noisy_xi = xi + xi * noise level, where 0<noise level<1.
In the network identification method (Fig. 1b, left) of Gardner et al. the identified network can be represented by a connectivity matrix A. PAGs are found by applying the perturbation to the network (unknown vector p) and measuring gene expression (measured vector x). We have the equation system
x = Ap,
which can be solved.
For comparisons with the Network identification method, we generated random linear networks and performed the simulations as described above. The network was then mutated by randomly removing nodes and inserting random false nodes to reach certain coverage of the original network. The PAGs were then calculated with the Gardner method using the matrix from the mutated network.
Training data calculations
The data from Ideker et al. were recalculated to non-logarithmic values. They are also relative to the reference wild-type + galactose (wt + gal) of the form:
The data were rearranged to fit the CutTree algorithm:
A fold change of two was used as cutoff to classify a gene as differentially expressed or not. This was the only dataset we could identify that contained a dual application of a compound (galactose) and a genetic perturbation (knockout).
The work was supported by grants from The Swedish Research Council and the Swedish Foundation for Strategic Research. We thank Dr Jose Peña and Roland Nilsson for useful discussions and Dr Collins for helpful advice.
- Knowles J, Gromo G: A guide to drug discovery: Target selection in drug discovery. Nat Rev Drug Discov 2003, 2: 63–69. 10.1038/nrd986View ArticlePubMedGoogle Scholar
- Kola I, Landis J: Can the pharmaceutical industry reduce attrition rates? Nat Rev Drug Discov 2004, 3: 711–715. 10.1038/nrd1470View ArticlePubMedGoogle Scholar
- Lum PY, Armour CD, Stepaniants SB, Cavet G, Wolf MK, Butler JS, Hinshaw JC, Garnier P, Prestwich GD, Leonardson A, Garrett-Engele P, Rush CM, Bard M, Schimmack G, Phillips JW, Roberts CJ, Shoemaker DD: Discovering modes of action for therapeutic compounds using a genome-wide screen of yeast heterozygotes. Cell 2004, 116: 121–137. 10.1016/S0092-8674(03)01035-3View ArticlePubMedGoogle Scholar
- Giaever G, Flaherty P, Kumm J, Proctor M, Nislow C, Jaramillo DF, Chu AM, Jordan MI, Arkin AP, Davis RW: Chemogenomic profiling: identifying the functional interactions of small molecules in yeast. Proc Natl Acad Sci USA 2004, 101: 793–798. 10.1073/pnas.0307490100PubMed CentralView ArticlePubMedGoogle Scholar
- Marton MJ, DeRisi JL, Bennett HA, Iyer VR, Meyer MR, Roberts CJ, Stoughton R, Burchard J, Slade D, Dai H, Bassett DE, Hartwell LH, Brown PO, Friend SH: Drug target validation and identification of secondary drug target effects using DNA microarrays. Nat Med 1998, 4: 1293–1301. 10.1038/3282View ArticlePubMedGoogle Scholar
- Gardner TS, di Bernardo D, Lorenz D, Collins JJ: Inferring genetic networks and identifying compound mode of action via expression profiling. Science 2003, 301: 102–105. 10.1126/science.1081900View ArticlePubMedGoogle Scholar
- Luscombe NM, Babu MM, Yu H, Snyder M, Teichmann SA, Gerstein M: Genomic analysis of regulatory network dynamics reveals large topological changes. Nature 2004, 431: 308–312. 10.1038/nature02782View ArticlePubMedGoogle Scholar
- Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet 2004, 5: 101–113. 10.1038/nrg1272View ArticlePubMedGoogle Scholar
- Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL: The large-scale organization of metabolic networks. Nature 2000, 407: 651–654. 10.1038/35036627View ArticlePubMedGoogle Scholar
- Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner R, Goodlett DR, Aebersold R, Hood L: Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 2001, 292: 929–934. 10.1126/science.292.5518.929View ArticlePubMedGoogle Scholar
- Platt A, Ross HC, Hankin S, Reece RJ: The insertion of two amino acids into a transcriptional inducer converts it into a galactokinase. Proc Natl Acad Sci USA 2000, 97: 3154–3159. 10.1073/pnas.97.7.3154PubMed CentralView ArticlePubMedGoogle Scholar
- Lohr D, Venkov P, Zlatanova J: Transcriptional regulation in the yeast GAL gene family: a complex genetic network. FASEB J 1995, 9: 777–787.PubMedGoogle Scholar
- Johnston M: A model fungal gene regulatory mechanism: the GAL genes of Saccharomyces cerevisiae. Microbiol Rev 1987, 51: 458–476.PubMed CentralPubMedGoogle Scholar
- DeRisi JL, Iyer VR, Brown PO: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997, 278: 680–686. 10.1126/science.278.5338.680View ArticlePubMedGoogle Scholar
- Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne JB, Volkert TL, Fraenkel E, Gifford DK, Young RA: Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 2002, 298: 799–804. 10.1126/science.1075090View ArticlePubMedGoogle Scholar
- Angermayr M, Bandlow W: Permanent nucleosome exclusion from the Gal4p-inducible yeast GCY1 promoter. J Biol Chem 2003, 278: 11026–11031. 10.1074/jbc.M210932200View ArticlePubMedGoogle Scholar
- Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH: Functional discovery via a compendium of expression profiles. Cell 2000, 102: 109–126. 10.1016/S0092-8674(00)00015-5View ArticlePubMedGoogle Scholar
- Saccharomyces Genome Database[http://www.yeastgenome.org]
- Yeung MK, Tegner J, Collins JJ: Reverse engineering gene networks using singular value decomposition and robust regression. Proc Natl Acad Sci USA 2002, 99: 6163–6168. 10.1073/pnas.092576199PubMed CentralView ArticlePubMedGoogle Scholar
- Tegner J, Yeung MK, Hasty J, Collins JJ: Reverse engineering gene networks: integrating genetic perturbations with dynamical modeling. Proc Natl Acad Sci USA 2003, 100: 5944–5949. 10.1073/pnas.0933416100PubMed CentralView ArticlePubMedGoogle Scholar
- Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, Califano A: Reverse engineering of regulatory networks in human B cells. Nat Genet 2005, 37: 382–390. 10.1038/ng1532View ArticlePubMedGoogle Scholar
- Rice JJ, Tu Y, Stolovitzky G: Reconstructing biological networks using conditional correlation analysis. Bioinformatics 2005, 21: 765–773. 10.1093/bioinformatics/bti064View ArticlePubMedGoogle Scholar
- di Bernardo D, Thompson MJ, Gardner TS, Chobot SE, Eastwood EL, Wojtovich AP, Elliott SJ, Schaus SE, Collins JJ: Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nat Biotechnol 2005, 23: 377–383. 10.1038/nbt1075View ArticlePubMedGoogle Scholar
- Mittal V: Improving the efficiency of rna interference in mammals. Nat Rev Genet 2004, 5: 355–365. 10.1038/nrg1323View ArticlePubMedGoogle Scholar
- Wagner A: Estimating coarse gene network structure from large-scale gene perturbation data. Genome Res 2002, 12: 309–315. 10.1101/gr.193902PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.