Efficient discovery of responses of proteins to compounds using active learning
© Kangas et al.; licensee BioMed Central Ltd. 2014
Received: 26 November 2013
Accepted: 7 May 2014
Published: 16 May 2014
Skip to main content
© Kangas et al.; licensee BioMed Central Ltd. 2014
Received: 26 November 2013
Accepted: 7 May 2014
Published: 16 May 2014
Drug discovery and development has been aided by high throughput screening methods that detect compound effects on a single target. However, when using focused initial screening, undesirable secondary effects are often detected late in the development process after significant investment has been made. An alternative approach would be to screen against undesired effects early in the process, but the number of possible secondary targets makes this prohibitively expensive.
This paper describes methods for making this global approach practical by constructing predictive models for many target responses to many compounds and using them to guide experimentation. We demonstrate for the first time that by jointly modeling targets and compounds using descriptive features and using active machine learning methods, accurate models can be built by doing only a small fraction of possible experiments. The methods were evaluated by computational experiments using a dataset of 177 assays and 20,000 compounds constructed from the PubChem database.
An average of nearly 60% of all hits in the dataset were found after exploring only 3% of the experimental space which suggests that active learning can be used to enable more complete characterization of compound effects than otherwise affordable. The methods described are also likely to find widespread application outside drug discovery, such as for characterizing the effects of a large number of compounds or inhibitory RNAs on a large number of cell or tissue phenotypes.
Drug discovery and development is a lengthy process that begins with the identification of potential drug targets and ends after testing in clinical trials. The targets are generally identified through basic science studies as being critical components affected in a disease. Once a target protein has been identified, the goal is to identify drug-like compounds that either increase or decrease its activity. High throughput screening (HTS) and high content screening (HCS) are frequently used to ascertain the effects of many compounds on a target. However, even with automation, screening a large experimental space can be expensive (especially for HCS). One approach to reducing the need for experimentation is to generate a model for compound effects in silico, a process referred to as virtual screening. There are two common methods . During a quantitative structure activity relationship (QSAR) analysis, molecules are checked for the presence or absence of specific structural elements. The vector describing a molecule is referred to as a “fingerprint.” QSAR methods have been used to make predictions about the activity of compounds on target proteins [2, 3]. Molecular docking is an alternative method that requires knowledge of the structure of both target and compound [4, 5]. Computer simulations are run in which the target and compound are forced into contact and the interaction energy between the target and compound molecule estimated. These methods take into consideration features of the target protein and potential drugs. Beyond virtual screening, efforts have also been made to apply machine learning techniques to the wealth of information available in the PubChem database, paying particular attention to the gross imbalance of active to inactive compounds [6, 7] in efforts to make accurate predictions of the effects of compounds on targets.
These predictive studies consider the effects of many compounds on one (or a small number) of targets in order to identify promising compounds for further development. However, it is not uncommon in drug development for previously unknown effects to be discovered after significant investment in a potential drug, resulting in relatively high attrition rates in later phases or even after drug release . These side effects are not discovered earlier because screening is for desired effects of compounds on a single target protein without considering whether compounds have undesired effects on other targets. This suggests that early drug screening should consider a larger portion of the compound-target effect space . Ideally, we would have knowledge of the whole experimental space of compounds and targets (which we can represent as a matrix with rows for each target (~104) and columns for each compound (~106)).
By having knowledge of all effects of all compounds, much more informed decisions could be made about which compounds to advance through the development process (including the possibility that a desired drug should have more than one effect as well as minimal side effects). However, measuring the full matrix would require on the order of 1010 measurements, the cost of which would be prohibitive. An alternative is clearly needed.
As with single targets, predictive modeling methods for the larger space have also been described. Chemogenomic approaches have been developed that concurrently consider the similarity of compounds and the similarity of ligands to make predictions for unknown associations between proteins and compounds . Furthermore, methods have been developed that allow for the identification of compounds with a desired effect profile across multiple targets by using evolutionary methods to generate compounds to be tested . These polypharmacological methods make predictions for the effects of compounds across multiple targets. Using text mining methods, clinical outcome records have also been analyzed to predict effects . Inverse docking methods have been developed as well which start with a compound and measure the interaction energy between the compound and multiple proteins .
Building any of these predictive models requires data for at least a subset of all possible experiments. This is typically all data currently available, or new results for a human-specified subset thought to be representative. In approaches like those mentioned above, machine learning methods are then used to predict results for a large set of compounds, and a small number of these are tested. In most cases, the process stops after this, and selected compounds are advanced to further development. However, the process can be made iterative, so that information from the additional experiments may be used to improve the model, make new predictions and select more experiments to execute . This type of approach is referred to as active learning in the machine learning literature. In active learning, rather than being chosen in advance, experiments are iteratively selected to most improve the accuracy of the predictive model. In the context of drug development, this should require fewer experiments to make accurate predictions (of both desired and undesired effects) allowing for more effective decisions and reduced late-phase attrition . While active learning is widely used in some fields, there have been only limited applications to biological problems [15–21].
The most important difference of the work described here from previous approaches is our emphasis on active machine learning to simultaneously model the effects of many compounds on many targets. To demonstrate the utility of active learning for drug discovery in the context of multi-target modeling, we combined two modeling approaches to make predictions about activities for large numbers of combinations of compounds and targets. Our model uses features developed for virtual screening to describe compounds, and features from sequence analysis to describe target proteins. As a part of this effort, we did not endeavor to make the most accurate predictive model possible. Rather, we investigated the utility of applying active learning in combination with predictive models in order to efficiently discover active compound-target pairs. In tests using data from the PubChem database, we found that active compound-target pairs could be discovered as much as twenty-four times faster using active learning than by random selection of experiments. The algorithms we describe are also computationally efficient, making application to very large experimental spaces practical.
To evaluate our proposed approaches, we chose to use existing experimental results for assays on many targets and many compounds. We therefore began by assembling a large set of compound effect scores from PubChem (http://pubchem.ncbi.nlm.nih.gov). In total, compound activity scores for 177 assays were assembled. Of these assays, 108 were from in vitro assays and 69 were from in vivo assays. Of the 600,000 compounds in PubChem across the 177 assays, an average of 30% had a reported activity score for a given assay. (We do not know but assume that the missing values are approximately missing at random.) Of these, we created a dataset of all assay data for 20,000 randomly-chosen compounds, resulting in a system with 3.5 million possible experiments (the distribution of scores across all compounds and assays is shown in Additional file 1). All combinations of target and compound with scores above 80 or below −80 were marked as hits. (Note that each PubChem assay includes its own rank score cutoff above which a chemical is considered to be “active”. Our cutoff of 80 is more stringent than that used for most assays.) Information on the assays, compounds and their respective features, can be found in the Additional files 2 and 3.
As an initial approach to constructing a predictive model, we explored using linear combinations of features. Given the large numbers of features involved, lasso regression  was used because it allows for efficient feature selection for linear regression models. We note that while the assay scores may be non-linearly related to true activity, and while estimates of true activity may be obtained by further manipulation or testing, we expect them to be good approximate predictors of which combinations of compounds and targets will show high activity.
Three approaches to prediction of the assay scores were used. The first approach used compound features only (CFO) to predict the activity of each compound in a given assay (analogously to QSAR). Using lasso regression, compound features were selected that were strongly indicative of the activity of a compound on a single target. A regression model was learned for each individual target allowing for the selection of compound features unique to a target (Figure 1b and Equation 1). The second approach used protein features only (PFO) to predict the effect on each target of a given compound. When considering all experiments which involved a single compound, lasso regression allowed us to select features of the target protein which were indicative of the likelihood for a target to be affected by that single compound (Figure 1c and Equation 2). The third approach made a combined compound-target (CCT) prediction by averaging the two predictions for each compound-target combination (Figure 1d and Equation 3).
We first sought to determine how accurately these models could predict target-compound hits as a function of how much training data was available. To do this, we randomly sampled a sequence of experiments in batches of 384 experiments until 3% of the experimental space had been sampled (note that each combination of assay and compound was considered independently when selecting random experiments). As each experiment was sampled, we combined it with all previous experiments from that sequence to train a model and evaluated its ability to predict hits for all remaining data.
We also considered which features were more informative than others. To make a single set of predictions across the entire space of 20,000 compounds and 177 targets requires the training of 20,177 lasso regression models. The final models trained at 3% of the experimental space (from Figure 2) were analyzed and the proportion of models where the coefficient for each feature was non-zero was calculated. To determine the magnitude of the effect of a feature on prediction, the mean absolute coefficient for each feature (only when it was selected) was calculated. For targets, the most frequently selected features (and those with the largest coefficients) were the amino acid compositions. For compounds, the most frequently selected feature was “Group IIa (Alkaline earth)” and the feature with the largest absolute coefficient was “4 M Ring”. Further details on other features are provided in the Additional file 3.
Given that our modeling approach performed better than random at predicting relative activity scores, we next determined whether it could be used to successfully drive an active learning process (i.e., to find hits faster than expected at random or maximize predictive accuracy rapidly). For this, simulations were run for an experimental space of all 177 assays (129 unique protein targets) and all 20,000 compounds. For this experimental space, rank scores from actual experiments executed were available in PubChem for 1,043,300 experiments out of 3,540,000 possible experiments. Experiments selected during simulations were restricted to those for which results were available; requests from an active learner for other experiments were skipped.
To initialize a simulation, all experimental results were hidden from the active learner as if they had never been executed. A set of 384 experiments were selected randomly for “execution.” During the execution phase (Figure 1a), results from selected experiments were “revealed” and used for training of a predictive model (Figure 1b-d). A new batch of experiments was then selected using one of a number of active learning methods (illustrated in Figure 1e and described in Methods). Finally, the data for the selected experiments were added to the pool of previously selected data and the loop continued until 3% of the possible experimental space was explored. Each round consisted of the selection of 384 experiments. Ten separate simulations were run for each experiment selection method, each starting out with a different set of initial experiments. At each round, the discoveries (combinations whose absolute activity score was greater than or equal to 80) were counted, and the mean count and associated standard error recorded as a function of the fraction of experimental space so far explored.
The rate of discovery for the greedy method using CCT decreased as the simulations progressed. Exploration of the experimental space with the greedy algorithm was limited to regions of the feature space which were predicted to have large activities. We considered the possibility that this limited the system’s ability to learn a better model, and that this could be overcome by acquiring data in regions where few observations have been made or where the model predictions were uncertain. Therefore, a “density-based” approach was also tested which selected experiments so as to explore the experimental space efficiently without regard to predicted values or experimental results. In this approach experiments were tested which were most similar to the unobserved experiments and least similar to observed experiments . A variation on this idea, diversity sampling, was also tested, along with uncertainty sampling in which experiments with the highest uncertainty of their prediction were selected. Results for these approaches are shown in Additional file 4. The uncertainty-based selection method performed much better than random but not as well as CCT with greedy sampling. Density-based and diversity-based sampling performed similarly to random selection. These three classical active learning methods are generally designed to select experiments for execution which will yield the most accurate model, while the results in Additional file 4 are for finding hits. We therefore considered the accuracies of the models for each method by calculating the area under the ROC curve (as previously described for Figure 2). As shown in Additional file 5, all selection methods, except for uncertainty sampling, resulted in an initial peak accuracy followed by a slight, gradual reduction in the accuracy of the models. The better performance of uncertainty sampling compared to CCT with greedy sampling is consistent with the opposite result in Additional file 4. This is because uncertainty sampling does not prefer finding hits over non-hits.
For reasons of computational time, we restricted our analysis to 20,000 compounds. It was therefore of interest to estimate how performance might change if more compounds were included. As a preliminary indication of this, we performed simulations for smaller sets of compounds. The results (Additional file 7) show that the learning rate is significantly worse for 5,000 compounds than for 20,000, but that it is not much different for 10,000 than 20,000. This suggests performance for larger sets might be as good or better.
We have described a pipeline for executing experiments driven by an active learning system and demonstrated that it can result in the rapid discovery of compounds which affect target proteins using a set of heterogeneous assays. We found that the selection of experiments based only on predictions calculated using compound features (predicting the effect of a compound on a single target) performed significantly better than the selection of experiments based only on predictions from target features (predicting the sensitivity of a target protein to a single compound). Decent performance of the prediction models using compound features is to be expected given past results with QSAR approaches to modeling compound activity on a given target. The comparatively poor performance of the protein models could be a result of multiple issues: poor features, limited data, and heterogeneous data sources. The system included only features that could be calculated from sequence information, and it is likely that this feature set could be improved by the inclusion of features calculated from protein structural information. Some assays utilized in this study included high content screening assays in which living cells were imaged to measure the effects of compounds. These types of experiments are inherently more complex than simple binding assays and may have been poorly represented by features only describing a single target protein within the complex system. Both types of models performed better than random prediction and combining them yielded accurate models that could be utilized to rapidly make new discoveries. Previously, ensembles of predictors have yielded good results and the performance of these combined models may be caused by the same effect. Importantly, the addition of memory limitations to these models further improves the discovery rate. In this experiment, only information from 177 assays was used. As information from more assays becomes available, predictive models are expected to improve.
There are at least five factors to be considered in applying active learning approaches to problems such as compound screening. First, whether to use a priori measures of similarity between compounds or targets must be decided. The advantage of using them is that predictions can be made even before any data are acquired, but the disadvantage is that they may be biased towards previously explored compounds or targets. In separate work, we have described approaches for using modeling and active learning without such features . Second, the method for choosing experiments to perform should reflect the goals of the campaign. As we have illustrated here, uncertainty sampling can be used to learn an accurate predictive model very efficiently. However, when the goal is not to learn an accurate predictive model of the whole space, but rather something such as just finding hits, we have also illustrated how hybrid experimentation selection methods can prove very beneficial. With hybrid methods, a portion of the experiments are chosen so as to learn an accurate predictive model and the remainder of the experiments are chosen to take advantage of the improved predictive model to accomplish the desired goal. Further, we have shown that limiting the memory of the active learning system can result in further improvements in efficiency by avoiding exploration of areas of the experimental space in which most relevant information has already been discovered. Third, computational complexity is an important consideration in practical use of active learning methods. Methods that model the entire space at once are theoretically preferable [27, 28], but they can require prohibitively extensive computation for problems with thousands of targets and millions of compounds. In such cases, the methods we have described here can provide a faster alternative. Fourth, the logistics of the types of experimentation to be undertaken need to be considered. For example, in this study with the batch size we chose, 3% of the experimental space would have required approximately 80 rounds of experimentation. For some types of experimentation, a large number of small rounds may not be practical and thus larger batch sizes could be used in fewer rounds.
Finally, the primary goal of the active learning process is to reduce the experimentation required to complete an objective. In order for those reductions to be realized one needs to determine when to stop running experiments. This is an ongoing area of study, but progress has been made in our prior work . In the current study, we observed that the discovery was high as the first 1.5% of the experimental space was explored and then decreased (but still occurred at a substantial rate). To explore whether the learning would continue or would plateau, we continued the simulations past 3% for the best method (greedy-uncertainty hybrid with 10 round memory limit). The learning rate continued at a rate about 2–3 times as fast as for random sampling and did not reach a plateau (data not shown). Extrapolating the learning rate predicts that it would find all hits after sampling approximately 20% of the experimental space.
It is worth noting that while simultaneous consideration of multiple targets and multiple compounds may increase the number of experiments needed to find a compound that affects a single target, it may be expected to decrease the average number of experiments per target when used to simultaneously conduct campaigns for multiple targets.
The selection of an appropriate batch size is an important consideration for the utilization of an active learning system. If there is a significant setup cost for a set of experiments (as is typically the case for HTS and HCS), then larger batches are preferable. If on the other hand, setup costs are low and a short time is required to execute the experiments relative to computational time to update the model, a smaller batch size would be preferable.
In conclusion, the work presented here provides a practical, scalable approach to the specific problem of learning a combined model for the effects of many compounds on many targets and demonstrates that the model can be combined with active machine learning methods to dramatically reduce the number of experiments needed to find compounds with desired target effects. Many variations on the approaches described here can be considered, including different predictive models, different feature sets and different active learning algorithms (such as information-theoretic scoring [27, 28]). An exhaustive evaluation of these variations is beyond the scope of this paper, but we have firmly established that significant improvement in learning rates can be achieved. We believe active learning will be particularly important as drug development efforts increasingly consider variation among cell types and among individuals. The size of this experimental space clearly precludes exhaustive experimentation. The paradigm of exploring combinatorial experimental spaces through active learning is also widely applicable in biomedical research beyond drug discovery. This includes any study that seeks to determine the effects of large numbers of perturbations (such as genomic variation or exposure to compounds or inhibitory RNAs) on large numbers of molecular, cellular or histological behaviors (such as enzyme activities, cell shapes or motility, protein expression or localization). As the size of the experimental space grows, exhaustive experimentation becomes more impractical and active learning may be expected to provide even greater benefit.
Each assay from the PubChem database  contains gene target information, chemical identifier information and activity scores for all compounds tested in the assay. Various features describing the primary structure of the target protein were calculated using ProtParam , Protein Recon (http://reccr.chem.rpi.edu/Software/Protein-Recon/Protein-Recon-index.html) and Prosite . In total, each assay was described by 388 features which described the target protein of that assay. All non-binary features were z-scored. The compounds in the assays were described with 1559 binary features calculated using OpenBabel  (http://openbabel.org). Assays from PubChem targeting human proteins with more than 15,000 compounds tested were manually annotated. For each assay, it was determined what type of effect was being detected for the target (inhibition, excitation, etc.) and the nature of the activity scores reported. The selected assays are found in Additional file 8. Only assays whose activity scores were scaled with a measured effect from the compound were kept for simulation. Activity scores were rescaled if necessary to a maximum of 100. For all assays testing for inhibition, scores were made negative. From the ~600,000 possible compounds, 20,000 were selected randomly for use in simulations of the active learning processes. The selected compounds are found in Additional file 9.
All regression models were trained using the Least Angle Regression method  implemented in SciKits (http://scikits.appspot.com). Penalties (s) were tested between 10−4 and 104. Penalties were selected which minimized the mean squared error of five-fold cross validation within the training data for each model in each round of active learning.
Experiments were selected which had the greatest absolute value of predicted rank score (Y(d,p)). In some cases, no information was available to make a prediction for an experiment. If no prediction could be made from available data for an experiment, that experiment was predicted to have a rank score of zero. All experiments with equivalent predicted values were treated in random order.
For each assay, five CFO models were learned by subsampling the results available from observed experiments in that assay. For each unobserved experiment in an assay, a prediction is made using each of the five CFO models. Likewise, for each compound, five PFO models were learned by subsampling the results available from observed experiments for that compound. Across all unobserved assay in each compound a prediction was made using each PFO model. As a result, each unobserved experiment had five CFO predictions and five PFO predictions. Twenty-five predictions were calculated for each experiment by calculating the mean of each pair of CFO prediction and PFO prediction. If a model was impossible to calculate because there were no results from testing a compound d or no results from testing an assay p, only predictions from a single model were used. Experiments were selected which had the largest standard deviation of predictions because those were the experiments for which the model had the least confidence in prediction.
Each experiment (target, compound) was represented by a single feature vector formed by concatenating the target features and the compound features for that experiment. For computational efficiency, a maximum of 2000 observed and 2000 unobserved experiments were used. Among the two thousand unobserved experiments, selections were made using a density-based sampling method  which attempted to choose experiments which were most distant (Euclidian distance) from already observed experiments and least distant from unobserved experiments. No predictions from a learned model were utilized for this selection method.
Each experiment was represented by a single vector formed by concatenating the target features and the compound features for that experiment. A random set of 4000 experiments was clustered using the k-means algorithm (with k being the size of the batch desired, in our case 384). The experiment nearest to each centroid was selected for execution. No predictions from a learned model were utilized for this selection method.
For each round, half of the experiments were selected using one method and half were selected using another method.
When a predictive model was learned from observed data using Lasso regression, memory limitations were applied such that only experiments observed from the last m rounds of selection were used for training the predictive model.
The data supporting the results of this article are included within the article and its additional files.
Compound features only-model learned using information from only compound features
Protein features only-model learned using information from only target protein features
Combined compound-target model-model learned using information from both compound features and target protein features.
We thank Jaime Carbonell, Jeffrey Schneider, Aarti Singh and D. Lansing Taylor for helpful discussions. This work was supported in part by NIH grants R01 GM075205 and T32 EB009403.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.