Predicting clinically promising therapeutic hypotheses using tensor factorization

Background Determining which target to pursue is a challenging and error-prone first step in developing a therapeutic treatment for a disease, where missteps are potentially very costly given the long-time frames and high expenses of drug development. With current informatics technology and machine learning algorithms, it is now possible to computationally discover therapeutic hypotheses by predicting clinically promising drug targets based on the evidence associating drug targets with disease indications. We have collected this evidence from Open Targets and additional databases that covers 17 sources of evidence for target-indication association and represented the data as a tensor of 21,437 × 2211 × 17. Results As a proof-of-concept, we identified examples of successes and failures of target-indication pairs in clinical trials across 875 targets and 574 disease indications to build a gold-standard data set of 6140 known clinical outcomes. We designed and executed three benchmarking strategies to examine the performance of multiple machine learning models: Logistic Regression, LASSO, Random Forest, Tensor Factorization and Gradient Boosting Machine. With 10-fold cross-validation, tensor factorization achieved AUROC = 0.82 ± 0.02 and AUPRC = 0.71 ± 0.03. Across multiple validation schemes, this was comparable or better than other methods. Conclusion In this work, we benchmarked a machine learning technique called tensor factorization for the problem of predicting clinical outcomes of therapeutic hypotheses. Results have shown that this method can achieve equal or better prediction performance compared with a variety of baseline models. We demonstrate one application of the method to predict outcomes of trials on novel indications of approved drug targets. This work can be expanded to targets and indications that have never been clinically tested and proposing novel target-indication hypotheses. Our proposed biologically-motivated cross-validation schemes provide insight into the robustness of the prediction performance. This has significant implications for all future methods that try to address this seminal problem in drug discovery. Electronic supplementary material The online version of this article (10.1186/s12859-019-2664-1) contains supplementary material, which is available to authorized users.

determine the number of latent factors. ARD is mainly used to infer relevant features from a large number of input features. The basic idea of ARD is to assign independent zero-mean Gaussian priors on feature weights. The variances of feature weight priors represent the relevance of different input features. If the variance is zero, then those weights are constrained to be zero, and the corresponding input cannot have any effect on the predictions, therefore making it irrelevant. ARD optimizes these variances to discover which inputs are relevant. Similarly, we can treat latent factors in Bayesian tensor factorization method as latent features and use the same idea of ARD to determine the number of relevant latent factors. Specifically, we leverage the approximate distribution generated from MCMC samples and estimate the variance of latent features by fitting the model using a large enough number of factors. If the variance is close to zero, the corresponding latent factor is set to zero. Then the resulting number of factors with non-zero variance is used as the number of factors to re-fit the model. In practice, it is difficult to set a threshold to determine if the variance is close to zero. Here we chose the point before the last large gap or "elbow" appeared in a plot of latent factor's variance in descending order ( Figure S1a). The intuition behind this approach is after this point, the latent factors with low variance will no longer preserve the inherent structure in the data and incorporating these factors will only add noise to the final predictions.
As a comparison, we also ran the three cross-validation experiments on the Bayesian tensor factorization model using a series of latent factors. Interestingly, although the performance increases as the number of factors in general ( Figure S1b), there exists a local peak around the chosen number of factors determined by the proposed method, especially in the leave one target class out and leave one disease cluster out settings.

De novo disease clustering
The goal of partitioning indications into disease clusters is to obtain a relatively large grouping of indications for leave one out validation such that indications in a group are more similar than the indications that are not in the same group. A simple way is to directly use the hierarchy structure curated in the MeSH (Medical Subject Headings) system. However, there are two problems with this approach. First, it is not uncommon that one MeSH term is assigned to multiple MeSH trees, which brings the problem of not uniquely assigning indications to a group. Second, since the MeSH structure is human-curated, it is possible that some disease relationships are not captured in the structure. Given these problems, we took an approach to derive a disease partitioning de novo. MeSH similarity. We performed clustering of diseases based on a) disease-disease similarity encoded in MeSH structure and b) the disease-disease co-occurrence in literature to capture the similarity information missed in the MeSH structure, and then merged the two results into one partition using consensus clustering.
MeSH similarity between any pair of disease terms was calculated using Lin's [2] and Resnik's [3] methods as described in Nelson [4], using the averaged similarity score of the two methods to construct a diseasedisease similarity matrix S. Then we performed hierarchical clustering using Ward's method on diseasedisease distance matrix, which is defined as 1-S. Then we cut the hierarchy tree at the level where it can yield ten clusters ( Figure S2a). Ten is chosen for practical reasons so that the number of clusters for crossvalidation is of the same size as the ones used in the standard leave one target class out cross-validation.
Disease co-occurrence. We used the TERMITE platform from SciBite (www.scibite.com/products/termite) to process the scientific literature (April 29, 2016). We recorded the number of disease pairs occurring in the same abstract, and compared these with counts in which the disease appeared with any diseases, converting the resulting overrepresentation P value and odds ratios into a score (0 for random, 1 for highest possible overrepresentation), which we then clustered as in the MeSH case ( Figure S2b).
Merged results. We merged the two clustering results into one using consensus clustering. Specifically, each clustering result induces one disease-disease matrix where 1 represents that these two diseases are in the same cluster, 0 otherwise. We took an average of the two disease-disease matrices induced by the two clustering results, performed hierarchical clustering on the resulting disease-disease matrix using Ward's method and cut the hierarchical tree at the level corresponding to ten clusters ( Figure S2c). The ten clusters are named after by the most abundant MeSH root term in each cluster. We further grouped cluster 8 and cluster 9 into one cluster since most of the diseases in these two clusters are infectious diseases. As neoplasm diseases (cluster 3) were not included in this paper, the final number of diseases clusters for the leave-one-out-validation is eight.   Tables   Table S1. Clinical outcome statistics of target-indication pairs (TIP) grouped by target classes.