Predicting drug-disease associations by using similarity constrained matrix factorization

Background Drug-disease associations provide important information for the drug discovery. Wet experiments that identify drug-disease associations are time-consuming and expensive. However, many drug-disease associations are still unobserved or unknown. The development of computational methods for predicting unobserved drug-disease associations is an important and urgent task. Results In this paper, we proposed a similarity constrained matrix factorization method for the drug-disease association prediction (SCMFDD), which makes use of known drug-disease associations, drug features and disease semantic information. SCMFDD projects the drug-disease association relationship into two low-rank spaces, which uncover latent features for drugs and diseases, and then introduces drug feature-based similarities and disease semantic similarity as constraints for drugs and diseases in low-rank spaces. Different from the classic matrix factorization technique, SCMFDD takes the biological context of the problem into account. In computational experiments, the proposed method can produce high-accuracy performances on benchmark datasets, and outperform existing state-of-the-art prediction methods when evaluated by five-fold cross validation and independent testing. Conclusion We developed a user-friendly web server by using known associations collected from the CTD database, available at http://www.bioinfotech.cn/SCMFDD/. The case studies show that the server can find out novel associations, which are not included in the CTD database.


Background
A drug is a chemical that treats, cures, prevents, or diagnoses diseases. The drug design has three stages: discovery stage, preclinical stage and clinical development stage [1], and the development of a new drug take 15 years [2] and cost 800 million dollars [3].
The drug-disease associations refer to the events that drugs exert effects on diseases, which can be classified into two types: drug indications and drug side-effects. Some drugs could have a therapeutic role in a disease, e.g. a drug treats leukemia & lymphoma; other drugs could play a role in the etiology of a disease, e.g. exposure to a drug causes lung cancer [4]. Drug-disease associations reveal the close relations between drugs and diseases, and have gained great attention. Computational methods can screen possible drug-disease associations, and complement or guide laborious and costly wet experiments.
In recent years, a great number of computational methods have been proposed to predict drug-disease associations. As shown in Fig. 1, existing methods are roughly classified as two types. One type of methods makes use of biological elements shared by drugs and diseases to predict drug-disease associations. Eichborn J et al. [5] studied drug-disease relations based on drug side effects. Wang et al. [6] and Wiegers et al. [7] considered drug-gene-disease relations. Yu et al. [8] used common protein complexes related to drugs and diseases. These methods have to use elements shared by drugs and diseases, but many drugs and diseases do not share any elements, and these methods fail to work in this case. The other type of methods predicts novel drug-disease associations by using known drug-disease associations, drug features and disease features. Gottlieb et al. [9] constructed a universal predictor named PRE-DICT for drug repositioning to express drug-disease associations in a large-scale manner that integrated molecular structure, molecular activity and disease semantic data. Yang et al. [10] built Naive Bayes models to predict indications for diseases based on their side effects. Wang et al. [11] proposed the method "PreDR" that trained a support vector machine (SVM) model based on drug structures, drug target proteins, and drug side effects. Huang et al. [12] combined three different networks of drugs, genomic and disease phenotypes to build a heterogeneous network to predict drug-disease associations. Oh et al. [13] proposed scoring methods to obtain quantified scores as features between drugs and diseases, and built classifiers based on the extracted features to predict novel drug-disease associations. Wang et al. [14] proposed a three-layer heterogeneous network model (TL-HGBI), and applied the approach on drug repositioning by using existing omics data of diseases, drugs and drug targets. Martínez et al. [15] built a network of interconnected drugs, proteins and diseases to identify their relations. Wang et al. [16] adopted recommendation systems to predict drug-disease relations. Moghadam et al. [17] combined drug features and disease features by using kernel fusion, and then built SVM-based prediction model. Liang et al. [18] proposed a Laplacian regularized sparse subspace learning method (LRSSL), which integrated drug chemical information, drug target domain information and target annotation information.
A great number of drug-disease associations have been identified and stored in databases. However, many associations remain unobserved and need to be discovered. In this paper, we proposed a similarity constrained matrix factorization method for the drug-disease association prediction (SCMFDD), which makes use of known drug-disease associations, drug features and disease semantic information. SCMFDD projects the drug-disease association relationship into two low-rank spaces, which uncover latent features for drugs and diseases, and then introduces drug feature-based similarity and disease semantic similarity as constraints for drugs and diseases in low-rank spaces. Different from the classic matrix factorization technique, SCMFDD can take the biological context of the problem into account. Computational experiments show that SCMFDD can produce high-accuracy performances on benchmark datasets and outperform existing state-of-the-art prediction methods, i.e. PREDICT, TL-HGBI and LRSSL when evaluated by five-fold cross validation and independent testing on the same datasets. Moreover, a web server is constructed on known associations collected from the CTD database [4], and case studies show that the web server can help to find out novel associations.
The main contributions of this paper include: 1) we proposed a novel matrix factorization approach (SCMFDD), which is different from the traditional matrix factorization methods. SCMFDD incorporates drug features and disease semantic information into the matrix factorization frame; 2) an efficient optimization algorithm is developed Fig. 1 Two types of drug-disease association prediction methods. a Infer drug-disease associations without known associations; b Infer unobserved drug-disease associations based on known associations

Datasets
CTD database [4] is a publicly available database that intends to advance understanding about how environmental exposures affect human health. CTD database provides curated and inferred chemical-disease associations. The curated associations are real associations extracted from literature. Several databases describe features for drugs and diseases. PubChem Compound database [19] provides drug substructures. DrugBank database [20] is a comprehensive resource for drug targets, drug enzymes and drug-drug interactions. KEGG DRUG database [21] provides pathway information for approved drugs in Japan, USA and Europe. U.S. National Library of Medicine stores disease MeSH descriptors, which reflect the hierarchy of diseases. We downloaded real drug-disease associations from CTD database, and collected features for drugs and diseases to compile our datasets. In order to avoid sparsity of drug-disease associations, we selected drugs that are associated with more than 10 diseases, and also selected diseases that are associated with more than 10 drugs. Moreover, we collected drug features: substructures, targets, enzymes, pathways and drug-drug interactions as well as disease MeSH descriptors. Thus, we compiled a dataset named "SCMFDD-S", which contains 18,416 associations between 269 drugs and 598 diseases. Further, we selected drugs associated with at least one disease as well as diseases associated with at least one drug, and collected drug substructures and disease MeSH descriptors. Thus, we compiled a larger dataset named "SCMFDD-L", which contains 49,217 associations between 1323 drugs and 2834 diseases. Table 1 summarizes the datasets "SCMFDD-S" and "SCMFDD-L".
Several benchmark datasets were used in the drug-disease association prediction. Gottlieb et al. [9] compiled a dataset with 1933 associations between 593 drugs in DrugBank and 313 diseases in OMIM, and used it for the method "PREDICT". This dataset contains five types of drug-drug similarities and two types of disease-disease similarities. Three drug-drug similarities  are calculated based on drug-related genes, by using Smith-Waterman sequence alignment score [22], all-pairs shortest paths algorithm [23] and semantic similarity scores [24] respectively; other two drug-drug similarities are drug structure-based Tanimoto similarity and drug side effect-based Jaccard similarity. Two disease-disease similarity measures are semantic similarity and genetic similarity. Wang et al. [14] compiled a dataset with 1461 interactions between 1409 drugs in DrugBank database and 5080 diseases in OMIM database, and used it for the method "TL-HGBI". The dataset also contains the drug-drug structure similarity and disease semantic similarity. Liang et al. [18] obtained 3051 associations between 763 drugs and 681 diseases from the study [25], and collected drug substructures, protein domains of target proteins, gene ontology terms of target proteins to calculate three types of drug-drug similarities as well as the disease-disease semantic similarity. The dataset was used for the method "LRSSL". We name these datasets as "PREDICT dataset", "TL-HGBI dataset" and "LRSSL datasets". Therefore, we adopt SCMFDD-S dataset, SCMFDD-L dataset, PREDICT dataset, TL-HGBI dataset and LRSSL datasets as benchmark datasets.

Similarity constrained matrix factorization method
The aim of this study is to predict unobserved drug-disease associations by using drug features, disease semantic information and known associations. Figure 2 illustrates the basic idea of the similarity constrained matrix factorization method for the drug-disease association prediction (SCMFDD).

Drug-drug similarities
Actually, a feature is a set of descriptors. A drug has a subset of descriptors, and thus is represented as a bit vector, whose dimensions indicate the presence or absence of corresponding descriptors with the value 1 or 0. Let P and Q denote feature vectors of two drugs, we can calculate the Jaccard similarity between two drugs by using, where P ∩ Q| is the number of bits where P and Q both have the value 1, and P ∪ Q| is the number of bits where either P and Q has the value 1.
When we have different features of a drug, i.e. substructures, targets, enzymes, pathways and drug-drug interactions, we can represent them as feature vectors in different feature spaces, and calculate different types of drug-drug similarities.

Disease-disease semantic similarity
MeSH is the National Library of Medicine's controlled vocabulary thesaurus, and MeSH provides hierarchical descriptors for diseases. As described in [26][27][28], we can calculate disease-disease semantic similarity by using MeSH information.
For each disease, a directed acyclic graph (DAG) is constructed based on hierarchical descriptors, in which nodes represent disease MeSH descriptors (or disease terms) and the edges represent the relationship between the current node and its ancestors. For the disease A, the DAG is denoted as DAG(A) = (N(A), E(E)), where N(A) is the set of all ancestors of A (including itself ) and E(A) is the set of their corresponding links.
We define the contribution of a node d d in DAG(A) to the semantic value of disease A: Fig. 4 The influence of parameters on SCMFDD models. a the influnce of μ and λ b the influence of k where Δ is the semantic contribution factor, and we set Δ = 0.5 in the study.
The semantic value of disease A is defined as, The semantic similarity between two diseases A and B is calculated by,

Objective Function
The observed drug-disease associations can be formulated as a bipartite network, and represented by a binary matrix A ∈ R n × m , where n is the number of drugs and m is the number of diseases. a ij is the (i, j)th entry of A. If the vertex (drug) d i and the vertex (disease) dis j are connected, a ij = 1; otherwise a ij = 0. The bipartite network and the association matrix are demonstrated in Fig. 3. SCMFDD factorizes the drug-disease association matrix A into two low-rank feature matrices X ∈ R n × k and Y ∈ R m × k , where k is the dimension of drug feature and disease feature in the low-rank spaces. The drug-disease association can be approximated by inner product between the drug feature vector and the disease feature vector: a ij ≈ x i y T j , where x i is the ith row of X, and y j is the jth row of Y.The objective function is defined as: Then, to avoid overfitting problem, L 2 regularization terms of x i and y j are added to the objective function (1), where μ is the regularization parameter for x i and y j . Recent studies on manifold learning theory [29,30], spectral graph theory [31,32] and their applications [33][34][35][36][37][38] show that the geometric and topological structure of data points may be maintained when they are mapped from high dimensional space into low dimensional space. Considering that the similarity matrix w d and w s not only can be defined to represent statistical correlation but also can be regarded as geometric properties of the data points, we introduce the similarity constraint terms R X and R Y : where w d ij denotes the similarity between the drug d i and the drug d j , which is calculated in the drug feature space; w s ij denotes the similarity between the disease dis i and the disease dis j , which is calculated in the disease feature space. It is generally believed that the similarity between two data points is higher if the distance of them is smaller. Therefore, R X (or R Y ) incurs a heavy penalty if drug d i and the drug d j (disease dis i and the disease dis j ) are close in the drug feature space (or disease feature space) and thus minimizing it further incurs that drug d i and the drug d j (or disease dis i and the disease dis j ) are mapped closely in low-rank spaces. Hence, we could maintain effectively the topological structure of drug data points and disease data points by minimizing R X and R Y .
By combining R X and R Y with the original objective function (2), we propose the objective function of SCMFDD, where λ is the hyper parameter controlling the smoothness of the similarity consistency.

Optimization algorithm
Here, we develop an efficient optimization algorithm to solve the objective function in (5). First, we calculate the partial derivatives of L with respect to x i and y j , A(i, :) represents the ith row of A and A(:, j) represents the jth column of A.
Then, we can calculate the second derivatives of L with respect to x i and y j : Utilizing Newton's method, we have: Thus, we can obtain the updating rules: We alternatively update x i and y j with Eq. (12) and Eq. (13) until convergence. The prediction matrix is given by The score of (A predict ) ij represents the probability that the drug d i and the disease dis j has the association. The optimization algorithm is summarized in Algorithm 1.

Algorithm 1 Algorithm to solve objective function (5)
Input: known drug-disease association matrix, A ∈ R n × m ; drug similarity matrix, W d ∈ R n × n ; disease similarity matrix, W s ∈ R m × m ; dimension of the low-rank feature space, k < min(m, n); regularization parameter, μ > 0, λ > 0; Output: the prediction matrix A predict 1 Initialize X ∈ R n × k , Y ∈ R m × k as two random matrices; 2 Repeat 3 Update X:

Evaluation metrics
In our experiments, we adopted five-fold cross validation (5-CV) to test performances of prediction models. To implement five-fold cross validation, we randomly split all known drug-disease associations into five equal-sized subsets. In each fold, we combined four subsets as the training set, and used the other subset as the testing set. We constructed the prediction model based on known associations in the training set, and predicted associations in the testing set. Training and testing were repeated five times, and the average of performances was adopted.
AUC and AUPR are popular metrics for evaluating prediction models. Since drug-disease pairs without associations are much more than known drug-disease associations, we adopted AUPR as the primary metric, which takes into recall and precision. We also considered several binary classification metrics, i.e. sensitivity (SN, also known as recall), specificity (SP), accuracy (ACC) and F-measure (F).

Performances of SCMFDD
First of all, we discussed the influence of parameters on SCMFDD models by using SCMFDD-S dataset. SCMFDD has three parameters, i.e. the number of latent variables k, the regularization parameter μ and the regularization parameter λ. k is the dimension of drugs and diseases in low-rank spaces, and k is less than row number and column number of the association matrix, and k < k 0 = min(m, n). For simplicity, we set k as the percentage of k 0 . SCMFDD builds prediction model constrained by drug-drug similarity and disease-disease semantic similarity. We have several drug features in SCMFDD-S dataset, and can calculate several types of drug-drug similarities. Here, we used the drug interaction-based similarity and the disease semantic similarity to build SCMFDD models for analysis. We considered all combinations of following values λ ∈ {2 −3 , 2 −2 , 2 −1 , 2 0 , 2 1 , 2 2 , 2 3 }, μ ∈ {2 −3 , 2 −2 , 2 −1 , 2 0 , 2 1 , 2 2 , 2 3 } and k ∈ {5%, 10%, 15 % …, 50%} to build SCMFDD models, and implemented five-fold cross validation to evaluate models. The experiments for all parameter combinations cost about 12 h on a PC with Intel i7 7700 K CPU and 16GB RAM.
In computational experiments, SCMFDD produced the best AUPR score when k = 45 % , μ = 2 0 and λ = 2 2 . Then, we fixed the latent variable number k = 45%, and evaluated the influence of parameters μ and λ, and results are shown in Fig. 4a. Clearly, μ and λ have great impact on the model. When μ is a small value, greater λ could lead to better performances; when μ is a great value, greater λ contributes to poorer performances. Further, we fixed the parameters μ = 2 0 and λ = 2 2 , and tested the influence of the latent variable number k. The latent variable numbers and AUPR scores of corresponding models are shown in Fig. 4b. Clearly, performances of SCMFDD will increase as k increases, and remain unchanged after reaching a threshold.
Further, we tested the impact of different similarity constraints on SCMFDD models. We have various features of drugs, and can calculate different types of drug-drug similarities, i.e. substructure similarity, target similarity, pathway similarity, enzyme similarity and drug interaction similarity. These similarities can be used as the constraint terms for SCMFDD models. We set k = 45%, μ = 2 0 and λ = 2 2 in the experiments. As shown in Table 2, SCMFDD models using different drug-drug similarities produce high-accuracy and robust performances. Since drug structures directly influence functions and drug interactions may induce drug effects, drug substructures and drug interactions lead to better results than other features.
The known drug-disease association is an important resource for predicting unobserved drug-disease associations. The data richness, which is the ratio of association number vs drug-disease pair number, may influence performances of SCMFDD. Here, we used the dataset SCMFDD-L for analysis. We removed drugs that are associated with less than m diseases, and removed diseases that associated with less than m drugs from SCMFDD-L dataset, m ∈ {2, 3, 4, 5, 6…10}. As displayed in Fig. 5, the data richness will increase as the threshold m increases, and then improve performances of SCMFDD models. Although the data richness influences the performances, SCMFDD could still produce robust performances.

Comparison with state-of-the-art prediction methods
In this section, we compared our method with three state-of-the-art drug-disease association prediction methods: PREDICT [9], TL-HGBI [14] and LRSSL [18]. PREDICT constructed a universal predictor for drug repositioning to express drug-disease associations in a large-scale manner that integrates molecular structure, molecular activity and semantic data. TL-HGBI was a computational framework based on a three-layer heterogeneous network model, which made use of Omics data about diseases, drugs and drug targets to make predictions. LRSSL was a Laplacian regularized sparse subspace learning method, which integrated drug chemical information, drug target domains and target annotation information to make predictions. We obtained datasets of PREDICT [9], datasets and source codes of TL-HGBI [14]   from authors. The datasets and source codes of LRSSL [18] are publicly available. Therefore, we can adopt these methods as benchmark methods for fair comparison. First, we compared our method with PREDICT based on the PREDICT dataset by using five-fold cross validation. SCMFDD uses one drug similarity constraint and one disease similarity constraint. The PREDICT dataset contains five kinds of drug-drug similarities and two kinds of diseases-disease similarity. Thus, we built 10 different SCMFDD models by combining drug-drug similarities and diseases-disease similarities. As shown in Table 3, SCMFDD models and PREDICT produce similar AUC scores, but SCMFDD models yield much greater AUPR scores than PREDICT. Moreover, SCMFDD models were robust to different similarities, and the models based on the drug Genes-Waterman similarity and disease Gene Signature similarity produced the best results.
Then, we compared our method with TL-HGBI by using TL-HGBI dataset. TL-HGBI dataset contains one drug chemical structure similarity and one disease phenotypic similarity. We constructed the SCMFDD model by using drug structure similarity and disease phenotypic similarity. As shown in Table 4, SCMFDD produced similar AUC score but much greater AUPR score compared with TL-HGBI.
Further, we compared SCMFDD and LRSSL by using LRSSL dataset. Since LRSSL dataset contains three features of drugs: chemical substructures, protein domains of target proteins, gene ontology information of target proteins. Three drug similarities were calculated, and disease semantic similarity was provided as well. Therefore, we can construct three SCMFDD models by combing three drug similarities and the disease semantic similarity. Table 5 shows the performances of prediction models evaluated by five-fold cross validation. Clearly, three SCMFDD models can produce better performance than LRSSL.

Independent experiments
In this section, we conducted independent experiments to test performances of our method in predicting novel drug-disease associations.
CTD database is an up-to-date resource about the experimentally determined drug-disease associations. Since PREDICT dataset and LRSSL dataset were compiled several years ago, we can build prediction models by using PREDICT dataset and LRSSL dataset, and check up the predictions in the CTD database. Different drugs and diseases could be matched according to their names and synonyms (provided by CTD database "Chemical vocabulary" and "Disease vocabulary"). PREDICT dataset and LRSSL dataset include different types of drug-drug similarities, and we build different similarity-based SCMFDD models for the comprehensive comparison. The PREDICT model and the LRSSL model respectively predict novel interaction by using PREDICT dataset and LRSSL dataset.
We considered the top predictions from top 2 to top 1000 in a step size of 2, and respectively counted how many predicted associations can be confirmed in CTD database. Figure 6 shows the number of checked predictions and the number of confirmed associations. Clearly, our method finds out more novel associations than benchmark methods, and has the good performances in the independent experiments.

Web server and applications
To facilitate the drug-disease association prediction, we developed a web server named "SCMFDD" by using the dataset SCMFDD-L, available at http://www.bioinfo tech.cn/SCMFDD/. Users can predict novel drug-disease Table 3 Performance of PREDICT and SCMFDD on PREDICT Dataset  associations for a given drug or a given disease, and then visualize predictions. Here, we used two case studies to illustrate the usefulness for the drug-disease association prediction of our web server. Clozapine is an effective drug to treat patients with refractory schizophrenia [39,40]. Clozapine works by changing the actions of chemicals in the brain. Here, the web server predicts diseases that are associated with Clozapine. Table 6 lists top 10 predictions among all unknown relationships between Clozapine and diseases in the SCMFDD-L dataset. Then, we analyze these predicted diseases case by case. From https://en.wikipedia.org/wiki/ Clozapine (access on 2018-2-1), three diseases: sleep initiation and maintenance disorders (also insomnia), status epilepticus and headache have been reported as side effects of Clozapine, indicating that they have associations with the drug "Clozapine". Further, the study [41] found that Clozapine improved the syndrome of inappropriate antidiuretic hormone secretion(SIADH) in a patient; the studies [42,43] revealed that Clozapine can be used for the treatment of post-traumatic stress disorder (PTSD); the study [44] demonstrated that Clozapine can be used for the treatment of Parkinson's disease; the study [45] indicated that Clozapine can affect the visual memory.
Alzheimer's disease (AD) is a chronic neurodegenerative disorder that leads to disturbances of cognitive functions. The radical cause and effective treatment of AD remain unclear, and AD has attracted many scientists to study its pathogenic mechanism and therapeutic function. Table 7 lists top 10 predicted drugs associated with Alzheimer's disease, and evidence is available for six drugs. For example, the study [46] revealed that Olanzapine appears to be effective in treating psychotic and behavioral disturbances associated with AD; the study [47] found that stimulation of the dopaminergic system could improve  cognitive function in a murine model and suggested that Levodopa that works in the dopaminergic system could ameliorate typical symptoms of AD: learning and memory deficits. The study [48] revealed that the presence of Malondialdehyde level is a risk factor for AD.
The study [49] confirmed that progesterone significantly could reduce and inhibit tau hyperphosphorylation, a chemical process implicated in AD. The study [50] demonstrated that Valproic Acid (VPA) could decrease β-amyloid(Aβ) production which is the key risk factor in AD and improve memory deficits of AD model mice. The study [51] showed that Ethanol protect neurons against Aβ-induced synapse damage and explained epidemiological reports that moderate alcohol consumption protects against the development of AD. The server can visualize the predictions. Figure 7 shows the top 100 predictions for Clozapine and top 200 predictions for Alzheimer's disease. As shown in Fig. 7a, "dark blue circle" stands for a disease, which has a known association with Clozapine, and "red square" stands for predicted diseases, which have an association with Clozapine. As shown in Fig. 7b, "dark blue circle" stands for a drug, which has a known association with Alzheimer's disease, and "red square" stands for predicted drugs, which have an association with Alzheimer's disease. Users can adjust the number of predictions for visualization.

Conclusion
In this paper, we proposed a computational method "SCMFDD" to predict unobserved drug-disease associations. SCMFDD incorporate drug feature-based similarities and disease semantic similarity into the matrix factorization frame. Experimental results show that SCMFDD can produce high-accuracy performances on  [51] Scores are normalized by using ((score-min)/(max-min)) Fig. 7 Web Visualization of predictions for Clozapine a and predictions for Headache b five benchmark datasets when evaluated by five-fold cross validation, and SCMFDD outperforms state-of-the-art methods under fair comparison. Moreover, SCMFDD produces satisfying performances for different similarity constraints, and is also robust to the data richness. We constructed a web server based on drug-disease associations, which are collected from the CTD database. The server can predict novel drug-disease associations, and also can help researchers to quickly find associations for interested drugs or diseases. In recent years, the deep learning methods have been applied to similar tasks [52][53][54]. However, designing an effective neural network is a hard task, and the training process also costs a great amount of time. Compared to deep learning-based methods, SCMFDD is easy to implement, and SCMFDD can be applied into similar tasks in bioinformatics.
However, SCMFDD still has several limitations. First, SCMFDD has three parameters, and there is no good way of determining suitable parameters except going through all combinations. For our datasets, it costs dozens of hours to determine optimal parameters. Second, SCMFDD only uses individual drug feature-based similarity to build prediction models. When we have multiple drug features, we can calculate different drug feature-based similarities. Combining diverse information can usually lead to improved performances [55][56][57][58][59][60], and how to integrate multiple similarities in a model is our future work. Third, the server can make predictions for the drugs and diseases in our dataset, but can't support other drugs or diseases. Availability of data and materials a user-friendly web server available at: http://www.bioinfotech.cn/SCMFDD/.
Authors' contributions WZ and FL conceived the project; XY, WL and WZ designed the experiments; XY, WL and FH performed the experiments; WW and RL designed the server; WZ and XY wrote the paper. All authors read and approved the final manuscript.
Ethics approval and consent to participate Not applicable.