Skip to main content


Predicting drug-disease associations by using similarity constrained matrix factorization



Drug-disease associations provide important information for the drug discovery. Wet experiments that identify drug-disease associations are time-consuming and expensive. However, many drug-disease associations are still unobserved or unknown. The development of computational methods for predicting unobserved drug-disease associations is an important and urgent task.


In this paper, we proposed a similarity constrained matrix factorization method for the drug-disease association prediction (SCMFDD), which makes use of known drug-disease associations, drug features and disease semantic information. SCMFDD projects the drug-disease association relationship into two low-rank spaces, which uncover latent features for drugs and diseases, and then introduces drug feature-based similarities and disease semantic similarity as constraints for drugs and diseases in low-rank spaces. Different from the classic matrix factorization technique, SCMFDD takes the biological context of the problem into account. In computational experiments, the proposed method can produce high-accuracy performances on benchmark datasets, and outperform existing state-of-the-art prediction methods when evaluated by five-fold cross validation and independent testing.


We developed a user-friendly web server by using known associations collected from the CTD database, available at The case studies show that the server can find out novel associations, which are not included in the CTD database.


A drug is a chemical that treats, cures, prevents, or diagnoses diseases. The drug design has three stages: discovery stage, preclinical stage and clinical development stage [1], and the development of a new drug take 15 years [2] and cost 800 million dollars [3].

The drug-disease associations refer to the events that drugs exert effects on diseases, which can be classified into two types: drug indications and drug side-effects. Some drugs could have a therapeutic role in a disease, e.g. a drug treats leukemia & lymphoma; other drugs could play a role in the etiology of a disease, e.g. exposure to a drug causes lung cancer [4]. Drug-disease associations reveal the close relations between drugs and diseases, and have gained great attention. Computational methods can screen possible drug-disease associations, and complement or guide laborious and costly wet experiments.

In recent years, a great number of computational methods have been proposed to predict drug-disease associations. As shown in Fig. 1, existing methods are roughly classified as two types. One type of methods makes use of biological elements shared by drugs and diseases to predict drug-disease associations. Eichborn J et al. [5] studied drug-disease relations based on drug side effects. Wang et al. [6] and Wiegers et al. [7] considered drug-gene-disease relations. Yu et al. [8] used common protein complexes related to drugs and diseases. These methods have to use elements shared by drugs and diseases, but many drugs and diseases do not share any elements, and these methods fail to work in this case. The other type of methods predicts novel drug-disease associations by using known drug-disease associations, drug features and disease features. Gottlieb et al. [9] constructed a universal predictor named PREDICT for drug repositioning to express drug-disease associations in a large-scale manner that integrated molecular structure, molecular activity and disease semantic data. Yang et al. [10] built Naive Bayes models to predict indications for diseases based on their side effects. Wang et al. [11] proposed the method “PreDR” that trained a support vector machine (SVM) model based on drug structures, drug target proteins, and drug side effects. Huang et al. [12] combined three different networks of drugs, genomic and disease phenotypes to build a heterogeneous network to predict drug-disease associations. Oh et al. [13] proposed scoring methods to obtain quantified scores as features between drugs and diseases, and built classifiers based on the extracted features to predict novel drug-disease associations. Wang et al. [14] proposed a three-layer heterogeneous network model (TL-HGBI), and applied the approach on drug repositioning by using existing omics data of diseases, drugs and drug targets. Martínez et al. [15] built a network of interconnected drugs, proteins and diseases to identify their relations. Wang et al. [16] adopted recommendation systems to predict drug-disease relations. Moghadam et al. [17] combined drug features and disease features by using kernel fusion, and then built SVM-based prediction model. Liang et al. [18] proposed a Laplacian regularized sparse subspace learning method (LRSSL), which integrated drug chemical information, drug target domain information and target annotation information.

Fig. 1

Two types of drug-disease association prediction methods. a Infer drug-disease associations without known associations; b Infer unobserved drug-disease associations based on known associations

A great number of drug-disease associations have been identified and stored in databases. However, many associations remain unobserved and need to be discovered. In this paper, we proposed a similarity constrained matrix factorization method for the drug-disease association prediction (SCMFDD), which makes use of known drug-disease associations, drug features and disease semantic information. SCMFDD projects the drug-disease association relationship into two low-rank spaces, which uncover latent features for drugs and diseases, and then introduces drug feature-based similarity and disease semantic similarity as constraints for drugs and diseases in low-rank spaces. Different from the classic matrix factorization technique, SCMFDD can take the biological context of the problem into account. Computational experiments show that SCMFDD can produce high-accuracy performances on benchmark datasets and outperform existing state-of-the-art prediction methods, i.e. PREDICT, TL-HGBI and LRSSL when evaluated by five-fold cross validation and independent testing on the same datasets. Moreover, a web server is constructed on known associations collected from the CTD database [4], and case studies show that the web server can help to find out novel associations.

The main contributions of this paper include: 1) we proposed a novel matrix factorization approach (SCMFDD), which is different from the traditional matrix factorization methods. SCMFDD incorporates drug features and disease semantic information into the matrix factorization frame; 2) an efficient optimization algorithm is developed to obtain the solution of SCMFDD; 3) we developed a user-friendly web server to facilitate the drug-disease association prediction, available at



CTD database [4] is a publicly available database that intends to advance understanding about how environmental exposures affect human health. CTD database provides curated and inferred chemical-disease associations. The curated associations are real associations extracted from literature. Several databases describe features for drugs and diseases. PubChem Compound database [19] provides drug substructures. DrugBank database [20] is a comprehensive resource for drug targets, drug enzymes and drug-drug interactions. KEGG DRUG database [21] provides pathway information for approved drugs in Japan, USA and Europe. U.S. National Library of Medicine stores disease MeSH descriptors, which reflect the hierarchy of diseases.

We downloaded real drug-disease associations from CTD database, and collected features for drugs and diseases to compile our datasets. In order to avoid sparsity of drug-disease associations, we selected drugs that are associated with more than 10 diseases, and also selected diseases that are associated with more than 10 drugs. Moreover, we collected drug features: substructures, targets, enzymes, pathways and drug-drug interactions as well as disease MeSH descriptors. Thus, we compiled a dataset named “SCMFDD-S”, which contains 18,416 associations between 269 drugs and 598 diseases. Further, we selected drugs associated with at least one disease as well as diseases associated with at least one drug, and collected drug substructures and disease MeSH descriptors. Thus, we compiled a larger dataset named “SCMFDD-L”, which contains 49,217 associations between 1323 drugs and 2834 diseases. Table 1 summarizes the datasets “SCMFDD-S” and “SCMFDD-L”.

Table 1 The summary of SCMFDD-S dataset and SCMFDD-L dataset

Several benchmark datasets were used in the drug-disease association prediction. Gottlieb et al. [9] compiled a dataset with 1933 associations between 593 drugs in DrugBank and 313 diseases in OMIM, and used it for the method “PREDICT”. This dataset contains five types of drug-drug similarities and two types of disease-disease similarities. Three drug-drug similarities are calculated based on drug-related genes, by using Smith-Waterman sequence alignment score [22], all-pairs shortest paths algorithm [23] and semantic similarity scores [24] respectively; other two drug-drug similarities are drug structure-based Tanimoto similarity and drug side effect-based Jaccard similarity. Two disease-disease similarity measures are semantic similarity and genetic similarity. Wang et al. [14] compiled a dataset with 1461 interactions between 1409 drugs in DrugBank database and 5080 diseases in OMIM database, and used it for the method “TL-HGBI”. The dataset also contains the drug-drug structure similarity and disease semantic similarity. Liang et al. [18] obtained 3051 associations between 763 drugs and 681 diseases from the study [25], and collected drug substructures, protein domains of target proteins, gene ontology terms of target proteins to calculate three types of drug-drug similarities as well as the disease-disease semantic similarity. The dataset was used for the method “LRSSL”. We name these datasets as “PREDICT dataset”, “TL-HGBI dataset” and “LRSSL datasets”.

Therefore, we adopt SCMFDD-S dataset, SCMFDD-L dataset, PREDICT dataset, TL-HGBI dataset and LRSSL datasets as benchmark datasets.

Similarity constrained matrix factorization method

The aim of this study is to predict unobserved drug-disease associations by using drug features, disease semantic information and known associations. Figure 2 illustrates the basic idea of the similarity constrained matrix factorization method for the drug-disease association prediction (SCMFDD).

Fig. 2

The basic idea of similarity constrained matrix factorization

Drug-drug similarities

Actually, a feature is a set of descriptors. A drug has a subset of descriptors, and thus is represented as a bit vector, whose dimensions indicate the presence or absence of corresponding descriptors with the value 1 or 0. Let P and Q denote feature vectors of two drugs, we can calculate the Jaccard similarity between two drugs by using,

$$ \mathrm{J}\left(\mathrm{P},\mathrm{Q}\right)=\frac{\mid \mathrm{P}\cap \mathrm{Q}\mid }{\mid \mathrm{P}\cup \mathrm{Q}\mid } $$

where P ∩ Q is the number of bits where P and Q both have the value 1, and P  Q is the number of bits where either P and Q has the value 1.

When we have different features of a drug, i.e. substructures, targets, enzymes, pathways and drug-drug interactions, we can represent them as feature vectors in different feature spaces, and calculate different types of drug-drug similarities.

Disease-disease semantic similarity

MeSH is the National Library of Medicine’s controlled vocabulary thesaurus, and MeSH provides hierarchical descriptors for diseases. As described in [26,27,28], we can calculate disease-disease semantic similarity by using MeSH information.

For each disease, a directed acyclic graph (DAG) is constructed based on hierarchical descriptors, in which nodes represent disease MeSH descriptors (or disease terms) and the edges represent the relationship between the current node and its ancestors. For the disease A, the DAG is denoted as DAG(A) = (N(A), E(E)), where N(A) is the set of all ancestors of A (including itself) and E(A) is the set of their corresponding links.

We define the contribution of a node d d in DAG(A) to the semantic value of disease A:

$$ {C}_A(d)=\left\{\begin{array}{c}1\kern16.75em if\ d=A\\ {}\mathit{\max}\left\{\Delta \ast {C}_A\left({d}^{\prime}\right)|{d}^{\prime}\in children\ of\ \left.d\right\}\kern0.5em if\ d\ne A\right.\end{array}\right. $$

where is the semantic contribution factor, and we set  = 0.5 in the study.

The semantic value of disease A is defined as,

$$ DV(A)=\sum \limits_{d\in N(A)}{C}_A(d) $$

The semantic similarity between two diseases A and B is calculated by,

$$ {S}_{A,B}=\frac{\sum_{d\in N\left(\mathrm{A}\right)\cap N(B)}\left({C}_A(d)+{C}_B(d)\right)}{DV(A)+ DV(B)} $$

Objective Function

The observed drug-disease associations can be formulated as a bipartite network, and represented by a binary matrix ARn × m, where n is the number of drugs and m is the number of diseases. a ij is the (i, j)th entry of A. If the vertex (drug) d i and the vertex (disease) dis j are connected, a ij  = 1; otherwise a ij  = 0. The bipartite network and the association matrix are demonstrated in Fig. 3.

Fig. 3

The bipartite network and the association network

SCMFDD factorizes the drug-disease association matrix A into two low-rank feature matrices XRn × k and YRm × k, where k is the dimension of drug feature and disease feature in the low-rank spaces. The drug-disease association can be approximated by inner product between the drug feature vector and the disease feature vector: \( {a}_{ij}\approx {\boldsymbol{x}}_i{\boldsymbol{y}}_j^T \), where x i is the ith row of X, and y j is the jth row of Y.The objective function is defined as:

$$ \mathit{\min}\frac{1}{2}\sum \limits_{ij}{\left({a}_{ij}-{\boldsymbol{x}}_i{\boldsymbol{y}}_j^T\right)}^2 $$

Then, to avoid overfitting problem, L2 regularization terms of x i and y j are added to the objective function (1),

$$ \mathit{\min}\frac{1}{2}\sum \limits_{ij}{\left({a}_{ij}-{\boldsymbol{x}}_i{\boldsymbol{y}}_j^T\right)}^2+\frac{\mu }{2}\sum \limits_i{\left\Vert {\boldsymbol{x}}_i\right\Vert}^2+\frac{\mu }{2}\sum \limits_j{\left\Vert {\boldsymbol{y}}_j\right\Vert}^2 $$

where μ is the regularization parameter for x i  and y j .

Recent studies on manifold learning theory [29, 30], spectral graph theory [31, 32] and their applications [33,34,35,36,37,38] show that the geometric and topological structure of data points may be maintained when they are mapped from high dimensional space into low dimensional space. Considering that the similarity matrix wd and ws not only can be defined to represent statistical correlation but also can be regarded as geometric properties of the data points, we introduce the similarity constraint terms R X and R Y :

$$ {R}_X=\frac{1}{2}\sum \limits_{ij}{\left\Vert {\boldsymbol{x}}_i-{\boldsymbol{x}}_j\right\Vert}^2{w}_{ij}^d $$
$$ {R}_Y=\frac{1}{2}\sum \limits_{ij}{\left\Vert {\boldsymbol{y}}_i-{\boldsymbol{y}}_j\right\Vert}^2{w}_{ij}^s $$

where \( {w}_{ij}^d \) denotes the similarity between the drug d i and the drug d j , which is calculated in the drug feature space; \( {w}_{ij}^s \) denotes the similarity between the disease dis i and the disease dis j , which is calculated in the disease feature space. It is generally believed that the similarity between two data points is higher if the distance of them is smaller. Therefore, R X (or R Y ) incurs a heavy penalty if drug d i and the drug d j (disease dis i and the disease dis j ) are close in the drug feature space (or disease feature space) and thus minimizing it further incurs that drug d i and the drug d j (or disease dis i and the disease dis j ) are mapped closely in low-rank spaces. Hence, we could maintain effectively the topological structure of drug data points and disease data points by minimizing R X and R Y .

By combining R X and R Y with the original objective function (2), we propose the objective function of SCMFDD,

$$ \underset{X,Y}{\min }L=\frac{1}{2}\sum \limits_{ij}{\left({a}_{ij}-{\boldsymbol{x}}_i{\boldsymbol{y}}_j^T\right)}^2+\frac{\mu }{2}\sum \limits_i{\left\Vert {\boldsymbol{x}}_i\right\Vert}^2+\frac{\mu }{2}\sum \limits_j{\left\Vert {\boldsymbol{y}}_j\right\Vert}^2+\frac{\uplambda}{2}\sum \limits_{ij}{\left\Vert {\boldsymbol{x}}_i-{\boldsymbol{x}}_j\right\Vert}^2{w}_{ij}^d+\frac{\uplambda}{2}\sum \limits_{ij}{\left\Vert {\boldsymbol{y}}_i-{\boldsymbol{y}}_j\right\Vert}^2{w}_{ij}^s $$

where λ is the hyper parameter controlling the smoothness of the similarity consistency.

Optimization algorithm

Here, we develop an efficient optimization algorithm to solve the objective function in (5). First, we calculate the partial derivatives of L with respect to x i  and y j ,

$$ {\nabla}_{{\boldsymbol{x}}_i}L=\sum \limits_j\left({\boldsymbol{x}}_i{\boldsymbol{y}}_j^T-{a}_{ij}\right){y}_j+{\mu \boldsymbol{x}}_i+\uplambda \left(\sum \limits_j\left({\boldsymbol{x}}_i-{\boldsymbol{x}}_j\right){w}_{ij}^d-\sum \limits_j\left({\boldsymbol{x}}_j-{\boldsymbol{x}}_i\right){w}_{ji}^d\right)={\boldsymbol{x}}_i\left({Y}^TY+\mu I+\uplambda \left(\sum \limits_j{w}_{ij}^d+\sum \limits_j{w}_{ji}^d\right)I\right)-A\left(i,:\right)Y-\uplambda \sum \limits_j\left({w}_{ij}^d+{w}_{ji}^d\right){\boldsymbol{x}}_j $$
$$ {\nabla}_{{\boldsymbol{y}}_j}L=\sum \limits_i\left({\boldsymbol{y}}_j{\boldsymbol{x}}_i^T-{a}_{ij}\right){x}_i+\mu {\boldsymbol{y}}_j+\uplambda \left(\sum \limits_i\left({\boldsymbol{y}}_j-{\boldsymbol{y}}_i\right){w}_{ji}^s-\sum \limits_i\left({\boldsymbol{y}}_i-{\boldsymbol{y}}_j\right){w}_{ij}^s\right)={\boldsymbol{y}}_j\left({X}^TX+\mu I+\uplambda \left(\sum \limits_i{w}_{ij}^s+\sum \limits_i{w}_{ji}^s\right)I\right)-A{\left(:,j\right)}^TX-\uplambda \sum \limits_i\left({w}_{ij}^s+{w}_{ji}^s\right){\boldsymbol{y}}_i $$

A(i, :) represents the ith row of A and A(:, j) represents the jth column of A.

Then, we can calculate the second derivatives of L with respect to x i  and y j :

$$ {\nabla}_{{\boldsymbol{x}}_i}^2L={Y}^TY+\mu I+\uplambda \left(\sum \limits_j{w}_{ij}^d+\sum \limits_j{w}_{ji}^d\right)I $$
$$ {\nabla}_{{\boldsymbol{y}}_j}^2L={X}^TX+\mu I+\uplambda \left(\sum \limits_i{w}_{ij}^s+\sum \limits_i{w}_{ji}^s\right)I $$

Utilizing Newton’s method, we have:

$$ {\boldsymbol{x}}_i\leftarrow {\boldsymbol{x}}_i-{\nabla}_{{\boldsymbol{x}}_i}L{\left({\nabla}_{{\boldsymbol{x}}_i}^2L\right)}^{-1} $$
$$ {\boldsymbol{y}}_j\leftarrow {\boldsymbol{y}}_j-{\nabla}_{{\boldsymbol{y}}_j}L{\left({\nabla}_{{\boldsymbol{y}}_j}^2L\right)}^{-1} $$

Thus, we can obtain the updating rules:

$$ {\boldsymbol{x}}_i=\left(A\left(i,:\right)Y+\uplambda \sum \limits_j\left({w}_{ij}^d+{w}_{ji}^d\right){\boldsymbol{x}}_j\right){\left({Y}^TY+\mu I+\uplambda \left(\sum \limits_j{w}_{ij}^d+\sum \limits_j{w}_{ji}^d\right)I\right)}^{-1} $$
$$ {\boldsymbol{y}}_j=\left(A{\left(:,j\right)}^TX+\uplambda \sum \limits_i\left({w}_{ij}^s+{w}_{ji}^s\right){\boldsymbol{y}}_i\right){\left({X}^TX+\mu I+\uplambda \left(\sum \limits_i{w}_{ij}^s+\sum \limits_i{w}_{ji}^s\right)I\right)}^{-1} $$

We alternatively update x i and y j with Eq. (12) and Eq. (13) until convergence. The prediction matrix is given by

$$ {A}_{predict}=X{Y}^T $$

The score of (A predict ) ij represents the probability that the drug d i and the disease dis j has the association. The optimization algorithm is summarized in Algorithm 1.

Algorithm 1 Algorithm to solve objective function (5)
Input: known drug-disease association matrix, ARn × m;
drug similarity matrix, WdRn × n;
disease similarity matrix, WsRm × m;
dimension of the low-rank feature space, k < min(m, n);
regularization parameter, μ > 0, λ > 0;
Output: the prediction matrix A predict
1 Initialize XRn × kYRm × k as two random matrices;
2 Repeat
3 Update X:
4 for each i(1 ≤ i ≤ n) do
5 update x i by Eq. (12);
6 end
7 Update Y:
8 for each j(1 ≤ j ≤ m) do
9 update y j by Eq. (13);
10 end
11 Until Converges;
12 Calculate the prediction matrix A predict by Eq. (14);
13 Output A predict ;

Results and discussion

Evaluation metrics

In our experiments, we adopted five-fold cross validation (5-CV) to test performances of prediction models. To implement five-fold cross validation, we randomly split all known drug-disease associations into five equal-sized subsets. In each fold, we combined four subsets as the training set, and used the other subset as the testing set. We constructed the prediction model based on known associations in the training set, and predicted associations in the testing set. Training and testing were repeated five times, and the average of performances was adopted.

AUC and AUPR are popular metrics for evaluating prediction models. Since drug-disease pairs without associations are much more than known drug-disease associations, we adopted AUPR as the primary metric, which takes into recall and precision. We also considered several binary classification metrics, i.e. sensitivity (SN, also known as recall), specificity (SP), accuracy (ACC) and F-measure (F).

Performances of SCMFDD

First of all, we discussed the influence of parameters on SCMFDD models by using SCMFDD-S dataset. SCMFDD has three parameters, i.e. the number of latent variables k, the regularization parameter μ and the regularization parameter λ. k is the dimension of drugs and diseases in low-rank spaces, and k is less than row number and column number of the association matrix, and k < k0 = min(m, n). For simplicity, we set k as the percentage of k0.

SCMFDD builds prediction model constrained by drug-drug similarity and disease-disease semantic similarity. We have several drug features in SCMFDD-S dataset, and can calculate several types of drug-drug similarities. Here, we used the drug interaction-based similarity and the disease semantic similarity to build SCMFDD models for analysis. We considered all combinations of following values λ  {2−3, 2−2, 2−1, 20, 21, 22, 23}, μ {2−3, 2−2, 2−1, 20, 21, 22, 23} and k {5%, 10%, 15 % …, 50%} to build SCMFDD models, and implemented five-fold cross validation to evaluate models. The experiments for all parameter combinations cost about 12 h on a PC with Intel i7 7700 K CPU and 16GB RAM.

In computational experiments, SCMFDD produced the best AUPR score when k = 45 % ,  μ = 20 and λ = 22. Then, we fixed the latent variable number k = 45%, and evaluated the influence of parameters μ and λ, and results are shown in Fig. 4a. Clearly, μ and λ have great impact on the model. When μ is a small value, greater λ could lead to better performances; when μ is a great value, greater λ contributes to poorer performances. Further, we fixed the parameters μ = 20 and λ = 22, and tested the influence of the latent variable number k. The latent variable numbers and AUPR scores of corresponding models are shown in Fig. 4b. Clearly, performances of SCMFDD will increase as k increases, and remain unchanged after reaching a threshold.

Fig. 4

The influence of parameters on SCMFDD models. a the influnce of μ and λ b the influence of k

Further, we tested the impact of different similarity constraints on SCMFDD models. We have various features of drugs, and can calculate different types of drug-drug similarities, i.e. substructure similarity, target similarity, pathway similarity, enzyme similarity and drug interaction similarity. These similarities can be used as the constraint terms for SCMFDD models. We set k = 45%, μ = 20 and λ = 22 in the experiments. As shown in Table 2, SCMFDD models using different drug-drug similarities produce high-accuracy and robust performances. Since drug structures directly influence functions and drug interactions may induce drug effects, drug substructures and drug interactions lead to better results than other features.

Table 2 The performances of SCMFDD models based on different drug features

The known drug-disease association is an important resource for predicting unobserved drug-disease associations. The data richness, which is the ratio of association number vs drug-disease pair number, may influence performances of SCMFDD. Here, we used the dataset SCMFDD-L for analysis. We removed drugs that are associated with less than m diseases, and removed diseases that associated with less than m drugs from SCMFDD-L dataset, m {2,  3, 4, 5, 6…10}. As displayed in Fig. 5, the data richness will increase as the threshold m increases, and then improve performances of SCMFDD models. Although the data richness influences the performances, SCMFDD could still produce robust performances.

Fig. 5

The influence of association exclusion criteria on data richness (a) and model performance (b)

Comparison with state-of-the-art prediction methods

In this section, we compared our method with three state-of-the-art drug-disease association prediction methods: PREDICT [9], TL-HGBI [14] and LRSSL [18]. PREDICT constructed a universal predictor for drug repositioning to express drug-disease associations in a large-scale manner that integrates molecular structure, molecular activity and semantic data. TL-HGBI was a computational framework based on a three-layer heterogeneous network model, which made use of Omics data about diseases, drugs and drug targets to make predictions. LRSSL was a Laplacian regularized sparse subspace learning method, which integrated drug chemical information, drug target domains and target annotation information to make predictions. We obtained datasets of PREDICT [9], datasets and source codes of TL-HGBI [14] from authors. The datasets and source codes of LRSSL [18] are publicly available. Therefore, we can adopt these methods as benchmark methods for fair comparison.

First, we compared our method with PREDICT based on the PREDICT dataset by using five-fold cross validation. SCMFDD uses one drug similarity constraint and one disease similarity constraint. The PREDICT dataset contains five kinds of drug-drug similarities and two kinds of diseases-disease similarity. Thus, we built 10 different SCMFDD models by combining drug-drug similarities and diseases-disease similarities. As shown in Table 3, SCMFDD models and PREDICT produce similar AUC scores, but SCMFDD models yield much greater AUPR scores than PREDICT. Moreover, SCMFDD models were robust to different similarities, and the models based on the drug Genes-Waterman similarity and disease Gene Signature similarity produced the best results.

Table 3 Performance of PREDICT and SCMFDD on PREDICT Dataset

Then, we compared our method with TL-HGBI by using TL-HGBI dataset. TL-HGBI dataset contains one drug chemical structure similarity and one disease phenotypic similarity. We constructed the SCMFDD model by using drug structure similarity and disease phenotypic similarity. As shown in Table 4, SCMFDD produced similar AUC score but much greater AUPR score compared with TL-HGBI.

Further, we compared SCMFDD and LRSSL by using LRSSL dataset. Since LRSSL dataset contains three features of drugs: chemical substructures, protein domains of target proteins, gene ontology information of target proteins. Three drug similarities were calculated, and disease semantic similarity was provided as well. Therefore, we can construct three SCMFDD models by combing three drug similarities and the disease semantic similarity. Table 5 shows the performances of prediction models evaluated by five-fold cross validation. Clearly, three SCMFDD models can produce better performance than LRSSL.

Table 4 Performance of TL-HGBI and SCMFDD on TL-HGBI Dataset
Table 5 Performance of LRSSL and SCMFDD on Liang Dataset

Independent experiments

In this section, we conducted independent experiments to test performances of our method in predicting novel drug-disease associations.

CTD database is an up-to-date resource about the experimentally determined drug-disease associations. Since PREDICT dataset and LRSSL dataset were compiled several years ago, we can build prediction models by using PREDICT dataset and LRSSL dataset, and check up the predictions in the CTD database. Different drugs and diseases could be matched according to their names and synonyms (provided by CTD database “Chemical vocabulary” and “Disease vocabulary”). PREDICT dataset and LRSSL dataset include different types of drug-drug similarities, and we build different similarity-based SCMFDD models for the comprehensive comparison. The PREDICT model and the LRSSL model respectively predict novel interaction by using PREDICT dataset and LRSSL dataset.

We considered the top predictions from top 2 to top 1000 in a step size of 2, and respectively counted how many predicted associations can be confirmed in CTD database. Figure 6 shows the number of checked predictions and the number of confirmed associations. Clearly, our method finds out more novel associations than benchmark methods, and has the good performances in the independent experiments.

Fig. 6

The number of confirmed associations in top predictions of PREDICT, LRSSL, SCMFDD. (a) For drugs, Che: Chemical Similarity, SE: Chemical Similarity, GP: Genes-Perlman Similarity, GO: Genes- Ovaska Similarity, GW: Genes-Waterman Similarity. For diseases, GS: Gene Signature Similarity, Phen: Phenotypic Similarity (b) For drugs, Che: Chemical Similarity, Dom: Protein Domains Similarity, Go: Gene ontology Similarity. For diseases, Sem: Semantic Similarity

Web server and applications

To facilitate the drug-disease association prediction, we developed a web server named “SCMFDD” by using the dataset SCMFDD-L, available at Users can predict novel drug-disease associations for a given drug or a given disease, and then visualize predictions. Here, we used two case studies to illustrate the usefulness for the drug-disease association prediction of our web server.

Clozapine is an effective drug to treat patients with refractory schizophrenia [39, 40]. Clozapine works by changing the actions of chemicals in the brain. Here, the web server predicts diseases that are associated with Clozapine. Table 6 lists top 10 predictions among all unknown relationships between Clozapine and diseases in the SCMFDD-L dataset. Then, we analyze these predicted diseases case by case. From (access on 2018–2-1), three diseases: sleep initiation and maintenance disorders (also insomnia), status epilepticus and headache have been reported as side effects of Clozapine, indicating that they have associations with the drug “Clozapine”. Further, the study [41] found that Clozapine improved the syndrome of inappropriate antidiuretic hormone secretion(SIADH) in a patient; the studies [42, 43] revealed that Clozapine can be used for the treatment of post-traumatic stress disorder (PTSD); the study [44] demonstrated that Clozapine can be used for the treatment of Parkinson’s disease; the study [45] indicated that Clozapine can affect the visual memory.

Table 6 Top 10 predicted diseases associated with Clozapine

Alzheimer’s disease (AD) is a chronic neurodegenerative disorder that leads to disturbances of cognitive functions. The radical cause and effective treatment of AD remain unclear, and AD has attracted many scientists to study its pathogenic mechanism and therapeutic function. Table 7 lists top 10 predicted drugs associated with Alzheimer’s disease, and evidence is available for six drugs. For example, the study [46] revealed that Olanzapine appears to be effective in treating psychotic and behavioral disturbances associated with AD; the study [47] found that stimulation of the dopaminergic system could improve cognitive function in a murine model and suggested that Levodopa that works in the dopaminergic system could ameliorate typical symptoms of AD: learning and memory deficits. The study [48] revealed that the presence of Malondialdehyde level is a risk factor for AD. The study [49] confirmed that progesterone significantly could reduce and inhibit tau hyperphosphorylation, a chemical process implicated in AD. The study [50] demonstrated that Valproic Acid (VPA) could decrease β-amyloid(Aβ) production which is the key risk factor in AD and improve memory deficits of AD model mice. The study [51] showed that Ethanol protect neurons against Aβ-induced synapse damage and explained epidemiological reports that moderate alcohol consumption protects against the development of AD.

Table 7 Top 10 predicted drugs associated with Alzheimer’s disease

The server can visualize the predictions. Figure 7 shows the top 100 predictions for Clozapine and top 200 predictions for Alzheimer’s disease. As shown in Fig. 7a, “dark blue circle” stands for a disease, which has a known association with Clozapine, and “red square” stands for predicted diseases, which have an association with Clozapine. As shown in Fig. 7b, “dark blue circle” stands for a drug, which has a known association with Alzheimer’s disease, and “red square” stands for predicted drugs, which have an association with Alzheimer’s disease. Users can adjust the number of predictions for visualization.

Fig. 7

Web Visualization of predictions for Clozapine a and predictions for Headache b


In this paper, we proposed a computational method “SCMFDD” to predict unobserved drug-disease associations. SCMFDD incorporate drug feature-based similarities and disease semantic similarity into the matrix factorization frame. Experimental results show that SCMFDD can produce high-accuracy performances on five benchmark datasets when evaluated by five-fold cross validation, and SCMFDD outperforms state-of-the-art methods under fair comparison. Moreover, SCMFDD produces satisfying performances for different similarity constraints, and is also robust to the data richness. We constructed a web server based on drug-disease associations, which are collected from the CTD database. The server can predict novel drug-disease associations, and also can help researchers to quickly find associations for interested drugs or diseases.

In recent years, the deep learning methods have been applied to similar tasks [52,53,54]. However, designing an effective neural network is a hard task, and the training process also costs a great amount of time. Compared to deep learning-based methods, SCMFDD is easy to implement, and SCMFDD can be applied into similar tasks in bioinformatics.

However, SCMFDD still has several limitations. First, SCMFDD has three parameters, and there is no good way of determining suitable parameters except going through all combinations. For our datasets, it costs dozens of hours to determine optimal parameters. Second, SCMFDD only uses individual drug feature-based similarity to build prediction models. When we have multiple drug features, we can calculate different drug feature-based similarities. Combining diverse information can usually lead to improved performances [55,56,57,58,59,60], and how to integrate multiple similarities in a model is our future work. Third, the server can make predictions for the drugs and diseases in our dataset, but can’t support other drugs or diseases.



Area under ROC curve


Area under the precision-recall curve


Similarity constrained matrix factorization method


  1. 1.

    Wilson JF. Alterations in processes and priorities needed for new drug development. Ann Intern Med. 2006;145(10):793–6.

  2. 2.

    Dimasi JA. New drug development in the United States from 1963 to 1999. Clin Pharmacol Ther. 2001;69(5):286–96.

  3. 3.

    Adams CP, Brantner VV. Estimating the cost of new drug development: is it really 802 million dollars? Health Aff. 2006;25(2):420–8.

  4. 4.

    Davis AP, Grondin CJ, Johnson RJ, Sciaky D, King BL, McMorran R, Wiegers J, Wiegers TC, Mattingly CJ. The comparative Toxicogenomics database: update 2017. Nucleic Acids Res. 2017;45(D1):D972–8.

  5. 5.

    von Eichborn J, Murgueitio MS, Dunkel M, Koerner S, Bourne PE, Preissner R. PROMISCUOUS: a database for network-based drug-repositioning. Nucleic Acids Res. 2011;39(Database):D1060–6.

  6. 6.

    Wang L, Wang Y, Hu Q, Li S. Systematic analysis of new drug indications by drug-gene-disease coherent subnetworks. CPT: pharmacometrics & systems pharmacology. 2014;3:e146.

  7. 7.

    Wiegers TC, Davis AP, Cohen KB, Hirschman L, Mattingly CJ. Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD). BMC Bioinformatics. 2009;10:326.

  8. 8.

    Yu L, Huang J, Ma Z, Zhang J, Zou Y, Gao L. Inferring drug-disease associations based on known protein complexes. BMC Med Genet. 2015;8(Suppl 2, S2)

  9. 9.

    Gottlieb A, Stein GY, Ruppin E, Sharan R. PREDICT: a method for inferring novel drug indications with application to personalized medicine. Mol Syst Biol. 2011;7:496.

  10. 10.

    Yang L, Agarwal P. Systematic drug repositioning based on clinical side-effects. PLoS One. 2011;6(12):e28025.

  11. 11.

    Wang Y, Chen S, Deng N, Wang Y. Drug repositioning by kernel-based integration of molecular structure, molecular activity, and phenotype data. PLoS One. 2013;8(11):e78518.

  12. 12.

    Huang YF, Yeh HY, Soo VW. Inferring drug-disease associations from integration of chemical, genomic and phenotype data using network propagation. BMC Med Genet. 2013;6(Suppl 3):S4.

  13. 13.

    Oh M, Ahn J, Yoon Y. A network-based classification model for deriving novel drug-disease associations and assessing their molecular actions. PLoS One. 2014;9(10):e111668.

  14. 14.

    Wang W, Yang S, Zhang X, Li J. Drug repositioning by integrating target information through a heterogeneous network model. Bioinformatics. 2014;30(20):2923–30.

  15. 15.

    Martinez V, Navarro C, Cano C, Fajardo W, Blanco A. DrugNet: network-based drug-disease prioritization by integrating heterogeneous data. Artif Intell Med. 2015;63(1):41–9.

  16. 16.

    Wang H, Gu Q, Wei J, Cao Z, Liu Q. Mining drug-disease relationships as a complement to medical genetics-based drug repositioning: where a recommendation system meets genome-wide association studies. Clin Pharmacol Ther. 2015;97(5):451–4.

  17. 17.

    Moghadam H, Rahgozar M, Gharaghani S. Scoring multiple features to predict drug disease associations using information fusion and aggregation. SAR QSAR Environ Res. 2016;27(8):609–28.

  18. 18.

    Liang X, Zhang P, Yan L, Fu Y, Peng F, Qu L, Shao M, Chen Y, Chen Z. LRSSL: predict and interpret drug–disease associations based on data integration using sparse subspace learning. Bioinformatics. 2017;33(8):1187–96.

  19. 19.

    Li Q, Cheng T, Wang Y, Bryant SH. PubChem as a public resource for drug discovery. Drug Discov Today. 2010;15(23–24):1052–7.

  20. 20.

    Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, Arndt D, Wilson M, Neveu V, et al. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 2014;42(Database issue):D1091–7.

  21. 21.

    Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010;38(Database issue):D355–60.

  22. 22.

    Smith TF, Waterman MS, Burks C. The statistical distribution of nucleic acid similarities. Nucleic Acids Res. 1985;13(2):645–56.

  23. 23.

    Perlman L, Gottlieb A, Atias N, Ruppin E, Sharan R. Combining drug and gene similarity measures for drug-target elucidation. J Comput Biol. 2011;18(2):133–45.

  24. 24.

    Ovaska K, Laakso M, Hautaniemi S. Fast gene ontology based clustering for microarray experiments. BioData Min. 2008;1(1):11.

  25. 25.

    Wang F, Zhang P, Cao N, Hu J, Sorrentino R. Exploring the associations between drug side-effects and therapeutic indications. J Biomed Inform. 2014;51:15–23.

  26. 26.

    Xuan P, Han K, Guo M, Guo Y, Li J, Ding J, Liu Y, Dai Q, Li J, Teng Z, et al. Prediction of microRNAs associated with human diseases based on weighted k most similar neighbors. PLoS One. 2013;8(8):e70204.

  27. 27.

    Chen X, Yan CC, Luo C, Ji W, Zhang Y, Dai Q. Constructing lncRNA functional similarity network based on lncRNA-disease associations and disease semantic similarity. Sci Rep. 2015;5:11338.

  28. 28.

    Wang D, Wang J, Lu M, Song F, Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics. 2010;26(13):1644–50.

  29. 29.

    Ma Y, Fu Y. Manifold learning theory and applications. Boca Raton: CRC; Taylor & Francis distributor; 2012.

  30. 30.

    Zhang W, Liu X, Chen Y, Wu W, Wang W, Li X. Feature-derived graph regularized matrix factorization for predicting drug side effects. Neurocomputing. 2018;287:154–62.

  31. 31.

    Rana B, Juneja A, Saxena M, Gudwani S, Kumaran SS, Behari M, Agrawal RK. Graph-theory-based spectral feature selection for computer aided diagnosis of Parkinson's disease using T1-weighted MRI International Journal of Imaging Systems and Technology Volume 25, Issue 3. Int J Imaging Syst Technol. 2015;25(3):245–55.

  32. 32.

    Chung FRK: Spectral graph theory. Providence, R.I.: published for the conference board of the mathematical sciences by the American Mathematical Society; 1997.

  33. 33.

    Zhang W, Chen Y, Li D. Drug-target interaction prediction through label propagation with linear neighborhood information. Molecules. 2017;22(12):2056.

  34. 34.

    Zhang W, Qu Q, Zhang Y, Wang W. The linear neighborhood propagation method for predicting long non-coding RNA–protein interactions. Neurocomputing. 2018;273:526–34.

  35. 35.

    Zhang W, Yue X, Chen Y, Lin W, Li B, Liu F, Li X. Predicting drug-disease associations based on the known association bipartite network. IEEE Int Conf Bioinformatics Biomed. 2017:503–9.

  36. 36.

    Zhang W, Chen Y, Tu S, Liu F, Qu Q. Drug side effect prediction through linear neighborhoods and multiple data source integration. IEEE Int C Bioinform. 2016:427–34.

  37. 37.

    Ruan CY, Wang Y, Zhang YC, Ma JG, Chen HJ, Aickelin U, Zhu SF, Zhang T. THCluster:herb supplements categorization for precision traditional Chinese medicine. IEEE Int Conf Bioinformatics And Biomedicine. 2017;2017:417–24.

  38. 38.

    Zhang W, Yue X, Liu F, Chen YL, Tu SK, Zhang XN. A unified frame of predicting side effects of drugs by using linear neighborhood similarity. BMC Syst Biol. 2017;11

  39. 39.

    Alvir JM, Lieberman JA, Safferman AZ, Schwimmer JL, Schaaf JA. Clozapine-induced agranulocytosis. Incidence and risk factors in the United States. N Engl J Med. 1993;329(3):162–7.

  40. 40.

    Lieberman JA, Alvir JM. A report of clozapine-induced agranulocytosis in the United States. Incidence and risk factors. Drug Saf. 1992;7(Suppl 1):1–2.

  41. 41.

    Fujimoto M, Hashimoto R, Yamamori H, Yasuda Y, Ohi K, Iwatani H, Isaka Y, Takeda M. Clozapine improved the syndrome of inappropriate antidiuretic hormone secretion in a patient with treatment-resistant schizophrenia. Psychiatry Clin Neurosci. 2016;70(10):469.

  42. 42.

    Abejuela HR, Festin FE, Lynn E. Clozapine for Treatment- Resistant Post-Traumatic Stress Disorder (PTSD). J Traum Stress Disord Treatment. 2014;3(2):1–9.

  43. 43.

    Kant R, Chalansani R, Chengappa KN, Dieringer MF. The off-label use of clozapine in adolescents with bipolar disorder, intermittent explosive disorder, or posttraumatic stress disorder. J Child Adolesc Psychopharmacol. 2004;14(1):57.

  44. 44.

    Klein C, Gordon J, Pollak L, Rabey JM. Clozapine in Parkinson's disease psychosis: 5-year follow-up review. Clin Neuropharmacol. 2003;26(1):8–11.

  45. 45.

    Mutlu O, Ulak G, Celikyurt IK, Akar FY, Erden F, Tanyeri P. Effects of olanzapine, sertindole and clozapine on MK-801 induced visual memory deficits in mice. Pharmacol Biochem Behav. 2011;99(4):557–65.

  46. 46.

    Schatz RA. Olanzapine for psychotic and behavioral disturbances in Alzheimer disease. Ann Pharmacother. 2003;37(9):1321–4.

  47. 47.

    Ambrée O, Richter H, Sachser N, Lewejohann L, Dere E, Ma DSS, Herring A, Keyvani K, Paulus W, Schäbitz WR. Levodopa ameliorates learning and memory deficits in a murine model of Alzheimer's disease. Neurobiol Aging. 2009;30(8):1192–204.

  48. 48.

    Lópezriquelme N, Alompoveda J, Vicianomorote N, Llinaresibor I, Tormodíaz C. Apolipoprotein E ε4 allele and malondialdehyde level are independent risk factors for Alzheimer’s disease. SAGE Open Med. 2016;4(2016–1-22):4.

  49. 49.

    Carroll JC, Rosario ER, Chang L, Stanczyk FZ, Oddo S, Laferla FM, Pike CJ. Progesterone and estrogen regulate Alzheimer-like neuropathology in female 3xTg-AD mice. J. Neurosci. Off. J. Soc. Neurosci. 2007;27(48):13357.

  50. 50.

    Hong Q, He G, Ly PTT, Fox CJ, Staufenbiel M, Cai F, Zhang Z, Wei S, Sun X, Chen CH. Valproic acid inhibits Aβ production, neuritic plaque formation, and behavioral deficits in Alzheimer's disease mouse models. J Exp Med. 2008;205(12):2781.

  51. 51.

    Bate C, Williams A. Ethanol protects cultured neurons against amyloid-β and α-synuclein-induced synapse damage. Neuropharmacology. 2011;61(8):1406–12.

  52. 52.

    Cohen T, Widdows D. Embedding of semantic predications. J Biomed Inform. 2017;68:150–66.

  53. 53.

    Mower J, Subramanian D, Shang N, Cohen T. Classification-by-analogy: using vector representations of implicit relationships to identify plausibly causal drug/side-effect relationships. AMIA Annu Symp Proc. 2016;2016:1940–9.

  54. 54.

    Zhang W, Zhu X, Fu Y, Tsuji J, Weng Z. Predicting human splicing branchpoints by combining sequence-derived features and multi-label learning methods. BMC Bioinformatics. 2017;18(Suppl 13):464.

  55. 55.

    Zhang W, Niu Y, Zou H, Luo L, Liu Q, Wu W. Accurate prediction of immunogenic T-cell epitopes from epitope sequences using the genetic algorithm-based ensemble learning. PLoS One. 2015;10(5):e0128194.

  56. 56.

    Zhang W, Liu F, Luo L, Zhang J. Predicting drug side effects by multi-label learning and ensemble learning. BMC Bioinformatics. 2015;16:365.

  57. 57.

    Li D, Luo L, Zhang W, Liu F, Luo F. A genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs. BMC Bioinformatics. 2016;17(1):329.

  58. 58.

    Luo L, Li D, Zhang W, Tu S, Zhu X, Tian G. Accurate prediction of transposon-derived piRNAs by integrating various sequential and physicochemical features. PLoS One. 2016;11(4).

  59. 59.

    Zhang W, Chen YL, Liu F, Luo F, Tian G, Li XH. Predicting potential drug-drug interactions by integrating chemical, biological, phenotypic and network data. Bmc Bioinformatics. 2017;18:18.

  60. 60.

    Zhang W, Shi JW, Tang GF, Wu WJ, Yue X, Li DF. Predicting small RNAs in bacteria via sequence learning ensemble method. IEEE Int Conf Bioinformatics Biomed. 2017;2017:643–7.

Download references


This work is supported by the National Natural Science Foundation of China (61772381, 61572368), the Fundamental Research Funds for the Central Universities (2042017kf0219). The fundings have no role in the design of the study and collection, analysis, and interpretation of data and writing the manuscript.

Availability of data and materials

a user-friendly web server available at:

Author information

WZ and FL conceived the project; XY, WL and WZ designed the experiments; XY, WL and FH performed the experiments; WW and RL designed the server; WZ and XY wrote the paper. All authors read and approved the final manuscript.

Correspondence to Wen Zhang or Feng Liu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, W., Yue, X., Lin, W. et al. Predicting drug-disease associations by using similarity constrained matrix factorization. BMC Bioinformatics 19, 233 (2018).

Download citation


  • Drug-disease associations
  • Similarity constrained matrix factorization