 Research article
 Open Access
 Published:
Predicting potential drugdrug interactions by integrating chemical, biological, phenotypic and network data
BMC Bioinformatics volume 18, Article number: 18 (2017)
Abstract
Background
Drugdrug interactions (DDIs) are one of the major concerns in drug discovery. Accurate prediction of potential DDIs can help to reduce unexpected interactions in the entire lifecycle of drugs, and are important for the drug safety surveillance.
Results
Since many DDIs are not detected or observed in clinical trials, this work is aimed to predict unobserved or undetected DDIs. In this paper, we collect a variety of drug data that may influence drugdrug interactions, i.e., drug substructure data, drug target data, drug enzyme data, drug transporter data, drug pathway data, drug indication data, drug side effect data, drug off side effect data and known drugdrug interactions. We adopt three representative methods: the neighbor recommender method, the random walk method and the matrix perturbation method to build prediction models based on different data. Thus, we evaluate the usefulness of different information sources for the DDI prediction. Further, we present flexible frames of integrating different models with suitable ensemble rules, including weighted average ensemble rule and classifier ensemble rule, and develop ensemble models to achieve better performances.
Conclusions
The experiments demonstrate that different data sources provide diverse information, and the DDI network based on known DDIs is one of most important information for DDI prediction. The ensemble methods can produce better performances than individual methods, and outperform existing stateoftheart methods. The datasets and source codes are available at https://github.com/zw9977129/drugdruginteraction/.
Background
Drugs may interact when multiple drugs are coprescribed. Drugdrug interactions (DDIs) may exert different effects, and adverse drugdrug interactions can lead to patient death or drug withdrawal [1–4]. DDI prediction can help to reduce unexpected effects as well as optimize the treatments in the drug design, clinical trials, and postmarketing surveillance.
Silico methods, in vitro methods, vivo experiments and clinical trials can identify DDIs, but they are laborintensive and timeconsuming. Statistical methods [5–9] were developed to detect whether the combination of two drugs is associated with an increased risk of certain adverse events, by analyzing spontaneous reports, insurance claim databases and electronic medical records.
In recent years, researchers collected drug data from literatures, reports and etc., and constructed public databases [10–17] which facilitate the development of computational prediction methods. To the best of our knowledge, a great number of machine learning methods were proposed to predict DDIs. Existing methods are roughly classified into two types: similaritybased methods and classificationbased methods. The similaritybased methods employed the assumption that similar drugs may interact with a same drug. Gottlieb et al. [18] built prediction models by considering seven kinds of drugdrug similarities. Vilar et al. proposed the substructure similaritybased prediction method [19] and the interaction profile fingerprint similaritybased prediction method [20]. Li et al. [21] developed a Bayesian network of combining drug molecular similarity and phenotypic similarity to predict the combination efficacy of drugs. By using drugdrug similarity indirectly, Park et al. [22] applied a random walk with restart to simulate signaling propagation from drug targets and make predictions; Zhang et al. [23] adopted the label propagation method to build prediction models based on drug chemical substructures, drug side effects and drug and off side effects. Classificationbased methods formulate the drugdrug prediction as binary classification tasks. Cami et al. [24] represented drugdrug pairs as feature vectors, and use presence or absence of interactions as labels, and then built logistic regression models. Cheng et al. [25] applied five predictive models (naive Bayes, decision tree, knearest neighbor, logistic regression, and support vector machine) to build prediction models. Besides similaritybased methods and classificationbased methods, there are several methods designed for specific purposes. Takarabe et al. [26] constructed a multilevel drugdrug interaction network, and analyzed, characterized and classified adverse drugdrug interactions. Huang et al. [27] developed a targetcenter system for each drug, which consists of drug targets and their neighbors in the PPI network and human tissue gene expression.
Since many DDIs are not detected or observed in clinical trials, this work is aimed to predict undetected or unobserved drugdrug interactions. Classification methods utilize two classes of data: annotated drugdrug interaction pairs and annotated noninteraction pairs to build classification models. In the binary classification, known interactions are used as positive instances, but other drug pairs may have undetected or unobserved interactions, which need to be predicted. In machine learning, similar problems are transformed as semisupervised learning tasks. For this reason, we build DDI prediction models under the frame of semisupervised learning.
In this paper, we collect drug substructure data, drug target data, drug enzyme data, drug transporter data, drug pathway data, drug indication data, drug side effect data, drug off side effect data and known drugdrug interactions. Multisource data provide biological information, chemical information, phenotypic information and known interactions to characterize drugdrug interactions. To make use of diverse information, we adopt three representative methods, i.e., the neighbor recommender method [28, 29], the random walk method and the matrix perturbation method [30], to build different prediction models. According to performances of prediction models, we evaluate the usefulness of different information sources for the DDI prediction. The study reveals that DDI network based on known DDIs can provide the important information for DDI prediction. Further, we present flexible frames of integrating different models with suitable ensemble rules, including the weighted average ensemble rule and the classifier ensemble rule, and develop ensemble models to achieve better performances. The experiments demonstrate that ensemble methods can combine diverse information to produce the highaccuracy performances, and outperform existing stateoftheart methods.
Methods
Datasets
The FDA Adverse Event Reporting System (FAERS) is a database which contains adverse event reports and medication error reports submitted to FDA. Tatonetti processed adverse event reports in the AERS, and constructed a database named “TWOSIDES” [31] which contains side effects caused by the combination of drugs. There are 645 drugs and 63,473 distinct pairwise DDIs from unsafe coprescriptions in TWOSIDES.
The biological information, chemical information and phenotypic information about drugs may be associated with drugdrug interactions. PubChem Compound database [12, 15] can provide drug structures. DrugBank database [10, 11, 16, 17] is a bioinformatics resource with drug targets, drug enzymes and drug transporters. KEGG database [13] is an information resource for protein pathways. Drug targets are mapped to KEGG to obtain drug pathways. SIDER database [14] contains 1430 drugs and 5880 side effect terms which are compiled from public documents and package inserts. Drug side effects and indications are available in SIDER. OFFSIDES database [31] contains 1332 drugs and 10,093 “offlabel” side effects.
We map drugs in TWOSIDES to SIDER, OFFSIDES, PubChem and DrugBank. As shown in Table 1, we obtain 548 drugs and 48,584 pairwise DDIs, and substructure data, target data, enzyme data, transporter data, pathway data, indication data, side effect data, off side effect data of these drugs are available. Based on the data, we conduct the comprehensive study to evaluate the usefulness of different data sources for DDI prediction, and discuss how to combine them for the highaccuracy prediction.
DDI prediction based on multisource data
Multisource data provide different information for the DDI prediction. Here, we describe how to build models based on different data.
Drugdrug similarities bring important clues for the DDI prediction, and different similarities can be extracted from multisource data. Drug data are classified as four types, i.e., chemical data, biological data, phenotypic data and the drugdrug interaction network data (formed by known drugdrug interactions). On one hand, we calculate the drugdrug similarities in the biological space, chemical space and phenotypic space, by using drug substructures, drug targets, drug enzymes, drug transporters, drug pathways, drug indications, drug side effects and drug off side effects. On the other hand, we calculate the drugdrug similarities in the drugdrug interaction network. In order to utilize drugdrug similarities, we consider two representative methods [28, 32]: the neighbor recommender method and random walk method, and build DDI prediction models.
We take drugs as nodes and known interactions as edges in the DDI network, and transform the DDI prediction problem as a missing link prediction task. The missing link prediction is an important topic of theoretical interest and practical significance in the complex network [33]. Recently, a novel method named “matrix perturbation method” [30] is proposed, which utilize the network to predict missing links (unobserved DDIs). The studies demonstrated that this method outperforms other missing link prediction methods. Therefore, we adopt the matrix perturbation method to predict potential DDIs based on the DDI network.
In the following context, Similaritybased DDI prediction based on multisource data presents how to extract different drugdrug similarities from different data and how to develop similaritybased models; Matrix perturbation method for DDI prediction presents the missing link prediction method (matrix perturbation method).
Similaritybased DDI prediction based on multisource data
Drugdrug similarity based on biological data, chemical data and phenotypic data
A drug can be represented as a binary feature vector, by using drug substructures, drug targets, drug enzymes, drug transporters, drug pathways, drug indications, drug side effects, or drug off side effects. Dimensions of the feature vector respond to presence or absence of components with values 1 or 0. For example, there are 881 types of drug substructures, and a drug can be transformed as an 881dimensional vector.
Given a drug x and a drug y, their feature vectors are V _{ x } and V _{ y }, and the similarity between x and y is then calculated by Jaccard formula:
where M _{11} is the number of dimensions where V _{ x } and V _{ y } both have a value of 1; M _{01} is the number of dimensions where V _{ x } has a value of 0 and V _{ y } has a value of 1; M _{10} is the number of dimensions where V _{ x } has a value of 1 and V _{ y } has a value of 0.
Therefore, we can obtain 8 drug featurebased drugdrug similarities, including substructurebased similarity, targetbased similarity, enzymebased similarity, transporterbased similarity, pathwaybased similarity, indicationbased similarity, side effectbased similarity and off side effectbased similarity.
Drugdrug similarity based on known drugdrug interactions
By considering drugs as nodes and interaction as edges, known DDIs can form a DDI network. We calculate drugdrug similarities in the DDI network [33]. The adjacent matrix of the DDI network is denoted as A = (a _{ ij }), and denotes the set of nodes linked to node. Several similarities between a drug x and a drug y can be defined.
Common neighbor similarity S _{ CN }(x, y) takes the number of common neighbors between two nodes,
AdamicAdar similarity S _{ AA }(x, y) is the counting of common neighbors by assigning the less connected neighbors more weights,
Resource Allocation similarity S _{ RA }(x, y) is based on the complex network resource allocation dynamics,
Katz similarity S _{ Katz }(x, y) sums over the collection of paths with exponential damping according to path lengths,
where α is a parameter, and I is the identity matrix. α < 1/λ _{max} is the condition for the compact form, and λ _{max} is the largest eigenvalue of A.
Average Commute Time similarity S _{ ACT }(x, y) is the average number of steps required by a random walker starting from one node to reach another,
where L ^{+} is the pseudoinverse of the Laplacian matrix for the network.
The random walk with restart similarity S _{ RWR }(x, y) is the probability that a random walker starting from an initial node x reaches y. The walker moves with the probability μ of returning to the initial node and the probability 1 − μ going to adjacent nodes,
where q = (1 − μ)(1 − μP ^{T})^{− 1} A, and P = D ^{− 1} A is the normalized transition matrix of the adjacency matrix A, and D is the degree matrix of A.
Therefore, we obtain 6 DDI networkbased drugdrug similarities, including common neighbor similarity, AdamicAdar similarity, resource allocation similarity, Katz similarity, average commute time similarity and random walk with restart similarity.
Similaritybased methods for DDI prediction
Given a N × N similarity matrix S = (s _{ ij }) for N drugs, known pairwise DDIs are denoted by an adjacent matrix A = (a _{ ij }). The neighbor recommender method and the random walk method are briefly introduced as follows.
The neighbor recommender method [28, 34] is one of most popular methods in recommender systems, which recommends items (movies, music, books, et al.) to users, or predicts the ‘rating’ or ‘preference’ that users would give to items. The neighbor recommender method takes the weighted average information of neighbors for prediction. Y _{ ij } = ∑ _{ k = 1,k ≠ j } ^{N} s _{ ik } a _{ kj }/∑ _{ k = 1,k ≠ j } ^{N} s _{ ik } is calculated for drug_{ i } and drug_{ j } which don’t have known interaction, where s _{ ik } is the similarity between drug_{ i } and drug_{ k }, and a _{ kj } = 1 or 0 means interaction or noninteraction between drug_{ k } and drug_{ j }. We can calculate Y _{ ji } in this same way. The probability that drug_{ i } interacts with drug_{ j } score _{ ji } = score _{ ij } = Y _{ ij } + Y _{ ji }.
A random walk is a mathematical formalization of a path that consists of a succession of random steps. There are a great number of successful applications in the network analysis [35–38]. In random walk, a random walker starts from an initial node, and moves to neighbors with the probability μ and moves back to the initial node with the probability 1 − μ. The similarity matrix S is normalized as W = D ^{− 1} S, where D is the degree matrix of S. The matrix form of the update is summarized as Y = μWY + (1 − μ)A, and it will converge to the solution: Y = (1 − μ)(I − μW)^{− 1} A. The probability that drug_{ i } interacts with drug_{ j } score _{ ji } = score _{ ij } = Y _{ ij } + Y _{ ji }.
Matrix perturbation method for DDI prediction
The matrix perturbation method assumes that random removal of a small proportion of links from a network will not change the network structure [30], which is reflected by eigenvectors of its adjacent matrix.
Let’s introduce notations for the matrix perturbation method. Given the drugdrug interaction network G(V, E), V is the set of nodes, and E is the set of edges. The adjacent matrix is A = (a _{ ij }), and the eigenvectors and eigenvalues of the adjacent matrix are denoted by x _{ k } and λ _{ k }, k = 1, 2, ⋯, N.
A fraction of links ΔE are randomly removed from E, and the set of remaining links E ^{R} = E − ΔE. Thus, we obtain the new network G ^{R}(V, E ^{R}) with the adjacent matrix A ^{R} = A − ΔA, where ΔA is the adjacent matrix for removed links. Then, we calculate the eigenvectors x _{ k } ^{R} and eigenvalues λ _{ k } ^{R} of A ^{R}, k = 1, 2, ⋯, N. We denote that A = A ^{R} + ΔA, x _{ k } = x _{ k } ^{R} + Δx _{ k } and λ _{ k } = λ _{ k } ^{R} + Δλ _{ k }.
In the network G(V, E), the relation of eigenvectors, eigenvalues and the adjacent matrix is written as,
By left multiplying (x _{ k } ^{R} )^{T} in above equation, we can obtain \( \Delta {\lambda}_k\approx \frac{{\left({x}_k^R\right)}^T\Delta A{x}_k^R}{{\left({x}_k^R\right)}^T{x}_k^R} \).
We estimate eigenvalues λ _{ k } = λ _{ k } ^{R} + Δλ _{ k }, and keep eigenvectors x _{ k } ^{R} unchanged. Then, we reconstruct the adjacent matrix of G(V, E) by summing eigenvalues and eigenvectors,
The probability that drug_{ i } interacts with drug_{ j } score _{ ij } = score _{ ji } = Ã _{ ij } + Ã _{ ji }. More details are available in the publication [30].
Combining multisource data for DDI prediction
Since we build different prediction models based on different data, it is natural to combine them for better performance. Ensemble learning is a useful technique that aggregates multiple machine learning models to achieve overall high prediction accuracy as well as good generalization [39]. Ensemble learning has been applied to a great number of applications in bioinformatics [29, 40, 41].
An ensemble learning system usually has two components: base predictors and ensemble rules. In our ensemble system, we adopt heterogeneous models {f _{ i }} _{ i = 1} ^{n} based on multisource data as base predictors. To integrate base predictors, we consider two popular ensemble rules: the weighted average ensemble rule and the classifier ensemble rule. Figure 1 demonstrates the flowchart of ensemble systems.
The weighted average ensemble rule takes the weighted average of outputs from base predictor. For a new input x _{ new }, base predictors give out the predictions {f _{ i }(x _{ new })} _{ i = 1} ^{n} , and their weighted average ∑ _{ i = 1} ^{n} w _{ i } f _{ i }(x _{ new }) is adopted as the prediction of the ensemble model, where ∑ _{ i = 1} ^{n} w _{ i } = 1 and w _{ i } ≥ 0. We adopt the genetic algorithm (GA) to determine weights in the ensemble model. In the GA optimization, candidate weights are represented as chromosomes, and the fitness of a chromosome is the area under the precisionrecall curve (AUPR) score of the ensemble model on the validation data. The objective function of GA optimization is to maximize the AUPR score.
The classifier ensemble rule is to seek a classification function G : (f _{1}(x), f _{2}(x), ⋯, f _{ n }(x)) → {0, 1}, which maps outputs of n base predictors to a label. For a new input x _{ new }, outputs of base predictors are {f _{ i }(x _{ new })} _{ i = 1} ^{n} , and the prediction of the classifier ensemble model is G(f _{1}(x _{ new }), f _{2}(x _{ new }), ⋯, f _{ n }(x _{ new })). Here, we adopt logistic regression as the classification function.
Results and discussion
Evaluation metrics
We adopt kfold cross validation (kCV) to evaluate prediction models. Known interactions are randomly split into k subsets with equal size. In each fold, one subset is used as the testing set; 80 and 20% of other interactions (k1 subsets) are used as the training set and validation set. Base predictors are constructed on the training set, and parameters in the ensemble system are tuned by using the validation set. Then, the ensemble model makes predictions for the testing set. This procedure is repeated until each subset is ever used for testing. To avoid the bias of data split, we implement 20 independent runs of kCV for each model, and average performances are adopted.
Here, we adopt several evaluation metrics to measure performances of prediction models, i.e., accuracy (ACC), precision, recall, Fmeasure (F), area under ROC curve (AUC) and the area under the precisionrecall curve (AUPR). In our task, DDIs take a small proportion of all drug pairs, and thus AUPR, which takes into account both recall and precision, is used as the primary evaluation metric.
Performances of different models based on multisource data
We extract 14 different similarities from multisource data, and respectively adopt the neighbor recommender method and the random walk method to build 28 similaritybased prediction models. By formulating the original problem as a missing link prediction task, we adopt the matrix perturbation method to build the prediction model based on known DDIs. Therefore, we construct 29 prediction models based on multisource data. Since different models utilize different information for DDI prediction, performances of the models are indicators for the usefulness of information sources.
As shown in Table 2, these models produce different performances on the benchmark dataset in the cross validation. Among eight featurebased similarities, substructure similarity, side effect similarity, off side effect similarity and indication similarity lead to better performances than other similarities, indicating that drug substructures, drug side effects, drug off side effects and drug indications provide important information for the drugdrug interactions. Among the network topologybased similarities, RA and RWR can produce better results. The comparison shows that drug featurebased similarities as well as topological similarities can provide useful information to characterize drugdrug interactions and lead to useful models. The matrix perturbation method utilizes the DDI network as a whole to make predictions. Among all prediction models, the matrix perturbation method produces the best results, indicating that known DDIs provide one of most useful information to identify potential DDIs.
We also conduct 20 runs of 3CV to evaluate prediction models, and results are shown in Table 3. The comparison between 3CV results and 5CV results demonstrates that prediction models have different performances under different experimental conditions, and a model cannot produce the best results in all cases. For example, the matrix perturbation method assumes that the topology of a network will not change if only a small proportion of links are removed. In 3CV, more links are kept for testing, and the predictive power may be affected. Therefore, the matrix perturbation method is not the best predictor in 3CV experiments. For this reason, we integrate different models to make robust predictions.
Performances of ensemble models
Based on multisource data, we construct 29 prediction models including 28 similaritybased models and one perturbation matrix model. We use these models as base predictors, and respectively adopt the weighted average ensemble rule and the classifier ensemble rule to build ensemble models.
We apply the genetic algorithm (GA) to determine optimal weights in the weighted average ensemble models. GA is implemented by using python package “deap”. The initial population has 100 chromosomes. In the population update, the elitist strategy is used for the selection operator, and default parameters are adopted for the mutation probability and crossover probability. The population update terminates when the change of best fitness scores is less than the default value of 1E6 or the max generation number of 50 is reached.
To build classifier ensemble models, we train the logistic regression classifier to combine outputs of base predictors. The logistic regression is implemented by using python package “scikitlearn”. Default parameters are used; L1 regularization and L2 regularization are respectively considered. In the following context, classifier ensembles models refer to logistic regression ensemble models.
Table 4 shows 3CV results and 5CV results. In 5CV experiments, the weighted average ensemble model, the classifier ensemble model (L1 regularization) and classifier ensemble model (L2 regularization) produce the AUPR scores of 0.795, 0.807 and 0.806; in 3CV experiments, three models yield the AUPR scores of 0.832, 0.841 and 0.839. The comparison demonstrates that the classifier ensemble models produce better results than the weighted average ensemble model. The possible reason is that the weighted average ensemble method uses the linear function for ensemble learning and classifier ensemble method trains nonlinear function. Moreover, the classifier ensemble method with L1 regularization can produce better results than the classifier ensemble method with L2 regularization, for L1 regularization can produce the sparse model and enhance the generalization capability.
Clearly, ensemble models produce better results than base predictors. In 5CV experiments, the classifier ensemble method (L1) can improve the AUPR score of 0.782 (produced by the matrix perturbation model) to 0.806. Since we implement 20 runs of 5CV for ensemble models and matrix perturbation models, we conduct ttest to test the difference of their performances in terms of AUPR score, and the statistical significance is observed (pvalue =1.21E39). In 3CV experiments, the classifier ensemble method (L1) can enhance the AUPR score from 0.820 (produced by the indicationbased random walk model) to 0.839, and we also observe the statistical significance of improvement between the classifier ensemble model (L1) and the indicationbased random walk model (pvalue =3.12E41).
Further, we investigate into details of the ensemble models based on 3CV results and 5CV results. Firstly, we analyze weights in the weighted average ensemble models determined by GA. There are 100 sets of weights for 20 runs of 5CV; there are 60 sets of weights for 20 runs of 3CV. We calculate the average weights for each predictor, and visualize the normalized weights in Fig. 2. Base predictors with high AURP scores may be assigned great weights. For example, the matrix perturbation model produces best 5CV results, and thus gains the greatest weight in the ensemble models. We observe that several base predictors (such as RWRbased random walk model) are not used in the ensemble models. The classifier ensemble method (L1) produces the sparse models, which integrate the subset of base predictors. According to 5CV results, several base predictors (index: 1, 10, 15, 21, 22, 27, 28, 29) are not used in the classifier ensemble model. In the view of computer science, multisource data provide diverse information but also bring the redundant information. Combining base predictors is a combinatorial optimization problem. Therefore, the weighted average ensemble method and the classifier ensemble method (L1) use a subset of base predictors to develop ensemble models.
Comparison with existing stateoftheart methods
Since this work is designed to predict undetected or unobserved DDIs, we adopt methods of the same type for comparison. Vilar used known interactions of most similar drugs to predict DDIs, and proposed the substructure similaritybased model [19] and interaction profile fingerprint (also known as common neighbors, CN) similaritybased model [20]. Zhang [23] adopted the label propagation algorithm to build substructure similaritybased model, side effect similaritybased model and off side effect similaritybased model. We name these models as Vilar’s substructurebased model, Vilar’s CN indexbased model, substructurebased label propagation model, side effectbased label propagation model and off side effectbased label propagation model. These prediction models are implemented according to details in publications. All models are evaluated by 20 runs of cross validation under the same conditions.
As shown in Table 5, our ensemble methods produce better results than other stateoftheart methods in terms of different metrics. The classifier ensemble method (L1) produces the best results in both 3CV experiments and 5CV experiments. Further, we adopt ttest to compare the ensemble methods with other stateoftheart methods in terms of AUPR scores. Table 6 demonstrates that our ensemble methods produce significantly better results (p < 0.05 in terms of AUPR scores).
In one fold of 5fold cross validation, we adopt 80% interactions (38,868) as the training set and the validations set, and use other interactions (9716) as the testing set. We build the prediction model based on the training set and the validations set, and then make predictions for noninteraction drugdrug pairs (111,010) to identify testing interactions (9716). Based on the result, we respectively count how many testing DDIs are identified in the top 10,000 predictions and top 15,000 predictions. As shown in Fig. 3, the classifier ensemble model (L1) can identify 7027 testing interactions when verifying top 10,000 predictions, and identify 7842 testing interactions when verifying top 15,000 predictions. In general, our ensemble models can identify 300 ~ 400 more interactions than other methods do.
Predicted novel interactions
In this paper, we use the benchmark dataset with 548 drugs and 48,584 pairwise drugdrug interactions from TWOSIDES database. There are 149,878 drugdrug pairs between these drugs. Besides 48,584 known pairwise DDIs, 101294 remaining drug pairs (“noninteraction pairs”) may contain undetected or unobserved DDIs, which are not available in TWOSIDES. We train the prediction models based on 548 drugs and 48,584 known DDIs, and predict unobserved DDIs. In the prediction, great scores of drug pairs indicate high probabilities of having interactions, and the prediction results are transformed as a recommendation list of unobserved interactions or novel interactions. To confirm novel interactions, we look up them in the latest online version of DrugBank database. Table 7 lists top 20 novel interactions predicted by our method, and a significant fraction of novel interactions (7 out of 20) are confirmed in DrugBank database.
Further, we compare the ensemble model and the matrix perturbation model by testing their capability of finding out novel interactions. The top 1000 novel interactions predicted by the ensemble model and the matrix perturbation model are provided in supplementary material (see Additional file 1). For each method, we find evidences in DrugBank to confirm novel interactions. If we look up all 1000 interactions of the matrix perturbation model and the ensemble model, we can confirm 297 novel interactions and 318 novel interactions respectively (252 common interactions are shared). Further, based on the top 1000 novel interactions, we use the number of predictions as Xaxis and the number of confirmed novel interactions in the predictions as Yaxis, and then visualize performances of two models (see Additional file 2). In general, the ensemble model can find out more novel interactions than the matrix perturbation model, indicating the usefulness of integrating multisource data.
Conclusions
The prediction of drugdrug interactions is an important task in the drug discovery, which helps to reduce potential risks and understand the mechanism of drugdrug interactions. This paper collects a wide variety of drug data, and designs the models based on multisource data for the DDI prediction. Compared with existing DDI prediction methods, our methods produce better performances, and the statistical analysis demonstrates that the performance improvements achieved by our method are statistically significant. In conclusion, the proposed methods are promising for the DDI prediction.
Abbreviations
 5CV:

5fold cross validation
 AUC:

Area under ROC curve
 AUPR:

Area under precisionrecall curve
 DDI:

Drugdrug interaction
 GA:

Genetic algorithm
References
 1.
Nagai N. Drug interaction studies on new drug applications: current situations and regulatory views in Japan. Drug Metab Pharmacokin. 2010;25(1):3–15.
 2.
Percha B, Altman RB. Informatics confronts drugdrug interactions. Trends Pharmacol Sci. 2013;34(3):178–84.
 3.
Prueksaritanont T, Chu X, Gibson C, Cui D, Yee KL, Ballard J, Cabalu T, Hochman J. Drugdrug interaction studies: regulatory guidance and an industry perspective. AAPS J. 2013;15(3):629–45.
 4.
Kusuhara H. How far should we go? Perspective of drugdrug interaction studies in drug development. Drug Metab Pharmacokin. 2014;29(3):227–8.
 5.
Noren GN, Sundberg R, Bate A, Edwards IR. A statistical methodology for drugdrug interaction surveillance. Stat Med. 2008;27(16):3057–70.
 6.
Tatonetti NP, Denny J, Murphy S, Fernald G, Krishnan G, Castro V, Yue P, Tsau P, Kohane I, Roden D. Detecting drug interactions from adverse‐event reports: interaction between paroxetine and pravastatin increases blood glucose levels. Clin Pharmacol Ther. 2011;90(1):133–42.
 7.
Duke JD, Han X, Wang Z, Subhadarshini A, Karnik SD, Li X, Hall SD, Jin Y, Callaghan JT, Overhage MJ, et al. Literature based drug interaction prediction with clinical assessment using electronic medical records: novel myopathy associated drug interactions. PLoS Comput Biol. 2012;8(8):e1002614.
 8.
Tatonetti NP, Fernald GH, Altman RB. A novel signal detection algorithm for identifying hidden drugdrug interactions in adverse event reports. J Am Med Inform Assoc. 2012;19(1):79–85.
 9.
He L, Yang Z, Zhao Z, Lin H, Li Y. Extracting drugdrug interaction from the biomedical literature using a stacked generalizationbased approach. 2013.
 10.
Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006;34(Database issue):D668–72.
 11.
Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008;36(Database issue):D901–6.
 12.
Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009;37((Web Server issue):W623–33.
 13.
Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010;38(Database issue):D355–60.
 14.
Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol. 2010;6:343.
 15.
Li Q, Cheng T, Wang Y, Bryant SH. PubChem as a public resource for drug discovery. Drug Discov Today. 2010;15(23–24):1052–7.
 16.
Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, et al. DrugBank 3.0: a comprehensive resource for ‘omics’ research on drugs. Nucleic Acids Res. 2011;39(Database issue):D1035–41.
 17.
Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, Arndt D, Wilson M, Neveu V, et al. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 2014;42(Database issue):D1091–7.
 18.
Gottlieb A, Stein GY, Oron Y, Ruppin E, Sharan R. INDI: a computational framework for inferring drug interactions and their associated recommendations. Mol Syst Biol. 2012;8:592.
 19.
Vilar S, Harpaz R, Uriarte E, Santana L, Rabadan R, Friedman C. Drugdrug interaction through molecular structure similarity analysis. J Am Med Inform Assoc. 2012;19(6):1066–74.
 20.
Vilar S, Uriarte E, Santana L, Tatonetti NP, Friedman C. Detection of drugdrug interactions by modeling interaction profile fingerprints. PLoS One. 2013;8(3):e58321.
 21.
Li P, Huang C, Fu Y, Wang J, Wu Z, Ru J, Zheng C, Guo Z, Chen X, Zhou W, et al. Largescale exploration and analysis of drug combinations. Bioinformatics. 2015;31(12):2007–16.
 22.
Park K, Kim D, Ha S, Lee D. Predicting pharmacodynamic drugdrug interactions through signaling propagation interference on proteinprotein interaction networks. PLoS One. 2015;10(10):e0140816.
 23.
Zhang P, Wang F, Hu J, Sorrentino R. Label propagation prediction of drugdrug interactions based on clinical side effects. Sci Rep. 2015;5:12339.
 24.
Cami A, Manzi S, Arnold A, Reis BY. Pharmacointeraction network models predict unknown drugdrug interactions. PLoS One. 2013;8(4):e61468.
 25.
Cheng F, Zhao Z. Machine learningbased prediction of drugdrug interactions by integrating drug phenotypic, therapeutic, chemical, and genomic properties. J Am Med Inform Assoc. 2014;21(e2):e278–86.
 26.
Takarabe M, Shigemizu D, Kotera M, Goto S, Kanehisa M. Networkbased analysis and characterization of adverse drugdrug interactions. J Chem Inf Model. 2011;51(11):2977–85.
 27.
Huang J, Niu C, Green CD, Yang L, Mei H, Han JD. Systematic prediction of pharmacodynamic drugdrug interactions through proteinproteininteraction network. PLoS Comput Biol. 2013;9(3):e1002998.
 28.
Bobadilla J, Ortega F, Hernando A, Gutiérrez A. Recommender systems survey. KnowlBased Syst. 2013;46:109–32.
 29.
Zhang W, Zou H, Luo L, Liu Q, Wu W, Xiao W. Predicting potential side effects of drugs by recommender methods and ensemble learning. Neurocomputing. 2016;173:979–87.
 30.
Lu L, Pan L, Zhou T, Zhang YC, Stanley HE. Toward link predictability of complex networks. Proc Natl Acad Sci U S A. 2015;112(8):2325–30.
 31.
Tatonetti NP, Ye PP, Daneshjou R, Altman RB. Datadriven prediction of drug effects and interactions. Sci Transl Med. 2012;4(125):125ra131.
 32.
Schafer JB, Konstan J, Riedl J. Recommender systems in ecommerce. In: Proceedings of the 1st ACM conference on Electronic commerce. New York: ACM; 1999. p. 158–66.
 33.
Lü L, Zhou T. Link prediction in complex networks: a survey. Physica A. 2011;390(6):1150–70.
 34.
Koren Y, Bell R. Advances in collaborative filtering. In: Recommender Systems Handbook. New York: Springer; 2015. p. 77–118.
 35.
Liu W, Lü L. Link prediction based on local random walk. EPL (Europhysics Letters). 2010;89(5):58007.
 36.
Backstrom L, Leskovec J. Supervised random walks: predicting and recommending links in social networks. In: Proceedings of the fourth ACM international conference on Web search and data mining. New York: ACM; 2011. p. 635–44.
 37.
Chen X, Liu MX, Yan GY. Drugtarget interaction prediction by random walk on the heterogeneous network. Mol BioSyst. 2012;8(7):1970–8.
 38.
Seal A, Ahn YY, Wild DJ. Optimizing drugtarget interaction prediction based on random walk on heterogeneous networks. J Cheminf. 2015;7:40.
 39.
Polikar R. Ensemble based systems in decision making. IEEE Circuits Syst Mag. 2006;6(3):21–45.
 40.
Zhang W, Niu Y, Xiong Y, Zhao M, Yu R, Liu J. Computational prediction of conformational Bcell epitopes from antigen primary structures by ensemble learning. PLoS One. 2012;7(8):e43575.
 41.
Zhang W, Liu F, Luo L, Zhang J. Predicting drug side effects by multilabel learning and ensemble learning. BMC Bioinf. 2015;16:365.
Acknowledgement
We would like to thank Longqiang Luo for his support during this project.
Funding
This work is supported by the National Natural Science Foundation of China (61103126, 61402340, 61572368), and Natural Science Foundation of Hubei Province of China (2014CFB194, ZRY2014000901). The fundings have no role in the design of the study and collection, analysis, and interpretation of data and writing the manuscript.
Availability of data and materials
The datasets and source codes are available at https://github.com/zw9977129/drugdruginteraction/.
Authors’ contributions
WZ conceived the project; WZ, YC and FL(Feng) designed the experiments; and FL(fei) performed the experiments; WZ, GT and XL wrote the paper. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Author information
Additional files
Additional file 1:
Top 1000 novel interactions predicted by the ensemble model and the matrix perturbation model. (XLSX 75 kb)
Additional file 2:
Visualization of the number of predictions vs. number of confirmed interactions. (TIF 519 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zhang, W., Chen, Y., Liu, F. et al. Predicting potential drugdrug interactions by integrating chemical, biological, phenotypic and network data. BMC Bioinformatics 18, 18 (2017). https://doi.org/10.1186/s1285901614159
Received:
Accepted:
Published:
Keywords
 Drugdrug interaction
 Ensemble learning
 Missing link prediction
 Random walk