Predicting potential drug-drug interactions by integrating chemical, biological, phenotypic and network data

Zhang, Wen; Chen, Yanlin; Liu, Feng; Luo, Fei; Tian, Gang; Li, Xiaohong

doi:10.1186/s12859-016-1415-9

Research article
Open access
Published: 05 January 2017

Predicting potential drug-drug interactions by integrating chemical, biological, phenotypic and network data

Wen Zhang^1,2,
Yanlin Chen³,
Feng Liu⁴,
Fei Luo^1,2,
Gang Tian^1,2 &
…
Xiaohong Li^1,2

BMC Bioinformatics volume 18, Article number: 18 (2017) Cite this article

12k Accesses
189 Citations
Metrics details

Abstract

Background

Drug-drug interactions (DDIs) are one of the major concerns in drug discovery. Accurate prediction of potential DDIs can help to reduce unexpected interactions in the entire lifecycle of drugs, and are important for the drug safety surveillance.

Results

Since many DDIs are not detected or observed in clinical trials, this work is aimed to predict unobserved or undetected DDIs. In this paper, we collect a variety of drug data that may influence drug-drug interactions, i.e., drug substructure data, drug target data, drug enzyme data, drug transporter data, drug pathway data, drug indication data, drug side effect data, drug off side effect data and known drug-drug interactions. We adopt three representative methods: the neighbor recommender method, the random walk method and the matrix perturbation method to build prediction models based on different data. Thus, we evaluate the usefulness of different information sources for the DDI prediction. Further, we present flexible frames of integrating different models with suitable ensemble rules, including weighted average ensemble rule and classifier ensemble rule, and develop ensemble models to achieve better performances.

Conclusions

The experiments demonstrate that different data sources provide diverse information, and the DDI network based on known DDIs is one of most important information for DDI prediction. The ensemble methods can produce better performances than individual methods, and outperform existing state-of-the-art methods. The datasets and source codes are available at https://github.com/zw9977129/drug-drug-interaction/.

Background

Drugs may interact when multiple drugs are co-prescribed. Drug-drug interactions (DDIs) may exert different effects, and adverse drug-drug interactions can lead to patient death or drug withdrawal [1–4]. DDI prediction can help to reduce unexpected effects as well as optimize the treatments in the drug design, clinical trials, and post-marketing surveillance.

Silico methods, in vitro methods, vivo experiments and clinical trials can identify DDIs, but they are labor-intensive and time-consuming. Statistical methods [5–9] were developed to detect whether the combination of two drugs is associated with an increased risk of certain adverse events, by analyzing spontaneous reports, insurance claim databases and electronic medical records.

In recent years, researchers collected drug data from literatures, reports and etc., and constructed public databases [10–17] which facilitate the development of computational prediction methods. To the best of our knowledge, a great number of machine learning methods were proposed to predict DDIs. Existing methods are roughly classified into two types: similarity-based methods and classification-based methods. The similarity-based methods employed the assumption that similar drugs may interact with a same drug. Gottlieb et al. [18] built prediction models by considering seven kinds of drug-drug similarities. Vilar et al. proposed the substructure similarity-based prediction method [19] and the interaction profile fingerprint similarity-based prediction method [20]. Li et al. [21] developed a Bayesian network of combining drug molecular similarity and phenotypic similarity to predict the combination efficacy of drugs. By using drug-drug similarity indirectly, Park et al. [22] applied a random walk with restart to simulate signaling propagation from drug targets and make predictions; Zhang et al. [23] adopted the label propagation method to build prediction models based on drug chemical substructures, drug side effects and drug and off side effects. Classification-based methods formulate the drug-drug prediction as binary classification tasks. Cami et al. [24] represented drug-drug pairs as feature vectors, and use presence or absence of interactions as labels, and then built logistic regression models. Cheng et al. [25] applied five predictive models (naive Bayes, decision tree, k-nearest neighbor, logistic regression, and support vector machine) to build prediction models. Besides similarity-based methods and classification-based methods, there are several methods designed for specific purposes. Takarabe et al. [26] constructed a multi-level drug-drug interaction network, and analyzed, characterized and classified adverse drug-drug interactions. Huang et al. [27] developed a target-center system for each drug, which consists of drug targets and their neighbors in the PPI network and human tissue gene expression.

Since many DDIs are not detected or observed in clinical trials, this work is aimed to predict undetected or unobserved drug-drug interactions. Classification methods utilize two classes of data: annotated drug-drug interaction pairs and annotated non-interaction pairs to build classification models. In the binary classification, known interactions are used as positive instances, but other drug pairs may have undetected or unobserved interactions, which need to be predicted. In machine learning, similar problems are transformed as semi-supervised learning tasks. For this reason, we build DDI prediction models under the frame of semi-supervised learning.

In this paper, we collect drug substructure data, drug target data, drug enzyme data, drug transporter data, drug pathway data, drug indication data, drug side effect data, drug off side effect data and known drug-drug interactions. Multi-source data provide biological information, chemical information, phenotypic information and known interactions to characterize drug-drug interactions. To make use of diverse information, we adopt three representative methods, i.e., the neighbor recommender method [28, 29], the random walk method and the matrix perturbation method [30], to build different prediction models. According to performances of prediction models, we evaluate the usefulness of different information sources for the DDI prediction. The study reveals that DDI network based on known DDIs can provide the important information for DDI prediction. Further, we present flexible frames of integrating different models with suitable ensemble rules, including the weighted average ensemble rule and the classifier ensemble rule, and develop ensemble models to achieve better performances. The experiments demonstrate that ensemble methods can combine diverse information to produce the high-accuracy performances, and outperform existing state-of-the-art methods.

Methods

Datasets

The FDA Adverse Event Reporting System (FAERS) is a database which contains adverse event reports and medication error reports submitted to FDA. Tatonetti processed adverse event reports in the AERS, and constructed a database named “TWOSIDES” [31] which contains side effects caused by the combination of drugs. There are 645 drugs and 63,473 distinct pairwise DDIs from unsafe co-prescriptions in TWOSIDES.

The biological information, chemical information and phenotypic information about drugs may be associated with drug-drug interactions. PubChem Compound database [12, 15] can provide drug structures. DrugBank database [10, 11, 16, 17] is a bioinformatics resource with drug targets, drug enzymes and drug transporters. KEGG database [13] is an information resource for protein pathways. Drug targets are mapped to KEGG to obtain drug pathways. SIDER database [14] contains 1430 drugs and 5880 side effect terms which are compiled from public documents and package inserts. Drug side effects and indications are available in SIDER. OFFSIDES database [31] contains 1332 drugs and 10,093 “off-label” side effects.

We map drugs in TWOSIDES to SIDER, OFFSIDES, PubChem and DrugBank. As shown in Table 1, we obtain 548 drugs and 48,584 pairwise DDIs, and substructure data, target data, enzyme data, transporter data, pathway data, indication data, side effect data, off side effect data of these drugs are available. Based on the data, we conduct the comprehensive study to evaluate the usefulness of different data sources for DDI prediction, and discuss how to combine them for the high-accuracy prediction.

Table 1 The descriptions about multi-source drug data

Full size table

DDI prediction based on multi-source data

Multi-source data provide different information for the DDI prediction. Here, we describe how to build models based on different data.

Drug-drug similarities bring important clues for the DDI prediction, and different similarities can be extracted from multi-source data. Drug data are classified as four types, i.e., chemical data, biological data, phenotypic data and the drug-drug interaction network data (formed by known drug-drug interactions). On one hand, we calculate the drug-drug similarities in the biological space, chemical space and phenotypic space, by using drug substructures, drug targets, drug enzymes, drug transporters, drug pathways, drug indications, drug side effects and drug off side effects. On the other hand, we calculate the drug-drug similarities in the drug-drug interaction network. In order to utilize drug-drug similarities, we consider two representative methods [28, 32]: the neighbor recommender method and random walk method, and build DDI prediction models.

We take drugs as nodes and known interactions as edges in the DDI network, and transform the DDI prediction problem as a missing link prediction task. The missing link prediction is an important topic of theoretical interest and practical significance in the complex network [33]. Recently, a novel method named “matrix perturbation method” [30] is proposed, which utilize the network to predict missing links (unobserved DDIs). The studies demonstrated that this method outperforms other missing link prediction methods. Therefore, we adopt the matrix perturbation method to predict potential DDIs based on the DDI network.

In the following context, Similarity-based DDI prediction based on multi-source data presents how to extract different drug-drug similarities from different data and how to develop similarity-based models; Matrix perturbation method for DDI prediction presents the missing link prediction method (matrix perturbation method).

Similarity-based DDI prediction based on multi-source data

Drug-drug similarity based on biological data, chemical data and phenotypic data

A drug can be represented as a binary feature vector, by using drug substructures, drug targets, drug enzymes, drug transporters, drug pathways, drug indications, drug side effects, or drug off side effects. Dimensions of the feature vector respond to presence or absence of components with values 1 or 0. For example, there are 881 types of drug substructures, and a drug can be transformed as an 881-dimensional vector.

Given a drug x and a drug y, their feature vectors are V _x and V _y, and the similarity between x and y is then calculated by Jaccard formula:

$$ S\left({V}_x,{V}_y\right)=\frac{M_{11}}{M_{01}+{M}_{10}+{M}_{11}} $$

where M ₁₁ is the number of dimensions where V _x and V _y both have a value of 1; M ₀₁ is the number of dimensions where V _x has a value of 0 and V _y has a value of 1; M ₁₀ is the number of dimensions where V _x has a value of 1 and V _y has a value of 0.

Therefore, we can obtain 8 drug feature-based drug-drug similarities, including substructure-based similarity, target-based similarity, enzyme-based similarity, transporter-based similarity, pathway-based similarity, indication-based similarity, side effect-based similarity and off side effect-based similarity.

Drug-drug similarity based on known drug-drug interactions

By considering drugs as nodes and interaction as edges, known DDIs can form a DDI network. We calculate drug-drug similarities in the DDI network [33]. The adjacent matrix of the DDI network is denoted as A = (a _ij), and denotes the set of nodes linked to node. Several similarities between a drug x and a drug y can be defined.

Common neighbor similarity S _CN(x, y) takes the number of common neighbors between two nodes,

$$ {S}_{CN}\left(x,y\right)=\left|\Gamma (x)\cap \Gamma (y)\right| $$

Adamic-Adar similarity S _AA(x, y) is the counting of common neighbors by assigning the less connected neighbors more weights,

$$ {S}_{AA}\left(x,y\right)={\displaystyle \sum_{z\in \Gamma (x)\cap \Gamma (y)}\frac{1}{ \log \left|\Gamma (z)\right|}} $$

Resource Allocation similarity S _RA(x, y) is based on the complex network resource allocation dynamics,

$$ {S}_{RA}\left(x,y\right)={\displaystyle \sum_{z\in \Gamma (x)\cap \Gamma (y)}\frac{1}{\left|\Gamma (z)\right|}} $$

Katz similarity S _Katz(x, y) sums over the collection of paths with exponential damping according to path lengths,

$$ {S}_{Katz}\left(x,y\right)=\alpha {A}_{xy}+{\alpha}^2{A}_{xy}^2+{\alpha}^3{A}_{xy}^3+\cdots ={\left(I-\alpha A\right)}^{-1}-I $$

where α is a parameter, and I is the identity matrix. |α| < 1/λ _max is the condition for the compact form, and λ _max is the largest eigenvalue of A.

Average Commute Time similarity S _ACT(x, y) is the average number of steps required by a random walker starting from one node to reach another,

$$ {S}_{ACT}\left(x,y\right)=\frac{1}{l_{xx}^{+}+{l}_{yy}^{+}-2{l}_{xy}^{+}} $$

where L ⁺ is the pseudoinverse of the Laplacian matrix for the network.

The random walk with restart similarity S _RWR(x, y) is the probability that a random walker starting from an initial node x reaches y. The walker moves with the probability μ of returning to the initial node and the probability 1 − μ going to adjacent nodes,

$$ {S}_{RWR}\left(x,y\right)={q}_{xy}+{q}_{yx} $$

where q = (1 − μ)(1 − μP ^T)^− 1 A, and P = D ^− 1 A is the normalized transition matrix of the adjacency matrix A, and D is the degree matrix of A.

Therefore, we obtain 6 DDI network-based drug-drug similarities, including common neighbor similarity, Adamic-Adar similarity, resource allocation similarity, Katz similarity, average commute time similarity and random walk with restart similarity.

Similarity-based methods for DDI prediction

Given a N × N similarity matrix S = (s _ij) for N drugs, known pairwise DDIs are denoted by an adjacent matrix A = (a _ij). The neighbor recommender method and the random walk method are briefly introduced as follows.

The neighbor recommender method [28, 34] is one of most popular methods in recommender systems, which recommends items (movies, music, books, et al.) to users, or predicts the ‘rating’ or ‘preference’ that users would give to items. The neighbor recommender method takes the weighted average information of neighbors for prediction. Y _ij = ∑ ^N_{k = 1,k ≠ j} s _ik a _kj/∑ ^N_{k = 1,k ≠ j} s _ik is calculated for drug_i and drug_j which don’t have known interaction, where s _ik is the similarity between drug_i and drug_k, and a _kj = 1 or 0 means interaction or non-interaction between drug_k and drug_j. We can calculate Y _ji in this same way. The probability that drug_i interacts with drug_j score _ji = score _ij = Y _ij + Y _ji.

A random walk is a mathematical formalization of a path that consists of a succession of random steps. There are a great number of successful applications in the network analysis [35–38]. In random walk, a random walker starts from an initial node, and moves to neighbors with the probability μ and moves back to the initial node with the probability 1 − μ. The similarity matrix S is normalized as W = D ^− 1 S, where D is the degree matrix of S. The matrix form of the update is summarized as Y = μWY + (1 − μ)A, and it will converge to the solution: Y = (1 − μ)(I − μW)^− 1 A. The probability that drug_i interacts with drug_j score _ji = score _ij = Y _ij + Y _ji.

Matrix perturbation method for DDI prediction

The matrix perturbation method assumes that random removal of a small proportion of links from a network will not change the network structure [30], which is reflected by eigenvectors of its adjacent matrix.

Let’s introduce notations for the matrix perturbation method. Given the drug-drug interaction network G(V, E), V is the set of nodes, and E is the set of edges. The adjacent matrix is A = (a _ij), and the eigenvectors and eigenvalues of the adjacent matrix are denoted by x _k and λ _k, k = 1, 2, ⋯, N.

A fraction of links ΔE are randomly removed from E, and the set of remaining links E ^R = E − ΔE. Thus, we obtain the new network G ^R(V, E ^R) with the adjacent matrix A ^R = A − ΔA, where ΔA is the adjacent matrix for removed links. Then, we calculate the eigenvectors x ^R_k and eigenvalues λ ^R_k of A ^R, k = 1, 2, ⋯, N. We denote that A = A ^R + ΔA, x _k = x ^R_k + Δx _k and λ _k = λ ^R_k + Δλ _k.

In the network G(V, E), the relation of eigenvectors, eigenvalues and the adjacent matrix is written as,

$$ \left({A}^R+\Delta A\right)\left({x}_k^R+\Delta {x}_k\right)=\left({\lambda}_k^R+\Delta {\lambda}_k\right)\left({x}_k^R+\Delta {x}_k\right) $$

By left multiplying (x ^R_k )^T in above equation, we can obtain $ \Delta {\lambda}_k\approx \frac{{\left({x}_k^R\right)}^T\Delta A{x}_k^R}{{\left({x}_k^R\right)}^T{x}_k^R} $.

We estimate eigenvalues λ _k = λ ^R_k + Δλ _k, and keep eigenvectors x ^R_k unchanged. Then, we reconstruct the adjacent matrix of G(V, E) by summing eigenvalues and eigenvectors,

$$ \tilde{A}={\displaystyle \sum_{i=1}^N\left({\lambda}_k^R+\Delta {\lambda}_k\right)}{x}_k^R{\left({x}_k^R\right)}^T $$

The probability that drug_i interacts with drug_j score _ij = score _ji = Ã _ij + Ã _ji. More details are available in the publication [30].

Combining multi-source data for DDI prediction

Since we build different prediction models based on different data, it is natural to combine them for better performance. Ensemble learning is a useful technique that aggregates multiple machine learning models to achieve overall high prediction accuracy as well as good generalization [39]. Ensemble learning has been applied to a great number of applications in bioinformatics [29, 40, 41].

An ensemble learning system usually has two components: base predictors and ensemble rules. In our ensemble system, we adopt heterogeneous models {f _i} ⁿ_i = 1 based on multi-source data as base predictors. To integrate base predictors, we consider two popular ensemble rules: the weighted average ensemble rule and the classifier ensemble rule. Figure 1 demonstrates the flowchart of ensemble systems.

The weighted average ensemble rule takes the weighted average of outputs from base predictor. For a new input x _new, base predictors give out the predictions {f _i(x _new)} ⁿ_i = 1 , and their weighted average ∑ ⁿ_i = 1 w _i f _i(x _new) is adopted as the prediction of the ensemble model, where ∑ ⁿ_i = 1 w _i = 1 and w _i ≥ 0. We adopt the genetic algorithm (GA) to determine weights in the ensemble model. In the GA optimization, candidate weights are represented as chromosomes, and the fitness of a chromosome is the area under the precision-recall curve (AUPR) score of the ensemble model on the validation data. The objective function of GA optimization is to maximize the AUPR score.

The classifier ensemble rule is to seek a classification function G : (f ₁(x), f ₂(x), ⋯, f _n(x)) → {0, 1}, which maps outputs of n base predictors to a label. For a new input x _new, outputs of base predictors are {f _i(x _new)} ⁿ_i = 1 , and the prediction of the classifier ensemble model is G(f ₁(x _new), f ₂(x _new), ⋯, f _n(x _new)). Here, we adopt logistic regression as the classification function.

Results and discussion

Evaluation metrics

We adopt k-fold cross validation (k-CV) to evaluate prediction models. Known interactions are randomly split into k subsets with equal size. In each fold, one subset is used as the testing set; 80 and 20% of other interactions (k-1 subsets) are used as the training set and validation set. Base predictors are constructed on the training set, and parameters in the ensemble system are tuned by using the validation set. Then, the ensemble model makes predictions for the testing set. This procedure is repeated until each subset is ever used for testing. To avoid the bias of data split, we implement 20 independent runs of k-CV for each model, and average performances are adopted.

Here, we adopt several evaluation metrics to measure performances of prediction models, i.e., accuracy (ACC), precision, recall, F-measure (F), area under ROC curve (AUC) and the area under the precision-recall curve (AUPR). In our task, DDIs take a small proportion of all drug pairs, and thus AUPR, which takes into account both recall and precision, is used as the primary evaluation metric.

Performances of different models based on multi-source data

We extract 14 different similarities from multi-source data, and respectively adopt the neighbor recommender method and the random walk method to build 28 similarity-based prediction models. By formulating the original problem as a missing link prediction task, we adopt the matrix perturbation method to build the prediction model based on known DDIs. Therefore, we construct 29 prediction models based on multi-source data. Since different models utilize different information for DDI prediction, performances of the models are indicators for the usefulness of information sources.

As shown in Table 2, these models produce different performances on the benchmark dataset in the cross validation. Among eight feature-based similarities, substructure similarity, side effect similarity, off side effect similarity and indication similarity lead to better performances than other similarities, indicating that drug substructures, drug side effects, drug off side effects and drug indications provide important information for the drug-drug interactions. Among the network topology-based similarities, RA and RWR can produce better results. The comparison shows that drug feature-based similarities as well as topological similarities can provide useful information to characterize drug-drug interactions and lead to useful models. The matrix perturbation method utilizes the DDI network as a whole to make predictions. Among all prediction models, the matrix perturbation method produces the best results, indicating that known DDIs provide one of most useful information to identify potential DDIs.

Table 2 Performances of different models evaluated by 20 runs of 5-CV

Full size table

We also conduct 20 runs of 3-CV to evaluate prediction models, and results are shown in Table 3. The comparison between 3-CV results and 5-CV results demonstrates that prediction models have different performances under different experimental conditions, and a model cannot produce the best results in all cases. For example, the matrix perturbation method assumes that the topology of a network will not change if only a small proportion of links are removed. In 3-CV, more links are kept for testing, and the predictive power may be affected. Therefore, the matrix perturbation method is not the best predictor in 3-CV experiments. For this reason, we integrate different models to make robust predictions.

Table 3 Performances of different models evaluated by 20 runs of 3-CV

Full size table

Performances of ensemble models

Based on multi-source data, we construct 29 prediction models including 28 similarity-based models and one perturbation matrix model. We use these models as base predictors, and respectively adopt the weighted average ensemble rule and the classifier ensemble rule to build ensemble models.

We apply the genetic algorithm (GA) to determine optimal weights in the weighted average ensemble models. GA is implemented by using python package “deap”. The initial population has 100 chromosomes. In the population update, the elitist strategy is used for the selection operator, and default parameters are adopted for the mutation probability and crossover probability. The population update terminates when the change of best fitness scores is less than the default value of 1E-6 or the max generation number of 50 is reached.

To build classifier ensemble models, we train the logistic regression classifier to combine outputs of base predictors. The logistic regression is implemented by using python package “scikit-learn”. Default parameters are used; L1 regularization and L2 regularization are respectively considered. In the following context, classifier ensembles models refer to logistic regression ensemble models.

Table 4 shows 3-CV results and 5-CV results. In 5-CV experiments, the weighted average ensemble model, the classifier ensemble model (L1 regularization) and classifier ensemble model (L2 regularization) produce the AUPR scores of 0.795, 0.807 and 0.806; in 3-CV experiments, three models yield the AUPR scores of 0.832, 0.841 and 0.839. The comparison demonstrates that the classifier ensemble models produce better results than the weighted average ensemble model. The possible reason is that the weighted average ensemble method uses the linear function for ensemble learning and classifier ensemble method trains nonlinear function. Moreover, the classifier ensemble method with L1 regularization can produce better results than the classifier ensemble method with L2 regularization, for L1 regularization can produce the sparse model and enhance the generalization capability.

Table 4 Performances of ensemble model evaluated by 20 runs of 3-CV and 5-CV

Full size table

Clearly, ensemble models produce better results than base predictors. In 5-CV experiments, the classifier ensemble method (L1) can improve the AUPR score of 0.782 (produced by the matrix perturbation model) to 0.806. Since we implement 20 runs of 5-CV for ensemble models and matrix perturbation models, we conduct t-test to test the difference of their performances in terms of AUPR score, and the statistical significance is observed (p-value =1.21E-39). In 3-CV experiments, the classifier ensemble method (L1) can enhance the AUPR score from 0.820 (produced by the indication-based random walk model) to 0.839, and we also observe the statistical significance of improvement between the classifier ensemble model (L1) and the indication-based random walk model (p-value =3.12E-41).

Further, we investigate into details of the ensemble models based on 3-CV results and 5-CV results. Firstly, we analyze weights in the weighted average ensemble models determined by GA. There are 100 sets of weights for 20 runs of 5-CV; there are 60 sets of weights for 20 runs of 3-CV. We calculate the average weights for each predictor, and visualize the normalized weights in Fig. 2. Base predictors with high AURP scores may be assigned great weights. For example, the matrix perturbation model produces best 5-CV results, and thus gains the greatest weight in the ensemble models. We observe that several base predictors (such as RWR-based random walk model) are not used in the ensemble models. The classifier ensemble method (L1) produces the sparse models, which integrate the subset of base predictors. According to 5-CV results, several base predictors (index: 1, 10, 15, 21, 22, 27, 28, 29) are not used in the classifier ensemble model. In the view of computer science, multi-source data provide diverse information but also bring the redundant information. Combining base predictors is a combinatorial optimization problem. Therefore, the weighted average ensemble method and the classifier ensemble method (L1) use a subset of base predictors to develop ensemble models.

Comparison with existing state-of-the-art methods

Since this work is designed to predict undetected or unobserved DDIs, we adopt methods of the same type for comparison. Vilar used known interactions of most similar drugs to predict DDIs, and proposed the substructure similarity-based model [19] and interaction profile fingerprint (also known as common neighbors, CN) similarity-based model [20]. Zhang [23] adopted the label propagation algorithm to build substructure similarity-based model, side effect similarity-based model and off side effect similarity-based model. We name these models as Vilar’s substructure-based model, Vilar’s CN index-based model, substructure-based label propagation model, side effect-based label propagation model and off side effect-based label propagation model. These prediction models are implemented according to details in publications. All models are evaluated by 20 runs of cross validation under the same conditions.

As shown in Table 5, our ensemble methods produce better results than other state-of-the-art methods in terms of different metrics. The classifier ensemble method (L1) produces the best results in both 3-CV experiments and 5-CV experiments. Further, we adopt t-test to compare the ensemble methods with other state-of-the-art methods in terms of AUPR scores. Table 6 demonstrates that our ensemble methods produce significantly better results (p < 0.05 in terms of AUPR scores).

Table 5 Performances of the ensemble method and benchmark methods evaluated by 20 runs of 3-CV and 5-CV

Full size table

Table 6 The statistical significance of performance improvements achieved by our ensemble methods

Full size table

In one fold of 5-fold cross validation, we adopt 80% interactions (38,868) as the training set and the validations set, and use other interactions (9716) as the testing set. We build the prediction model based on the training set and the validations set, and then make predictions for non-interaction drug-drug pairs (111,010) to identify testing interactions (9716). Based on the result, we respectively count how many testing DDIs are identified in the top 10,000 predictions and top 15,000 predictions. As shown in Fig. 3, the classifier ensemble model (L1) can identify 7027 testing interactions when verifying top 10,000 predictions, and identify 7842 testing interactions when verifying top 15,000 predictions. In general, our ensemble models can identify 300 ~ 400 more interactions than other methods do.

Predicted novel interactions

In this paper, we use the benchmark dataset with 548 drugs and 48,584 pairwise drug-drug interactions from TWOSIDES database. There are 149,878 drug-drug pairs between these drugs. Besides 48,584 known pairwise DDIs, 101294 remaining drug pairs (“non-interaction pairs”) may contain undetected or unobserved DDIs, which are not available in TWOSIDES. We train the prediction models based on 548 drugs and 48,584 known DDIs, and predict unobserved DDIs. In the prediction, great scores of drug pairs indicate high probabilities of having interactions, and the prediction results are transformed as a recommendation list of unobserved interactions or novel interactions. To confirm novel interactions, we look up them in the latest online version of DrugBank database. Table 7 lists top 20 novel interactions predicted by our method, and a significant fraction of novel interactions (7 out of 20) are confirmed in DrugBank database.

Table 7 Top 20 novel interactions predicted by our method (confirmed interactions shown in bold)

Full size table

Further, we compare the ensemble model and the matrix perturbation model by testing their capability of finding out novel interactions. The top 1000 novel interactions predicted by the ensemble model and the matrix perturbation model are provided in supplementary material (see Additional file 1). For each method, we find evidences in DrugBank to confirm novel interactions. If we look up all 1000 interactions of the matrix perturbation model and the ensemble model, we can confirm 297 novel interactions and 318 novel interactions respectively (252 common interactions are shared). Further, based on the top 1000 novel interactions, we use the number of predictions as X-axis and the number of confirmed novel interactions in the predictions as Y-axis, and then visualize performances of two models (see Additional file 2). In general, the ensemble model can find out more novel interactions than the matrix perturbation model, indicating the usefulness of integrating multi-source data.

Conclusions

The prediction of drug-drug interactions is an important task in the drug discovery, which helps to reduce potential risks and understand the mechanism of drug-drug interactions. This paper collects a wide variety of drug data, and designs the models based on multi-source data for the DDI prediction. Compared with existing DDI prediction methods, our methods produce better performances, and the statistical analysis demonstrates that the performance improvements achieved by our method are statistically significant. In conclusion, the proposed methods are promising for the DDI prediction.

Abbreviations

5-CV:: 5-fold cross validation
AUC:: Area under ROC curve
AUPR:: Area under precision-recall curve
DDI:: Drug-drug interaction
GA:: Genetic algorithm

References

Nagai N. Drug interaction studies on new drug applications: current situations and regulatory views in Japan. Drug Metab Pharmacokin. 2010;25(1):3–15.
Article CAS Google Scholar
Percha B, Altman RB. Informatics confronts drug-drug interactions. Trends Pharmacol Sci. 2013;34(3):178–84.
Article CAS PubMed Google Scholar
Prueksaritanont T, Chu X, Gibson C, Cui D, Yee KL, Ballard J, Cabalu T, Hochman J. Drug-drug interaction studies: regulatory guidance and an industry perspective. AAPS J. 2013;15(3):629–45.
Article CAS PubMed PubMed Central Google Scholar
Kusuhara H. How far should we go? Perspective of drug-drug interaction studies in drug development. Drug Metab Pharmacokin. 2014;29(3):227–8.
Article CAS Google Scholar
Noren GN, Sundberg R, Bate A, Edwards IR. A statistical methodology for drug-drug interaction surveillance. Stat Med. 2008;27(16):3057–70.
Article PubMed Google Scholar
Tatonetti NP, Denny J, Murphy S, Fernald G, Krishnan G, Castro V, Yue P, Tsau P, Kohane I, Roden D. Detecting drug interactions from adverse‐event reports: interaction between paroxetine and pravastatin increases blood glucose levels. Clin Pharmacol Ther. 2011;90(1):133–42.
Article CAS PubMed PubMed Central Google Scholar
Duke JD, Han X, Wang Z, Subhadarshini A, Karnik SD, Li X, Hall SD, Jin Y, Callaghan JT, Overhage MJ, et al. Literature based drug interaction prediction with clinical assessment using electronic medical records: novel myopathy associated drug interactions. PLoS Comput Biol. 2012;8(8):e1002614.
Article CAS PubMed PubMed Central Google Scholar
Tatonetti NP, Fernald GH, Altman RB. A novel signal detection algorithm for identifying hidden drug-drug interactions in adverse event reports. J Am Med Inform Assoc. 2012;19(1):79–85.
Article PubMed Google Scholar
He L, Yang Z, Zhao Z, Lin H, Li Y. Extracting drug-drug interaction from the biomedical literature using a stacked generalization-based approach. 2013.
Google Scholar
Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006;34(Database issue):D668–72.
Article CAS PubMed Google Scholar
Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008;36(Database issue):D901–6.
CAS PubMed Google Scholar
Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009;37((Web Server issue):W623–33.
Article Google Scholar
Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010;38(Database issue):D355–60.
Article CAS PubMed Google Scholar
Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol. 2010;6:343.
Article PubMed PubMed Central Google Scholar
Li Q, Cheng T, Wang Y, Bryant SH. PubChem as a public resource for drug discovery. Drug Discov Today. 2010;15(23–24):1052–7.
Article CAS PubMed PubMed Central Google Scholar
Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, et al. DrugBank 3.0: a comprehensive resource for ‘omics’ research on drugs. Nucleic Acids Res. 2011;39(Database issue):D1035–41.
Article CAS PubMed Google Scholar
Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, Arndt D, Wilson M, Neveu V, et al. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 2014;42(Database issue):D1091–7.
Article CAS PubMed Google Scholar
Gottlieb A, Stein GY, Oron Y, Ruppin E, Sharan R. INDI: a computational framework for inferring drug interactions and their associated recommendations. Mol Syst Biol. 2012;8:592.
Article PubMed PubMed Central Google Scholar
Vilar S, Harpaz R, Uriarte E, Santana L, Rabadan R, Friedman C. Drug-drug interaction through molecular structure similarity analysis. J Am Med Inform Assoc. 2012;19(6):1066–74.
Article PubMed PubMed Central Google Scholar
Vilar S, Uriarte E, Santana L, Tatonetti NP, Friedman C. Detection of drug-drug interactions by modeling interaction profile fingerprints. PLoS One. 2013;8(3):e58321.
Article CAS PubMed PubMed Central Google Scholar
Li P, Huang C, Fu Y, Wang J, Wu Z, Ru J, Zheng C, Guo Z, Chen X, Zhou W, et al. Large-scale exploration and analysis of drug combinations. Bioinformatics. 2015;31(12):2007–16.
Article CAS PubMed Google Scholar
Park K, Kim D, Ha S, Lee D. Predicting pharmacodynamic drug-drug interactions through signaling propagation interference on protein-protein interaction networks. PLoS One. 2015;10(10):e0140816.
Article PubMed PubMed Central Google Scholar
Zhang P, Wang F, Hu J, Sorrentino R. Label propagation prediction of drug-drug interactions based on clinical side effects. Sci Rep. 2015;5:12339.
Article PubMed Google Scholar
Cami A, Manzi S, Arnold A, Reis BY. Pharmacointeraction network models predict unknown drug-drug interactions. PLoS One. 2013;8(4):e61468.
Article CAS PubMed PubMed Central Google Scholar
Cheng F, Zhao Z. Machine learning-based prediction of drug-drug interactions by integrating drug phenotypic, therapeutic, chemical, and genomic properties. J Am Med Inform Assoc. 2014;21(e2):e278–86.
Article PubMed PubMed Central Google Scholar
Takarabe M, Shigemizu D, Kotera M, Goto S, Kanehisa M. Network-based analysis and characterization of adverse drug-drug interactions. J Chem Inf Model. 2011;51(11):2977–85.
Article CAS PubMed Google Scholar
Huang J, Niu C, Green CD, Yang L, Mei H, Han JD. Systematic prediction of pharmacodynamic drug-drug interactions through protein-protein-interaction network. PLoS Comput Biol. 2013;9(3):e1002998.
Article CAS PubMed PubMed Central Google Scholar
Bobadilla J, Ortega F, Hernando A, Gutiérrez A. Recommender systems survey. Knowl-Based Syst. 2013;46:109–32.
Article Google Scholar
Zhang W, Zou H, Luo L, Liu Q, Wu W, Xiao W. Predicting potential side effects of drugs by recommender methods and ensemble learning. Neurocomputing. 2016;173:979–87.
Article Google Scholar
Lu L, Pan L, Zhou T, Zhang YC, Stanley HE. Toward link predictability of complex networks. Proc Natl Acad Sci U S A. 2015;112(8):2325–30.
Article CAS PubMed PubMed Central Google Scholar
Tatonetti NP, Ye PP, Daneshjou R, Altman RB. Data-driven prediction of drug effects and interactions. Sci Transl Med. 2012;4(125):125ra131.
Article Google Scholar
Schafer JB, Konstan J, Riedl J. Recommender systems in e-commerce. In: Proceedings of the 1st ACM conference on Electronic commerce. New York: ACM; 1999. p. 158–66.
Lü L, Zhou T. Link prediction in complex networks: a survey. Physica A. 2011;390(6):1150–70.
Article Google Scholar
Koren Y, Bell R. Advances in collaborative filtering. In: Recommender Systems Handbook. New York: Springer; 2015. p. 77–118.
Liu W, Lü L. Link prediction based on local random walk. EPL (Europhysics Letters). 2010;89(5):58007.
Article Google Scholar
Backstrom L, Leskovec J. Supervised random walks: predicting and recommending links in social networks. In: Proceedings of the fourth ACM international conference on Web search and data mining. New York: ACM; 2011. p. 635–44.
Chen X, Liu MX, Yan GY. Drug-target interaction prediction by random walk on the heterogeneous network. Mol BioSyst. 2012;8(7):1970–8.
Article CAS PubMed Google Scholar
Seal A, Ahn YY, Wild DJ. Optimizing drug-target interaction prediction based on random walk on heterogeneous networks. J Cheminf. 2015;7:40.
Article Google Scholar
Polikar R. Ensemble based systems in decision making. IEEE Circuits Syst Mag. 2006;6(3):21–45.
Article Google Scholar
Zhang W, Niu Y, Xiong Y, Zhao M, Yu R, Liu J. Computational prediction of conformational B-cell epitopes from antigen primary structures by ensemble learning. PLoS One. 2012;7(8):e43575.
Article CAS PubMed PubMed Central Google Scholar
Zhang W, Liu F, Luo L, Zhang J. Predicting drug side effects by multi-label learning and ensemble learning. BMC Bioinf. 2015;16:365.
Article Google Scholar

Download references

Acknowledgement

We would like to thank Longqiang Luo for his support during this project.

Funding

This work is supported by the National Natural Science Foundation of China (61103126, 61402340, 61572368), and Natural Science Foundation of Hubei Province of China (2014CFB194, ZRY2014000901). The fundings have no role in the design of the study and collection, analysis, and interpretation of data and writing the manuscript.

Availability of data and materials

The datasets and source codes are available at https://github.com/zw9977129/drug-drug-interaction/.

Authors’ contributions

WZ conceived the project; WZ, YC and FL(Feng) designed the experiments; and FL(fei) performed the experiments; WZ, GT and XL wrote the paper. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Author information

Authors and Affiliations

State Key Lab of Software Engineering, Wuhan University, Wuhan, 430072, China
Wen Zhang, Fei Luo, Gang Tian & Xiaohong Li
School of Computer, Wuhan University, Wuhan, 430072, China
Wen Zhang, Fei Luo, Gang Tian & Xiaohong Li
School of Mathematics and Statistics, Wuhan University, Wuhan, 430072, China
Yanlin Chen
International School of software, Wuhan University, Wuhan, 430072, China
Feng Liu

Authors

Wen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yanlin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Feng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Fei Luo
View author publications
You can also search for this author in PubMed Google Scholar
Gang Tian
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohong Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wen Zhang.

Additional files

Additional file 1:

Top 1000 novel interactions predicted by the ensemble model and the matrix perturbation model. (XLSX 75 kb)

Additional file 2:

Visualization of the number of predictions vs. number of confirmed interactions. (TIF 519 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Zhang, W., Chen, Y., Liu, F. et al. Predicting potential drug-drug interactions by integrating chemical, biological, phenotypic and network data. BMC Bioinformatics 18, 18 (2017). https://doi.org/10.1186/s12859-016-1415-9

Download citation

Received: 20 July 2016
Accepted: 09 December 2016
Published: 05 January 2017
DOI: https://doi.org/10.1186/s12859-016-1415-9

Predicting potential drug-drug interactions by integrating chemical, biological, phenotypic and network data

Abstract

Background

Results

Conclusions

Background

Methods

Datasets

DDI prediction based on multi-source data

Similarity-based DDI prediction based on multi-source data

Drug-drug similarity based on biological data, chemical data and phenotypic data

Drug-drug similarity based on known drug-drug interactions

Similarity-based methods for DDI prediction

Matrix perturbation method for DDI prediction

Combining multi-source data for DDI prediction

Results and discussion

Evaluation metrics

Performances of different models based on multi-source data

Performances of ensemble models

Comparison with existing state-of-the-art methods

Predicted novel interactions

Conclusions

Abbreviations

References

Acknowledgement

Funding

Availability of data and materials

Authors’ contributions

Competing interests

Consent for publication

Ethics approval and consent to participate

Author information

Authors and Affiliations

Corresponding author

Additional files

Additional file 1:

Additional file 2:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us