DDIGIP: predicting drug-drug interactions based on Gaussian interaction profile kernels

Background A drug-drug interaction (DDI) is defined as a drug effect modified by another drug, which is very common in treating complex diseases such as cancer. Many studies have evidenced that some DDIs could be an increase or a decrease of the drug effect. However, the adverse DDIs maybe result in severe morbidity and even morality of patients, which also cause some drugs to withdraw from the market. As the multi-drug treatment becomes more and more common, identifying the potential DDIs has become the key issue in drug development and disease treatment. However, traditional biological experimental methods, including in vitro and vivo, are very time-consuming and expensive to validate new DDIs. With the development of high-throughput sequencing technology, many pharmaceutical studies and various bioinformatics data provide unprecedented opportunities to study DDIs. Result In this study, we propose a method to predict new DDIs, namely DDIGIP, which is based on Gaussian Interaction Profile (GIP) kernel on the drug-drug interaction profiles and the Regularized Least Squares (RLS) classifier. In addition, we also use the k-nearest neighbors (KNN) to calculate the initial relational score in the presence of new drugs via the chemical, biological, phenotypic data of drugs. We compare the prediction performance of DDIGIP with other competing methods via the 5-fold cross validation, 10-cross validation and de novo drug validation. Conlusion In 5-fold cross validation and 10-cross validation, DDRGIP method achieves the area under the ROC curve (AUC) of 0.9600 and 0.9636 which are better than state-of-the-art method (L1 Classifier ensemble method) of 0.9570 and 0.9599. Furthermore, for new drugs, the AUC value of DDIGIP in de novo drug validation reaches 0.9262 which also outperforms the other state-of-the-art method (Weighted average ensemble method) of 0.9073. Case studies and these results demonstrate that DDRGIP is an effective method to predict DDIs while being beneficial to drug development and disease treatment.


Background
Drug-drug interactions (DDI) is defined as that a drug affects the efficacy of another drug when multi-drugs are adopted in the treatment of a disease [1]. DDIs can lead to change systemic exposure and result in variations in drug responses, which can improve the drugs efficiency and the life quality of illnesses [2]. However, DDIs also can cause serious adverse effects, drug withdrawal from the market and even the patient morality [3,4]. Meanwhile, with the medical technology development and personalized medical requirements, more and more patients were simultaneously treated by multi-drugs and between 2009 and 2012, 38.1% of U.S. adults aging 18-44 years used three or more prescription drugs during a 30 day time period [5][6][7]. Therefore, identifying the potential DDIs has become a major issue in drug development and practice process.
With the high-throughput sequencing technology development, many databases related to drugs have been constructed. DrugBank database can provide drug targets, drug enzymes, drug transporters and DDIs, which are widely used in studying drug-target associations and drug repositioning [8][9][10]. PubChem Compound database contains the chemical substructures and their biological test results [11]. In addition, SIDER and OFFSIDES databases include drug side effects and "off-label" side effects, respectively [12,13]. KEGG database contains drug pathways and chemical substructures [14]. TWO-SIDES database contains the DDIs based on the adverse event reports in the AERS (adverse effect reactions) [13,15].
The above mentioned databases related to drugs were extracted from the published literature and reports with experimental validation, and could provide the basis to the development of computational methods to predict new DDIs. Recently, many computational methods have been proposed to predict potential DDIs based on the assumption that similar drugs tend to interact with similar other drugs. These approaches usually used the biological network data, chemical substructure data and phenotypic data. Based on MACCS substructures of drugs, Vilar et al. developed a similarity-based model to predict new DDIs [16]. Liu et al. proposed a model to predict potential DDIs via random forest-based classification model, which also adopted a feature selection technique over the chemical substructures, protein-protein interactions between targets of drugs and target enrichment of KEGG pathways [17]. Cheng et al. proposed a method to infer novel DDIs via machine learning classifiers, whose major feature is integrating drug chemical, phenotypic and genomic properties [18]. IPFs (interaction profile fingerprints) method was proposed to predict hidden DDIs [19]. Logistic regression model was used to predict new DDIs by Takeda et al., which analyzed the effects of 2D structural similarities of drugs on DDI prediction with other pharmacokinetics (PK) and pharmacodynamics (PD) knowledge [20]. Via constructing the drug similarity based on their 2D and 3D molecular substructures, targets, side effects and known DDIs, Vilar et al. further proposed a method to predict new DDIs on a large scale data, where the key feature is capturing the characteristics of drugs by 3D substructures when 2D substructures are missing [21]. Herrero-Zazo et al. provided a computational method to predict DDIs by different types of DDIs and their mechanisms [22]. By integrating similarities from drug molecular and pharmacological phenotypes, Li et al. used a Bayesian network to provide large-scale exploration and analysis of drug combinations [23]. Through calculating the functional similarity from drug carriers, drug transporters, drug enzymes and drug targets, Ferdousi et al. developed an approach to discover new DDIs [24]. Based on the Probabilistic Soft Logic method, a computational framework was developed to discover new DDIs by integrating the multiple drug similarities and known DDIs [25]. The label propagation approach was also developed to discover new DDIs, which used drug chemical structures, side effects and off side effects [26]. In order to predict drug adverse drug reactions (ADRs), a systems pharmacology model called MEF (multiple evidence fusion) has been developed by integrating known DDIs and other similarities of drugs [27]. Based on the assumption that synergistic effects of drugs are usually similar, Network-based Laplacian regularized Least Square Synergistic (NLLSS) method was developed to predict novel DDIs [28]. Via calculating the similarities of chemical, biological, phenotypic and known DDIs of drugs, Zhang et al. proposed three ensemble methods to predict novel DDIs, which included a weight average ensemble method and two classifier ensemble methods (L1 and L2) [29].
In addition, many other approaches used quantitative structure-activity relationship (QSAR) model, clinical data and data mining to study DDIs. Matthews et al. developed 14 QSAR models to predict the cardiac adverse effects for generic pharmaceutical substances [30]. Zakharov et al. developed QSAR models to predict the likelihood of DDIs for any pair of drugs by radial basis functions with self-consistent regression (RBF-SCR) and random forest (RF) [31]. Cami et al. proposed a Predictive Pharmacointeraction Networks (PPINs) to predict novel DDIs by exploiting the known DDIs and other intrinsic and taxonomic properties of drugs and AEs [32]. Huang et al. developed a method to predict DDIs using proteinprotein interaction network and clinical side effects [33]. Based on information of drug metabolism, text-mining and reasoning methods were developed to infer new DDIs [34]. Iyer et al. used the textual portion Electronic health records (EHRs) to directly discover new DDIs [35]. Banda et al. also adopted a data mining method to predict new DDIs from the EHRs [36]. Based on the k-nearest neighbor algorithm, Chen et al. proposed a model to predict DDIs which integrated nine predictors by majority voting [37]. Furthermore, the drug response prediction and drug-target interaction prediction are also the important research topics about drugs. By integrating genomic/pharmaceutical data, protein-protein interaction network, and prior knowledge of drug-target interactions with the techniques of network propagation, Wang et al. have developed a dNetFS method to prioritize genetic and gene expression features of cancer cell lines that predict drug response [38]. Based on the massively collected drug-kinase interactions and drug sensitivity datasets, Liu et al. employed a sparse linear model to infer essential kinases governing the cellular responses to drug treatments in cancer cells [39].Based on the sequence information of both targets and drugs, DeepDTA is used to predict drug-target interaction binding affinities, which is a deep-learning based model (convolutional neural networks) [40].
Although the above DDI prediction methods have achieved some good prediction results of novel DDIs and provided useful information for drug development and practice process. However, these methods did not pay enough attention to new drugs which do not have any DDIs with other drugs or cannot predict novel DDIs for new drugs because known DDIs are missing.
In this study, we develop a computational method (called DDIGIP) to predict novel DDIs based on drug Gaussian interaction profile (GIP) kernel similarity and regularized least squares (RLS) classifier. We calculate the GIP similarity of drugs by known DDIs, and then adopt the RLS method to compute the related scores of any drug pairs. In addition, when predicting DDIs for new drugs, we use the KNN method to compute the initial relational scores by similarity calculated from some important chemical, biological and phenotypic information of drugs. The drug chemical structures, drug-target interactions, drug enzymes, drug transports, drug pathways, drug indications, drug side effects and drug off side effects are all used to calculate similarity of drugs. 5-fold cross validation (5CV), 10-fold cross validation (10CV) and de novo drug validation are used to systemically assess prediction performance of DDIGIP, compared with other methods. In 5-fold cross validation, the area under the ROC curve (AUC) value of DDIGIP is 0.9600 which is slightly better than the state-of-the-art method L1 classifier ensemble (L1E) method results of 0.9570. In addition, the experimental results of 10-fold cross validation also demonstrate that DDIGIP outperforms the L1E method. In de novo drug validation, DDIGIP achieves the AUC of 0.9262, which is also better than the weighted average ensemble (WAE) method result of 0.9073. Case studies further validate the prediction ability of DDIGIP method.

Materials
In this study, the benchmark dataset of DDIs composes of 548 drugs and 48,584 DDIs. This dataset is obtained from the TWOSIDES database. In addition, because we need to calculate the relational scores of new drugs, we also download other chemical, biological and phenotypic data from other databases to compute the similarity of drugs. The chemical data are PubChem substructures which are downloaded from the PubChem Compound database. Biological data include drug targets, drug transports, drug enzymes and drug pathways, the first three types are obtained from the DrugBank database and the last one is from the KEGG database. Furthermore, the phenotypic data composes of drug indications, drug side effects and drug off side effects. The SIDER database provided the drug indications and drug side effects, and OFFSIDES provided the drug off side effects.
Previous studies also provided the download links for these datasets [29]. Table 1 shows the relevant information which includes data type, data source and dimensionality.

GIP kernel similarity of drugs
The GIP kernel similarity has widely been used in other prediction issues of similar areas and achieved effective prediction performances [41][42][43][44][45][46]. RLS-Kron is provided to predict drug-target interactions based on RLS classifier of Kronecker product kernel and GIP kernel similarities of drugs and targets [41]. SDTRLS is provided to predict drug-target interactions based on integration similarity of drug GIP kernel similarity and chemical substructure similarity by the SNF method [42,47]. LDAP is used to predict lncRNA-disease associations by using a bagging SVM classifier based on lncRNA and disease similarities which include GIP kernel similarity [43]. DNRLMF-MDA is an miRNA-disease associations prediction method based on dynamic neighborhood regularized logistic matrix factorization, which also uses the GIP kernel similarity. We compute the GIP similarity of drugs via known DDIs in this study. We denote D = {d 1 , d 2 , ......, d N } as the set of N drugs. The known DDIs can be represented by an adjacency matrix Y ∈ N * N. The value of y ij is 1 if d i and d j have a known interaction, and 0 otherwise. The GIP kernel similarity between drugs d i and d j can be calculated as follows: where γ d is the regularization parameter of kernel bandwidth and γ , d is set to be 1 according to previous studies [42,44], yd i = {y i1 , y i2 , ......, y iN } is the interaction profile of drug d i .

RLS classifier and prediction dDIs
The (kernel) RLS classifier is based on the assumption that similar principal (adjuvant) drugs are tended to interact with the same adjuvant (principal) drug and has been widely used in other areas [42,48,49]. After calculating the GIP kernel similarity G sim , we adopt the RLS classifier to compute the interaction probability scores of drug pairs as follows:Ŷ where σ is the regularization parameter and set to be 1 according to previous study [41]. Furthermore, the G sim and I are the GIP similarity matrix and the identity matrix, respectively. The Y p is the final prediction result matrix, which is symmetric. The interacted probabilities of drug pairs are ranked in descending order. A candidate drug pair with the rank 1 is of the most possible drug pair.

KNN for new drugs
New drugs have no any known interaction with other drugs, which makes prediction DDIs for these drugs is impossible by existing methods. Therefore, we adopt the KNN method to calculate their initial relational scores based on the integrated feature similarity of chemical structure, biological and phenotypic information.
In order to calculate the integrated feature similarity S sim ∈ N * N, we adopt the Pearson correlation coefficient to compute similarity based on the binary vectors of drug chemical substructures, drug targets, drug transporters, drug enzymes, drug pathways, drug indications, drug side effects and drug off side effects. We can see from Table  1 that the total dimensionality of a binary vector of any drug is 21,351, whose value is 1 when the related feature is present and otherwise is 0. Specifically, the similarity of drug pair d i and d j is calculated as follows: where v d i and v d j are the feature vectors of drugs d i and d j , respectively. Cov is the covariance. E and D are the mathematical expectation and standard deviation, respectively. After obtaining the integrated feature similarity S sim , we calculate the initial scores of new drugs by the KNN method. Specifically, the interaction scores Y KNN (d i , d j ) between new drug d i and another drug d j can be calculated as follows: sim is the (i, l)-th element of the integrated similarity matrix and y lj is the (l, j)-th element of known DDIs matrix Y ∈ N * N. K set represents the set of top K nearnest neighbors according to the S sim matrix. In this study, we set the value of K by de novo drug validation.
Algorithm 1 is the description of our DDIGIP method. As the 0 vectors in the DDIs adjacency matrix Y correspond to unknown cases, we firstly compute the initial relational interaction scores for new drugs via the KNN method which uses the feature similarity S sim of drugs by integrating chemical, biological and phenotypic data. The

Algorithm 1 DDIGIP
Input: Drug set DS, DDIs adjacency matrix Y, and parameter K Output: predicted association matrix Y p DDIGIP(DS, Y , K) feature similarity S sim is calculated by Pearson correlation coefficient. After computing the GIP similarity G d of drugs, we take the RLS classifier to calculate the interaction scores of drug pairs. The final prediction result matrix is Y p . Figure 1 demonstrates the work flow of DDIGIP.

Benchmark evaluation and evaluation indices
5CV and 10CV are widely used to evaluate the performance for predicting drug-drug interactions [28,29], drug-target interactions [42,50], drug-disease interactions [51][52][53], lncRNA-disease associations [43,54], miRNA-disease associations [44,55] and so on. In this study, we evaluate the predictive performance of DDIGIP using 5CV and 10CV. In 5CV, all known DDIs are divided into 5 folds, and each fold, in turn, was left out as the test set while the rest 4 folds as the training set. In 10CV, we also divide known DDIs into 10 folds, and each fold is treated as test set in turn, while the remaining 9-folds are as the training set. We adopt 10 repeats in 5CV and 10CV. Furthermore, the actual generalization ability of predicting potential DDIs for new drugs is also an important aspect to assess the prediction performance. We thus conduct de novo drug validation to evaluate the predictive performance of DDIGIP. In de novo drug validation, we take known DDIs of each drug, in turn, and the rest DDIs of other drugs as the training set.
From a prediction method, each drug pair obtains a prediction score. Then each known interaction between two drugs in the test is ranked relative to the candidate interactions (all unknown interactions). On a specified rank threshold, TPR (true positive rate) is the fraction of known interactions that are correctly predicted, and FPR (false positive rate) is the fraction of unknown interactions that are predicted to be true interactions. The receiver operating characteristic curve (ROC) can be drawn based on various TPR and FPR values with various rank thresholds. Then we also use the area under the receiver operating curve (AUC) to measure the prediction performance of DDIGIP and other methods. The higher its AUC value is, the better prediction performance a method achieves.

Comparison with previous methods
In this study, we compare our method with other four methods: weighted average ensemble (WAE) method, L1 classifier ensemble (L1E) method, L2 classifier ensemble (L2E) method [29] and label propagation (LP) method [26], with the same validation method in the benchmark dataset. Table 2 shows that the prediction performances of five methods in 5CV. Based on the AUC values of these methods, DDIGIP is slightly better than other methods. It shows that the GIP similarity is reasonable to use known DDIs because DDIGIP only uses known DDIs in 5CV. In addition, three integrating methods (WAE, L1E, L2E) were also achieved the good results because they integrated the neighbor recommender method, random walk method and matrix perturbation method.  The ∅ represents that we did not compute the prediction performance because the prediction limit for new drugs. Table 2 also shows the prediction performances of five methods in 10CV. DDIGIP also achieved the best prediction result and its AUC value is 0.9636 which is larger than other methods WAE: 0.9530, L1E:0.9599, L2E:0.9594 and LP (max): 0.9378, respectively. By comparing the prediction performances of DDIGIP in 5CV and 10CV, DDIGIP is more effective to predict DDIs in 10CV than in 5CV. It proves that DDIGIP has better prediction ability when there are many known DDIs.

Denovo drug validation
In de novo drug validation, we compare DDIGIP with LP and WAE. We do not perform the de novo drug validation on other existing methods because of their prediction limit for new drugs. Similar to previous studies, we also obtain the weights of integrated methods (neighbor recommender method and random walk method) with drug chemical data, biological data and phenotypic data. Table 2 shows that DDIGIP also obtains the best prediction performance in terms of AUC (0.9262), compared with other methods (WAE: 0.9073, LP (max):0.8997). It also further indicates that the GIP similarity is effective to use known DDIs.

Computation time comparison
The computation time is also an important aspect to assess the performance of computational methods. In this study, we also compare the average computation time of five methods in 5CV. Figure 2 shows that the runtime of DDIGIP is less than those of other methods. In addition, since WAE, L1E and L2E are the integration method, their computation times are longer than those of LP and DDIGIP. We can see from Fig. 2 that DDIGIP runs the fastest and its computation time is 6.61 seconds in 5CV.

Parameter analysis for K
In order to verify the robustness of DDIGIP, we analyze the parameters K that is the number of the nearest neighbors in de novo drug validation. The optimal parameter value of K is selected by the grid search. Figure 3 shows the AUC values of DDIGIP under variation of K ranging from 1 to 15 in de novo validation. We can see from Fig. 3 that the prediction performance has the ascending trend when K ranges from 1 to 7, while has the descending trend when K ranges from 11 to 15. In addition, DDIGIP has a relatively stable prediction performance and achieves the best prediction result (AUC:0.9262) when K is 9. It indicates that a reasonable value of K can improve the prediction performance of DDIGIP.

Case studies
To illustrate the prediction performance of DDIGIP method, we conduct two types of case studies. The one includes the top 20 predicted DDIs under all known DDIs, in which the benchmark dataset is obtained from the TWOSIDES database while the confirmed database is DrugBanK database. Another includes top 20 the new DDIs in de novo validation of drug Ranolazine (DB00243) whose confirmed database composes of TWOSIDES database and DrugBanK database. We can see from Table 3 that 9 out of top 20 DDIs predicted by DDIGIP are validated in DrugBank. The verification success rate is 45%. Zafirlukast (DB00549) is an oral leukotriene receptor antagonist (LTRA) drug usually used in the maintenance treatment of asthma, its metabolism can be decreased by Rabeprazole (DB01129) [56,57]. Atazanavir (DB01072) is an antiretroviral drug of the protease inhibitor (PI) class, which is used to treat infection of human immunodeficiency virus (HIV) and its metabolism can be decreased when combining with Amlodipine (DB00381) [8,58]. In addition, Pantoprazole (DB00213) also decreases the metabolism of Methadone (DB00333) [59]. The risk or severity of adverse effects can be increased when Atenolol (DB00335) is combined with Nadolol (DB01203), Clotrimazole (DB00257) is combined with Pregabalin (DB00230) or Enalapril (DB00584) is combined with Perindopril (DB00790) [9,10,60,61]. The hypotensive activities of Nadolol (DB01203) can be increased by Propranolol (DB00571) [62]. The absorption of Cefpodoxime (DB01416) can be decreased when combining with Ranitidine (DB00863) [63]. Acebutolol (DB01193) also increases the serum concentration of Metoprolol (DB00264) [64]. Ranolazine is an antianginal medication used in the treatment of chronic angina [10]. Table 4 shows that top 20 predicted DDIs of Ranolazine are validated in TWOSIDES database or DrugBanK database. In addition, 12 out of top 20 DDIs are simultaneously confirmed by TWOSIDES database and DrugBanK database, while the rest are confirmed by one of them. For example, the metabolism of Levothyroxine (DB00451) and Zolpidem (DB00425) can be decreased when combining with Ranolazine [15,56]. Clopidogrel is an antiplatelet agent structurally and pharmacologically similar to ticlopidine, which is used to inhibit blood clots in a variety of conditions such as peripheral vascular disease, coronary artery disease, and cerebrovascular disease [8].   The serum concentration of Clopidogrel (DB00758) can be increased when combining with Ranolazine [15]. Similarly, the serum concentration of Simvastatin (DB00641), Acetylsalicylic (DB00945) or Metformin (DB00331) also can be increased when combining with Ranolazine [56,65]. In addition, when Ranolazine is combined with Omeprazole (DB00338) or Acetaminophen (DB00316), its serum concentration also can be increased [15,66].

Conclusion
In this study, we have proposed a computational method, called DDIGIP, for DDIs prediction. The GIP similarity of drugs is calculated by the known DDIs, which makes full use of known DDIs. To our knowledge, in the previous studies the RLS-Kron method is used to predict interaction of bipartite networks, such as drug-target interaction networks, drug-disease interaction network and so on. Experiments are conducted using two different types of cross validations: 5-fold cross validation and 10-fold cross validation. The prediction ability of DDIGIP has been illustrated by comparing it with four other competing state-of-the-art methods. Furthermore, based on Pearson correlation coefficient, we obtain a comprehensive feature similarity of drugs by integrating the chemical, biological and phenotypic data into a high dimension binary vector. In order to more effectively predict DDIs for new drugs, we also conduct de novo drug validation. We add a preprocessing step, KNN, to compute the initial relational scores according to the feature similarity of drugs. Because the vector 0 in the matrix corresponding to unknown cases or missing values rather than confirmed non-interactions, the preprocessing can improve the prediction performance.
Despite the advantages of DDIGIP as discussed above, it still has some limitations. The more effective method should be developed to integrate known DDIs with other chemical, biological and phenotypic data. In addition, other new prediction methods such as matrix completion [67], deep learning [68] and interpretable boosting model [69] could be considered. Finally, in this study, the benchmark dataset of DDIs only includes the positive samples and is an imbalanced dataset, we will also consider some other methods (SVM [70],LibD3C [71],extreme learning machine [72] and collaborative metric learning [73]) to predict DDIs when we obtain reliable negative samples in the future. We expect to develop a more effective method to predict DDIs by overcoming these limitations in the future.