DDI-PULearn: a positive-unlabeled learning method for large-scale prediction of drug-drug interactions

Background Drug-drug interactions (DDIs) are a major concern in patients’ medication. It’s unfeasible to identify all potential DDIs using experimental methods which are time-consuming and expensive. Computational methods provide an effective strategy, however, facing challenges due to the lack of experimentally verified negative samples. Results To address this problem, we propose a novel positive-unlabeled learning method named DDI-PULearn for large-scale drug-drug-interaction predictions. DDI-PULearn first generates seeds of reliable negatives via OCSVM (one-class support vector machine) under a high-recall constraint and via the cosine-similarity based KNN (k-nearest neighbors) as well. Then trained with all the labeled positives (i.e., the validated DDIs) and the generated seed negatives, DDI-PULearn employs an iterative SVM to identify a set of entire reliable negatives from the unlabeled samples (i.e., the unobserved DDIs). Following that, DDI-PULearn represents all the labeled positives and the identified negatives as vectors of abundant drug properties by a similarity-based method. Finally, DDI-PULearn transforms these vectors into a lower-dimensional space via PCA (principal component analysis) and utilizes the compressed vectors as input for binary classifications. The performance of DDI-PULearn is evaluated on simulative prediction for 149,878 possible interactions between 548 drugs, comparing with two baseline methods and five state-of-the-art methods. Related experiment results show that the proposed method for the representation of DDIs characterizes them accurately. DDI-PULearn achieves superior performance owing to the identified reliable negatives, outperforming all other methods significantly. In addition, the predicted novel DDIs suggest that DDI-PULearn is capable to identify novel DDIs. Conclusions The results demonstrate that positive-unlabeled learning paves a new way to tackle the problem caused by the lack of experimentally verified negatives in the computational prediction of DDIs.

There are 2 additional files in total. The files are structured as follows: (1) Additional file 1 A. The implementation details and experiment results of the feature importance ranking analysis.
(2) Additional file 2 Additional results and data used in this work. A.

Feature importance ranking using Random Forrest
In this work, we collected a variety of drug property data that may help to improve the prediction performance, i.e., drug chemical substructure, drug substituents, drug target proteins, drug side-effects, drug indications, drug-associated pathways, and drug-associated genes. Then we investigate how these drug data contribute to the performance of drug-drug interactions using Random Forrest (RF). Random Forest is a widely-used method for feature ranking because it requires very little feature engineering and parameter tuning [1].
Similar with the subsection "Feature vector representation for DDIs", we first represent each drug as a 15,753-dimensional feature vector according to its property data, including 881 drug chemical substructures, 1,235 drug substituents, 722 drug targets, 1,685 drug associated pathways, 9,000 drug associated genes, 1,620 drug side-effects and 610 drug indications. The drug chemical substructures correspond to 881 substructures defined in the PubChem database; The drug substituents, drug targets, and drug indications are 1,235 unique substituents, 722 unique targets and 610 unique indications in DrugBank respectively; The drug associated pathways and genes are 1,685 unique pathways and 9,000 unique genes in CTD; The side-effects are 1,620 unique side-effects in SIDER. Each bit in the feature vector denotes the absence/presence of the corresponding substructure/substituent/target/sideeffect/indication/pathway/gene by 0/1. Then each drug is represented as a 15,753-dimension feature vector by concatenating the above features sequentially. Following that, we represent each drug-drug interaction as the average value of the two corresponding drug feature vectors using formula 2. Finally, we treat all validated DDIs as positive samples (48,584), and all potential DDIs which are not validated DDIs as negative samples (10,1294 = 548 2 − 48,584). The RF is employed as the classifier to classify the above samples and the feature importance scores are produced by RF.
The detailed results are included in Table S5 in Additional file 2. Here, we provide the statistical information for the feature importance ranking as shown in Table 1 and Table 2. We rank all the 15,753 features in descending order of their RF feature importance scores. We observed the feature numbers belong to different drug properties among the top 50/100/150/200/300/400/500/600/700/800/900/1000 ranked features. It can be seen from Table 1 that features from drug properties including chemical substructures, indications, and side-effects account for most of the top-ranked features. For example, for the top 50 features, they take 47 positions (94%); for the top 1000 features, they take 707 positions (70.7%). The feature number belongs to different drug properties varies a lot. Drug properties which have more features are more likely to take more positions in the top-ranked features. To avoid this bias, we also investigate the ratio of top-ranked features relative to the total feature number for each drug property. For each drug property, the ratio is the quotient of the top-ranked features belong to this drug property and the total number of features belong to this property. Related results are listed in table 2. It can be seen that features belong to the above three properties (i.e., chemical substructures, indications, and side-effects) achieve larger ratios as well. All the above results indicate that features from the chemical substructures, indications, and side-effects play a leading role in the DDI prediction task. Therefore, we use chemical substructures, indications, and side-effects as base properties to represent drugs for DDI-PUlearn. More experiments for investigation on the impacts of drug properties in the DDI prediction performance are described in the main text.