 Methodology Article
 Open Access
 Published:
Predicting effective drug combinations using gradient tree boosting based on features extracted from drugprotein heterogeneous network
BMC Bioinformatics volume 20, Article number: 645 (2019)
Abstract
Background
Although targeted drugs have contributed to impressive advances in the treatment of cancer patients, their clinical benefits on tumor therapies are greatly limited due to intrinsic and acquired resistance of cancer cells against such drugs. Drug combinations synergistically interfere with protein networks to inhibit the activity level of carcinogenic genes more effectively, and therefore play an increasingly important role in the treatment of complex disease.
Results
In this paper, we combined the drug similarity network, protein similarity network and known drugprotein associations into a drugprotein heterogenous network. Next, we ran random walk with restart (RWR) on the heterogenous network using the combinatorial drug targets as the initial probability, and obtained the converged probability distribution as the feature vector of each drug combination. Taking these feature vectors as input, we trained a gradient tree boosting (GTB) classifier to predict new drug combinations. We conducted performance evaluation on the widely used drug combination data set derived from the DCDB database. The experimental results show that our method outperforms seven typical classifiers and traditional boosting algorithms.
Conclusions
The heterogeneous networkderived features introduced in our method are more informative and enriching compared to the primary ontology features, which results in better performance. In addition, from the perspective of network pharmacology, our method effectively exploits the topological attributes and interactions of drug targets in the overall biological network, which proves to be a systematic and reliable approach for drug discovery.
Background
Traditional “one drug, one target” treatment can not always lead to desirable therapeutic effect on complex diseases, because biological pathways and networks are often redundant and robust to single point perturbations [1]. Drug combination perturbs the biological network through synergistic and synthetic lethal effects, and inhibits more effectively the activity level of pathogenic genes [2, 3]. Previous studies have shown that combination drugs can effectively inhibit cancer cell growth or promote cancer cell apoptosis, with reduced toxicity and side effects than single target drugs [4]. Even more promising, drug resistance can be decreased or even overcome through combination therapy [5]. Therefore, therapeutic schemes from single to multitarget drugs play an increasingly important role in the treatment of complex diseases [6].
Despite the increasing successes of combination drugs in inhibiting cancer cell proliferation, most of them are discovered by clinical experience or by occasional chances [5, 6]. Developing combinations of targeted agents is more difficult than developing a single agent [7], as inhibiting the crosstalks among multiple pathways depends on our insight into the pathway interdependencies underlying the cancer cell proliferation and survival in a specific cancer type [8, 9]. The highthroughput screening (HTS) experiments currently used to evaluate drug combinations are still time and cost consuming because they rely heavily on the search for a large number of possible target combinations [10–12]. So, there is an urgent demand for rational and systematically in silico methods to narrow down the candidates for combinatorial drugs for wetlab experimental validations [13].
Quite a few computational methods have been proposed to predict cancer sensitivity to combinatorial drugs [1, 4, 14–16]. The existing methods can be roughly divided into two categories: system biologybased methods [17] and networkbased analysis [15, 18]. System biologybased methods mathematically model the perturbation of drugs using biochemical reactions and kinetic parameters, which are often limited to small scale and wellstudied signaling pathway. Networkbased methods often exploit genomic, chemical and pharmacological properties to build an overall network composed of the associations among drugs, proteins and pathways, and then adopt scoring rules [19, 20], optimal combination searches [1, 16, 21], machine learning [4, 18] to predict potential drug combinations. As networkbased methods integrate various kinds of ontological features and interactions between different subject of interest, some of these methods achieve remarkable performance in predicting drug combinations. For example, Ligeti et al. [20] proposed socalled Target Overlap Score (TOS) prioritization function, which is defined for two drugs as the number of jointly perturbed targets divided by the number of all targets potentially affected by these two drugs, to rank candidate drug combinations. Pang et al. [1] proposed mixed integer linear programming to find balanced target set cover (BTSC) and minimum off target set cover (MOTSC) for combination therapy. Huang et al. [18] propose DrugComboRanker, which first builds a drug functional network based on their genomic profiles, and diseasespecific signaling networks based on patients genomic profiles and interactome data, and then prioritize synergistic drug combinations by searching drugs whose targets are enriched in the complementary signaling modules of the disease signaling network. Matlock et al. [21] tried to find drug combinations maximizing sensitivity over tumor cell models while minimizing toxicity over normal cell models, and then proposed a lexicographic search algorithm to find optimal target set. In addition, some methods exploit the concept of synthetic lethality to discover combinatorial drugs [3, 22, 23]. However, most of previous methods are usually limited to the ability to dissect potential molecular mechanisms, or to associate multiple drugs to one disease in huge pharmacological space.
There have been many approaches that integrate multiple heterogeneous networks to infer the associations between biological entities, including lncRNA functions [24–26], lncRNAdisease associations [27], drugdisease associations [28, 29] and gene functions inference [30]. Inspired by heterogeneous networkbased inference, we ran random walk with restart on the drugprotein heterogenous network to extract features for drug combinations, and then trained gradient tree boosting classifier using the extracted features to predict new drug combinations. Concretely, we integrated a variety of data sources, including chemical structures of the drugs, protein sequences, and known drugprotein associations, to construct a drugprotein heterogeneous network. The random walk with restart procedure was implemented on the heterogenous network using the combinatorial drug and their targets as the initial probability, respectively. The converged probability distribution was used as feature vector of the drug combination. Based on the probability distribution vectors, we subsequently trained the gradient tree boosting (GTB) classifier, which achieved the AUC of 0.949 by 10fold crossvalidation. We also compared our method to other seven typical classifiers, including kNN, SVM, Logistic regression, Naive Bayes, AdaBoost, Random Forest and LogistBoost. The performance comparison results demonstrate that our proposed model significantly outperformed other traditional methods. From the perspective of network pharmacology, our method effectively make use of the topological attributes and functional interactions of drug targets in the proteinprotein network.
Results
Drug combination dataset
The set of effective drug combinations was obtained from DCDB 2.0 [31], a typical drug combination database focused on collecting verified drug combinations to facilitate further exploration, including theoretical modeling and simulation of such beneficial drug combinations. In total, the current version(2.0) of DCDB includes 1363 drug combinations (330 approved and 1033 investigational, including 237 unsuccessful usages), covering 904 individual drugs and 805 targets. We selected those combinations that are approved or under trials in DCDB as positive samples. Note that the number of noneffective drug combinations is actually enormous, much larger than that of effective in real world. Therefore, we generated a number of negative samples of drug combinations by randomly picking up pairwise drugs to balance the positive and negative samples in our benchmark set. The strategy of generation of negative samples has been widely adopted in the prediction of drugtarget interactions and drugdisease associations [28, 32]. Importantly, the drug set that we selected pairwise combinations is expanded from the individual drugs in DCDB to their most associated 3 drugs according to STITCH, yielding 3266 drugs in total. Finally, the benchmark drug combination set contains 1359 positive combinations and 1359 negative combinations.
Performance measures
We conducted performance evaluation using 10fold cross validations. In particular, the training set were randomly divided into ten subsets and each subset had roughly equal size to others. Each subset was in turn used as the test set, and the remaining nine subsets were used as training set. This validation process was repeated ten times and each performance measure was averaged over the ten folds for performance evaluation. A couple of performance measures were used in our experiment, including precision (PRE), recall (REC), Fmeasure, Matthews correlation coefficient (MCC) and the area under the receiver operating characteristic curve (AUC). They are formally defined as below:
in which TP and TN represent the numbers of correctly predicted positive and negative samples, FP and FN represent the numbers of wrong predicted positive and negative samples, respectively. Additionally, the AUC score is computed by varying the cutoff of the predicted scores from the smallest to the greatest value.
Impact of parameters on performance
To explore the impact of parameter λ, which is the probability of random walker jumping to different type of network, We gradually increased its value from 0.1 to 0.9 at interval of 0.1. The aforementioned metrics obtained by 10fold crossvalidation are shown in Table 1, which demonstrate that λ has a moderate impact on the prediction performance of our proposed method. In terms of AUC, the values fit approximately a parabola which hit the top 0.944 at λ 0.7, which was thus adopted in our subsequent experiments. For other two parameters introduced in random walk with restart, restart probability α and tradeoff η, we conducted similar tuning to determine their optimal values that achieve the best performance. As shown in Additional file 1: Table S1 and S2, the AUC measure reached the highest value when α and tradeoff η were equal to 0.2 and 0.9. According to the results, the restart probability α has a negligible effect on the AUC. Generally, the restart probability is a heuristical parameter without any theoretical guide or justification when selecting [33]. However, the heterogeneous network is established based on drugdrug similarity, proteinprotein similarity and known drugprotein associations, resulting in a heterogeneous network with quantitative weighted edges. From this perspective, since the random walk simulates the influence of drugs in protein network, the convergence state will have a bias on higher weighted nodes. Therefore, the restart probability may have a slight effect on the final distribution. As a result, we set the three parameters λ, α and η to 0.7, 0.2 and 0.9 in the following performance comparison experiments.
Performance comparison to typical classifiers
To demonstrate the outstanding performance of our method, we carried out performance evaluation on the benchmark combination set by comparing our method with seven other typical classifiers, including kNN, SVM, Logistic regression, Naive Bayes, Random Forest, Adaboost and LogitBoost. Based on the derived feature distribution vectors, we implemented these competitive classifiers separately using R package [34] so as to conveniently reproduce our work. For Native Bayes, we adopted the R package e1071 [35] and its default setting. Also, logistic regression and SVM are implemented based on the e1071 R package, and logistic regression was run with default settings, while the misclassification penalty coefficient for SVM varied from 10 to 10000 by interval of 500 to achieve best performance. For KNN, R package kknn [36] was used to run the algorithm, in which the parameter k (k =1, 3, 5, 7 and 9) was enumerated to tune its performance. For the distance metric of kNN, we have tried Manhattan distance, Euclidean distance and Chebyshev distance and found that they yield to similar performance, thereby we adopted Chebyshev distance (q=5) in the performance evaluation. The R package randomForest [37] was used to run random forest algorithm and the number of trees varies from 60 to 500 by interval of 20. For boosting methods Adaboost and Logitboost, the R packages Adabag [38] and caTools were used, where the number of training iterations was tuned from 10 to 100 by interval of 5 and 10, respectively. The performance measures of each comparative method, including precision, recall, F1, MCC and AUC, achieved by the finetuned parameters, are shown in Table 2. Apparently, our proposed method significantly outperformed other classifiers in terms of almost all performance metrics.
To present clear performance comparison, the ROC curves of GTB and other seven classifiers are also illustrated in Fig 1. It can be demonstrated that GTB classifier greatly outperforms all other competitive methods, which achieves the highest AUC value 0.95, followed by Random forest and Adaboost at 0.86. The performance of Naive Bayes is the worst and gets only 0.508 AUC value.
Performance improvement by heterogenous networkderived features
To validate the effectiveness of the features extracted from drugprotein heterogeneous network, we conducted performance comparison between the primary ontology features and heterogenous networkbased features. Due to different number of individual drugs and target proteins involved in drug combinations, we can not directly concatenate the drug fingerprints and protein GO annotations to construct feature vectors that are inconsistent in dimension. Instead, we first unified the chemical fingerprints of individual drugs in a combination, i.e. union of individual fingerprint vectors, as well as the union of GO terms of target proteins of individual drugs. Next, we concatenated the union sets of chemical fingerprints and GO annotations for each pair of drug combinations as the input features of the GTB classifier. The performance measures are shown in Table 3. It can be demonstrated that the performance of GTB classifier with input derived from heterogeneous networkbased features is vastly superior to that with primary ontology features. For example, the AUC value increased from 0.528 to 0.949 for GTB classifier. Moreover, we conducted performance comparison for other typical classifiers to validate the advantage of our extracted feature from drugprotein heterogenous network. As shown in Tables 2 and 3, the performance of all these classifiers were greatly boosted by extracting features from the random walk with restart on the heterogenous network.
Discussion
In this paper, we proposed a computational method for predicting effective combination drugs based on features derived from drugprotein heterogenous network by random walk with restart. In order to verify our proposed method, we conducted plenty of empirical experiments to compare the performance of our method to other typical classifiers on the benchmark dataset we constructed previously, and the experimental results significantly demonstrated that our method achieves stateoftheart performance. Note that the input of the GTB classifier is the output of random walk with restart on the heterogeneous network, which is the probability distribution vector only accounting for 6,074 dimensions. Therefore, we believed that the heterogeneous networkderived features are more informative and have been dimensionreduced compared to the highdimensional primary ontology features that may lead to curse of dimensionality when performing classification. As a result, the performance of GTB and other classifiers are significantly improved. In addition, the majority of current methods to predict drug combinations are limited to their size, in which pairwise drugs are most used. However, our proposed method can expand the size of drug combinations, which appreciably increases the practicality.
It is worth noting that the protein network introduced in the random walk with restart is helpful to dig into the biological mechanism of drug combinations in vivo. In fact, the final probability distribution of certain drug combination derived by random walk with restart strongly suggests the indications of the drug combination to some extent. Taking the pairwise combination Docetaxel and Capecitabine as an example, which has been approved by FDA to treat metastatic breast cancer. We ranked the protein nodes according to the probability distribution, the protein with the highest probability is ENSP00000315644 and the third is ENSP00000252029, which are both encoded by Tyms gene. It has been shown that the polymorphisms of Tyms gene are associated with etiology of neoplasia, including breast cancer. In addition, the fourth and fifth are ENSP00000269571 and ENSP00000275493, which are encoded by Erbb2 and Egfr, respectively, are all highly linked to breast cancer. To further evaluate the potential of the protein network, we exemplified another pair of drug combination, Atorvastatin and Proguanil, which currently has no official indication. The resulting probability distribution derived by random walk with restart of this drug combination shows that the protein ENSP00000396308 encoded by Dhfr has the highest probability value. It has been demonstrated that diseases associated with Dhfr include megaloblastic anemia due to dihydrofolate reductase deficiency and megaloblastic anemia. Expectedly, quite a few works have demonstrated the pharmacological effect of Atorvastatin and Proguanil on Anemia. For example, Vahid et al. [39] validated that Atorvastatin can soften human red blood cells, and physical deformation of the red blood cells underlies pathological manifestations of sickle cell anemia and hypercholesterolemia. Another trial demonstrated the effectiveness of Proguanil in treatment of malarial anemia [40]. Therefore, anemia may be a potential indication of the drug combination Atorvastatin and Proguanil. In summary, we draw the conclusion that the probability distribution derived by random walk can effectively reveal the indication of drug combinations.
We further checked the positive samples that are falsely classified, as negative samples are randomly generated. We found that the falsely determined samples by our method have low similarity to other samples. In fact, most existing computational models, which aim at the prediction of drugtarget interactions, drugdisease associations, often hold the assumption that similar compounds are likely to interact with similar target proteins and thereby play similar therapeutic efficacy in cellular microenvironment. These computational methods have achieved superior performance, and greatly narrowed down the number of candidate drug targets and reveal new indications of approved drugs. Under this assumption, the prediction accuracy often relies on the close associations of tested samples with known samples that have been validated by wetlab experiments, such as drug combinations and drugtarget interactions. In terms of network medicine, the influence of drug molecule would perturb the cellular network via signal cascade reactions and protein interaction network. Many computational methods have taken into account this consideration, and adopted random walks and diffusion on network to capture the perturbation of the drugs.
However, there are always some samples located far from validated samples in the feature space. For instance, some new drugs have low similarity to other drugs, and some proteins have low similarity to other protein in different protein family. As a result, similaritybased or network diffusionbased computational methods tend to encounter failure in predicting drug combinations or drugtarget interactions composed of such drugs or proteins. Fortunately, the emergence of largescale experimental data derived from highthroughput screening technique can strongly motivate the novelty of methods to predict synergistic drugs or effective drug combinations.
Conclusion
In this paper, we proposed a gradient tree boosting (GTB) classifier based on heterogeneous networkderived features to predict effective drug combinations. The heterogeneous network integrates the drug similarity network, protein similarity network and known drugprotein associations. Next, we ran random walk with restart (RWR) on the heterogenous network using the combinatorial drugs and their associated targets as the initial probability, and obtained the converged probability distribution as the feature vector of each drug combination. The heterogeneous networkderived features introduced in our method are more informative and enriching compared to the primary ontology features. The GTB classifier trained based on the heterogeneous networkderived features outperforms seven typical classifiers and traditional boosting algorithms. Moreover, our case studies show that our method is helpful in revealing the indications of drug combinations. From the perspective of network pharmacology, our method effectively exploits the topological attributes and interactions of drug targets in the overall biological network, which proves to be a systematic and reliable approach for drug discovery.
Methods
Overview of our methodology
We first constructed the benchmark drug combination set composed of positive samples derived from public databases and negative samples that were randomly generated. For individual drugs included in the benchmark set, we collected a variety of related characteristics, including chemical fingerprints, drug targets and drugprotein associations, as shown in Fig. 2a. These ontology features of drugs and proteins were used to compute the drugdrug similarities and proteinprotein similarities. Together with the known drugprotein associations, we constructed the drugprotein heterogeneous network. Next, the random walk with restart on heterogeneous network proposed in our previous work [28] was conducted for each drug combinations as initial state, as illustrated in Fig. 2bc. The probability distribution when the random walk reaches steady state was used as the feature vector of the drug combination. Based on the feature representation of the drug combinations, the gradient tree boosting (GTB) classifier was trained to predict new effective drug combinations.
Drugprotein associations
We selected drugprotein associations from STITCH database [41], which is a comprehensive database that collected compoundprotein interactions from different sources: biochemical experiments, external databases, text mining and computational predictions. STITCH has computed a confidence score for each interaction ranging from 0 to 1,000, which indicates the confidence of the compoundprotein interaction supported by four types of evidences. We first used a confidence threshold 0.5 (corresponding to 500 combined score in STITCH) to remove lowconfidence target proteins, because we think too lowconfidence targets are probable unauthentic ones. Next, we selected top 3 from the rest of target proteins of each drug. If one drug has less than 3 target proteins with confidence score higher than 0.5, we then took only those targets into account. In total, we got 210,235 drugprotein associations regarding to 3,266 unique drugs (drug set are built by selecting top 3 similar drugs, see following subsection for details). Formally, denoted by D=(d_{1},d_{2},...,d_{n}) and P=(p_{1},p_{2},...,p_{m}) the drug and protein node set, and A the adjacent matrix of drugprotein associations with element a_{ij} equal to the confidence score if there is validated interaction between drug i and protein j, and a_{ij}=0 otherwise.
Drugdrug similarity network
We expanded the list of individual drugs by selecting top 10 most similar drugs to each single agent included in DCDB, according to the chemicalchemical combined scores that were derived from STITCH [41]. After removal of duplicate drugs, 3266 unique drugs were obtained. Similar compounds are likely to interact with similar target proteins and thereby play similar therapeutic efficacy in cellular microenvironment [42], allowing us to find new drug combinations by introducing similar drugs to known ones. Therefore, we believe that the expanded list of drugs can increase the opportunity for discovery of novel drug combinations.
Next, we generate the chemical fingerprint of the drugs to calculate the similarity measurement of each pair of drugs. Similar to our previous work [28], we applied PaDEL software [43] to compute the chemical fingerprints using the SMILES string of a drug, and obtain an 880d binary vector for each drug. The element 1 of the binary vector represents that the drug contains the corresponding chemical fingerprint, and 0 otherwise. Subsequently, Jaccard score, a widely used similarity measure, is calculated based on the chemical fingerprints as the chemical similarities for pairwise drugs. The Jaccard score is generally defined as the intersection size divided by the union size of two individual sets, which is shown as follows:
Further, the bipartite network projection algorithm, a method inspired by the networkbased resourceallocation dynamics [44], was adopted to compute another drug similarity measure based on known drugprotein associations. In the drugprotein bipartite network, each drug node equally allocates the original resource to its associated protein nodes, and successively the assigned resource of each protein node is equally transferred back to its neighborhood drugs. As a result, the proportion of the resource of drug d_{i} conveyed to drug d_{j} in such allocation process represents the strength of association between two drugs. Suppose the initial resource of each drug node is oneunit, the second drug similarity measure, denoted by \(S^{(d2)}_{ij}\), can be formulated as below:
in which k(d_{j}) and k(p_{l}) are the degrees of drug d_{j} and protein p_{l} in the drugprotein association network. Intuitively, more common associated protein nodes the pairwise drugs share, higher similarity the drugs have. Particularly, if the associated proteins of two drugs are not overlapped, i.e. no common associated protein exists, the similarity is denoted by 0.
Finally, these two aforementioned drugdrug similarities were integrated into a comprehensive measurement using the probability disjunction formula as below:
Proteinprotein similarity network
Correspondingly, we constructed the proteinprotein similarity network based on two different similarity measures, including protein sequence similarity and GO semantic similarity. By using R package biomaRt (2.40.4) [45, 46], the protein sequences can be readily obtained from Ensembl genome database (2018 updated), which is dedicated to curating generelated information to encourage genome analysis [47]. The sequence similarity \(S_{ij}^{(p1)}\) between protein p_{i} and protein p_{j} was computed by using the R package Protr (1.62) [48], in which the SmithWaterman algorithm is applicable.
Similar drugs are supposed to interact with proteins that act in similar biological processes or have similar molecular functions or reside in similar compartments [49]. Therefore, the GO semantic similarity \(S_{ij}^{(p2)}\) between protein p_{i} and protein p_{j} was calculated using R package GOSemSim (2.10.0) [50]. All three types of ontology features are used in the calculation of semantic similarity.
Likewise, the probability disjunction was used to integrate two aforementioned proteinprotein similarities, which is formulated as below:
where \(S_{ij}^{(p)}\) is the comprehensively integrated similarity measurement between protein p_{i} and protein p_{j}.
Random walk with restart on heterogenous network
The drugdrug similarity network, proteinprotein similarity network and drugprotein association network were combined to construct the drugprotein heterogeneous network G=(V,E). The node set V={D,P}, V is the union set of the drug and protein nodes. The edge set E={E_{dd}∪E_{dp}∪E_{pd}∪E_{pp}}, where E_{dd}, E_{pp}, E_{dp} and E_{pd} are the drugdrug edge, proteinprotein and drugprotein edge collections, respectively.
In order to obtain the feature representations of drug combinations, we extended our previous work in which the random walks with restart on the heterogeneous network was developed for single drug repurposing [28]. More precisely, for a drug combination d_{i} and drug d_{j}, we performed random walk with restart on the heterogeneous network in which these two drugs and their known target proteins act as seed nodes, as shown in Fig. 3. Actually, since the initial probability distribution can be easily extended to more drugs and their targets, the number of individual drugs involved in the combination is not limited to 2 in our method. When the random walk process reaches steady state, the probability distribution vector can be regarded as the perturbation on the protein network by the combinatorial drugs. With the drugprotein heterogeneous network, the transition matrix T can be defined as below:
where T^{(dd)} and T^{(pp)} are the probability transition matrix from drug nodes (protein) to drug nodes (protein nodes) during the random walk process; T^{(dp)} denotes the probability transition matrix that drug nodes walk to protein nodes, and T^{(pd)} denotes the probability transition matrix that protein nodes walk to drug nodes.
Suppose that the random walker starts from a drug node, and then visits one of its targeted proteins with probability λ, or visits any other drug nodes with probability (1 λ) in the heterogeneous network. If λ=0, the random walker can only stay within the networks where it starts. According to the drugdrug similarity, the transition probability from drug d_{i} to drug d_{j} can be defined as below:
where S_{ij} is the similarity between ith drug and jth drug, a_{il} is the association confidence score between ith drug and lth protein. The sum of a_{il} equaling to 0 indicates that the drug has no approved or predicted association with any proteins. Similarly, the transition probability from protein p_{i} to protein p_{j} can be defined based on the proteinprotein similarity as below:
where S_{ij} is the similarity between ith protein and jth protein, a_{li} is the association score between lth drug and ith protein.
Accordingly, the transition probability from drug d_{i} to protein p_{j} is defined as:
The transition probability from protein p_{i} to drug d_{j} is defined as:
Provided that P(t) is a (n+m)dimension probability vector at step t, in which P(t)[i] represents the probability of the random walker visiting node i(drug or protein), the random walk process can be iteratively calculated as below:
where α is the restart probability, and P_{0} is the initial probability distribution vector of a set of seed nodes consisting of a combinatorial drugs and their targeted proteins. Take the drug combination d_{i} and d_{j} as an example, d_{i} and d_{j} are employed as the seed nodes in the drug network and each seed node is given equal probability 1/2. By giving rest drug nodes probability 0, the initial probability matrix with respect to drugs can be constructed. Correspondingly, the protein nodes related to drug d_{i} and drug d_{j} are used as seed nodes in protein network and equal probabilities are allocated to these protein nodes so that the sum of the probabilities is 1. As shown in Fig. 3, there are three targeted proteins and thus each protein is given initial probability 1/3. Let \(P^{(d)}_{0}\) and \(P^{(p)}_{0}\) be the initial probability vectors of drugs and proteins separately, the initial probability P_{0} for drugcentric random walk can be defined as follows:
where η∈[0,1] is a tradeoff parameter to balance the weight of importance between the drug nodes and protein nodes. In our experiments, η is set to 0.5. If the difference between twice iteration is lower than 1e10, the random walk is supposed to reach steady state. Once the random walk process converges, the probability distribution is used as the feature vector of the drug combination.
Building gradient tree boosting classifier
Based on the feature vectors produced by the random walk on drugprotein heterogenous network for each pair of drug combination, we built a gradient tree boosting (GTB) classification model, referred to as gradient boosting regression or decision tree (GBRT or GBDT). Gradient tree boosting is an efficacious machine learning method that has achieved desirable performance in both classification and regression problems [51–53]. In fact, Caruana and NiculescuMizil have conducted comprehensive performance evaluation on eight different binary classification problems by comparing boosted trees algorithm with other nine typical classifiers, including SVMs, Neural Nets, Logistic regression, Naive Bayes, memorybased learning, Random Forests, Decision Trees, Bagged Trees and Boosted Stumps. Their conclusion showed that boosted treebased algorithm achieved best performance [54]. Another empirical performance evaluation has also demonstrated that boosted decision trees perform exceptionally well when the dimensionality of the input is not too high [55]. Therefore, we adopted the GTB algorithm to build our classification model.
Formally, the decision function of GTB is initialized as:
where N is the number of drug combinations contained in the training set. The gradient tree boosting algorithm repeatedly constructs K different classification subtrees h(x,a_{1}), h(x,a_{2}),..., h(x,a_{K}), each of which is separately trained based on a subset of randomly selected samples from the training set, and then iteratively establishes the additive function θ_{k}(x):
in which b_{k} and a_{k} are the weight and parameter vector of the kth classification subtree h(x,a_{k}). The loss function L(y,θ_{k}(x)) is defined as:
where y is a binary value representing the real class of the combination and θ(x) is the decision function. In order to minimize the loss function L(y,θ_{k}(x)), both b_{k} and a_{k} are iteratively optimized by applying grid search. In this paper, grid search strategy was adopted to tune the optimal hyperparameters of GTB by 10fold crossvalidation on the constructed drug combination dataset. Finally, the optimal number of trees of the GTB is 300, and the tuned depth of the trees is 13.
Availability of data and materials
The source codes, datasets and additional files used in this work are all available at https://github.com/hliu2016/SynerDrug.
Abbreviations
 GO:

Gene ontology
 GTB:

Gradient tree boosting
 kNN:

KNearestNeighbor
 RWR:

Random walk with restart
 SVM:

Support vector machine
References
Pang K, Wan YW, Choi WT, Donehower LA, Sun J, Pant D, Liu Z. Combinatorial therapy discovery using mixed integer linear programming. Bioinformatics. 2014; 30(10):1456–63.
Lee MJ, Albert SY, Gardino AK, Heijink AM, Sorger PK, MacBeath G, Yaffe MB. Sequential application of anticancer drugs enhances cell death by rewiring apoptotic signaling networks. Cell. 2012; 149(4):780–94.
Guo J, Liu H, Zheng J. Synlethdb: synthetic lethality database toward discovery of selective and sensitive anticancer drug targets. Nucleic Acids Res. 2015; 44(D1):1011–7.
Preuer K, Lewis RP, Hochreiter S, Bender A, Bulusu KC, Klambauer G. Deepsynergy: predicting anticancer drug synergy with deep learning. Bioinformatics. 2017; 34(9):1538–46.
Jia J, Zhu F, Ma X, Cao Z, Cao Z, Li Y, Li Y, Chen Y. Mechanisms of drug combinations: interaction and network perspectives. Nat Rev Drug Discov. 2009; 8(2):111–28.
AlLazikani B, Banerji U, Workman P. Combinatorial drug therapy for cancer in the postgenomic era. Nat Biotechnol. 2012; 30(7):679.
Gioeli D, Wunderlich W, SeboltLeopold J, Bekiranov S, Wulfkuhle JD, Petricoin EF, Conaway M, Weber MJ. Compensatory pathways induced by mek inhibition are effective drug targets for combination therapy against castrationresistant prostate cancer. Mol Cancer Ther. 2011; 10(9):1581–90.
Liu T, Yacoub R, TaliaferroSmith LD, Sun SY, Graham TR, Dolan R, Lobo C, Tighiouart M, Yang L, Adams A, et al. Combinatorial effects of lapatinib and rapamycin in triplenegative breast cancer cells. Mol Cancer Ther. 2011; 10(8):1460–9.
Smalley KS, Haass NK, Brafford PA, Lioni M, Flaherty KT, Herlyn M. Multiple signaling pathways must be targeted to overcome drug resistance in cell lines derived from melanoma metastases. Mol Cancer Ther. 2006; 5(5):1136–44.
Nelander S, Wang W, Nilsson B, She QB, Pratilas C, Rosen N, Gennemark P, Sander C. Models from experiments: combinatorial drug perturbations of cancer cells. Mol Syst Biol. 2008; 4(1):216.
Sun X, Vilar S, Tatonetti NP. Highthroughput methods for combinatorial drug discovery. Sci Transl Med. 2013; 5(205):205–12051.
Li P, Huang C, Fu Y, Wang J, Wu Z, Ru J, Zheng C, Guo Z, Chen X, Zhou W, et al. Largescale exploration and analysis of drug combinations. Bioinformatics. 2015; 31(12):2007–16.
Costello JC, Heiser LM, Georgii E, Gönen M, Menden MP, Wang NJ, Bansal M, Hintsanen P, Khan SA, Mpindi JP, et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nat Biotechnol. 2014; 32(12):1202.
Csermely P, Agoston V, Pongor S. The efficiency of multitarget drugs: the network approach might help drug design. Trends Pharmacol Sci. 2005; 26(4):178–82.
Iadevaia S, Lu Y, Morales FC, Mills GB, Ram PT. Identification of optimal drug combinations targeting cellular networks: integrating phosphoproteomics and computational network analysis. Cancer Res. 2010; 70(17):6704–14.
Tang J, Karhinen L, Xu T, Szwajda A, Yadav B, Wennerberg K, Aittokallio T. Target inhibition networks: predicting selective combinations of druggable targets to block cancer survival pathways. PLoS Comput Biol. 2013; 9(9):1003226.
Ryall KA, Tan AC. Systems biology approaches for advancing the discovery of effective drug combinations. J Cheminform. 2015; 7(1):7.
Huang L, Li F, Sheng J, Xia X, Ma J, Zhan M, Wong ST. Drugcomboranker: drug combination discovery based on target network analysis. Bioinformatics. 2014; 30(12):228–36.
Zhao XM, Iskar M, Zeller G, Kuhn M, Van Noort V, Bork P. Prediction of drug combinations by integrating molecular and pharmacological data. PLoS Comput Biol. 2011; 7(12):1002323.
Ligeti B, Pénzváltó Z, Vera R, Győrffy B, Pongor S. A networkbased target overlap score for characterizing drug combinations: high correlation with cancer clinical trial results. PLoS ONE. 2015; 10(6):0129267.
Matlock K, Berlow N, Keller C, Pal R. Combination therapy design for maximizing sensitivity and minimizing toxicity. BMC Bioinformatics. 2017; 18(4):116.
De Raedt T, Walton Z, Yecies JL, Li D, Chen Y, Malone CF, Maertens O, Jeong SM, Bronson RT, Lebleu V, et al. Exploiting cancer cell vulnerabilities to develop a combination therapy for rasdriven tumors. Cancer cell. 2011; 20(3):400–13.
Roller DG, Axelrod M, Capaldo BJ, Jensen K, Mackey A, Weber MJ, Gioeli D. Synthetic lethal screening with smallmolecule inhibitors provides a pathway to rational combination therapies for melanoma. Mol Cancer Ther. 2012; 11:2505–15.
Zhang Z, Zhang J, Fan C, Tang Y, Deng L. Katzlgo: largescale prediction of lncrna functions by using the katz measure based on multiple networks. IEEE/ACM Trans Comput Biol Bioinform. 2017. https://doi.org/10.1109/tcbb.2017.2704587.
Deng L, Wu H, Liu C, Zhan W, Zhang J. Probing the functions of long noncoding rnas by exploiting the topology of global association and interaction network. Comput Biol Chem. 2018; 74:360–7.
Deng L, Wang J, Xiao Y, Wang Z, Liu H. Accurate prediction of proteinlncrna interactions by diffusion and hetesim features across heterogeneous network. BMC Bioinformatics. 2018; 19(1):370.
Zhang J, Zhang Z, Chen Z, Deng L. Integrating multiple heterogeneous networks for novel lncrnadisease association inference. IEEE/ACM Trans Comput Biol Bioinform. 2019; 16:396–406.
Liu H, Song Y, Guan J, Luo L, Zhuang Z. Inferring new indications for approved drugs via random walk on drugdisease heterogenous networks. BMC Bioinformatics. 2016; 17:539.
Luo H, Wang J, Li M, Luo J, Peng X, Wu FX, Pan Y. Drug repositioning based on comprehensive similarity measures and birandom walk algorithm. Bioinformatics. 2016; 32(17):2664–71.
Zhang J, Deng L. Integrating multiple interaction networks for gene function inference. Molecules. 2019; 24(1):30.
Liu Y, Wei Q, Yu G, Gai W, Li Y, Chen X. Dcdb 2.0: a major update of the drug combination database. Database. 2014; 2014. https://doi.org/10.1093/database/bau124.
Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. Prediction of drugtarget interaction networks from the integration of chemical and genomic spaces. Bioinformatics. 2008; 24(13):232–40.
Jin W, Jung J, Kang U. Supervised and extended restart in random walks for ranking and link prediction in networks. PLoS ONE. 2019; 14(3):0213857.
R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing: 2014. http://www.Rproject.org/.
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F, ChihChung C, ChihChen C. e1071: misc functions of the department of statistics, probability theory group 1.73. R package e1071. 2019.
Hechenbichler K, Schliep K. Weighted kNearestNeighbor Techniques and Ordinal Classification. LudwigMaximilians University Munich: 2004. p. SFB 386. http://nbnresolving.de/urn/resolver.pl?urn=nbn:de:bvb:19epub17699.
Liaw A, Wiener M. Classification and regression by randomforest. R News. 2002; 2(3):18–22.
Alfaro E, Gamez M, Garcia N, et al. Adabag: An r package for classification with boosting and bagging. J Stat Softw. 2013; 54(2):1–35.
SheikhHasani V, Babaei M, Azadbakht A, PazokiToroudi H, Mashaghi A, MoosaviMovahedi AA, Reihani SNS. Atorvastatin treatment softens human red blood cells: an optical tweezers study. Biomed Opt Expr. 2018; 9(3):1256–61.
Mulenga M, Malunga F, Bennett S, Thuma PE, Shulman C, Fielding K, Alloueche A, Greenwood BM. A randomised, doubleblind, placebocontrolled trial of atovaquone–proguanil vs. sulphadoxine–pyrimethamine in the treatment of malarial anaemia in zambian children. Trop Med Int Health. 2006; 11(11):1643–52.
Szklarczyk D, Santos A, von Mering C, Jensen LJ, Bork P, Kuhn M. Stitch 5: augmenting proteinchemical interaction networks with tissue and affinity data. Nucleic Acids Res. 2015; 44(D1):380–4.
Johnson MA, Maggiora GM. Concepts and applications of molecular similarity: Wiley; 1990.
Yap CW. Padeldescriptor: An open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 2011; 32(7):1466–74.
Zhou T, Ren J, Medo M, Zhang YC. Bipartite network projection and personal recommendation. Phys Rev E. 2007; 76(4):046115.
Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, Huber W. Biomart and bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005; 21(16):3439–40.
Durinck S, Spellman PT, Birney E, Huber W. Mapping identifiers for the integration of genomic datasets with the r/bioconductor package biomart. Nat Protocol. 2009; 4(8):1184.
Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, Billis K, Cummins C, Gall A, Girón CG, et al. Ensembl 2018. Nucleic Acids Res. 2017; 46(D1):754–61.
Xiao N, Cao DS, Zhu MF, Xu QS. protr/protrweb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics. 2015; 31(11):1857–9.
Liu H, Sun J, Guan J, Zheng J, Zhou S. Improving compound–protein interaction prediction by building up highly credible negative samples. Bioinformatics. 2015; 31(12):221–9.
Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. Gosemsim: an r package for measuring semantic similarity among go terms and gene products. Bioinformatics. 2010; 26(7):976–8.
Pan Y, Liu D, Deng L. Accurate prediction of functional effects for variants by combining gradient tree boosting with optimal neighborhood properties. PLoS ONE. 2017; 12(6):0179314.
Fan C, Liu D, Huang R, Chen Z, Deng L. Predrsa: a gradient boosted regression trees approach for predicting protein solvent accessibility. BMC Bioinformatics. 2016; 17:8. BioMed Central.
Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal. 2002; 38(4):367–78.
Caruana R, NiculescuMizil A. An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd International Conference on Machine Learning. ACM: 2006. p. 161–8. https://doi.org/10.1145/1143844.1143865.
Caruana R, Karampatziakis N, Yessenalina A. An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning. ACM: 2008. p. 96–103. https://doi.org/10.1145/1390156.1390169.
Acknowledgements
The authors thank DCDB and STITCH for providing the data.
Funding
The motivation, design and data analysis and validations of the study was supported by National Natural Science Foundation of China under grants No. 61672113; The analysis and interpretation of data and writing the manuscript is partially supported by the project of Jiangsu Provincial Science and Technology Department (Grant No. BE2018638) and Jiangsu provincial 333 project; The writing of the manuscript is also partially supported by Social Development Projects of Jiangsu Province (BE2018643) and Scientific Program of Changzhou (CE20195048; ZD201919).
Author information
Authors and Affiliations
Contributions
HL, WZ and LN designed the study and conducted experiments. HL and WZ performed statistical analyses. HL and WZ drafted the manuscript. WZ, LN and XD prepared the experimental materials and benchmarks. JL and LZ administrated the project and acquired funding. All authors have read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
The drug combination data used to evaluate the GTB model are available at DCDB database which is an announced public data source. The drugprotein associations and drugdrug links are derived from STITCH on reasonable request. The STITCH is supported by European Molecular Biology Laboratory(EMBL), Swiss Institute of Bioinformatics(SIB) and NNF Center for Protein Research(CPR).
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Table S1. Impact of the restart probability α on the performance of GTB classifer. Table S2. Impact of the tradeoff parameter η on the performance of GTB classifer. Table S3. Impact of significant parameters involved in GTB classifier. Table S4S8. Parameter tuning for competitive methods.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Liu, H., Zhang, W., Nie, L. et al. Predicting effective drug combinations using gradient tree boosting based on features extracted from drugprotein heterogeneous network. BMC Bioinformatics 20, 645 (2019). https://doi.org/10.1186/s1285901932881
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901932881
Keywords
 Drug combination
 Random walk
 Heterogenous network