EDST: a decision stump based ensemble algorithm for synergistic drug combination prediction

Introduction There are countless possibilities for drug combinations, which makes it expensive and time-consuming to rely solely on clinical trials to determine the effects of each possible drug combination. In order to screen out the most effective drug combinations more quickly, scholars began to apply machine learning to drug combination prediction. However, most of them are of low interpretability. Consequently, even though they can sometimes produce high prediction accuracy, experts in the medical and biological fields can still not fully rely on their judgments because of the lack of knowledge about the decision-making process. Related work Decision trees and their ensemble algorithms are considered to be suitable methods for pharmaceutical applications due to their excellent performance and good interpretability. We review existing decision trees or decision tree ensemble algorithms in the medical field and point out their shortcomings. Method This study proposes a decision stump (DS)-based solution to extract interpretable knowledge from data sets. In this method, a set of DSs is first generated to selectively form a decision tree (DST). Different from the traditional decision tree, our algorithm not only enables a partial exchange of information between base classifiers by introducing a stump exchange method but also uses a modified Gini index to evaluate stump performance so that the generation of each node is evaluated by a global view to maintain high generalization ability. Furthermore, these trees are combined to construct an ensemble of DST (EDST). Experiment The two-drug combination data sets are collected from two cell lines with three classes (additive, antagonistic and synergistic effects) to test our method. Experimental results show that both our DST and EDST perform better than other methods. Besides, the rules generated by our methods are more compact and more accurate than other rule-based algorithms. Finally, we also analyze the extracted knowledge by the model in the field of bioinformatics. Conclusion The novel decision tree ensemble model can effectively predict the effect of drug combination datasets and easily obtain the decision-making process.


Introduction
In recent decades, great advances in drug development have been made with the discovery and application of new therapeutic targets [1].However, because many diseases are complex and involve multiple target genes, a single therapy cannot cure the disease completely.Moreover, drug resistance in many cases is a major barrier to effective treatment due to the complexity of the disease.To overcome the limitations of monotherapy, combination therapy is considered a promising approach to achieving better disease control.While high-throughput screening methods can be effective in speeding up the identification of synergistic drug combinations, the increasing number of drugs every year makes it expensive and time-consuming to rely solely on experiments to determine the effect of each possible drug combination [2].So academics have begun to use methods in the field of statistics or computers to predict the most effective combination of drugs [3].
Having a complete drug combination database is the primary condition for analyzing efficacy, so some scholars have collected various existing experimental data and summarized them into the database [4,5].And to better help others understand the relationship between drug resistance and drug signature attributes, they also provide visualization tools for drug combinations in the database [6][7][8].With the support of databases, various technologies began to emerge.More and more machine learning techniques-based applications had been proposed from different aspects [9][10][11][12].For example, Julkunen et al. [13] proposed comboFM, which modeled multidirectional interactions between cell lines and dose-response matrices of two drugs.Shi et al. [14] combined one-class SVM to design a two-layer multi-class classification system integrating five types of features that can discover potential drug pairs among unknown drugs.There are also some scholars use semi-supervised heterogeneous network algorithms based on graph embedding to predict the combination patterns of drugs [15].However, these methods based on traditional machine learning often do not achieve better prediction results.To this end, some scholars have tried to give up interpretability in exchange for improved accuracy.
DeepSynergy [16] which was constructed by a feed-forward neural network with two hidden layers is one of the early deep learning algorithms used in the drug combination.MatchMaker [17] which contained three neural subnetworks was proposed by Kuru et al.Both DeepSynergy and MatchMaker have been used to predict drug synergies in recent years and proved popular.There are many similar deep learning-based algorithms used in drug prediction, and they usually have good performance [18][19][20][21][22].However, the hundreds of neurons and the complex network structure make their internal logic incomprehensible.People therefore unable to conclude whether the predictions given by these models in practical applications can be trusted, which is considered very dangerous.
In order to balance the performance and interpretability of the model, scholars have made many attempts, including the ensemble algorithm based on decision trees.This algorithm not only improves the model performance by means of ensemble, but also maintains the interpretability of the decision tree.That's why our algorithmic framework uses this approach.In addition, to avoid the problem of reduced interpretability of knowledge after integration due to excessive depth of decision trees.We replace the binary splitting of traditional decision trees with multi-branched decision stump merging.
This study proposes an interpretable method for drug combination analysis.In detail, an ensemble algorithm is designed using DS-based tree structures as base learners.Our method is applied to classifying the drug combination data and extracting interpretable rules simultaneously.The main contributions of the paper are: • The construction of a 1956-dimensional dataset of the drug combination in two cell lines.• An effective ensemble algorithm, which not only eliminates the class imbalance problem by balancing the sampling probability and pairwise classification scheme, but also makes the decision information more global by introducing the stump exchange strategy.• A new tree generation algorithm based on the combination of DSs, taking the influence of the entire data set into consideration with the trees growing in shallower depth and fewer leaf nodes to maintain high interpretability.• The knowledge extracted by the algorithm analyzed in the field of bioinformatics.

Application of decision trees
As a typical interpretable model, the decision tree has been successfully applied to different situations [23].For example, Deelder et al. [24] developed a customized decision tree method called Treesist-TB, which can detect genomic variants in individual studies within aggregated datasets and model variant interactions to predict TB drug resistance.Tayefi et al. [25] extracted rules from the decision trees to maintain high accuracy and strong ability in biomarker discovery.Narayanan et al. [26] proposed a new multivariate statistical algorithm, Decision Tree-PLS (DTPLS), which improves the prediction and understanding ability of models based on local partial least squares regression (PLSR).Azagury et al. [27] developed a decision tree-based machine learning model to capture drug pairs with biological synergy as well as synergistic chemical self-assembly.

Application of tree ensemble
The data distribution has a great impact on the decision tree structure, and the ensemble of decision trees tends to provide better stability.For example, Lu et al. [28] proposed a hybrid ensemble algorithm combining AdaBoost with a genetic algorithm for cancer gene expression data analysis.An et al. [29] developed a Network EmbeDding framework in mulTiPlex networks (NEDTP) and used it to predict novel drug-target interactions.This method first applies a random walk algorithm to the similarity network of drugs and proteins to extract and merge features in the network.Finally, drug-target interactions prediction is made using the GBDT model implemented by LightGBM using drug and protein signatures.Xuan et al. [30] proposed a new gradient boosting decision tree-based method named DTIGBDT and used it to predict drug candidate-target interactions.The algorithm divides the path between the drug and target into multiple classes through the topological information of the drug-target heterogeneous network and constructs a model based on gradient a boosting decision tree.Ma et al. [31] combined a random forest algorithm and Shapley Additive exPlanation to predict the response of hepatocellular carcinoma under combination therapy.Hadi et al. [32] investigated the combination of quantitative computed tomography parametric imaging with the AdaBoost decision tree to predict how LABC tumors respond to NAC.
However, some of these algorithms take lots of time because of their complexity, and some do not achieve ideal results in pursuit of interpretability.So we propose an algorithm with faster running speed and better interpretability under the premise of ensuring good performance.

Method
This section gives the details about the proposed adaptive ensemble method, as shown in Fig. 1.The right side of the figure shows the process of dividing the subdatasets through the one-vs-one method and training their decision trees for final voting.On the left is the process of data preprocessing and how to integrate the decision-making results of each subdatasets.The specific steps are as follows.Assume D = {(x p1 , x p2 , . . ., x pv , y p )} n p=1 represents the data set containing v features and n samples, where y p denotes the label of the p-th sample.Algorithm 1 describes the specific contents of the entire ensemble algorithm.In order to reduce redundancy and speed up operation, our algorithm first selects the features of the dataset in step 1.Unlike simple random sampling in traditional random forests, our algorithm generates subdatasets by the following sampling probability α: where N j represents the number of samples of the j-th class in dataset D and C repre- sents the number of classes in dataset D. By assigning different sampling probabilities to samples of different classes, the samples in minority classes get higher sampling probability, so as to overcome the class imbalance problem.In this way, 2 × m replicates give 2 × m different subsets of samples.Then, a stump-based decision tree (DST) is trained on each subset of samples and the top m trees that work best on the training set D are retained.Finally, the results of these m trees probabilistically vote to obtain the final output.
After introducing the overall framework of the integration algorithm, the specific process of generating a complete decision tree (DST) will be followed (steps 4-24).To increase accuracy for multi-classification problems, our algorithm first divides the dataset into several sub-datasets by sampling in the one-vs-one scheme, which pairs up c classes to generate c(c − 1)/2 binary classification tasks (steps 4-5).Then, the algorithm needs to generate a corresponding stump for each feature (step 6).Fig. 2 shows the general process of generating a DS on f i , where {(x pi , y p )|x pi = a} represents a set of sam- ples taking value a and L represents the proportion of relevant labels in the group.For the continuous feature f i = {x 1i , x 2i , . . ., x ni } T which denotes the i-th feature vector, all training samples are sorted by their values in f i , forming a set of intervals.Samples are (1)  divided into different intervals from smallest to largest according to their distribution.
The label that appears most often in each interval is set as the final label.After that, the labels of all the intervals are checked, and the adjacent intervals with the same label are combined to form a group.For example, the top green square contains all samples with the i-th feature value a, where label 1 and label 2 are 10 to 1.Then, by comparing the proportion of samples in the adjacent green squares, the algorithm determines that the samples contained in the x pi = a and x pi = b squares are of the same category, and can be combined.
Next, we draw on ideas from the Genetic Algorithm (GA) to cross-mutate stumps in different subsets of samples (step 7).Based on the classification results of the stumps, the similarity of the two stumps is calculated by Eq 2. The 50% stump with the highest sum of similarities is screened out, which may have less information.Since there are a total of 2 × m subsets of data, 2 × m sets containing the highest similarity stumps can be obtained.If the stump generated by the same feature appears in multiple sets at the same time, the algorithm randomly exchanges them so that information can be passed between different subsets of data.
After the stump exchange is completed, the algorithm evaluates the classification performance of their stumps and selects the stump with the best performance as the root node stump (steps 8-13).Here use the Gini index as the evaluation method for each node of the DS-based decision tree, and its formula is as follows: where c is the number of classes in the present sub-dataset, u is the number of leaves of the feature tree stump, and p ( j, z) represents the ratio of the number of samples in the j- th class to the total number of samples in the z-th leaf node.This Gini index is modified by removing the penalty on the number of samples in the leaf to split the samples better.Finally, the stump is spliced in a loop iteration to generate an entire decision tree (steps 14-21) until the sample contained in a branch is smaller than θ , which is set to n/10 in our algorithm.
In our algorithm, the use of DSs as the elements to build a decision tree makes the tree gain global information instead of local information on the data at each split, to enhance the tree's generalization ability.The results of each tree are fused by the majority soft voting strategy to get the final decision.In this way, the base classifier DST required for the ensemble is obtained. (2)

Datasets
In the dataset, a sample represent a drug combination on a particular cancer cell line, i.e., a drug combination-cell line pair.The samples are collected from DrugComb database (v1.4) [4].The feature data are cell line-specific drug-inducible gene expression data extracted from L1000 project of the LINCS database.LINCS L1000 is the expanded CMap (Connectivity Map) that can be used to discover mechanism of action of small molecules, functionally annotate genetic variants of disease genes, and inform clinical trials [33].In 2006, Lamb et al. piloted the CMap concept by treating cells with 164 drugs and tool compounds, and then performing mRNA expression profiling using Affymetrix microarrays [34].However, the small scale of CMap limited its utility [33].Therefore, the CMap team proposed a new approach, L1000, to obtain gene expression profiling based on a reduced representation of the transcriptome.L1000 is a low-cost, high-throughput method that only needs 1,058 probes for 978 landmark transcripts and 80 control transcripts, making it well-suited for a large-scale Connectivity Map.The first release of 1,319,138 L1000 profiles are termed CMap-L1000v1, which serves as the data source for our study.In this study, the cell line-specific drug-inducible gene expression data of 978 landmark genes from the LINCS L1000 database are used to construct the feature dataset.The 978 landmarks have been shown to be sufficient to recover 80% of the information in the full transcriptome by Subramanian et al.We first obtain the Level 5 data from LINCS L1000 project, which contain the z-scores of gene expressions with multiple doses and times.For the data of different doses and times of the same drug in Level 5, we get the unique gene expression by applying the moderated z-score approach, which is used to derive the consensus replicate signatures from Level 4 in LINCS L1000 project [33].More specifically, the z-scores are weighted and averaged according to Spearman correlations.Finally, to obtain the feature of a sample, we splice together the 978-dimensional gene expression profiles of two drugs in combination, resulting in a 1956-dimensional feature vector.Excessive dimensionality of features is also a common problem in drug combination datasets [9].This may increase the complexity of data processing and affect the prediction performance of the model.
The above-mentioned two major problems of unbalanced dataset samples and too high feature dimensions can be well solved in our algorithm through one-vs-one classification, probability sampling, and feature selection.

Classification performance
In this section, we verify the classification effect of our method with the above-mentioned dataset and compare it with other common algorithms and some tree-based methods, including decision tree, XGboost, traditional random forest, SVM, KNN, MLP, etc.The neural network model that is difficult to explain is not the focus of our attention.So we only choose Deep Synergy and MatchMaker, which are widely used for drug synergy prediction among the classifiers based on deep learning.In addition, in order to facilitate comparison, we also built a simple single-layer network structure with only the minimum parameter configuration.These machine learning methods are implemented through the scikit-learn.When the two deep learning methods of Deep Synergy and MatchMaker are used, we do not modify the default parameters except for changing the activation function from linear to softmax.To ensure fairness, we use the same sample set for all methods.We used 80% of the samples in the combination drug data of the two cell lines as the training set and the remaining 20% as the test set and adopted a five-fold cross-validation method.The specific files of the dataset and algorithm implementation can be found at https:// github.com/ chenj ing13/ EDST.Since the dataset contains a large number of addition effect samples, which are often the ones that we do not need to pay too much attention to, we focus on the performance evaluation indicators such as AUC, F1 score, and the minority class metrics of recall and precision, which are also calculated through metrics function in the scikit-learn library, the formula is as follows: where TP/FP indicates that the positive predictions are true/false, and TN/FN indicates that the negative predictions are true/false.rank i indicates the position of the i-th sam- ple sorted by probability from smallest to largest.These indicators are used because all of the above datasets are unbalanced, and the more important synergies and antagonisms account for a small number of samples in the total sample size.
Tables 2 and 3 show the results obtained from different cell lines.The F1 score* is calculated by the algorithm on the recall and precision metrics in the Antagonism and Synergy categories.For the classification results of different categories, the EDST model proposed in this paper outperforms other methods on most metrics, especially on AUC and F1 score.Furthermore, our method yields higher recall for both antagonism and synergy samples.Although performance on precision is not always the optimal, precision and recall have some degree of conflict, especially with unbalanced datasets.From the table, we can see that SVM, KNN, and so on all get a recall of the additive higher than 0.8.That is because they classify more samples as additive, which results in fewer minority classes being identified.It can partly explain why these algorithms may show slightly higher in the precision of minority classes than ours.So AUC and F1 score can be the good criterion in this case.Finally, to observe the effect of each algorithm more clearly, we rank it, where Rank represents the ranking of each algorithm on AUC, F1 score, and all classes of recall and precision, and Rank* indicates the ranking of each algorithm on AUC, F1 score*, and the recall and precision of the class with fewer samples.he smaller the Rank or Rank* of an algorithm, the better it is.We can see that whether on Rank or Rank*, our algorithm is ahead of others.All the hyperparameters used in the model are shown in Table 4.
(4)  While our method does not always perform better on summed samples, it can achieve better performance in the minority class.That is, our method slightly sacrifices the performance of the majority class to guarantee results for the minority class.The method achieves the highest AUC and F1 score, which further confirms that the method can handle the class imbalance problem well.

Performance of DS-based tree
Again, we are using the F1 score as a judging metric to compare the performance differences between the DS-based decision tree and the traditional one without an ensemble.The results are shown in Fig. 3.The F1 score of the DS-based tree is always better than that of the traditional decision tree on the dataset of both cell lines.This means that our algorithm performs better than traditional decision trees on the balance of predictions between the two cell line samples.

Advantages of the one-vs-one scheme
Table 5 gives the classification results of DST with and without the one-vs-one scheme.Whether it is in the F1 score or the recall and precision of minority classes, those that have used the one-vs-one scheme have better results than those that have not used the one-vs-one scheme, which indicates that some samples not identified by the feature stump can be partially correctly identified after using the one-vs-one scheme.

Table 4 Model Hyperparameters Number of trees 100
Proportion of the exchange of stumps 10% Maximum depth of the decision tree 10 Minimum number of samples for leaf nodes 10%

Advantages of the stump exchange module
Table 6 shows the performance comparison of the algorithm EDST with or without using the stump exchange module.We can see from the table that the EDST algorithm using the stump exchange module performs significantly better on the F1 score and the recall and precision indicators of one minority class, but does not significantly improve on the other minority class.We suspect that this may be due to the data of the Synergy class being insensitive to distribution.

Determination of feature selection number
To solve the high-dimensional problem of the data and reduce the time and storage costs of the algorithm, we added a feature selection module and found the optimal number of features through experiments.We used chi-square as the criterion for feature selection and used five-fold cross-validation to look at the relationship between the number of selected features and the F1 score, respectively.In the experiment, we first worked out the correlation between the F1 score and the number of selected features and found that increasing the number of features did not significantly improve the performance of the algorithm (Fig. 4).This shows that our algorithm can perform well using only very few features.On the other hand, more features would greatly reduce the efficiency of the algorithm.Then we found by ANOVA that when the number of features is set to around 300, the variance of the algorithm is small (Figs. 5, 6).We believe that too many or too few features may increase or decrease the selection range of tree node features, thereby increasing the instability of the tree.Therefore, 300 is finally determined as the number of features selected by the algorithm.

Evaluation of interpretability
To compare the interpretability of two trees with different growth patterns, we calculated the average maximum depth and number of leaf nodes for the two trees under fivefold cross-validation, respectively.The depth of the tree is used to represent the number of features used in a single decision, and a shallower tree means a shorter length of the generated rules.The number of leaf nodes can represent the complexity of the model to a certain extent.As can be seen from Table 7, the average maximum depth and number of leaf nodes of trees grown using our method are half of those of traditional decision trees.This means that DS-Based Trees have better interpretability than traditional decision trees.Next, we extract the rules of each tree according to the decision path and calculate the accuracy and coverage of each rule using the following formulas: Fig. 4 The relationship between the number of features selected and the F1 score in the EDST algorithm Fig. 5 The relationship between the number of features selected and the variance in the EDST algorithm on HT29 where R coverage and R accuracy represent the coverage and the accuracy of the rule R. The extracted rules are shown as "IF 0.59 < feature 1695 ≤ 0.66 AND 0.52 < feature 431 ≤ 1.00 THEN PREDICT Additive THAN Synergy; IF feature 1604 ≤ 0.37 AND feature 404 ≤ 0.71 THEN PREDICT Additive THAN Antagonism", which means features 1695 and 431 distinguish the sample as the additive instead of the synergy and features 1604 and 404 distinguish it as the additive instead of the antagonism.Finally, we rank all the rules using Eq. 8 and select the top to analyze their biological significance.For example, in A375 melanoma cell line, the most important rule extracted for classifying synergy and other classes is "IF 0.2088 < feature 1237 ≤ 0.3578 AND 0.3618 < feature 142 ≤ 0.6430 THEN PREDICT Synergy THAN Additive; IF 1.0 < fea- ture 1850 THEN PREDICT Synergy THAN Antagonism".It is observed that the differential expression of features 1237, 142, and 1850 is crucial in the A375 cell line.The features 1237, 142 and 1850 represent the expression value of gene KIT, STX1A, and UFM1, respectively.Among them, genes KIT functions in the regulation processes of cell proliferation, migration, stem cell maintenance, differentiation and the occurrence of melanoma and some other cancers [35].It is reported that KIT might (8) score = 2 * R coverage * R accuracy R coverage + R accuracy , Fig. 6 The relationship between the number of features selected and the variance in the EDST algorithm on A375 be an important tumor-promoting factor that associated with metastasis and overall poor prognosis in the A375 cell line [36].The differential expression rule of gene KIT extracted by EDST might be the key factor in the study of the A375 melanoma cell line.In HT29 colorectal cancer (CRC) cell line, the rule of "IF 0.6469 < feature 1305 ≤ 0.7266 THEN PREDICT Synergy THAN Additive; IF 0.7260 < feature 1225 ≤ 1.0 THEN PREDICT Synergy THAN Antagonism" is extracted as an important rule to the synergy prediction process.The features 1305 and 1225 represent the expression value of genes DDR1 and NFKB2.DDR1 has been shown to be highly expressed in most colon adenocarcinomas and appears as an indicator of worse event-free survival [37].Arfi et al. suggested that the frequent high expression of DDR1 in colon cancer can be explored as a potential therapeutic target in this indication [37].

Gene ontology (GO) biological processes and KEGG pathway enrichment
To explore the influence of drug-induced gene expression on synergy, the key features affecting the prediction process are selected, and the genes involved in the key feature subsets are further investigated in this subsection.The importance of each feature for prediction is determined by the selected frequency in the EDST.Based on the contribution value, the features ranked among the top 217 or 223 are selected on A375 or HT29 cell line respectively.The extracted features make about 90% contribution to the two cell lines.To summarize the characteristics of genes involved in these contributing features, the Gene Ontology biological processes and KEGG pathway enrichment among these features are investigated.The top 10 enrichment results are shown in Fig. 7

Leave drug combinations out
The leave drug combinations out method is a common cross-validation strategy in the field of bioinformatics [16].In this method, the dataset is first divided into t groups according to drug types.Then, when dividing the training data and test data for each group, all samples containing the drugs in the group are used as the test data, and the others are used as the training data.In this way, we can evaluate the performance of the model in the presence of unknown drugs.In this experiment, we set t to 6.The results of our algorithm are compared with those of other algorithms, which are shown in Tables 8 and 9. From these tables, we can see that our algorithm is comparable to or better than other algorithms in the AUC and F1 score and also ahead of other algorithms in the ranking of multiple indicators.This experiment illustrates from another perspective how our algorithm performance is due to other algorithms.• thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year

•
At BMC, research is always in progress.

Learn more biomedcentral.com/submissions
Ready to submit your research Ready to submit your research ?Choose BMC and benefit from: ?Choose BMC and benefit from:

Fig. 1
Fig. 1 Overall flow chart of EDST algorithm

Fig. 2
Fig.2 An example of the tree stump generation

Fig. 3
Fig. 3 DS-based tree and traditional decision tree F1 score comparison (adjusted P-value < 0.05 ).On the A375 cell line, the most enriched biological processes are response to nutrient levels, axonogenesis and positive regulation of neurogenesis.The most enriched pathways are the Chemokine signaling pathway, Shigellosis and Focal adhesion.But on the HT29 cell line, the enrichment results are different.The most enriched biological processes are regulation of cell cycle phase transition, positive regulation of cell cycle and cell cycle G2/M phase transition.The most enriched pathways are the PI3K-Akt signaling pathway, Epstein-Barr virus infection and Human papillomavirus.
To predict cell line-specific synergistic drug combinations, samples from different cell lines are modeled separately.For a sample, DrugComb contains four types of synergy scores, including Bliss Independence, Highest Single Agent, Loewe Additivity, and Zero Interaction Potency, where positive values represent synergy and negative values represent the antagonism of the drug combination.In this study, samples are divided into three categories (Synergy, Antagonism, and Additive) based on all mentioned four synergy scores.Synergy or Antagonism represent those samples whose four synergy scores are both positive or negative numbers, respectively, while the remaining samples are classified as Additive.Then, the samples with missing features are removed.Finally, two typical cell lines, HT29 colorectal cell line and A375 melanoma cell line, that have the largest sample capacity are selected to construct the data set.Table1lists the number of samples in each class and each cell line.It can be observed that the number of samples in the Additive class is greater than the sum of the samples of the other two classes.This problem will affect model training and prediction.

Table 1
Data distribution in each cell line

Table 2
Classification results of algorithms on the colorectal cell line dataset HT29

Table 3
Classification results of algorithms on the melanoma cell line dataset A375

Table 5
Dst module with and without using the one-vs-one scheme on HT29 and A375

Table 6
EDST algorithm with and without using the stump exchange module on HT29 and A375

Table 7
Size comparison of the two trees

Table 8
Classification results of algorithms using leave drug combinations out on HT29

Table 9
Classification results of algorithms using leave drug combinations out on A375The symbol [bold] indicates the highest value in the same evaluation indicator