DPI_CDF: druggable protein identifier using cascade deep forest

Background Drug targets in living beings perform pivotal roles in the discovery of potential drugs. Conventional wet-lab characterization of drug targets is although accurate but generally expensive, slow, and resource intensive. Therefore, computational methods are highly desirable as an alternative to expedite the large-scale identification of druggable proteins (DPs); however, the existing in silico predictor’s performance is still not satisfactory. Methods In this study, we developed a novel deep learning-based model DPI_CDF for predicting DPs based on protein sequence only. DPI_CDF utilizes evolutionary-based (i.e., histograms of oriented gradients for position-specific scoring matrix), physiochemical-based (i.e., component protein sequence representation), and compositional-based (i.e., normalized qualitative characteristic) properties of protein sequence to generate features. Then a hierarchical deep forest model fuses these three encoding schemes to build the proposed model DPI_CDF. Results The empirical outcomes on 10-fold cross-validation demonstrate that the proposed model achieved 99.13 % accuracy and 0.982 of Matthew’s-correlation-coefficient (MCC) on the training dataset. The generalization power of the trained model is further examined on an independent dataset and achieved 95.01% of maximum accuracy and 0.900 MCC. When compared to current state-of-the-art methods, DPI_CDF improves in terms of accuracy by 4.27% and 4.31% on training and testing datasets, respectively. We believe, DPI_CDF will support the research community to identify druggable proteins and escalate the drug discovery process. Availability The benchmark datasets and source codes are available in GitHub: http://github.com/Muhammad-Arif-NUST/DPI_CDF.


Introduction
The Human Genome Project has enabled the discovery of new drug targets by identifying macromolecules likewise genes and proteins that are often involved in disease processes [1].Proteins are the most common druggable targets for drug development [2] because approximately 95% of known drug targets are proteins and over 92% of known drug-target interactions involve these organic molecules [3].The involvement of proteins in biological processes are essential for the understanding of cellular functions.Proteins can be broadly categorized into enzymes, receptors, ion channels, transporters and structural proteins based on their highly diverse structure and functions [4].Among these protein classes druggable proteins (DPs) has unique properties that make them attractive target for the drug discovery for the treatment of many diseases, including cancer, genetic disorders, chronic disease, blood pressure, cardiovascular diseases, etc [5].Analyzing the biochemical characteristics DP sequences, for example its binding affinity or enzymatic activity could help the researcher to provide insights about the protein interaction with other molecules inside the cell [6].Thus, investigating about DPs and non-DPs is crucial to accelerate the drug development process for curing multiple diseases.
In recent decades, the researchers have been characterizing the DPs through wet-lab experiments such as mass spectrometry, X-ray crystallography, nuclear magnetic resonance (NMR), etc [7,8].These wet lab experiments for determining DPs and non-DPs are precise, but time-consuming, resource intensive and expensive due to the nature of experiments as well as the huge abundance of un-annotated proteins in databases.Moreover, development pipeline for a novel drug can be a long and expensive process, with an average development time of over 12 years and a cost of around 2.6 billion USD [9].Furthermore, only a small percentage of drug development plans eventually result in licensed drugs, with estimates ranging from 4% to 12% [10,11].Hence, computational techniques led the researchers to use machine learning and deep learning algorithms as an alternative for analyzing large-scale druggable proteins data with improved accuracy.
Over the past few decades, considerable research attention has been directed toward a variety of computational methods to identify the distinctive characteristics of DPs vs. non-DPs.For example, DrugMiner [12], Sun et al. [13], GA-Bagging-SVM [5], DrugHy-brid_BS [14], Yu et al. [15], XGBDrugPred [16], MS Iraji et al. [17] and SPIDER [18] are the proposed predictors for discriminating DPs from non-DPs.The pioneering work along this line was conducted by Jamali et al. [12] in 2016 and constructed a bioinformatics protocol called DrugMiner for the prediction of DPs, using multiple discrete features in conjunction with neural network classifier.The proposed model achieved over 92.10% accuracy (ACC).However, loss of sequence order information and sequencelength effects are the main shortcomings of the proposed method [19].Afterword, Lin et al. [5] enhanced the performance by developing GA-Bagging-SVM for DPs prediction.Lin et al. first extracted the local and global feature vectors using reduced sequences, pseudo amino-acid-composition (PseAAC) and dipeptide composition (DPC).Then optimal features were selected through genetic algorithm (GA) and proposed support vector machine (SVM) based model by bagging-based ensemble strategy.Similarly, Gong et al. [14] designed hybrid-based predictor called DrugHybrid_BS using grouped amino-acid-composition, monoDIKgap and cross-variance with ensemble learning engine and achieved 97% of accuracy.Furthermore, Yu et al. [15] developed the first deep learning model to improve the overall performance of the DPs by incorporating sequence and dictionary features with ensemble convolutional recurrent neural network model(CNN-RNNs).The Yu's model attained the 89.80% of ACC and 0.799 MCC on independent dataset.Recently, R.Sikander et al. [16], proposed machine learning model called XGBDrugPred by utilizing group di-peptide composition (DPC), reduced amino acid composition and serial-PseAAC features with extreme gradient boosting classifier.More recently, P Charoenkwan et al. proposed an effective meta-learning based classifier SPIDER, stacked predictor of druggable proteins, which predicted DPs and non-DPs with high accuracy than other existing methods [18].Table 1 summarized the precedents of druggable protein prediction from literature.
Although each of the above mentioned predictors in (Table 1) has demonstrated a significant contribution to the prediction of DPs, however challenges remain unsolved.For instance, many existing predictors have relied upon the conventional sequence-based feature including mono-, di-, tri-peptide composition, and physicochemical properties.But these methods were unable to explore the evolutionary profile and structural properties of druggable protein sequence.Secondly, only two previous studies (Yu et al. [15], SPIDER [18]) performed an independent test evaluation to verify the generalization capability of their proposed methods.Thirdly, the overall performance of the previous models for DP prediction was not satisfactory indicating the room for improvement in the prediction capability.In the present article, we proposed a machine learning based predictor DPI_CDF for highly accurate identification of DPs and non-DPs based on novel combination of evolutionary-, physicochemical-and sequence-based feature of protein sequence.Our contribution can be briefly summarized as follows: (a) We proposed new set of feature descriptor to capture the evolutionary-, sequential-and physicochemical-based patterns from a given protein sequence.

Materials and methods
To develop DPI_CDF, we considered existing benchmark datasets of protein sequence that are already published in literature.Then we encoded the biological protein sequences into fixed length feature vector based on the compositional, physicochemical, and evolutionary properties of amino acids.Then, machine learning models were developed for the prediction of druggable protein.Finally, we evaluate the proposed model based on cross-validation and compared the performance of the proposed model against exiting methods.Figure 1 depicts the schematic diagram of the workflow for the development of DPI_CDF.

Dataset collection
We considered the available dataset from Jamali et al. [12].This dataset contains 1223 sequence that were considered as druggable protein sequence.It also contains 1319 sequence that are considered as non-druggable.We consider this dataset to develop machine learning model for DPI_CDF.Moreover, we consider another dataset from Yu et al. study [15] as independent dataset to determine the generalization power of the trained predictor.The independent dataset contains total of 461 sequence, of which 224 were labelled as druggable and 237 were labeled as non-druggable.Table 2 summarizes the number of samples from both datasets.

Feature encoding
Feature encoding schemes are challenging task used to formulate a biological sequence into fixed length numerical feature [19][20][21].In the present work, we considered physiochemical, compositional, and evolutionary-based algorithms to tackle this problem.The details of each feature descriptor are explained below.

Position-specific scoring matrix representation of druggable protein
The evolutionary conserved reign of amino acid residues is encoded by a technique called position-specific scoring matrix (PSSM).It has been observed that PSSM has been successfully improved the model prediction in divers bioinformatics problems for example prediction of protein folding [22], antifreeze protein identification [23] and prediction of DNA-binding protein [24].Motivated by these precedents, we considered PSSM to encode DP sequences into feature vector.PSSM generates the corresponding featurespace of 20 attributes and M rows for an input sequence in PSI-BLAST [25] PSI-BLAST compare the DP protein evolutionary information with default parameters in Swiss-Prot databank [26].Then, the obtained PSSM is normalized by the following mathematical formula: where a represent the actual value of PSSM.Then we considered the PSSM from each protein to generate feature vector.

Histogram of oriented gradient-based extraction of PSSM
Histogram of Oriented Gradient (HOG) is widely used as a feature extractor for object detection problem in computer vision [27,28].The HOG-based methods provided better results compared to the existing wavelet-based methods for extracting feature from input image [29,30].In this study, we consider the PSSM to retain the biological evolutionary information of a sequence and applied HOG encoding method for transforming the PSSM into an HOG-PSSM.We briefly describe the steps for generating HOG based feature from protein sequence below: Firstly, we calculated the horizontal gradient G x (i, j) and vertical gradient G y (i, j) of the PSSM image by following equations: (1) Then, we determined the magnitude as well as the direction of gradient based on the following equation: where G(i, j) and �(i, j) are the gradient magnitude and gradient direction matrices of L × 20 size.Then, we decomposed the image into fixed sized connected region called "cells".Each cell retained the feature set of magnitude and direction of graident inside the sub-matrix.

Normalized qualitative characteristic feature
Qualitative characteristics feature (QLC) [31] considers the physicochemical properties of proteins which are distributed globally in a protein sequence.QLC considered hydrophobicity, charge, predicted secondary structure, polarizability, polarity, normalized Van der Waals volume, and solvent accessibility as the seven physicochemical attributes of the AA residues to categorize them into three groups [31] (Details are in Additional file 1: Table T1).The QLC descriptor encode the composition, distribution and residue wise transition of the protein based on three indexes, namely C (composition) index, D (distribution) index, and T (transition) index.Therefore it is also named as composition, transition and distribution (CTD) [32].The C index characterizes the percent composition of each group of AA residues (based on physicochemical properties) in protein sequence; the T index (transition) signifies the transition likelihood between two adjacent residues of proteins associated with dissimilar properties; and the D index computes the distribution of AA residues along the sequence of each group in percent (25%, (2) 50%, and 75% or 100%), respectively [33].For each protein, C, T and D index generated 21, 21, and 105 dimensions of features, respectively.Thus, the resultant dimension of the feature was 147 for each protein sequence [32].Then we normalized the values within the range of [0, 1] using the following formulation to generate normalized QLC (NQLC): Where x i denotes the physicochemical features values of j_th (j=1, 2, 3 â€¦20) AA resi- dues.x¯ denote the mean value and std(x) denote the deviation from mean of 20 AAs.y i represent the resultant normalized value.

Composite protein sequence representation
Composite Protein Sequence Representation (CPSR) descriptor is adopted to encode the prominent physicochemical properties from DPs sequences.The AA residues in proteins possesses unique physiochemical properties [34] that play a vital role in different protein function prediction problems [35,36].CPSR-derived method has also been used in our previous studies for encoding anticancer proteins, membrane proteins etc [37,38].We have used seven different types of physicochemical properties of DP sequence (Table 3 ).
(a) Amino Acid Composition (AAC) For encoding protein sequence, AAC is considered the simplest formulation method.AAC counts the frequency of 20 residues in a proteins sequence and normalized its values.Resultantly, ACC generated a 20D vector of protein sequence.(b) Sequence Length (L) The total number of native AAs in the given protein sequence is defined as L. (c) 2-Gram Exchange Group Frequency The composition of the bi-gram exchange group plays a crucial functions in encoding the protein sequence.The exchange groups consider broad categories of AA residues that form clusters based on evolutionary effects [39].Thus, by computing the frequency of each possible bi-gram pair, thirty six features of 36-D were generated from its equivalent 6-letter exchange group of AAs.We have provided more detail about cluster pairs of AA in Additional file 1: Table T2.
( (d) Electron Group Based on the electron properties of AA, the 20 AA molecules can be broadly divided into six groups, i.e., acceptor or donor, electrically special and neutral AA, weak electron acceptor or donor, electron acceptor or donor electron donor or acceptor [38].For each protein sequence, we counted the number of AA from each group and represent it as a 6-D protein feature vector.(e) Rigidity We encode the rigidity of each protein sequence to describe protein structure static attributes under the impact of extrinsic factors.For each AA of a protein sequence, we summed the rigidity score and normalized by protein length, generating a 1-D feature vector.(f ) Flexibility "The flexibility of protein occurs universally at the level of AA sidechains and crucial for catalysis and binding function" [39].For each AA of a protein sequence, we summed the flexibility score and then normalized by protein length, generating a 1-D feature vector.(g) Irreplaceability The irreplaceability is a response to mutation deterioration during the evolution of life.To compute the irreplaceability of AA residues in protein, we summed the flexibility score and then normalized by protein length, generating a 1-D feature vector.(h) R-group The AA residues in a protein sequence possess a unique chemical side chain but similar functional group.The R-group categorize the druggable protein sequence based on sub-families of AA's and generate a 5D feature vector.The five categories are provided in the Additional file 1: Table T3.

Hybrid feature composition
Single set of feature may fail to capture enough attributes from protein sequence to build a generalized model [40].In order to bring better complimentary information from several sets of feature vectors, feature hybridization is a crucial strategy [41].Inspired by this, we adopted a serial feature hybridization technique to enhance the prediction capability of the learning algorithm.We merged HOG-PSSM, NQLC, and CPSR encoders to propose hybrid features for the model development.We considered three different hybrid features sets namely Hybrid1, Hybrid2 and Hybrid3.Hybrid1 combined the evolutionary profile and physicochemical based-feature of CPSR and HOG-PSSM and generate a feature space of 327D.Hybrid2 combined the compositional and evolutionary profile-based feature of NQLC and HOG-PSSM to form a feature space of 403D.Hybrid3 considered all features from CPSR, NQLC and HOG-PSSM to encode the protein sequence generating feature vector of 474D.

Cascade deep forest-based predictor development
The cascade deep forest (CDF) is an ensemble-based framework inspired by Zhou et al. [42] model, to the serves as a substitute for deep neural networks (DNNs) [43].In recent research, CDF model became a has become a dominant learning algorithm in wide range of domains like pattern recognition [44,45], and bioinformatics [46].CDF model structure is an ensemble of trees hierarchically sequenced in multiple layers [47].The top-down architecture of CDF enables the classifier ideal for training even limited number of samples [48].Furthermore, Zhou and Feng pinpointed in their pioneering work that DF is much easier in tuning the hyperparameter compare to DNN [48].Considering this, an improved version of CDF were developed containing an ensemble of RF [49], XGBoost [50], and extremely randomized trees (ERT) classifiers [51] to build DPI_CDF.Each layer of DPI_ CDF is composed of four learners of XGBoost, RF and ERT machine learning classifiers that take the feature-vector of the previous layer.The previous level's class probability is then passed on to the next layer.In order to produce the augmented attributes, the related heterogeneous feature vectors are merged, averaged and the maximum probability values is generated as output.The hyper parameter of the models were tuned using GridSearchCV and the selected parameters are added in Additional file 1: Table T6.The node split attributes were selected by randomly selecting features, where √ d is the total number of features.Figure 2 shows the layer-by-layer architecture of the DPI_CDF.

Performance evaluation metrics
To measure the prediction performance of binary class, we use four performance evaluation metrics: sensitivity (SEN), accuracy (ACC), specificity (SPE), and Matthew's correlation coefficient (MCC).These measures are mathematically formulated as follows:  In the above Eqs.(09-12), tp represents the correctly predicted DPs,tn represents the correctly predicted non-DPs.Whereas fn represent the incorrectly predicted DPs and in contrast the fp represent the non-DPs predicted incorrectly by the model.The abovementioned performance metrics are threshold dependent.In order to shed light the proposed model efficacy an independent evaluation measure receiver operating characteristic (ROC) curve (AUC) was adopted [52].

Model evaluation
The performance of machine learning models were evaluated based on k-fold cross-validation (CV).In k-fold CV, benchmark dataset was divided into k subsets (folds) of nearly same size.Of the all the k folds, the k-1 folds were used for training the model and the remaining one is taken for testing the model [53].In this work, we used 10 fold CV to evaluate the generalization power of the model based on Jamal et al. dataset [12] and an independent dataset (Yu's dataset [15] ) was used for examining the model performance.

Results and discussion
Performance of DPI_CDF using various feature descriptors on training (DP train ) and testing In this section, we analyzed the efficacy of four classification algorithms including multilayer perceptron (MLP), ERT( Extra Tree Classifier), XGBoost and DPI_CDF using various single-view descriptors, i.e., CPSR, NQLC, and HOG-PSSM and series combination of multi-view descriptors i.e., Hybrid1, Hybrid2 and Hybrid3.The classifiers were trained using 10-fold CV on DP train dataset and evaluated on DP ind independent dataset with five evaluation measures AUC, SEN, SPE, MCC and ACC.We can comprehensively analyze several observations from Table 4 as follows; First, in case of individual feature space, HOG-PSSM produces outstanding prediction results on cascade deep forest classifier which are mean ACC of 94.77% and MCC of 0.895.The second-best performer on HOG-PSSM is XGBoost learning engine which attain 93.63% of ACC and 0.876 of MCC respectively.However, in contrast it achieves worst predictions on MLP classifier i.e., ACC=79.23% and MCC =0.594.The CPSR encoding method comparatively generates satisfactory results on classifiers.Secondly, to improve the prediction performance of the proposed model, feature fusion strategy was employed.It is clear from empirical results in Table 4 that our proposed DPI_CDF model train on hybrid features particularly Hybrid3 (HOG-PSSM+CPSR+NQLC) features produce superior results than single-view descriptors on all evaluation indicators ACC, MCC, SEN and SPE.The highest success rates in terms of ACC=99.23% and MCC=0.99 are obtained by DPI_CDF using Hybrid5 feature set.On the other hand, ERT classifier performed over all poor predictions on the hybrid feature sets.We also performed the 5-,-6,and -8fold cross validation (11 on the training dataset and the results are highlighted in Additional file 1: Table T4.Using 10-fold CV we got the best results.In order to determine the proposed model prediction power, an independent or blind test is generally conducted.Table 5 illustrates the performance of all classifiers using various feature encoding methods on independent dataset.We can easily see that DPI_CDF with Hybrid1 (HOG-PSSM + CPSR) feature set achieve best performance on all evaluation metrics i.e., ACC=95.01%,MCC=0.900,SPE=93.24,AUC=0.986, and SEN=96.87%.The second best model is XGBoost classifier that achieved comparatively consistent results than MLP and ERT on various feature encoding schemes.Moreover, we have added the confusion matrix(TP, TN, FP, and FN) predictions of the proposed DPI-CDF model in Additional file 1: Table T4 and T5.
We also generated the receiver operating characteristics (ROC) curve for the proposed DPI_CDF model on training and independent set (Fig. 3).We can observe that the model with Hybrid3 based feature combination achieved the highest AUC for training and test set with 0.998 and 0.979, respectively.

Comparison with previous predictors
We comprehensively compared DPI_CDF with previously developed sequence-based computational models including DrugMiner [12], Sun's Method [13], GA-Bagging-SVM [5], YU's Method [15], XGB-DrugPred [16], and SPIDER [18] for characterizing and identifying DPs and non-DPs.It is worth noting that among these approaches, only two predictors i.e.Yu's Method [15] and SPIDER [18] were examined on both training DP train and testing DP ind datasets.The comparison outcomes of past studies over training and testing datasets are reported in Tables 6 and 7, respectively.The high prediction value of each criterion is presented in bold fonts.It is clear from  From aforementioned discussion, it can be concluded that the proposed method for determining the proteins druggability is far superior to all the available computational methods.Figure 4 highlights the performance evaluation metrics for DPI_CDF along with other existing methods for predicting druggable proteins.We can observe that both in training and independent set, DPI_CDF outperformed the existing methods.The proposed model attained the highest ACC and MCC of 95.01% and 0.949 respectively on independent test.

Visual analysis and explanation of the proposed features
In order to interpret the impact of engineered features, we used two dimension scatters plot t-SNE [54] and SHAP to visualize the distribution of extracted single-view features (CPSR, NQLC and HOG-PSSM) and multi-view features (Hybrid1, Hybrid2, and Hybrid3) on training dataset (Fig. 5).
In Fig. 5, the green dots represent the non-DPs and red dots represent DPs.Furthermore, SHAP (Shapley Additive exPlanations) method [55] was used to elucidate the relative contribution of each feature in model performance (Fig. 6).It is clear from the Fig. 6 that the positive and negative SHAP values for the top ranked features favored the prediction performance of DPs and non-DPs, respectively.Majority of the  top ranked features particularly HOG-PSSM51 and HOG-PSSM180 captured the key DPs attributes and had positive SAHP values, the model predicted a protein sequence as DP; otherwise a protein sequence was predicted as non-DP for negative SHAP values.We also noticed that among the toped ranked attributes, CPSR17 and NQLC75 from CPSR and NQLC, respectively contributed to boosting the performance of DPI_ CDF.Thus, it is evident that feature fusion strategy helped to enhance the prediction capability of the proposed DPI_CDF model.

Conclusion
Identification of drug targets is crucial for pharmaceutical industries to design new efficacious drugs.In this work, we have designed a novel high-throughput model, DPI_CDF, for screening proteins with druggable activity.To our best knowledge, DPI_CDF is the first ensemble-based method based on evolutionary, physiochemical and compositional feature vectors for characterizing and discriminating DPs and non-DPs.Experiment outcomes on the benchmark datasets anticipate that our proposed predictor attained superior performance in druggable target prediction and surpassed all the existing sequence-based DP prediction tools.Additionally, the DPI_ CDF protocol shows excellent efficacy due to multiple reasons.(a) The new encoding schemes were designed to dig out the prominent information from the biological Our work has some limitations that need to be mentioned.We considered hand curated features to encode protein sequences which requires domain expertise.We did not provide our methods as a web server or binary tool for the users.In future we will consider different combinations of novel features to encode proteins which may further improve the performance of DPI_CDF.In future we will try to collect more data and manually curate them to provide a high-quality larger dataset for this particular problem.We will also explore deep learning-based models, e.g., RNN, LSTM, B-LSTM, etc. to improve the performance of the predictor on large-scale un-annotated proteins.

Fig. 2
Fig. 2 The proposed architecture of DPI_CDF classifier Figure 5A-C are single-view descriptors, indicating that HOG-PSSM shows sharp distinction between the distribution of green and red dots (Fig. 5C) which significantly contribute to predicting DPs.Similarly, Fig. 5D-F are mixing different feature combination of (evolutionary +physicochemical) Hybrid1, (evolutionary + compositional) Hybrid2, and (evolutionary +physicochemical + compositional) Hybrid3.The plotting distribution of Hybrid1 feature in Fig. 5D seems more distinguishable than the other fused features indicate that Hybrid1 explore the biological region of DPs.The combination of physicochemical and evolutionary-based attributes is more effective in designing DPI_CDF model for DPs and non-DPs classification.

Fig. 4
Fig. 4 Performance comparison of DPI_CDF with existing DPs predictors over training (A) and testing (B) dataset

Fig. 6
Fig. 6 SHAP analysis for the top ranked 25 features for DPI_CDF

Table 1
Summary of the existing works on druggable protein prediction

Table 2
Dataset summarya N_Pos, N_Neg represent the total number of positive and negative sequences, respectively

Table 3
CPSR-based feature encoding

Table 4
Performance of various feature descriptors on DPtrain benchmark dataset using 10-fold CV test

Table 6
that DPI_CDF achieved highest performance on training dataset in terms of ACC 99.13%, MCC of 0.982, SPE

Table 5
Performance of various feature descriptors on independent dataset DPind ROC curves of DPI_CDF model using various feature encoding methods on the training (A) and testing (B) datasets of 99.69%, F-score of 0.999 and SEN of 98.52% which are 7.23%, 14.3%, 5.49%, 8.5% and 9.02% higher than recent state-of-the-art SPIDER method.Furthermore, to demonstrate the generalization power of DPI_CDF on unseen data, independent test set results are reported in Table6.The prediction outcomes in Table6reveals that DPI_CDF attained

Table 6
Performance comparison of DPI-CDF predictor with existing methods on training dataset

Table 7
Performance comparison of DPI-CDF predictor with existing state-of-the-art methods on independent dataset DP ind