PDRLGB: precise DNA-binding residue prediction using a light gradient boosting machine

Background Identifying specific residues for protein-DNA interactions are of considerable importance to better recognize the binding mechanism of protein-DNA complexes. Despite the fact that many computational DNA-binding residue prediction approaches have been developed, there is still significant room for improvement concerning overall performance and availability. Results Here, we present an efficient approach termed PDRLGB that uses a light gradient boosting machine (LightGBM) to predict binding residues in protein-DNA complexes. Initially, we extract a wide variety of 913 sequence and structure features with a sliding window of 11. Then, we apply the random forest algorithm to sort the features in descending order of importance and obtain the optimal subset of features using incremental feature selection. Based on the selected feature set, we use a light gradient boosting machine to build the prediction model for DNA-binding residues. Our PDRLGB method shows better overall predictive accuracy and relatively less training time than other widely used machine learning (ML) methods such as random forest (RF), Adaboost and support vector machine (SVM). We further compare PDRLGB with various existing approaches on the independent test datasets and show improvement in results over the existing state-of-the-art approaches. Conclusions PDRLGB is an efficient approach to predict specific residues for protein-DNA interactions.


Introduction
Th protein-DNA interaction is one of the central issues in molecular biology and widely exists in various biological activities in living organisms, such as DNA replication, repair, and modification processes. To understand the recognition mechanism of protein-DNA complexes, researchers often focus on protein-DNA binding sites especially the interface residues that bind DNA. Experimental approach such as electrophoretic mobility shift assays (EMSAs) [1,2], conventional chromatin immunoprecipitation (ChIP) [3], X-ray crystallography [4], PNA *Correspondence: hliu@cczu.edu.cn 2 Lab of Information Management, Changzhou University, 213164, Changzhou, China Full list of author information is available at the end of the article (peptide nucleic acid)-assisted identification of RNA binding proteins (PAIR) [5], and NMR spectroscopy [6] have been applied to expose the DNA binding amino acids. However, these laboratory methods are expensive and time-consuming. Alternatively, low-cost and efficient computational methods are particularly important in discovering specific interface residues of protein-DNA complexes.
A number of computational approaches have been focused on applying machine learning algorithms to build prediction models based on sequence and structural information. Wei [7] proposed novel evolutionary features for DNA-binding proteins prediction. Jones and his coworkers [8] proposed a simple method to identify DNAbinding residues using the positive electrostatic patches on the protein surface. Ahmad et al. [9] developed a neural network classifier to predict DNA-binding residues using a variety of composition, sequence and structural information. Wang et al. [10] built SVM-based models to predict DNA-binding residues by using data examples represented with three sequence characteristics. Ferrer-Costa et al. [11] implemented an effective linear predictor to determine the DNA-binding sites in protein sequences. Yan and his coworkers [12] trained a Naive Bayes classifier to predict whether a given amino acid is a DNA-binding site based on its characteristics and the features of its sequence neighbors. Wang and Yang [13] developed a random forest (RF) classifier according to the evolutionary information to detect the DNA-binding sites. Song et al. [14] employed imbalanced classification techniques for this problem. Carson et al. [15] combined the C4.5 algorithm with bootstrap aggregation and cost-sensitive learning to identify binding residues in protein-RNA complexes. Zou et al. [16] focused on the feature selection techniques and improved the performance. Ozbek et al. [17] presented a prediction method based on residue variations in high frequency forms using the Gaussian network. Other protein-DNA binding residue prediction tools such as DR_bind [18] and PreDNA [19] have also been developed.
Although a lot of studies has been performed, the problem of accurately identifying protein-DNA binding sites still has huge room for improvement. Firstly, effective features to detect DNA-binding interface residues from non-binding amino acids are not fully exploited. Secondly, the imbalanced problem exists since the numbers of DNA-binding and non-binding amino acids in proteins are extremely unbalanced, and will cause over-fitting and poor performance in the prediction of DNA-binding amino acids.
In this work, we develop a innovative computational pipeline, named PDRLGB, for predicting interface residues in protein-DNA complexes. We extract many sequence and structure features and use the random forest to select a subset of optimal features. Based on the selected characteristics, we train the DNA-binding residue prediction models using a new implementation of Gradient boosting decision tree (LightGBM) [20]. Our experiments show that PDRLGB significantly outperforms other state-of-the-art DNA-binding residue prediction approaches.

Datasets
To access the performance of the PDRLGB method and other existing approaches, two benchmarking datasets (PDNA-62 and PDNA-224) and two independent datasets (TS-72 and TS-61) are used. PDNA-62 was built by Ahmad et al. [9]. It consists of 67 sequences obtained from 62 protein-DNA complexes in the Protein Data Bank (PDB) [21] and the sequence identity between any two sequences is ≤ 25%. PDNA-224 was generated by Li et al. [19], which contains 224 proteins and the redundant sequences was removed by using the sequence identity cutoff of 25%. The independent test dataset called TS-72 was extracted by Ma et al. [22]. It contains 72 protein chains. TS-61 was constructed by Zhou et al. [23]. Redundant proteins are removed by using the CD-HIT [24], and the remaining 61 non-redundant DNA-binding protein sequences have ≤30% sequence identity with the protein sequences in PDNA-62, PDNA-224, and TS-72.
Similar to previous researches [10,18], a residue of a protein is defined as a binding amino acid if the closest distance between atoms of the protein and its binding DNA is ≤3.5Å. The whole positive samples and negative samples of the four datasets are summarized in Table 1.

Performance measures
To evaluate the performance, we use several typical measures, including accuracy (ACC), sensitivity (SN/Recall), specificity (SP), strength (ST), precision (PRE), F1-score (F1), and Matthews Correlation Coefficient (MCC) score. These measurements are defined as: In these equations, the TP, FP, TN, and FN represent the number of true positives, the number of false positives, the number of true negatives, and the number of false negatives, respectively. Because of the imbalanced problem in the data sets, the strength (ST) is the average score of sensitivity and specificity which is used to obtain a fair measure of the model. Additionally, there are two broadly employed measurement to estimate prediction performance including the receiver operating characteristic (ROC) [25] and the area under ROC curve (AUC) [26]. The ROC curve is plotted with the false positive rate against the true positive rate. When AUC takes the maximum value of 1, it represents a perfect predictor, and the values of AUC of random guessing is usually close to 0.5.

The prediction pipeline
The pipeline of PDRLGB is showed in Fig. 1. It is made up of several steps: A) feature extraction: a total of 83 sequence and structure features are extracted, and the feature vectors are generated using a sliding window of w; B) feature selection: the features are sorted with random forest and the optimal feature set is selected using the incremental feature selection approach; C) building prediction classifiers: the DNA-binding residue prediction models are built using the light gradient boosting machine. These processes are described in details in the following subsections.

Feature extraction
We extract a variety of features including position-specific scoring matrices (PSSMs) (20 features), physicochemical properties (10 features  1) Position-specific scoring matrices (PSSMs): PSSM based evolutionary information is obtained from multiple sequence alignment calculated by PSI-BLAST [27] searching against the NCBI non-redundant (NR) database, with iteration number as 3 and e-value as 0.001. 2) Physicochemical properties: The physicochemical properties of a residue include atom numbers, electrostatic charge numbers, potential hydrogen bonds, molecular mass (Mmass), hydrophobicity, hydrophilicity, polarity, polarizability, propensities and average accessible surface area [28]. The original values of the ten physicochemical attributes for each residue are obtained from the AAindex database [29].

3) Disordered regions: Predicted disordered regions
within a protein is also a significant property. Avoiding possibly disordered fragments in protein expression constructs can enhance expression, foldability, and stability of the protein. DisEMBL [30] is a useful tool for identifying disordered regions, which is needed for many biochemical studies, particularly structural biology, and structural genomics projects. In this study, DisEMBL is used to indentify dynamically disordered regions of the protein sequence. 4) Side-chain environment (pKa): The value of pKa is an effective metric in determining environmental features of a protein. The side-chain pKa rates are collected from Nelson and Cox [31] representing protein side-chain environmental factors and are broadly used by previous studies. 5) Identity vector: There is a 20-feature vector with 1 when the residue type occurs at the corresponding position and 0 for the remaining amino acid types. 6) Net charge of a residue: Twenty amino acids can be divided into non-polar amino acids, polar charged amino acids, polar uncharged amino acids. The DNA backbone is negatively charged, so the sequence of polar positively charged amino acids is thought to be characteristic of DNA binding. A charge of +1 is assigned to Arg and Lys and -1 to Asp and Glu. His is assigned a charge of +0.5 and all other residues are regarded as neutral.
Structure features: 1) Features from DSSP: we use DSSP [32] to obtain the secondary structures, including solvent accessible surface area (ASA), hydrogen bonds, atom coordinates and backbone torsion angles. 2) Features from NACCESS: We use NACCESS [33] to compute the absolute and relative ASA of all atoms, total side chain, main chain, non-polar side chain and allpolar side chain, respectively. ASA related features has been shown to be a important feature in identifying protein functional sites [34][35][36][37]. [38]. 4) B-factor of a residue: The B-factor [39] of protein crystal structures, including the B-factor of the C α and that of the C β of the amino acids in the sequence, was adopted.

Features selection
We encode the features with a sliding window of w and generate a large feature vector. To eliminate uninformative variables and obtain more cost-effective models, a reliable feature selection approach was applied. Firstly, we use  the random forest algorithm [40] to sort the features by using the mean decrease Gini index (MDGI) Z-Score [41]. MDGI Z-Score measures the importance of individual features. Features with higher MDGI Z-Scores are more sensitive to random shuffling of their values, and thus are more important for correctly classifying a residue into DNA-binding site and non-DNA binding site. After ranking the features in descending order of MDGI Z-Score, we utilize the incremental feature selection approach to select the top-k features. We construct the feature subset by incremental adding the features in the ranked list to the subset, and evaluate the performance of the top-k subset using the LightGBM classifier with 5-fold crossvalidation. We use a comprehensive evaluation score (R c ) to measure the performance of the feature subset. The R c score is defined as follows: where n is the repeat times of the 5-fold cross-validation.

Parameter selection
The sliding window describes the target residue's sequence neighborhood, and the window size w should be  Compute absolute values of gradients: Compute information gains: Get a new decision tree m (χ) on set D .  Fig. 2, the ST score increases when the window size increases from 1 to 11, and the highest ST score is achieved when the window size is 11. So we select the optimal window size as 11 in the proposed PDRLGB method.
The number of features (k) is another important parameter. We build LightGBM classifiers for each top−k subset and calculate the performance of 5-fold cross-validation.  The results are shown in Fig. 3. As the dimension of the features increases, the highest RC score of 0.85 is obtained when using the top 800 features. Finally, we select a subset of features (Top 800) that contribute the most to the classification as the optimal feature set.

Performance comparison with other machine learning techniques
In this section, we conduct a comparison experiment of LightGBM with existing machine learning techniques, including Support Vector Machine (SVM) [49], Random Forest (RF) [40] and AdaBoost [50]. The performance of these classifiers are listed in Table 2. It is worth emphasizing that these classifiers are trained on the same benchmark with the same feature set. The ROC curves are shown in Fig. 4. It is obvious that LightGBM achieves significant performance improvement on both PDNA-62 and PDNA-224 when it compares to these classifiers. Concretely, on the PDNA-62 dataset, LightGBM obtains at least 3.1% increase on ST, 2.2% increase on F1, 3.7% increase on MCC and 2.9% increase on AUC when comparing with SVM, RF and AdaBoost. As for the PDNA-224 dataset, LightGBM achieves at least 3.1% increase on ST, 0.6% increase on F1, 3.2% increase on MCC and 3.0% increase on AUC. Due to the imbalanced problem on both datasets, the ROC curve is regarded as the useful estimation for the overall performance. Higher ROC curve denotes better prediction performance. Figure 4a and b also show that LightGBM obtains the best ROC curves on the two datasets (PDNA-62 and PDNA-224).
The results imply that the LightGBM algorithm we used is more superior than other widely used classifiers.

Performance comparison on the independent test dataset
To further assess the performance, we compare PDRLGB with seven existing state-of-the-art protein-DNA binding site prediction methods, DNABR [22], BindN [10], BindN-RF [13], BindN+ [53], EL_PSSM-RT [23], DRNApred [55] and CNNsite [56] on the TS-72 dataset. DNABR [22] and BindN-RF [13] are built using random forest (RF). BindN [10] and BindN+ [53] are trained using support We also compare our PDRLGB method with DP-Bind [57], EL_PSSM-RT [23] and DRNApred [55] on the independent dataset TS-61. DP-Bind is implemented using machine learning algorithms including SVM, kernel logistic regression and penalized logistic regression. DP-Bind also implements two ensemble classifiers by using majority voting (MAJ) and unanimity voting (STR) respectively. Here we only compare with DP-Bind (STR) since the unanimity voting approach achieves the best performance according to Hwang et al [57]. The results are depicted in Table 5. We observe that PDRLGB gains the highest AUC score of 0.850. Although DRNApred has the highest specificity, PDRLGB has a better balance between recall and specificity.

Computing time comparison
We present the training time cost comparisons in this subsection, which is shown in Fig. 6. Our experiments on the two datasets show that LightGBM speeds up the training process of classical methods by up to over 20 times faster than SVM and is also faster than Adaboost. Although random forest (RF) and LightGBM have similar calculation speed, in fact, the performance of the LightGBMbased method is far better than that of the RF classifier. Therefore, the PDRLGB is an accurate and fast model in the prediction of protein-DNA binding residues in the protein.

Case study
In order to further validate the usefulness of PDRLGB for DNA-binding residue prediction, we apply PDRLGB trained on PDNA-62 to distinguish the binding residues from non-binding residues for the ISDra2 transposase/IS end complex which is not in the training set, namely, 2XMA [58]. Here, we use PDRLGB to investigate the DNA-binding residues (2XMA:A). PDRLGB achieves 87.05% on ACC, 0.67 on MCC, 86.67% on SP, 87.16% on SN, 86.91% on ST, which is very precise when compared with the available experimental data in the PDB database. The experimentally determined DNA-binding sites and predicted sites by PDRLGB for complex 2XMA are shown in Fig. 7. Figure 7a denotes the experimentally determined  binding sites of protein 2XMA:A and the red spheres represent real DNA-binding sites. Figure 7b presents the predicting binding sites of protein 2XMA:A. The results show that the majority of the DNA-binding residues are correctly predicted by the PDRLGB model.

Discussion
Existing methods for predicting DNA-binding sites are mainly divided into sequence-based methods, structurebased methods and hybrid methods. In this study, we integrate both sequence and structural features to effectively predict DNA-binding residues. A limitation of our PDRLGB approach is that it requires the protein structural information, which may limit its application. However, with the increasing solved protein structures, protein homology modeling projects and predicted 3D structures, it is expected that PDRLGB can be used as a powerful tool to effectively identify DNA-binding residues. We believe that PDRLGB can be an effective tool for accurately predicting DNA-binding residues with the increasing availability of high-quality protein-DNA complex structures.

Conclusion
Targeting specific DNA-binding amino acids that contribute to the strength and specificity of protein-DNA interactions has broad applications ranging from rational drug design to the investigation of metabolic and signal transduction networks. In this paper, we have developed a novel LightGBM-based algorithm termed PDRLGB, for DNA-binding residue prediction. The sequence features and structural characteristics are combined to construct the feature space, and random forest combined with incremental feature selection is applied to make a feature selection. As a result, the prediction performance on the two datasets PDNA-62 and PDNA-224 with five-fold cross-validation demonstrate that PDRLGB can accelerate the training process and performs better when compared with other widely used machine learning classifiers. At the same time, performance comparisons between PDRLGB and other existing state-of-the-art DNA-binding site prediction methods demonstrate that our PDRLGB approach achieves the best performance. We have also employed our PDRLGB to identify binding sites on a protein-DNA complex 2XMA and obtained satisfactory results.