- Open Access
PDRLGB: precise DNA-binding residue prediction using a light gradient boosting machine
© The Author(s) 2018
- Published: 31 December 2018
Identifying specific residues for protein-DNA interactions are of considerable importance to better recognize the binding mechanism of protein-DNA complexes. Despite the fact that many computational DNA-binding residue prediction approaches have been developed, there is still significant room for improvement concerning overall performance and availability.
Here, we present an efficient approach termed PDRLGB that uses a light gradient boosting machine (LightGBM) to predict binding residues in protein-DNA complexes. Initially, we extract a wide variety of 913 sequence and structure features with a sliding window of 11. Then, we apply the random forest algorithm to sort the features in descending order of importance and obtain the optimal subset of features using incremental feature selection. Based on the selected feature set, we use a light gradient boosting machine to build the prediction model for DNA-binding residues. Our PDRLGB method shows better overall predictive accuracy and relatively less training time than other widely used machine learning (ML) methods such as random forest (RF), Adaboost and support vector machine (SVM). We further compare PDRLGB with various existing approaches on the independent test datasets and show improvement in results over the existing state-of-the-art approaches.
PDRLGB is an efficient approach to predict specific residues for protein-DNA interactions.
- DNA-binding residue
- Light gradient boosting
- Random forest
- Incremental feature selection
Th protein-DNA interaction is one of the central issues in molecular biology and widely exists in various biological activities in living organisms, such as DNA replication, repair, and modification processes. To understand the recognition mechanism of protein-DNA complexes, researchers often focus on protein-DNA binding sites especially the interface residues that bind DNA. Experimental approach such as electrophoretic mobility shift assays (EMSAs) [1, 2], conventional chromatin immunoprecipitation (ChIP) , X-ray crystallography , PNA (peptide nucleic acid)-assisted identification of RNA binding proteins (PAIR) , and NMR spectroscopy  have been applied to expose the DNA binding amino acids. However, these laboratory methods are expensive and time-consuming. Alternatively, low-cost and efficient computational methods are particularly important in discovering specific interface residues of protein-DNA complexes.
A number of computational approaches have been focused on applying machine learning algorithms to build prediction models based on sequence and structural information. Wei  proposed novel evolutionary features for DNA-binding proteins prediction. Jones and his coworkers  proposed a simple method to identify DNA-binding residues using the positive electrostatic patches on the protein surface. Ahmad et al.  developed a neural network classifier to predict DNA-binding residues using a variety of composition, sequence and structural information. Wang et al.  built SVM-based models to predict DNA-binding residues by using data examples represented with three sequence characteristics. Ferrer-Costa et al.  implemented an effective linear predictor to determine the DNA-binding sites in protein sequences. Yan and his coworkers  trained a Naive Bayes classifier to predict whether a given amino acid is a DNA-binding site based on its characteristics and the features of its sequence neighbors. Wang and Yang  developed a random forest (RF) classifier according to the evolutionary information to detect the DNA-binding sites. Song et al.  employed imbalanced classification techniques for this problem. Carson et al.  combined the C4.5 algorithm with bootstrap aggregation and cost-sensitive learning to identify binding residues in protein-RNA complexes. Zou et al.  focused on the feature selection techniques and improved the performance. Ozbek et al.  presented a prediction method based on residue variations in high frequency forms using the Gaussian network. Other protein-DNA binding residue prediction tools such as DR_bind  and PreDNA  have also been developed.
Although a lot of studies has been performed, the problem of accurately identifying protein-DNA binding sites still has huge room for improvement. Firstly, effective features to detect DNA-binding interface residues from non-binding amino acids are not fully exploited. Secondly, the imbalanced problem exists since the numbers of DNA-binding and non-binding amino acids in proteins are extremely unbalanced, and will cause over-fitting and poor performance in the prediction of DNA-binding amino acids.
In this work, we develop a innovative computational pipeline, named PDRLGB, for predicting interface residues in protein-DNA complexes. We extract many sequence and structure features and use the random forest to select a subset of optimal features. Based on the selected characteristics, we train the DNA-binding residue prediction models using a new implementation of Gradient boosting decision tree (LightGBM) . Our experiments show that PDRLGB significantly outperforms other state-of-the-art DNA-binding residue prediction approaches.
To access the performance of the PDRLGB method and other existing approaches, two benchmarking datasets (PDNA-62 and PDNA-224) and two independent datasets (TS-72 and TS-61) are used. PDNA-62 was built by Ahmad et al. . It consists of 67 sequences obtained from 62 protein-DNA complexes in the Protein Data Bank (PDB)  and the sequence identity between any two sequences is ≤ 25%. PDNA-224 was generated by Li et al. , which contains 224 proteins and the redundant sequences was removed by using the sequence identity cutoff of 25%. The independent test dataset called TS-72 was extracted by Ma et al. . It contains 72 protein chains. TS-61 was constructed by Zhou et al. . Redundant proteins are removed by using the CD-HIT , and the remaining 61 non-redundant DNA-binding protein sequences have ≤30% sequence identity with the protein sequences in PDNA-62, PDNA-224, and TS-72.
Number of positive samples (binding sites) and negative samples (non-binding sites) of the four datasets
In these equations, the TP, FP, TN, and FN represent the number of true positives, the number of false positives, the number of true negatives, and the number of false negatives, respectively. Because of the imbalanced problem in the data sets, the strength (ST) is the average score of sensitivity and specificity which is used to obtain a fair measure of the model. Additionally, there are two broadly employed measurement to estimate prediction performance including the receiver operating characteristic (ROC)  and the area under ROC curve (AUC) . The ROC curve is plotted with the false positive rate against the true positive rate. When AUC takes the maximum value of 1, it represents a perfect predictor, and the values of AUC of random guessing is usually close to 0.5.
The prediction pipeline
We extract a variety of features including position-specific scoring matrices (PSSMs) (20 features), physicochemical properties (10 features), disordered features (3 features), side-chain environment (pKa) (2 features), identity vector (20 features), net charge (1 feature), the information from DSSP (15 features), the information from NACCESS (10 features), H-bonds (1 feature) and B-factor(1 feature). These features can be grouped into two categories: sequence and structure features.
Position-specific scoring matrices (PSSMs): PSSM based evolutionary information is obtained from multiple sequence alignment calculated by PSI-BLAST  searching against the NCBI non-redundant (NR) database, with iteration number as 3 and e-value as 0.001.
Physicochemical properties: The physicochemical properties of a residue include atom numbers, electrostatic charge numbers, potential hydrogen bonds, molecular mass (Mmass), hydrophobicity, hydrophilicity, polarity, polarizability, propensities and average accessible surface area . The original values of the ten physicochemical attributes for each residue are obtained from the AAindex database .
Disordered regions: Predicted disordered regions within a protein is also a significant property. Avoiding possibly disordered fragments in protein expression constructs can enhance expression, foldability, and stability of the protein. DisEMBL  is a useful tool for identifying disordered regions, which is needed for many biochemical studies, particularly structural biology, and structural genomics projects. In this study, DisEMBL is used to indentify dynamically disordered regions of the protein sequence.
Side-chain environment (pKa): The value of pKa is an effective metric in determining environmental features of a protein. The side-chain pKa rates are collected from Nelson and Cox  representing protein side-chain environmental factors and are broadly used by previous studies.
Identity vector: There is a 20-feature vector with 1 when the residue type occurs at the corresponding position and 0 for the remaining amino acid types.
Net charge of a residue: Twenty amino acids can be divided into non-polar amino acids, polar charged amino acids, polar uncharged amino acids. The DNA backbone is negatively charged, so the sequence of polar positively charged amino acids is thought to be characteristic of DNA binding. A charge of +1 is assigned to Arg and Lys and -1 to Asp and Glu. His is assigned a charge of +0.5 and all other residues are regarded as neutral.
Features from DSSP: we use DSSP  to obtain the secondary structures, including solvent accessible surface area (ASA), hydrogen bonds, atom coordinates and backbone torsion angles.
Features from NACCESS: We use NACCESS  to compute the absolute and relative ASA of all atoms, total side chain, main chain, non-polar side chain and allpolar side chain, respectively. ASA related features has been shown to be a important feature in identifying protein functional sites [34–37].
Number of H-bonds: The number of Hydrogen bonds (Hbond) is computed by HBPLUS .
B-factor of a residue: The B-factor  of protein crystal structures, including the B-factor of the Cα and that of the Cβ of the amino acids in the sequence, was adopted.
where n is the repeat times of the 5-fold cross-validation.
Building prediction classifiers
Gradient boosting decision tree (GBDT)  is a widely used and useful algorithm that can be used for both classification and regression problems [42–47]. Recently, Ke et al. proposed a novel GBDT algorithm named LightGBM , which utilize two novel techniques: Gradient-based One-Side Sampling (GOSS) along with Exclusive Feature Bundling (EFB) to deal with the huge number of data samples along with massive amount of features respectively. GOSS keeps all the examples with large gradients and conducts random sampling on the examples with small gradients. EFB algorithm can bundle many exclusive characteristics to the much fewer dense characteristics, which can dramatically avoid unnecessary calculation for zero feature values. Here we apply LightGBM to build the DNA-binding residue prediction models. The detailed steps of the LightGBM algorithm is shown in Algorithm 1.
Performance comparison with other machine learning techniques
Performance comparison of LightGBM with other machine learning methods
Performance comparison with other state-of-the-art predictors
Performance comparison of various prediction methods on PDNA-62 with 5-fold cross-validation
Performance of PDRLGB Compared with PreDNA and EL _PSSM-RT on PDNA-224 with 5-fold cross-validation
Performance comparison on the independent test dataset
Performance comparison on TS-61
Computing time comparison
Existing methods for predicting DNA-binding sites are mainly divided into sequence-based methods, structure-based methods and hybrid methods. In this study, we integrate both sequence and structural features to effectively predict DNA-binding residues. A limitation of our PDRLGB approach is that it requires the protein structural information, which may limit its application. However, with the increasing solved protein structures, protein homology modeling projects and predicted 3D structures, it is expected that PDRLGB can be used as a powerful tool to effectively identify DNA-binding residues. We believe that PDRLGB can be an effective tool for accurately predicting DNA-binding residues with the increasing availability of high-quality protein-DNA complex structures.
Targeting specific DNA-binding amino acids that contribute to the strength and specificity of protein-DNA interactions has broad applications ranging from rational drug design to the investigation of metabolic and signal transduction networks. In this paper, we have developed a novel LightGBM-based algorithm termed PDRLGB, for DNA-binding residue prediction. The sequence features and structural characteristics are combined to construct the feature space, and random forest combined with incremental feature selection is applied to make a feature selection. As a result, the prediction performance on the two datasets PDNA-62 and PDNA-224 with five-fold cross-validation demonstrate that PDRLGB can accelerate the training process and performs better when compared with other widely used machine learning classifiers. At the same time, performance comparisons between PDRLGB and other existing state-of-the-art DNA-binding site prediction methods demonstrate that our PDRLGB approach achieves the best performance. We have also employed our PDRLGB to identify binding sites on a protein-DNA complex 2XMA and obtained satisfactory results.
This work was supported by National Natural Science Foundation of China under grants No. 61672541 and No. 61672113, and Natural Science Foundation of Hunan Province under grant No. 2017JJ3287.
Publication costs are funded by National Natural Science Foundation of China under grant No. 61672541.
Availability of data and materials
The datasets used in this study is available at http://denglab.org/PDRLGB/.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 19, 2018: Proceedings of the 29th International Conference on Genome Informatics (GIW 2018): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-19.
LD, JP, XX, WY, CL and HL designed the study and conducted experiments. LD, JP, XX, WY and CL performed statistical analyses. LD, JP and HL drafted the manuscript. JP prepared the experimental materials and benchmarks. All authors have read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Jones S, Heyningen PV, Berman HM, Thornton JM. Protein-dna interactions: a structural analysis. Nucleic Acids Res. 1999; 29(4):943–54.Google Scholar
- Jones S, Barker JA, Nobeli I, Thornton JM. Using structural motif templates to identify proteins with dna binding function. Nucleic Acids Res. 2003; 31(11):2811.PubMedPubMed CentralGoogle Scholar
- Kono H, Sarai A. Structure-based prediction of dna target sites by regulatory proteins. Proteins Struct Funct Bioinforma. 2015; 35(1):114–31.Google Scholar
- Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. Cath-a hierarchic classification of protein domain structures. Structure. 1997; 5(8):1093–108.PubMedGoogle Scholar
- Olson WK, Gorin AA, Lu XJ, Hock LM, Zhurkin VB. Dna sequence-dependent deformability deduced from protein-dna crystal complexes. Proc Natl Acad Sci U S A. 1998; 95(19):11163–8.PubMedPubMed CentralGoogle Scholar
- Ponting CP, Schultz J, Milpetz F, Bork P. Smart: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Res. 1999; 27(1):229–32.PubMedPubMed CentralGoogle Scholar
- Wei L, Tang J, Zou Q. Local-dpp: An improved dna-binding protein prediction method by exploring local evolutionary information. Inf Sci. 2017; 384:135–44.Google Scholar
- Jones S, Shanahan HP, Berman HM, Thornton JM. Using electrostatic potentials to predict dna-binding sites on dna-binding proteins. Nucleic Acids Res. 2003; 31(24):7189–98.PubMedPubMed CentralGoogle Scholar
- Ahmad S, Gromiha MM, Sarai A. Analysis and prediction of dna-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics. 2004; 20(4):477–86.PubMedGoogle Scholar
- Wang L, Brown SJ. Bindn: a web-based tool for efficient prediction of dna and rna binding sites in amino acid sequences. Nucleic Acids Res. 2006; 34(Web Server issue):243–8.Google Scholar
- Ferrercosta C, Shanahan HP, Jones S, Thornton JM. Hthquery: a method for detecting dna-binding proteins with a helix-turn-helix structural motif.Bioinformatics. 2005; 21(18):3679–80.Google Scholar
- Yan C, Terribilini M, Wu F, Jernigan RL, Dobbs D, Honavar V. Predicting dna-binding sites of proteins from amino acid sequence. BMC Bioinformatics. 2006; 7(1):262.PubMedPubMed CentralGoogle Scholar
- Wang L, Yang MQ, Yang JY. Prediction of dna-binding residues from protein sequence information using random forests. BMC Genomics. 2009; 10(S1):1.Google Scholar
- Song L, Li D, Zeng X, Wu Y, Guo L, Zou Q. ndna-prot: identification of dna-binding proteins based on unbalanced classification. BMC Bioinformatics. 2014; 15(1):298.PubMedPubMed CentralGoogle Scholar
- Carson MB, Langlois R, Lu H. Naps: a residue-level nucleic acid-binding prediction server. Nucleic Acids Res. 2010; 38(Web Server issue):431–5.Google Scholar
- Zou Q, Wan S, Ju Y, Tang J, Zeng X. Pretata: predicting tata binding proteins with novel features and dimensionality reduction strategy. BMC Syst Biol. 2016; 10(4):114.PubMedPubMed CentralGoogle Scholar
- Ozbek P, Soner S, Erman B, Haliloglu T. Dnabindprot: fluctuation-based predictor of dna-binding residues within a network of interacting residues. Nucleic Acids Res. 2010; 38(Web Server issue):417–23.Google Scholar
- Chen YC, Wright JD, Lim C. Dr_bind: a web server for predicting dna-binding residues from the protein structure based on electrostatics, evolution and geometry. Nucleic Acids Res. 2012; 40(Web Server issue):249–56.Google Scholar
- Li T, Li QZ, Liu S, Fan GL, Zuo YC, Peng Y. Predna: accurate prediction of dna-binding sites in proteins by integrating sequence and geometric structure information. Bioinformatics. 2013; 29(6):678–85.PubMedGoogle Scholar
- Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Stat. 2001; 29(5):1189–232.Google Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The protein data bank, 1999–. Int Tables Crystallogr. 2000; 67(Suppl):675–84.Google Scholar
- Ma X, Guo J, Liu HD, Xie JM, Sun X. Sequence-based prediction of dna-binding residues in proteins with conservation and correlation information.IEEE/ACM Trans Biol Bioinforma. 2012; 9(6):1766–75.Google Scholar
- Zhou J, Lu Q, Xu R, He Y, Wang H. El_pssm-rt: Dna-binding residue prediction by integrating ensemble learning with pssm relation transformation. BMC Bioinformatics. 2017; 18(1):379.PubMedPubMed CentralGoogle Scholar
- Fu L, Niu B, Zhu Z, Wu S, Li W. Cd-hit: accelerated for clustering the next-generation sequencing data.Bioinformatics. 2012; 28(23):3150–2.PubMedPubMed CentralGoogle Scholar
- Swets JA. Measuring the accuracy of diagnostic systems. Science. 1988; 240(4857):1285–93.PubMedGoogle Scholar
- Bradley AP. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recog. 1997; 30(7):1145–59. https://doi.org/10.1016/S0031-3203(96)00142-2.
- Altschul S, Madden T, Schäffer A, Zhang J, Zhang Z, Miller W, Lipman D. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25(17):3389–402. https://doi.org/10.1093/nar/25.17.3389.
- Miller S, Lesk AM, Janin J, Chothia C. The accessible surface area and stability of oligomeric proteins. Nature. 1987; 328(6133):834–6.PubMedGoogle Scholar
- Kawashima S, Ogata H, Kanehisa M. Aaindex: Amino acid index database. Nucleic Acids Res. 1999; 27(1):368–9.PubMedPubMed CentralGoogle Scholar
- Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB. Protein disorder prediction: implications for structural proteomics. Structure. 2003; 11(11):1453.PubMedGoogle Scholar
- Re A, Joshi T, Kulberkyte E, Morris Q, Workman CT. Rna-protein interactions: an overview. Methods Mol Biol. 2014; 1097(1097):491.PubMedGoogle Scholar
- Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.Biopolymers. 1983; 22(12):2577–637.PubMedPubMed CentralGoogle Scholar
- Hubbard SJ, Naccess TM. Computer Program. London: Department of Biochemistry and Molecular Biology, University College of London; 1993.Google Scholar
- Deng L, Zhang QC, Chen Z, Meng Y, Guan J, Zhou S. Predhs: a web server for predicting protein–protein interaction hot spots by using structural neighborhood properties. Nucleic Acids Res. 2014; 42(W1):290–5.Google Scholar
- Tang Y, Liu D, Wang Z, Wen T, Deng L. A boosting approach for prediction of protein-rna binding residues. BMC Bioinformatics. 2017; 18(13):465.PubMedPubMed CentralGoogle Scholar
- Pan Y, Wang Z, Zhan W, Deng L. Computational identification of binding energy hot spots in protein–rna complexes using an ensemble approach. Bioinformatics. 2017; 34(9):1473–80.Google Scholar
- Nie L, Deng L, Fan C, Zhan W, Tang Y. Prediction of protein s-sulfenylation sites using a deep belief network. Curr Bioinforma. 2018; 13(5):461–7.Google Scholar
- Mcdonald IK, Thornton JM. Satisfying hydrogen bonding potential in proteins. J Mol Biol. 1994; 238(5):777–93.PubMedGoogle Scholar
- Yuan Z, Bailey TL, Teasdale RD. Prediction of protein b-factor profiles. Proteins Struct Funct Bioinforma. 2005; 58(4):905–12.Google Scholar
- Breiman L. Random forests. Mach Learn. 2001; 45(1):5–32.Google Scholar
- Liaw A, Wiener M. Classification and regression by random forest. R News. 2002; 2:18–22.Google Scholar
- Pan Y, Liu D, Deng L. Accurate prediction of functional effects for variants by combining gradient tree boosting with optimal neighborhood properties. PloS ONE. 2017; 12(6):0179314.Google Scholar
- Kuang L, Yu L, Huang L, Wang Y, Ma P, Li C, Zhu Y. A personalized qos prediction approach for cps service recommendation based on reputation and location-aware collaborative filtering. Sensors. 2018; 18(5):1556.Google Scholar
- Fan C, Liu D, Huang R, Chen Z, Deng L. Predrsa: a gradient boosted regression trees approach for predicting protein solvent accessibility. BMC Bioinf. 2016; 17(Suppl 1):8.Google Scholar
- Liao Z, Wan S, He Y, Zou Q. Classification of small gtpases with hybrid protein features and advanced machine learning techniques. Curr Bioinforma. 2018; 13(5):492–500.Google Scholar
- Li C, Zheng X, Yang Z, Kuang L. Predicting short-term electricity demand by combining the advantages of arma and xgboost in fog computing environment. Wirel Commun Mob Comput. 2018; 2018:5018053.Google Scholar
- Gan Y, Tao H, Zou G, Yan C, Guan J. Dynamic epigenetic mode analysis using spatial temporal clustering. BMC Bioinformatics. 2016; 17(17):537.PubMedPubMed CentralGoogle Scholar
- Ke G, Meng Q, Finely T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017; 30:3146–54.Google Scholar
- Cai YD, Lin SL. Support vector machines for predicting rrna-, rna-, and dna-binding proteins from amino acid sequence.Biochim Biophys Acta. 2003; 1648(1-2):127.PubMedGoogle Scholar
- Lab R, Gunnar Rätsch PD. Soft margins for adaboost. Mach Learn. 2001; 42(3):287–320.Google Scholar
- Shandar A, Akinori S. Pssm-based prediction of dna binding sites in proteins. BMC Bioinformatics. 2005; 6(1):1–6.Google Scholar
- Kuznetsov IB, Gou Z, Li R, Hwang S. Using evolutionary and structural information to predict dna-binding sites on dna-binding proteins. Proteins Struct Funct Bioinforma. 2006; 64(1):19.Google Scholar
- Wang L, Huang C, Yang MQ, Yang JY. Bindn+ for accurate prediction of dna and rna-binding residues from protein sequence features. BMC Syst Biol. 2010; 4(S1):3.Google Scholar
- Dietterich TG. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 1998; 10(7):1895–923.PubMedGoogle Scholar
- Yan J, Kurgan L. Drnapred, fast sequence-based method that accurately predicts and discriminates dna-and rna-binding residues. Nucleic Acids Res. 2017; 45(10):84.Google Scholar
- Zhou J, Lu Q, Xu R, Gui L, Wang H. Cnnsite: Prediction of dna-binding residues in proteins using convolutional neural network with sequence features. In: Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference On. Shenzhen: IEEE: 2016. p. 78–85.Google Scholar
- Hwang S, Gou Z, Kuznetsov IB. Dp-bind: a web server for sequence-based prediction of dna-binding residues in dna-binding proteins. Bioinformatics. 2007; 23(5):634–6.PubMedGoogle Scholar
- Hickman AB, James JA, Barabas O, Pasternak C, Ton-Hoang B, Chandler M, Sommer S, Dyda F. Dna recognition and the precleavage state during single-stranded dna transposition in d. radiodurans. EMBO J. 2010; 29(22):3840–52.PubMedPubMed CentralGoogle Scholar