nDNA-prot: identification of DNA-binding proteins based on unbalanced classification
© Song et al.; licensee BioMed Central Ltd. 2014
Received: 1 June 2014
Accepted: 3 September 2014
Published: 8 September 2014
DNA-binding proteins are vital for the study of cellular processes. In recent genome engineering studies, the identification of proteins with certain functions has become increasingly important and needs to be performed rapidly and efficiently. In previous years, several approaches have been developed to improve the identification of DNA-binding proteins. However, the currently available resources are insufficient to accurately identify these proteins. Because of this, the previous research has been limited by the relatively unbalanced accuracy rate and the low identification success of the current methods.
In this paper, we explored the practicality of modelling DNA binding identification and simultaneously employed an ensemble classifier, and a new predictor (nDNA-Prot) was designed. The presented framework is comprised of two stages: a 188-dimension feature extraction method to obtain the protein structure and an ensemble classifier designated as imDC. Experiments using different datasets showed that our method is more successful than the traditional methods in identifying DNA-binding proteins. The identification was conducted using a feature that selected the minimum Redundancy and Maximum Relevance (mRMR). An accuracy rate of 95.80% and an Area Under the Curve (AUC) value of 0.986 were obtained in a cross validation. A test dataset was tested in our method and resulted in an 86% accuracy, versus a 76% using iDNA-Prot and a 68% accuracy using DNA-Prot.
Our method can help to accurately identify DNA-binding proteins, and the web server is accessible at http://datamining.xmu.edu.cn/~songli/nDNA. In addition, we also predicted possible DNA-binding protein sequences in all of the sequences from the UniProtKB/Swiss-Prot database.
KeywordsDNA-binding protein Ensemble classifier Unbalanced dataset Bioinformatics
A DNA-binding protein is a type of composite protein that is comprised of a combination of structural proteins and is found in the chromosomes and DNA. These proteins perform an important role in the combination and separation of single-stranded DNA and in the detection of DNA damage. Other functions of DNA-binding proteins include stimulation of the nuclease, helicase and strand exchange proteins; transcription at the initiation site; and protein-protein interactions. DNA-binding proteins have important functions in the biological field. Currently, an increasing number of researchers are attempting to identify DNA-binding proteins from other multifarious proteins, and the number of proteins being extracted is rapidly increasing. In 2011, the number of protein sequences in the Swiss-Prot database  was more than 100-times greater than in 1986 . Unfortunately, extremely unbalanced data has caused multiple drawbacks in the recent methods for the identification of DNA-binding proteins. Because of this, a quick and effective approach for the identification of DNA-binding proteins is required.
In recent years, an increasing number of feature extractions has been tested in the field of machine learning and biology. Lin and Zou et al.  used a 188-dimensional (188D) feature extraction method, which was performed by considering the constitution, physicochemical properties, and distribution of the amino acids . A physicochemical distance transformation (PDT) approach, which is related to the physicochemical properties of amino acids,  has also been proposed. In the 188D method, the first 20 feature vectors are obtained based on the probability that every amino acid appears in a given protein sequence. Based on the protein’s physicochemical properties, the remaining 160 feature vectors can then be realised. Patel et al.  improved the sequence similarity matrices and used an artificial neural network (ANN), which is a standard back-propagation training algorithm for a feed-forward neural network. Among 1,000 proteins, which included only 62 sequence features, a total accuracy of 72.99% was obtained. Analogously, Cheng et al.  also proposed a recurrent neural network that was designed to solve the non-smooth convex optimisation problem. Bhardwaj et al.  studied the DNA-binding residues that appear on the protein surface using the residue features that differentiate DNA-binding proteins from non-DNA-binding proteins, and a management alternative was applied as a follow-up to improve the prediction results. Studies have also demonstrated some of the available feature extraction means . According to the protein position-specific scoring matrix, Zou et al.  extracted a 20D feature from protein sequences, and in 1992, Brown et al.  proposed the n-gram natural language algorithm. This type of algorithm, also applied in another previous study , obtains the feature vectors by using a probability calculation. The Basic Local Alignment Search Tool (BLAST), which is based on a position-specific scoring matrix, has also been applied to detect remote protein homology .
The abovementioned approaches have all been used to distinguish DNA-binding proteins from non-DNA-binding proteins. In 1999, Nordhoff et al.  described the use of mass spectrometry to identify DNA-binding. Gao et al.  developed a method based on a knowledge-based method (i.e., DNA-binding domain hunter) and demonstrated how to deduce DNA-binding protein remnants according to the corresponding templates. Loris et al. , via a genetic algorithm, discussed the combination of feature extraction approaches with a group of amino acid alphabets. Langlois et al.  compared BLAST with a standard sequence alignment technique and discussed the method by which general mechanisms were captured by concrete rules. In 2011, Lin et al. , using the grey model, introduced a method for differentiating large-scale DNA-binding proteins by analysing the modality of the pseudo amino acid constitution. Many approaches have also been used to categorise the experimental data in the bioinformatics field. The abovementioned methods can be categorised as follows: Random Forest (RF) [14–17], Support Vector Machine (SVM) [9, 18–22], Dynamic selection and Circulating Combination-based ensemble Clustering (LibD3C) [23, 24], ANN [25–29], k-nearest neighbours (KNN) algorithm , and bagging .
The founding recognition rate of DNA-binding proteins has also been obtained, at a lower accuracy, using the existing methods rather than by using methods from the other two categories. Additionally, DNA-binding protein classification is an unresolved issue because the results of previous research on the introduction of a number-based sampling strategy showed a high false-positive rate in the extended dataset. As a result, new DNA-binding proteins were not identified. Ahmad and Sarai  demonstrated that using the charge and moment information under a hybrid predictor condition resulted in an 83.9% accuracy via a cross validation. The quadrupole moment, using single-variable predictors, resulted in a 73.7% accuracy. Qian et al.  verified the association between the DNA-binding preference and the endogenous transcription factors and reached an accuracy rate of 76.6% when using the Jackknife cross-validation test as a predictor. All of these results have exhibited disadvantages though . For example, only some predictors are available on websites where their functions are demonstrated. Thus, an insufficient amount of data contributes to the difficulty in analysing and comparing the results. Currently, the results of many previous studies have not been authenticated, thus impeding the research and development of bioinformatics to some extent. Therefore, an enhanced accuracy rate is a significant research goal.
In light of the current problems, we developed a predictor that addresses the drawbacks of the previously developed predictors. We conducted a series of experiments following a preparation process involving a general selection in addition to data processing. All of the training datasets were obtained from the Universal Protein (UniProt) KB/Swiss-Prot database, which provides high-quality and comprehensive protein sequence resources. We developed a complete dataset that includes an integrated negative-sample dataset. Subsequently, we determined a suitable feature extraction method to reinforce the predictor. We chose the 188D feature extraction method, which is based on the physicochemical properties of proteins. Due to the unsatisfactory performance of the current single classifiers, we applied an ensemble classification prediction algorithm designated as “imDC” to our classification. imDC is based on an unbalanced data research and machine-learning algorithm. The determination of a cross-validation approach to inspect the test dataset was the next important step in the process. Inappropriate cross-validation methods may lead to a deviation in the results and the subsequent failure of the predictor. Finally, a user-friendly web-server that effectively discerns the DNA-binding proteins was developed for checking and verification and for further academic exchanges. Detailed descriptions are provided in the Methods Section.
We selected our DNA-binding protein sequences from the website http://www.uniprot.org/ and obtained the data from the UniProtKB/Swiss-Prot database. We used the keyword “DNA-binding” to search for and select the datasets. More than 3,000,000 protein sequences were obtained initially, so we reduced the number of sequences by adding restrictions. The number of protein sequences was reduced to 44,996 when we added restrictions, that is “sequence length from 50 to 6,000” and “reviewed: yes.” Protein sequences with lengths less than 50 amino acids may be incomplete, but those with lengths greater than 6,000 amino acids may be too complex. The sequences that were obtained using the abovementioned limits comprised the initial positive dataset. Twenty types of amino acids and six letters (b, j, o, u, x, and z) were removed. The data downloaded from the database are normally very similar, and such similarities could affect our experimental results. Therefore, we removed any redundant data using cluster database–high identity with tolerance (CD–HIT)  with a threshold of 40%. Currently, our positive dataset has 9,676 protein sequences. Every sequence belongs to one or two protein families (PFAMs) , and similar sequences belong to the same family. We identified all of the PFAMs from the positive datasets and deleted the redundant PFAMs. We extracted the longest sequences from every PFAM and entered them in the positive dataset, which contained 1,353 protein sequences. We created a file named “PF_all” and deleted the PFAMs where the positive datasets belonged. We also obtained a negative dataset that contained 9,361 protein sequences.
In the field of machine learning, which has a relatively mature development, single classifiers have gradually begun to show drawbacks. In the traditional machine-learning algorithm, the data are mostly sacrificed and the quantity of classifier numbers is inferior. The minority samples are inevitably ignored, and a high false-positive rate is obtained. The main solution to the problem is comprised of two methods: data and algorithm aspects. Currently, the majority of the protein sequences in the field of bioinformatics are extremely unbalanced. To use the minority samples efficiently and to avoid the associated lack of data and information, we propose an improved algorithm based on ensemble learning.
UniProt is a database that supplies the bioinformatics field with comprehensive, superior-quality results; users can freely access protein sequences and information on protein functions. We obtained the majority of our datasets from Swiss-Prot in the UniProt Knowledgebase, which contains 519,348 note entries in the version released in August 2010. We selected 44,996 protein sequences for our initial dataset. Many similar amino acid compositions or analogous protein functions and structures in the protein sequences were present in the datasets downloaded from the website. Similar amino acid compositions are referred to as having “sequence identity”, which describes the same proteins or nucleic acids that occur at the same position in two sequences.
The original dataset and the datasets following the threshold removal
Feature analysis and classification performance
In this subsection, we evaluate the classification and choice of feature information. We selected a series of specific metrics to measure the results from our method and those from other existing methods. Because the accuracy rate provides a satisfactory description of the results, it is a good measure for identifying a dataset and showing the classification status. However, in some datasets, when the dataset is extremely unbalanced, the accuracy value may not properly represent the quality of the classifier. For example, in a dataset with 100 samples, a certain classifier may think that all 100 are negative, when in fact, the dataset contains ten positive samples. Because of its 90% accuracy, one may think that the classifier shows excellent performance, yet it failed to identify the positive samples. Therefore, we still need to develop other auxiliary judgment criteria to identify the positive samples. In this paper, we chose the F-measure and receiver operating characteristic (ROC) as the criteria.
where TP denotes the number of positive samples that are divided correctly; FP refers to the false-negative samples; and TN and FN represent the opposite samples.
Performance with different parameters
Comparison of the F-measure of the ensemble classifier imDC and the other classifiers using each of the thresholds
Comparison of the AUC value of the ensemble classifier imDC and the other classifiers using each of the thresholds
Our tables and figures show only the weighted averages. The specific consequences of the positive and negative cases are not listed. After a series of experiments, a commonly preferred classifier was trained, and all of the comparison results demonstrated that the imbalance classifier, imDC, was very efficient for processing the unbalanced data. Compared with common single classifier, our ensemble classifier has more time loss. However, performance promotion can make up for time loss.
Performance of the features
In the field of bioinformatics, transforming amino acid sequences into feature values is a critical process, and several methods for this were mentioned in the introduction. Here, we describe the three types of approaches used in the next section in detail.
The first 20D features in 188D were exactly the same as the features from the n-gram when n = 1. It is necessary to determine the contribution rate under this condition. Therefore, we obtained an attribute ranking by using principal component analysis (PCA). The contribution of the first 31 dimension features reached 40.576%. Among the 31 features, only one feature belonged to 1-gram. The PCA results verified that our 188D features were significant.
Results and discussion
The rank of features in the mRMR feature selection
Comparison with other software
A comparison of the three predictor methods
There are more than 56 million sequences in the UniProt knowledgebase. We downloaded 545,388 protein sequences that have been reviewed in Swiss-Prot and used our generated model to predict the sequences. A total of 119 protein sequences were identified as DNA-binding proteins. Information about these sequences has been listed in the file Additional file 1. This work will aid in the discovery of more potential DNA-binding proteins.
The recognition of DNA-binding proteins is rapidly increasing. In this paper, we emphasised the analysis of unbalanced DNA-binding protein data and designed an ensemble classification algorithm to address this imbalance. The presentation of the new ensemble classifier imDC was shown to improve the ease of discriminating DNA-binding proteins from other complex proteins.
After a series of feature extraction comparisons, the 188D feature extraction method suggested the superiority of our unbalanced dataset, even if the improved dimension resulted in the loss of time. Feature selection is necessary to reduce the running time and increase the efficiency of a feature extraction. The feature selection method mRMR efficiently solves this problem. Our paper presents the results of the feature selection, and Table 4 summarises the following points: (1) Amino acids, such as C, H, K, and S, are important in recognising DNA-binding proteins, and (2) the features extracted based on the hydrophobicity contribute to 30% of the top ten features and show the materiality of hydrophobicity. Finding an appropriate feature dimension to achieve the maximum performance of a classifier in all types of thresholds will be considered in the future. In addition, using a simple test to compare our model with the other software, we showed that our method has a greater advantage for processing an unbalanced dataset. A user-friendly recognition prediction system is provided at http://datamining.xmu.edu.cn/~songli/nDNA, where users can submit protein sequences for prediction in a particular format. A quick prediction has already been performed on the DNA-binding protein sequences in the UniProtKB/Swiss-Prot database.
The model built in this paper positively affects the identification of DNA-binding proteins. The results of our research will be adopted in future studies in this field.
This work was supported by the Natural Science Foundation of China (No. 61370010, No.61202011, No.81101115, No.61301251).
- Boutet E, Lieberherr D, Tognolli M, Schneider M, Bairoch A: Uniprotkb/swiss-prot. Plant Bioinformatics. Humana Press. 2007, 406: 89-112.Google Scholar
- Lin W-Z, Fang JA, Xiao X, Chou KC: iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One. 2011, 6 (9): e24756-View ArticlePubMed CentralPubMedGoogle Scholar
- Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, Zou Q: Hierarchical classification of protein folds using a novel ensemble classifier. PLoS One. 2013, 8 (2): e56499-View ArticlePubMed CentralPubMedGoogle Scholar
- Chen W, Liu X, Huang Y, Jiang Y, Zou Q, Lin C: Improved method for predicting the protein fold pattern with ensemble classifiers. Genet Mol Res. 2012, 11 (1): 174-181.View ArticlePubMedGoogle Scholar
- Liu B, Wang X, Chen Q, Dong Q, Lan X: Using amino acid physicochemical distance transformation for fast protein remote homology detection. PLoS One. 2012, 7 (9): e46633-View ArticlePubMed CentralPubMedGoogle Scholar
- Patel AK, Patel S, Naik PK: Binary classification of uncharacterized proteins into DNA binding/non-DNA binding proteins from sequence derived features using Ann. Dig J Nanomaterials & Biostructures (DJNB). 2009, 4 (4): 775-782.Google Scholar
- Cheng L, Hou Z, Lin Y, Tan M, Zhang W, Wu F: Recurrent neural network for non-smooth convex optimization problems with application to the identification of genetic regulatory networks. IEEE Trans Neural Netw. 2011, 22 (5): 714-726.View ArticlePubMedGoogle Scholar
- Bhardwaj N, Lu H: Residue-level prediction of DNA-binding sites and its application on DNA-binding protein predictions. FEBS Lett. 2007, 581 (5): 1058-1066.View ArticlePubMed CentralPubMedGoogle Scholar
- Zou Q, Li X, Jiang Y, Zhao Y, Wang G: BinMemPredict: a web server and software for predicting membrane protein types. Curr Proteomics. 2013, 10 (1): 2-9.View ArticleGoogle Scholar
- Brown PF, Della Pietra VJ, de Souza PV, Lai JC, Mercer RL: Class-based n-gram models of natural language. Comput Linguist. 1992, 18 (4): 467-479.Google Scholar
- Nordhoff E, Krogsdam AM, Jorgensen HF, Kallipolitis BH, Clark BF, Roepstorff P, Kristiansen K: Rapid identification of DNA-binding proteins by mass spectrometry. Nat Biotechnol. 1999, 17 (9): 884-888.View ArticlePubMedGoogle Scholar
- Nanni L, Lumini A: An ensemble of reduced alphabets with protein encoding based on grouped weight for predicting DNA-binding proteins. Amino Acids. 2009, 36 (2): 167-175.View ArticlePubMedGoogle Scholar
- Nimrod G, Schushan M, Szilágyi A, Leslie C, Ben-Tal N: iDBPs: a web server for the identification of DNA binding proteins. Bioinformatics. 2010, 26 (5): 692-693.View ArticlePubMed CentralPubMedGoogle Scholar
- Langlois RE, Lu H: Boosting the prediction and understanding of DNA-binding domains from sequence. Nucleic Acids Res. 2010, 38 (10): 3149-3158.View ArticlePubMed CentralPubMedGoogle Scholar
- Ma X, Guo J, Liu HD, Xie JM, Sun X: Sequence-based prediction of DNA-binding residues in proteins with conservation and correlation information. IEEE/ACM Trans Comput Biol Bioinform. 2012, 9 (6): 1766-1775.View ArticlePubMedGoogle Scholar
- Brown J, Akutsu T: Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology. BMC Bioinformatics. 2009, 10 (1): 25-View ArticlePubMed CentralPubMedGoogle Scholar
- Fang Y, Guo Y, Feng Y, Li M: Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features. Amino Acids. 2008, 34 (1): 103-109.View ArticlePubMedGoogle Scholar
- Cai YD, Lin SL: Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim et Biophys Acta (BBA)-Proteins and Proteomics. 2003, 1648 (1): 127-133.View ArticleGoogle Scholar
- Cai C, Han L, Ji Z, Chen X, Chen Y: SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 2003, 31 (13): 3692-3697.View ArticlePubMed CentralPubMedGoogle Scholar
- Kumar M, Gromiha MM, Raghava GP: Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics. 2007, 8 (1): 463-View ArticlePubMed CentralPubMedGoogle Scholar
- Rashid M, Saha S, Raghava GP: Support Vector Machine-based method for predicting subcellular localization of mycobacterial proteins using evolutionary information and motifs. BMC Bioinformatics. 2007, 8 (1): 337-View ArticlePubMed CentralPubMedGoogle Scholar
- Liu B, Xu J, Zou Q, Xu R, Wang X, Chen Q: Using distances between Top-n-gram and residue pairs for protein remote homology detection. BMC Bioinformatics. 2014, 15 (Suppl 2): S3-View ArticleGoogle Scholar
- Zou Q, Wang Z, Wu Y, Liu B, Lin Z, Guan X: An approach for identifying cytokines based on a novel ensemble classifier. BioMed Res Int. 2013, 2013: 686090-PubMed CentralPubMedGoogle Scholar
- Lin C, Chen W, Qiu C, Wu Y, Krishnan S, Zou Q: LibD3C: ensemble classifiers with a clustering and dynamic selection strategy. Neurocomputing. 2014, 123: 424-435.View ArticleGoogle Scholar
- Schneider G, Wrede P: Artificial neural networks for computer-based molecular design. Prog Biophys Mol Biol. 1998, 70 (3): 175-222.View ArticlePubMedGoogle Scholar
- Molparia B, Goyal K, Sarkar A, Kumar S, Sundar D: ZiF-Predict: a web tool for predicting DNA-binding specificity in C2H2 zinc finger proteins. Genomics Proteomics Bioinformatics. 2010, 8 (2): 122-126.View ArticlePubMedGoogle Scholar
- Ahmad S, Sarai A: Moment-based prediction of DNA-binding proteins. J Mol Biol. 2004, 341 (1): 65-71.View ArticlePubMedGoogle Scholar
- Keil M, Exner TE, Brickmann J: Pattern recognition strategies for molecular surfaces: III. Binding site prediction with Neural Netw J Comput Chem. 2004, 25 (6): 779-789.Google Scholar
- Xu R, Zhou J, Liu B, Yao L, He Y, Zou Q, Wang X: enDNA-Prot: identification of DNA-Binding Proteins by applying ensemble learning. BioMed Res Int. 2014, 2014: 10-Google Scholar
- Cai Y, He J, Li X, Lu L, Yang X, Feng K, Lu W, Kong X: A novel computational approach to predict transcription factor DNA binding preference. J Proteome Res. 2008, 8 (2): 999-1003.View ArticleGoogle Scholar
- Breiman L: Bagging predictors. Machine Learn. 1996, 24 (2): 123-140.Google Scholar
- Qian Z, Cai Y-D, Li Y: A novel computational method to predict transcription factor DNA binding preference. Biochem Biophys Res Commun. 2006, 348 (3): 1034-1037.View ArticlePubMedGoogle Scholar
- Li W, Jaroszewski L, Godzik A: Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein Eng. 2002, 15 (8): 643-649.View ArticlePubMedGoogle Scholar
- Cheng X-Y, Huang WJ, Hu SC, Zhang HL, Wang H, Zhang JX, Lin HH, Chen YZ, Zou Q, Ji ZL: A global characterization and identification of multifunctional enzymes. PLoS One. 2012, 7 (6): e38979-View ArticlePubMed CentralPubMedGoogle Scholar
- Krogh A, Vedelsby J: Neural network ensembles, cross validation, and active learning. Adv Neural Inf Process Syst. 1995, 7: 231-238.Google Scholar
- Zhang Y, Ding C, Li T: Gene selection algorithm by combining reliefF and mRMR. BMC Genomics. 2008, 9 (Suppl 2): S27-View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.