Identification of DNA-binding proteins using support vector machines and evolutionary profiles

Background Identification of DNA-binding proteins is one of the major challenges in the field of genome annotation, as these proteins play a crucial role in gene-regulation. In this paper, we developed various SVM modules for predicting DNA-binding domains and proteins. All models were trained and tested on multiple datasets of non-redundant proteins. Results SVM models have been developed on DNAaset, which consists of 1153 DNA-binding and equal number of non DNA-binding proteins, and achieved the maximum accuracy of 72.42% and 71.59% using amino acid and dipeptide compositions, respectively. The performance of SVM model improved from 72.42% to 74.22%, when evolutionary information in form of PSSM profiles was used as input instead of amino acid composition. In addition, SVM models have been developed on DNAset, which consists of 146 DNA-binding and 250 non-binding chains/domains, and achieved the maximum accuracy of 79.80% and 86.62% using amino acid composition and PSSM profiles. The SVM models developed in this study perform better than existing methods on a blind dataset. Conclusion A highly accurate method has been developed for predicting DNA-binding proteins using SVM and PSSM profiles. This is the first study in which evolutionary information in form of PSSM profiles has been used successfully for predicting DNA-binding proteins. A web-server DNAbinder has been developed for identifying DNA-binding proteins and domains from query amino acid sequences .


Background
DNA-binding proteins (DNA-BPs) are very important constituent of both eukaryotic and prokaryotic proteomes. It has been reported that approximately 2-3% of prokaryotic and 6-7% of eukaryotic proteins can bind to DNA [1,2]. These proteins play important roles in DNA packaging, replication, transcription regulation and other activities associated with DNA. Hence proteins that target specific DNA sequences can be a potential therapeutics for genetic diseases and cancers. In the form of restriction enzymes, DNA-BPs play a crucial role in prokaryotic host defence. Due to different functions DNA-BPs are diverse group of proteins both in terms of amino acid sequences and three-dimensional structures. Hence, identification of DNA-BPs can play a vital role in proteome annotation and understanding an important class of proteins.
In past, several methods have been developed for predicting DNA-BPs. Broadly, these methods can be divided into two categories i) prediction from protein structure and ii) prediction from amino acid sequence. Structure based prediction methods discriminate DNA-binding and nonbinding proteins with high accuracy on the basis of positively charged electrostatic patches [2], DNA-binding structural motifs [3], protein sequence composition, solvent accessibility and secondary structure [4], net charge, dipole and quadrapole moments, [5] and size of largest positive surface patch and amino acid composition [6]. Unfortunately, these methods can't be used in high throughput annotation, as they require the structure of a protein for prediction. Cai and Lin 2003, used pseudoamino acid composition as input for support vector machine (SVM) to discriminate RNA, rRNA and DNA binding proteins from non-binding proteins (NBPs) [7]. Recently Yu et al. (2006) [8] has developed a SVM based method for prediction of rRNA, RNA and DNA binding proteins. They used a feature vector of dimension 132, which includes amino acid composition and composition of physico-chemical properties.
In this work, a systematic attempt has been made to predict DNA-BPs from their amino acid sequences using various features of proteins, like amino acid composition. First, we analyzed the amino acid composition of DNAbinding proteins and based on the observation, SVM models have been developed using amino acid, dipeptide and four-part amino acid compositions of proteins. Besides composition, we also developed SVM models using PSSM profiles obtained from PSI-BLAST. We also examined the performance of similarity search (BLAST and PSI-BLAST) and motif-finding (MEME/MAST) approaches. All models developed in this study were evaluated using five-fold cross validation technique.

SVM models
Prediction of DNA-binding domains/chains SVM models have been developed on DNAset or main dataset, which has DNA-binding and non-binding chains obtained from PDB. First, composition based SVM model has been developed for predicting DNA-binding domains and achieved the accuracy of 79.80% with MCC, 0.58 (Table 1). In order to understand the high success rate, we compared the amino acid compositions of DNA-binding and non-binding domains in DNAset ( Figure 1). As shown in Figure 1, few residues like Lys, Arg and Glu are abundant in DNA-binding domains where as other residues like Gly are less frequent in DNA-binding domains. Due to these significant compositional differences DNAbinding domains can be predicted with high accuracy. In general, SVM models based on dipeptides perform better than amino acid composition based models. However, in this study its performance was poor. In addition, earlier studies showed that the performance of split-amino acid composition was better than amino acid composition. Hence, we developed SVM models using amino acid composition of four non-overlapping parts of a protein. As shown in Table 1, SVM model developed with four-part composition was not as efficient as amino acid composition based model. Further, it has been well documented that evolutionary information in form of PSSM profiles provides more information, which significantly improved the accuracy of prediction in several studies, such as RNA binding sites, subcellular localization, β-turns etc [9][10][11][12][13]. Thus, we developed SVM models using PSSM profiles and achieved the overall accuracy of 86.62% with MCC, 0.72. The performance of all SVM modules developed using DNAset is shown in form of ROC plot in Figure 2. We also performed self-consistency test and achieved very high accuracy (See Additional file 1, Table S1).

Prediction on DNA-binding proteins
We developed SVM models on DNAaset, which consists of 1153 DNA-binding proteins and equal number of nonbinding proteins. This dataset have full-length proteins extracted from Swiss-Prot. As shown in Table 2, with amino acid composition we achieved the accuracy of 72.42% with MCC, 0.45. The performance of SVM model improved significantly using evolutionary information obtained from PSSM profiles, which raised the accuracy and MCC to 74.22% and 0.49, respectively. Performance of all SVM modules in form of ROC plot is shown in Figure S1.

Quality of PSSM profiles
As shown in Table 1, the PSSM based SVM models perform better than other models. In order to examine the  Table  S2). We examined the reason for the poor performance and found that a large number of query protein did not have any PSI-BLAST hit when searched against DNAset. Thus, we evaluated performance of SVM model on proteins having PSI-BLAST hits (24 DNA-BPs and 56 nonbinding proteins at e-value 0.1) and proteins having no PSI-BLAST hits (122 DNA-BPs and 194 non-binding proteins). We achieved the accuracy of 83.75% for proteins having PSI-BLAST hits and accuracy of 78.48% for proteins having no PSI-BLAST hits ( Table 3). The PSSM based SVM model performed worse than SVM model based on amino acid composition on proteins having no PSI-BLAST hit. The SVM model based on PSSM generated from "nr" database performed better than that generated with our dataset (even if there were PSI-BLAST hits). This demonstrates that performance of PSSM based SVM model is affected by quality of PSSM and performs better if PSSM is generated from similar sequences. In the absence of similarity its performance will be poor (See Figure S2).

Performance on blind dataset
It is important to examine the performance of the newly developed model on an independent dataset. In this study, we evaluated the performance of our SVM models (trained on DNAset) on independent dataset called DNAiset, which consists of 100 NBPs (BindN_testsp) and 92 DNA-BPs (BindN_testpdb) obtained from Wang and Brown (2006) [14]. S3). This demonstrates that our SVM model performs equally well on independent dataset.

Performance on realistic dataset
In real life, the number of non-binding proteins is significantly higher than DNA-BPs. Thus, it is important to build and evaluate SVM models on more realistic data rather than equal number of DNA-BPs and NBPs. Hence, we developed a realistic dataset (DNArset), which has 146 DNA-binding domains and 1500 NBPs. First, we developed SVM model using amino acid composition on DNArset and achieved the maximum MCC of 0.40 with accuracy, 89.31%. Then we developed PSSM based SVM model and achieved the maximum MCC of 0.57 with accuracy, 92.59%. As shown in Table 4, PSSM based model performed better than composition-based model both in terms of sensitivity and specificity. These results further confirmed the importance of evolutionary information in predicting the DNA-BPs.

Comparison with existing methods
It is important to compare the performance of newly developed method with existing methods in order to demonstrate its capabilities. We compared the performance of above SVM models with similarity and motifbased approaches, which are commonly used for functional annotation of proteomes. As shown in Additional file 1, Table S4, the sensitivity of both modules DNA-BLAST [15] and DNA-PSIBLAST [16] was about 10% at evalue of 0.1, when evaluated on DNAset using five fold cross-validation technique. We also evaluated the performance of MEME/MAST on DNAset and observed poor performance, where only 15 domains showed motif at evalue cut-off of 1. These results demonstrated that our SVM models perform better than commonly used techniques like BLAST, MEME/MAST.
Most of the existing methods predict DNA binding proteins from proteins structures. These structure based prediction methods are not suitable for high throughput genome annotation, as they require the structure of a protein. Best of authors knowledge, three methods have been developed in the past for predicting DNA-BPs from their amino acid sequences.  [4] developed a neural network based method using amino acid composition and reported the accuracy of 64.5%. They used small but clean and non-redundant dataset, which consists of 62 DNA-BPs obtained from PDB and 915 nonbinding proteins obtained from Swiss-Prot. Their dataset was similar to our main dataset (DNAset), where we achieved the maximum accuracy of 79.80% using amino acid composition and 86.62% using PSSM profiles. Cai and Lin (2003) [7] developed a SVM based method using pseudo-amino acid composition and obtained the average accuracy of 81%. Although they compiled a large dataset, it was neither non-redundant nor clean because it contained all proteins including probable DNA-BPs. Yu    method performs better than other existing methods. The usage of SVM, which performs better than ANN, particularly on small dataset and evolutionary information in the form of PSSM profiles, improved the performance of the present method.

Prediction of DNA-binding domains vs proteins
In this study, we developed modules for predicting DNA binding proteins using two types of datasets; i) DNAset consists of partial sequences (binding regions) or DNA binding domains, and ii) DNAaset consists of full-length DNA-binding proteins. The question arises whether modules trained on DNAset (domains or partial sequences) will be applicable for predicting full DNA-binding protein or vice versa. We predicted the proteins in DNAaset using amino acid composition based SVM module trained on DNAset and achieved the accuracy of about 55%. Similarly, we predicted proteins in DNAset using amino acid composition based SVM module trained on DNAaset and achieved the accuracy of about 63%. This revealed that the model trained on partial sequences or domains is not valid for predicting full DNA binding proteins and vice versa. In order to understand the reason of failure, we computed and compared amino acid compositions of DNA binding and non-binding proteins in DNAaset (Figure S3). The DNA-binding and non DNA-binding proteins have significantly different amino acid compositions in DNAset whereas such a trend is not observed in DNAaset ( Figure 1 & S3). It is due to the fact that DNAaset has full length DNA-binding proteins, which may also have other domains including non DNA-binding domains. Thus, prediction performance of methods trained and tested on DNAset are more accurate than on DNAaset. It also explains the reasons for the failure of methods trained on full proteins (DNAaset) and tested with domain dataset (DNAset), and vice versa. This shows that separate methods are necessary for predicting DNA-binding domains and DNA-binding proteins.
It is also possible that our models were over trained. In order to rule-out this possibility, we evaluated the performance with other existing methods. We found two methods available for public: (i) DBS-PRED developed on DNA-binding domains [4] and (ii) SVM-Prot developed on full DNA-binding proteins [17]. The performance of DBS-PRED was evaluated on a dataset of 100 DNA binding and 100 non-binding proteins extracted from Swiss-Prot. As shown in Additional file 1,

Webserver
The prediction method described in this paper is implemented in the form of a web-server DNAbinder (for detail descriptions please see section Availability and Requirements

Discussion
DNA-binding proteins are one of the major classes of proteins playing a central role in cellular metabolism. Due to their importance in the regulation of gene-expression and other processes, several methods have been developed for predicting DNA-BPs. Most of them predict DNA-BPs using their structural information. These methods have limited scope because the structures are unknown for most of the proteins. In this paper, we developed a highly accurate method for predicting DNA-BPs from their amino acid sequences. The reliability of any prediction method mainly depends on clean and valid dataset. In past, two types of datasets have been used: (i) small and clean dataset with experimentally validated DNA-BPs obtained from PDB [4] and (ii) large dataset, where sequences were obtained from Swiss-Prot [7,8]. In this study we developed our models on two datasets called DNAset or main dataset and DNAaset or alternate dataset in order to benchmark our newly developed method with existing methods. Most of our study was based on DNAset because it has clean and experimentally validated DNA-BPs where as DNAaset may also include fragments and putative proteins.
In any functional annotation, the primary step is to search a query protein against database of annotated proteins (e.g. Swiss-Prot) and assign the function if query protein has significant similarity with target proteins. Similarity based annotation is highly accurate if an experimentally annotated homologous protein is found. But the major challenge is to predict function of a protein in absence of significant similarity. Thus, we developed a method using non-redundant dataset, where similarity between proteins in test and training datasets is very low. We applied similarity search and motif finding techniques on our dataset and found poor performance, as expected. Hence, we have developed ANN and SVM models on DNAset using amino acid and dipeptide compositions for predicting DNA-BPs. We achieved an accuracy of 68.46% from ANN model using amino acid composition, which is slightly better than Ahmad et al., 2004 [4] (See Additional file 1, Table  S6). It may be due to the increase in the number of DNA-BPs in DNAset. It is interesting to note that the performance of SVM was significantly better than ANN. Thus, we developed rest of the models using SVM. We achieved significantly high accuracy of 79.80% using SVM model based on simple amino acid composition. This demonstrates the importance of machine learning techniques in prediction of DNA-BPs. In order to enhance the performance further, SVM models have been developed using dipeptide and four-part compositions. Unexpectedly, the performance of dipeptides and four-part composition based SVM models is poorer than composition-based SVM model. The standard techniques like BLAST, PSI-BLAST and MEME/MAST failed to predict DNA-BPs when tested on DNAset due to low similarity among domains/ chains in DNAset.
This is the first study that used evolutionary information to discriminate DNA-BPs from non-binding proteins. We extracted evolutionary information of a protein from PSSM profiles obtained form PSI-BLAST search against "nr". The accuracy of the method improved significantly from 79.80% to 86.62% using PSSM profiles, when evaluated on DNAset. The quality of PSSM profiles, of a query sequence depends on similar sequences in a target database. As shown in Table 3, performance of SVM based model decreases significantly when PSSM generated from DNAset instead of "nr" dataset. This is due to fact that "nr" is a very large database in comparison to DNAset and hence the chances of getting similar sequences are very high when a sequence is searched against "nr" database. Thus, quality of PSSM profiles will be superior in case of "nr". We also examined PSI-BLAST hits and observed that each protein in DNAset has BLAST hits when searched against "nr" database. We also demonstrate that PSSM based models perform better, if there is significant BLAST hits ( Table 3). As database of "nr" is growing exponentially due to number of sequencing projects the quality of PSSM profiles will improve over the times, which eventually improve the performance of PSSM based models. In DNAbinder server, first we examine whether a query sequence has any BLAST hit in "nr"; if yes, PSSM based It is well known that similarity/motif based methods perform better than ab-initio classifiers if a query protein has significant similarity with target proteins. The major challenge is to develop a method, which can identify a novel DNA binding protein even if it has no sequence similarity with any of the known DNA-BPs. Hence, we took nonredundant proteins in our main dataset. If we have taken redundant proteins then similarity based method would have performed better than other methods. Our SVM model based on amino acid composition performed well on a dataset where similarity based method failed to detect DNA-BPs. This demonstrates that models developed in this study are capable to identify novel DNA-BPs. The SVM model based on PSSM further improved the accuracy by 6-7%. The question arises why we generate PSSM from 'nr' protein database instead of our own dataset whereas we are assessing BLAST/PSI-BLAST and MEME/MAST on DNAset. There is a fundamental difference between BLAST/PSI-BLAST searching and SVM model using PSSM profiles. In case of BLAST/PSI-BLAST one need to search only against well annotated proteins because we have to assign the same function to query protein if it has similarity with target protein. In case of PSSM it is not important to search a query sequence against the annotated proteins only because we do not assign function based on similarity. It is not important whether query sequence has similarity with DNA-BPs, NBPs or un-annotated proteins for generating PSSM profiles. Thus, one may create PSSM from any database like "nr" in contrast to similarity-based methods, where we need to search against well-annotated proteins only.
The main objective of this study is to develop a prediction method for identification of new DNA-binding proteins and particularly annotating the newly sequenced genomes. Most of the existing methods developed in past have been trained on experimentally annotated DNAbinding protein chains/domains obtained from PDB. The models developed on dataset of protein chains like DNAset are suitable for predicting DNA binding domains but not for predicting DNA-BPs. In order to develop method for predicting DNA-binding proteins, we also developed SVM model using DNAaset, consists of full-length DNAbinding and non-binding proteins. Thus, models developed on DNAaset will be suitable for predicting DNA-BPs. The performance of our SVM model was slightly better than SVM model of Yu et al, 2006 [8], using amino acid composition on same dataset. The difference in the performance of SVM may be due to optimization of learning parameters as Yu et al., 2006 [8] used the default parameters of SVM_light where as we used optimized parameters. Our SVM model using PSSM profiles performs better than the SVM model of Yu et al. 2006 [8], based on physicochemical properties by 3%. These results suggest that evolutionary information is important for predicting DNAbinding proteins.

Conclusion
We developed a highly accurate method for predicting DNA-BPs using the machine learning technique, SVM. For the first time, evolutionary information has been used to predict DNA-binding proteins. It has been observed that PSSM based models perform better than any other models by 3-7% on all the datasets including independent and realistic datasets. The SVM models developed in this study perform better than other existing methods. One of the major features of this study is that we developed a publicly available web server and stand-alone software, which allows users to identify the DNA-BPs in their dataset of proteins. It was observed that models trained on DNA domains or partial sequences are not suitable for predicting DNA binding proteins and vice versa. Our server, DNAbinder allows users to identify DNA binding domains using the model trained on DNAset and prediction of DNA binding proteins using the model trained on DNAaset. We hope this study will assist the biologist in annotation of genomes.

Datasets
In this study, we used the following datasets to develop various models for predicting DNA-BPs and for evaluating SVM models.

DNAset
We extracted 2435 DNA-BPs from Protein Data Bank (PDB) [18] using the keywords, "Protein-DNA complex", "DNA binding" and "DNA binding proteins". All proteins having no DNA chain or having high similarity with other proteins were filtered. Finally, we got 146 non-redundant DNA-BPs in which no two proteins have the sequence identity of more than 25%. A non-redundant set of 250 non-binding proteins was obtained from Stawiski et al. (2003) [2]. They used following criteria: i) no two protein chains have similarity more than 25% and (ii) the approximate size and electrostatics are similar to DNA-BPs. Final dataset called DNAset or main dataset or domain dataset, consists of 146 DNA-binding and 250 non-binding protein chains or domains. We called proteins chains as domains for our convenience, in order to discriminate these PDB chains from full-length DNA-binding proteins obtained from Swiss-Prot.

DNAaset
In addition to main dataset (DNAset), we also created an alternate dataset called DNAaset. This dataset consists of 1153 DNA-BPs and 1153 NBPs extracted from Yu et al (2006) [8]. The parent dataset have 88 rRNA-BPs, 377 RNA-BPs, 1153 DNA-BPs and 17779 non-binding proteins. We randomly picked 1153 NBPs and all 1153 DNA-BPs to constitute the alternate dataset (DNAaset). This is non-redundant dataset where no two proteins have more than 25% similarity.

DNAiset
In order to evaluate performance of our models on dataset not used for training or testing, we created an independent dataset called DNAiset. This dataset has 92 DNA-binding protein chains obtained from PDB and 100 nonbinding proteins picked from Swiss-Prot. These proteins were obtained from BindN server [14].

DNArset
Equal number of negative and positives examples is important for developing an efficient classifier. They are also important for evaluating any prediction model where one can simply calculate accuracy for measuring performance. All above datasets have nearly equal number of DNA-binding and non-binding proteins. However, in real life DNA-BPs are significantly less than non-binding proteins. This raises question whether models developed on equal numbers will be effective in real life. Thus, we created a more realistic dataset called DNArset. This dataset has 146 DNA-BPs used in DNAset and 1500 NBPs. These 1500 NBPs were extracted from 17779 non-binding proteins used by Yu et al., 2006 [8] after removing the proteins, whose DNA binding property is not experimentally validated.

Evaluation of models
We have adopted five fold cross-validation approach to evaluate the performance of all models developed in this study. In this procedure, the whole dataset is randomly divided into five equal parts. Four sets are used for training and remaining one for testing. The procedure is repeated five times in such a way that each set is tested once. This type of sub-sampling test (e.g., 5 or 7-fold cross-validation) is often used to validate the prediction performance of statistical methods. On the other hand, jack-knife test is deemed the most rigorous and objective as analyzed by a comprehensive review [19] and has been increasingly adopted by investigators to test the power of various prediction methods (see, e.g., [20][21][22][23][24][25][26][27][28]). In order to assess the performance of a model, we computed different parameters: sensitivity, specificity, accuracy and Matthews correlation coefficient (MCC) [29].

DNA-BLAST and DNA-PSI-BLAST search
In order to assess the performance of similarity search approaches, we evaluated the performance of BLAST [15] and PSI-BLAST [16] on DNAset. We searched proteins of test set against training set proteins using BLAST/PSI-BLAST and assigned a query protein as DNA-binding or non-binding if the first hit was a DNA-BPs or NBPs respectively. We assigned a protein "unknown" if it has no significant similarity with any target protein.

Support vector machine (SVM)
In this study SVM_light, a freely available software package, has been used to implement SVM. The SVM is a supervised machine-learning method based on the structural risk minimization principle from statistical learning theory. It takes a set of feature vectors as input, along with their output, which is used for training of model. After training, learned model can be used for prediction of unknown examples [30,31]. Detailed description of SVM can be found at Vapnik (1995) [32]. In this work, the SVM training has been carried out by the optimization of various kernel function parameters and the value of the regularization parameter C.

Protein features and vector encoding Amino acid and dipeptide compositions
The aim of calculating composition of proteins is to transform the variable length of protein sequence to fixed length feature vectors. This is an important and most crucial step during classification of proteins using machinelearning techniques because they require fixed length pattern. The conversion of protein sequence to a vector of 20 dimensions using amino acid composition will encapsulate the properties of a protein into it. In addition to amino acid composition, dipeptide composition was also used for classification that gave a fixed pattern length of 400. The advantage of dipeptide composition over amino acid composition is that it encapsulates information about the fraction of amino acids as well as their local order. The amino acid as well as dipeptide composition was calculated as described below.
Where comp(i) is the percent composition of a residue of type i. Ri and N are number of residues of type i, and total the number of residues in protein i (length of protein) respectively.
Where dpep(i) is fraction or composition of dipeptide type i. Di and N are number of dipeptide of type i and number of overlapping peptides in protein i, respectively.

Evolutionary information
In past evolutionary information in form of position specific scoring matrix (PSSM) has been used for prediction of protein secondary structure [33][34][35]. Recently evolutionary information has been used for predicting subcellular localization of proteins [12,13]. In this study PSSM has been used for predicting DNA-binding proteins. The PSSM for each sequence was generated by PSI-BLAST search against 'nr' database using three iterations with evalue cut off 0.001. The PSSM contains probability of occurrence of each type of amino acid at each residue position of protein sequence. The evolutionary information in PSSM is presented by a matrix of dimension L × 21 (L rows and 21 columns) for a protein of length L where 21 columns represents occurrence/substitution of each type of 20 amino acids and dummy residue 'X' for insertion/deletion. We generated three vectors of dimension 21, 420 and 400, called PSSM-21, PSSM-420 and PSSM-400 respectively, from PSSM matrix. PSSM-21 is simple composition of occurrence of each type of amino acid, calculated by summing each column of PSSM. PSSM-420 is composition of occurrences of each type of amino acid corresponding to each type of amino acids in protein sequence; it means for each column we will have 20 values instead of one. Hence, we will have vector of dimension 20 × 21 for PSSM matrix. PSSM-400 is similar to PSSM-420 except dummy residue 'X' is ignored. It means dimension will be reduced to 20 × 20. We normalize the values of PSSM in range of 0-1 by using formula (Valueminimum)/(maximum-minimum) before computing vector PSSM-400 and PSSM-420. The process of converting L × 21 size matrix into PSSM-400 is diagrammatically shown in Figure 3. In this study we used mainly PSSM-400 (or PSSM) for developing models.

Four-part amino acid compositions
Protein sequence was divided into four non-overlapping equal length sub-sequences. Then amino acid composition of each sub-sequence was computed. Composition of each sub-sequence was concatenated together to make final input vector of dimension 80.

Licence: No restriction;
Any restriction to use by non-academics: No restriction. Schematic representation of algorithm used to convert 21*N dimensional PSSM into PSSM-400 Figure 3 Schematic representation of algorithm used to convert 21*N dimensional PSSM into PSSM-400.